Working with csv files

A common format to exchange data externally between an organization and the outside world is the CSV (short for Comma Separated Values) format. Files of this format have the .csv extension. Most spreadsheet software packages will be able to import and export csv files. Also, many database software packages will have the ability to export tables to csv files. The format of a csv file is relatively simple. A csv file is a plain text document containing multiple records, and each record begins on a new line (but it will not necessarily end on the same line). The first line of the document is not a record, but it contains a description of each column in the dataset. Each record will consist of values (or fields), each separated by a delimiter character (usually a comma). A field that is wrapped in double quotes may contain delimiter characters and newline characters. There is no documented standard that defines a proper csv file. The first line in the document might be the first record of the data. A record can span multiple lines. Some people like using tabs as their delimiter. Despite being such a deceptively simple file format, there is plenty of ambiguity. It is good practice to avoid writing your own csv library and to instead use a trusted library.

Preparing our environment

Using the cabal tool, we need to download the Text.CSV library. From the command line, type the following:

cabal install csv

We wish to use the Text.CSV library to parse our csv file. It will be helpful to learn a little more about this library. There are three Haskell types that are found in Text.CSV. The first type is CSV, which is our representation of a csv file. A CSV is composed of a list of the Record type (the second type). Each Record is a list of the Field type (the third type). Each Field is a String. Internally, a csv file is represented using these types, but you can think of a csv file as an expression of the [[String]] type.

Throughout our discussion on csv files, I will refer to the csv file format as csv and the Haskell CSV expression type as CSV.

Before we continue any further, we will perform some housekeeping on our LearningDataAnalysis02.hs file. These import statements will be necessary for our work in the following example. Any time we mention an import statement, add it to the beginning of the file after the module line:

import Data.List
import Data.Either
import Text.CSV

Describing our needs

Having addressed the ambiguity of csv files, when csv files are addressed in this book, we will assume the following:

  • The first line in the file contains a description of each column
  • Each line after the first line contains exactly one record
  • Each record field is separated by a comma character
  • The datasets may have quoted strings, which will not contain newline characters

We should get some practice with regard to performing simple data analysis-related work on a csv file. The United States Geological Survey (USGS) presents a variety of data to the general public, including recent data on worldwide earthquakes. By visiting the following site, you can download earthquake-related data that spans the past hour, day, seven days, and 30 days:

http://earthquake.usgs.gov/earthquakes/feed/v1.0/csv.php

From this site, you will see that the format of the data is in csv, the columns are clearly defined on the first line of every csv file, and the data is updated every five minutes.

I downloaded the csv file representing all the earthquakes in the past week. When the file was downloaded, the file size was just over 220 KB in size and contained 1,412 records of earthquakes. Based on our discussion in the previous section, we will attempt to answer the question—What was the average magnitude of every earthquake in the past week?

Crafting our solution

Before we consider an answer to our question, we must consider how we are going to access the data. Our data is buried in a file and we need to extract this information. We will craft a function that applies a desired operation (which will be supplied by the user) to every value in a single column of a csv file. We will make this function generic with regard to a desired operation because we wish to make it possible for users to pass it in their own functions (not just the function used in this example). Our goal is to craft a function that accepts the following input parameters:

  • A function that will perform our desired operation (in this case, the desired operation will be the average function written earlier)
  • A filename representing a csv file (a FilePath)
  • A column within our csv file on which the desired operation will be applied (a String)

In addition, our function should return a value, and the type of this value type will be determined by the desired operation's function (which will require a polymorphic type—more on this will be discussed later). Because each field in our file is a string, the input type of the passed function must also be a String. Also, we need some mechanism to handle the errors that might arise while performing our work. For this, we will return our values wrapped in the Either type, which is commonly used to pass error messages. Finally, since this is going to involve file I/O, we need to wrap the Either type in the IO monad.

Having taken all this into consideration, we will now present the signature of this function:

applyToColumnInCSVFile :: ([String] -> b) -> FilePath -> String -> IO (Either String b)

This is a lot to take in and we should not expect those new to data analysis in Haskell to think of all these considerations immediately. We are going to put this aside and come back to it later. There are two design considerations that will go into this function:

  • Finding the column index of the specified column that interacts with the csv file itself
  • In typical Haskell fashion, these concepts will be demonstrated with their own functions

Finding the column index of the specified column

We will start with the easier of the two design considerations—finding the column index of the specified column. This function will accept a CSV value and a column name as input, and it should return either the integer of the column or an error message. To handle the returning of the two possible types, we will use the Data.Either library. Since this function will take a CSV value (rather than a CSV filename), we do not have to concern ourselves with the wrapping of the returned value in the IO monad. Here is our function:

getColumnInCSV :: CSV -> String -> Either String Integer
getColumnInCSV csv columnName =
  case lookupResponse of
    Nothing -> Left
      "The column does not exist in this CSV file."
    Just x -> Right (fromIntegral x)
  where
  -- This line looks to see if column is in our CSV
  lookupResponse = findIndex (== columnName) (head csv)

The heart of this function is the very last line, in which we called the findIndex function using the column name and the first line of the csv file as the input. The findIndex function is found in Data.List, and it will return a Maybe Int type constructor. Maybe is one of the basic Haskell types consisting of two type constructors—Just and Nothing. On finding the column, findIndex will return a Just type, but if the column is not found, then Nothing is returned. We will perform pattern matching on the returned Maybe type to convert this to an Either type. On finding Nothing, we will convert this pattern into an error message that explicitly states that The column does not exist in this csv file. On finding the specific index of the column, we will return that column index and convert it into an Integer in the process.

We will perform some demonstrations of our function. For this demonstration, we will use the earthquake csv file. Our precondition for this test is that the csv file exists and is properly formatted. What we do not assume is that the desired column will exist in the csv file. In this example, we know there exists a column called mag. Our function resides in a module file called LearningDataAnalysis02.hs. We will load this module in the interactive interpreter using the : l (for load) command, as follows:

> :l LearningDataAnalysis02.hs
> csv <- parseCSVFromFile "all_week.csv"
> either (error -> Left "Problem Reading File") (csv -> getColumnInCSV csv "mag") csv
Right 4

Good. Note that the column exists on index 4. (Haskell, like C, C++, and Java, begins list indices with 0) The Right wrapper identifies that the function correctly identified the column and this column is 4. We will test for something that fails, as follows:

> either (error -> Left "Problem Reading File") (csv -> getColumnInCSV csv "not a column") csv
> Left "The column does not exist in this CSV file."

There is no column named not a column in this file and our function correctly reports this using the Left wrapper. In line with the typical use of Either, Right is used for the successful execution and Left is to report error messages.

The Maybe and Either monads

If you are new to Haskell, you might ask yourself, "What are Maybe, Just, and Nothing?" Maybe is a data type with two constructors—Just and Nothing. The Just constructor takes one value and Nothing takes no values. Maybe is also an instance of a general and powerful abstraction called a monad. In order to explain the importance of Maybe, I'm going to describe the features from other programming languages. Imagine the int data type found in the C language. C's int and Haskell's Int share a few things in common. Both are bounded types with a maximum and minimum value and are only allowed to hold the integer data, and both must always have a value associated with the variable of this type. In C, if a variable of the int type isn't given a value, the initial value will be whatever the compiler decides it to be (which is why you should always give C variables an initial value). It still has a value! In fact, there's no way to express that a C int variable doesn't have any value. Contrast this with the language of Java, which allows you to create an object of the Integer class. Java will initialize the variable with the null value. Java's null specifically means that the variable has no value, which means that all Java references can have either a value of the specified type or none at all. Haskell allows us to do the same thing using Maybe. When a variable is defined as a Maybe type, it means something similar to Java's notion of a reference. It is an explicit way of saying that this variable might hold a value of the specified data (the Just value) or it might hold Nothing.

Likewise, you might be asking yourself about the Either construct that we keep using. Either is a more advanced form of Maybe. When we encounter a variable in Haskell representing a Maybe, we're saying that it maybe holds a value, but we're not saying why if you find Nothing. The Either construct was created to address this. An Either expression comprises of two expressions—Left a and Right b. Either allows us to represent an expression that can evaluate two different expressions, each with their own constructors (Left and Right). Often, the Either expression is used to report error messages in code since computations can either be successful or can return an error message on failure. When using the Either expression, we can define the Left and Right expressions with whatever we want, but by convention, the Right side of Either is used to return the results of a successful execution, and the Left side of Either is used to return a String error message outlining why a function call failed. To help you remember, try remembering that Right is right and Left is wrong.

Applying a function to a specified column

Our next function will do the actual work of applying our desired function to a column within the csv file. This will be similar to the primary function with the only difference being that this function will assume that an already opened CSV data expression will be passed instead of a filename:

applyToColumnInCSV :: ([String] -> b) -> CSV -> String -> Either 
  String b
applyToColumnInCSV func csv column = either
    Left
    Right . func . elements
  columnIndex
  where
  columnIndex = getColumnInCSV csv column
  nfieldsInFile = length $ head csv
  records = tail $
    filter (
ecord -> nfieldsInFile == length record) csv
  elements ci = map
    (
ecord -> genericIndex record ci) records

Note that this function signature requests a value of CSV expression instead of String. This allows us to focus on the task of data analysis instead of being concerned about the file's I/O. There are a few considerations here that need to be explained. Starting with the where clause, we used the getColumnInCSV function to get the column index. Next, we counted the number of fields represented on the first row of our csv file. We used the number of fields to filter the records and ignore any records that do not have one field per field heading. The line beginning with elements ci will take the records and reduce them to a single column defined by columnIndex. Since columnIndex represents either a real column index or an error message, our function will make sure that the error message is propagated back out in a Left call or the actual work of applying a function to our column is performed inside a Right call.

We need to return to the discussion of the polymorphic type used in our function call identified as b. We will apply a function to a column in the csv file, but no assumptions are made with regard to the type of data that exists in the column or the function passed. In the context of our problem, we wish to compute the average magnitude of earthquakes over the span of a week. The data in the column is the String data that needs to first be interpreted into the Double value. Here is a simple helper function to do just that. Both this function and the average function need to be in LearningDataAnalysis02.hs:

readColumn :: [String] -> [Double]
readColumn xs = map read

As always, we should test our function in the following manner:

> csv <- parseCSVFromFile "all_week.csv"
> either
 (error -> Left "Problem Reading File")
 (csv -> applyToColumnInCSV (average . readColumn) csv "mag")
 csv
Right 1.6950637393767727

Note that the average magnitude for all earthquakes over the course of a sample week is 1.70 (rounded). Of course, if you are following along, your result will vary. We need to test this function again with a bad column name, as follows:

> either
 (error -> Left "Problem Reading File")
 (csv -> applyToColumnInCSV (average . readColumn) csv "not a column")
 csv
Left "The column does not exist in this CSV file."

This function appears to be in working order. Now, we will return to the writing of a function that puts what we have learned together. This function will be responsible for the opening of a csv file and the calling of the helper functions, as follows:

-- Opens a CSV file and applies a function to a column
-- Returns Either Error Message or the function result
applyToColumnInCSVFile ::
 ([String] -> b) -> FilePath -> String -> IO (Either String b)
applyToColumnInCSVFile func inFileName column = do
  -- Open and read the CSV file
  input <- readFile inFileName
  let records = parseCSV inFileName input
  -- Check to make sure this is a good csv file
  return $ either
    handleCSVError
    (csv -> applyToColumnInCSV func csv column)
    records
  where
    handleCSVError ==
      Left "This does not appear to be a CSV file."

The only consideration that needs to be addressed in this function is the use of the return call. This call makes sure that the Either value is wrapped in an IO monad.

Now, we will test the following code:

> :l LearningDataAnalysis02.hs
> applyToColumnInCSVFile (average . readColumn) "all_week.csv" "mag"
Right 1.6950637393767727

We will test the preceding code again, but with a nonexistent column, as follows:

> applyToColumnInCSVFile (average . readColumn) "all_week.csv" "not a column"
Left "The column does not exist in this CSV file."

It takes a lot of work in Haskell to reach such a simple function call and see a result. Once you build the layers of resources in your modules, you will see us pick up speed, for example, in the Data.List module, there are the maximum and minimum functions that we can use immediately with the work we have done so far, as follows:

> applyToColumnInCSVFile (maximum . readColumn)
                         "all_week.csv" "mag"
Right 6.7
> applyToColumnInCSVFile (minimum . readColumn)
                         "all_week.csv" "mag"
Right (-0.3)

While writing this example, I learned that the Richter magnitude scale can record negative values. I readily admit that I am not a domain expert in the geological sciences. Knowledge of the domain in which you will be working is something that is essential to an accurate interpretation of the data that you will be analyzing. When it comes to the geological sciences (or any other unfamiliar domain), should you lack the same expertise as me, do not be shy about partnering with an expert-level friend.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset