Counting the number of fields in each record

To demonstrate how to find problems in data, we will examine the common problems found in csv files. The first common issue is that the number of fields in each record does not always match the number of columns in the heading line. For this example, I created a simple csv file with an inconsistent number of fields named poorFieldCounts.csv. When typing up this file, make sure that the last line ends with a newline character. Some implementations of the Haskell CSV library require the following:

Name,FavoriteColor,FavoriteFood
Fred,Orange,Ribs
Wilma,White
Barney,Brown,Pie,Bowling
Betty,Blue,Cake

We can write a simple check for this using the following function. This function assumes that Text.CSV and Data.List have been imported:

countFieldsInEachRecord :: CSV -> [Integer]
countFieldsInEachRecord csv = map genericLength (init csv)

The preceding function will take an already opened csv file and count the number of fields in each record.

Testing the following statements that shows the number of fields in the heading row and every subsequent record:

> let csv = parseCSVFromFile "poorFieldCounts.csv"
> either Left
    (csv -> Right $ countFieldsInEachRecord csv) csv
Right [3,3,2,4,3]

You may see that the function reports the correct information, but this will not be useful if you have several thousands of lines in a csv file. We should filter out records that have exactly one field per column heading so that we can quickly identify the problem areas.

Our next function does just that. The lineNumbersWithIncorrectCount function will return a list of the Integer pairs. Within each pair, the first integer will represent the line number of a file and the second will report the number of fields that exist in that row:

lineNumbersWithIncorrectCount :: CSV -> [(Integer, Integer)]
lineNumbersWithIncorrectCount (fields:csv) = filter
    ((_, thisCount) -> thisCount /= nfields)
    lineNoCountPairs
  where
    nfields = genericLength fields
    count = countFieldsInEachRecord csv
    lineNoCountPairs = zip [1..] count

Executing the preceding code reveals where we should focus our corrections:

    > either Left
      (csv -> Right $ lineNumbersWithIncorrectCount csv) csv
    Right [(3,2),(4,4)]

From the result of the previous code we can see that line 3 only has two fields (3,2), whereas line 4 has four fields (4,4). How you chose to fix your file will be based on the circumstances of your problem, for example, your situation may allow you to ignore these records by either skipping the records or deleting them (always back up your data first).

We will test our function on a csv file that we already know to be correct, such as the earthquake data file used in the previous chapter:

> input <- readFile "all_week.csv"
> csv <- parseCSVFromFile "all_week.csv"
> either Left
  (csv -> Right $ lineNumbersWithIncorrectCount csv)
  csv
Right []

Good! An empty list (which is what we expected) is returned.

Our next venture into the cleaning of data depends on whether each record in our csv files has the correct number of fields.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset