Searching fields based on a regular expression

We are going to take what we learned in the previous section and expand it to csv files. We wish to identify every field in the previous csv file that matches a regular expression. These functions will require the Text.CSV module (installed in the last chapter) and the getColumnInCSV function that we wrote in the LearningDataAnalyis02 module.

In Chapter 2, Getting Our Feet Wet, we listed several assumptions that we would be making about csv files. We will now add a new assumption—csv files will have a unique identifier column somewhere in the file. A unique identifier column represents a column of identifiers (none duplicated) that represent the data for that row. In a relational database, this would be the primary key field. We will make a second assumption about our data—the column of the unique identifier will be free from errors. Typically, this column is found in the first column of the csv file, but this is not always the case. In the earthquake csv file, the unique identifier column was the 12th out of 15 columns.

Here's a personal story. When I was younger, I created a MySQL database table with a primary key using the SMALLINT type for a small business. In the third year of the business, the program failed because the business finally grew to the 65,537th business transaction, which is one more than is allowed by SMALLINT. Even the primary key field has the potential to fail if you make foolish design choices.

Here are the import statements needed for our code to work:

    import Text.CSV
    import Data.List
    import Text.Regex.Posix ((=~))
    import LearningDataAnalysis02

It is a good programming strategy to think a little, code a little, and then test a little. The first chunk of code to think/implement/test will represent the heart of the program; the function will apply the regular expression to every field in a record. The function should return three elements for every field that matches the regular expression, the column name in which the field was found, the unique identifier for this record, and the text of the matching field. Since this function depends on genericIndex, I wish to remind you that this function can fail if the assumption that the unique identifier exists is not met. This can be seen in the following code snippet:

identifyMatchingFields ::
    (String -> Bool)
     -> [String]
     -> [String]
     -> Integer
     -> [(String, String, String)]
  identifyMatchingFields
    myStringCmpFunc record headings idColumnIndex =
      filter
        ((_, _, field) -> myStringCmpFunc field)
      keyvalue
    where
      nfields = length headings
      keyvalue = zip3
                (replicate
                  nfields
                  (genericIndex record idColumnIndex)
                )
                headings
    record

What this function mostly does is juggle each of the necessary input parameters. This function requires a Boolean function used for string comparisons called myStringCmpFunc, a record from a csv file called record, the headings for each file called headings, and the index position of the unique identifier. In the where clause of the function, note that a combined list of every unique identifier, column heading, and field is being made using the zip3 function. This list is filtered just on the field, but when a field does match, the identifier and heading are returned with it.

In our last example, we used the =~ regular expression comparison operator to facilitate the filtering of the lines of a text file. In this example, we will let you decide as to whether you wish to use simple string comparison functions to identify fields or the far more complex regular expression engine. If you are already comfortable with regular expressions, then we are going to continue using them. If you need a little more time to get used to regular expressions, I hope that you find the guide to regular expressions in Haskell in the appendix of this book helpful and then return to this chapter. Either way, the function only needs to be written once.

We will conduct a test with a small sample. A regular expression of Journ will match anything that includes this sequence of characters, which is demonstrated in the following code:

> identifyMatchingFields (x -> x =~ "Journ") ["1", "Clark Kent", "Journalist", "Metropolis"] ["Id", "Name", "Profession", "Location"] 0
[("1","Profession","Journalist")]

A regular expression of Hero should not match anything, since this sequence does not exist in this data:

> identifyMatchingFields (x -> x =~ "Hero") ["1", "Clark Kent", "Journalist", "Metropolis"] ["Id", "Name", "Profession", "Location"] 0
[]

Of course, we do not have to use a regular expression with our newly crafted function. If you need to search for something simple, the use of regular expressions is going to be overkill. You can search for Metropolis in your data if you are looking for fields that exactly contain the word, Metropolis. The predicate function in the identifyMatchingFields function allows us to be versatile in the methodology of how fields are searched. You can use a simple comparison, a regular expression, or perhaps a method that I did not consider:

> identifyMatchingFields (== "Metropolis") ["1", "Clark Kent", "Journalist", "Metropolis"] ["Id", "Name", "Profession", "Location"] 0
[("1","Location","Metropolis")]

Once you feel satisfied with your testing, you can move on to building a function that matches every field in a csv file. This function will take a string comparison function, an open csv file, and a string representing a column heading and return every field that matches the given comparison function along with the field's unique identifier and heading, as follows:

identifyInCSV ::
  (String -> Bool) -> CSV -> String ->
    Either String [(String, String, String)]
    identifyInCSV myFieldFunc csv idColumn =
      either
      Left
      (ci -> Right $ concatMap
        (
ecord ->
        identifyMatchingFields
        myFieldFunc record (head csv) ci
        )
        (tail csv)
      )
      columnIndex
    where
      headings = head csv
      columnIndex = getColumnInCSV csv idColumn

By now, the preceding code should be familiar to you. We used the getColumnInCSV function (written in the last chapter) to get the index of a column heading. Since this function call has the capacity to return an error, we must wrap the returned expression in an either clause. The concatMap function does most of the work in the function by calling the identifyMatchingFields function repeatedly and concatenating all the returned lists into one.

First, we will test the list with a regular expression that we know will return a value. Only one record uses the state abbreviation PA:

> csv <- parseCSVFromFile "poordata.csv"
> either (error -> Left "CSV Problem") (csv -> identifyInCSV (x -> x =~ "PA") csv "Number") csv
Right [("10","State","PA")]

Next, we will test the list with a regular expression that should match multiple fields. Several records use the word male to represent a gender. Note that this also returns everything that matches female. Regular expressions do not limit themselves to matching complete strings:

> either (error -> Left "") (csv -> identifyInCSV (x -> x =~ "male") csv "Number") csv
Right [("1","Gender","female"),("3","Gender","male"),("4","Gender","male"),("5","Gender","male"),("7","Gender","male"),("8","Gender","Female"),("9","Gender","female")]

Again, after we are satisfied with our testing, we can craft a primary function, which will be used to call our helper functions:

identifyInCSVFile ::
  (String -> Bool) -> 
    String -> 
    String ->
    IO (Either String [(String, String, String)])
  identifyInCSVFile myStringCmpFunc inFileName idColumn = do

  records <- parseCSVFromFile inFileName
  return $ either
    (err -> 
    Left "This does not appear to be a CSV file")
    (csv ->
      identifyInCSV myStringCmpFunc (init csv) idColumn
    )
    records

Locating empty fields in a csv file based on a regular expression

Using our newly crafted function, we will attempt to locate each field in our csv file which is empty or nearly empty. Because a regular expression can have its match anywhere within a field, we must force our regular expression to start at the beginning and end at the end of a field. To ensure that a regular expression starts at the beginning, we will begin that expression with the ^ symbol. To ensure that an expression ends at the end of a string, we will end the expression with $. Thus, the regular expression ^$ represents an empty field:

> identifyInCSVFile (x -> x =~ "^$") "poordata.csv" "Number"
Right [("2","Gender",""),("7","State","")]

The second record contains no data for the Gender column and the seventh record contains no data for the State column.

How do we identify nearly empty data? In this case, we have data, but it is not useful because it has some space characters. We wish to find any field with 0 or more whitespace characters. In order to catch all the forms of whitespace characters (including space, tabs, and new lines), we will use the special shortcut atom called \s (for space) in the following command, and it needs to be modified with the * modifier to represent 0 or more occurrences of that atom. We also need to add the ^ and $ positional modifiers. Our final whitespace expression is ^\s*$. This regular expression translates as At the start of a string, there should exist 0 or more whitespace characters, followed by the end of the string.

> identifyInCSVFile (x -> x =~ "^\s*$") "poordata.csv" "Number"
Right [("2","Gender",""),("4","GivenName","    "),("6","Surname","  "),("7","State","")]

Since our dataset is a work of fiction, we can fill in the gaps with fictional data. When working with gaps in your dataset, you should probably seek the advice of a domain expert. At the very least, filter out lines containing gaps (back up your data first) until you can fill in those gaps with accurate information.

Crafting a regular expression to match dates

There are still many problems with our csv file. Let's focus on the date column. We wish to craft a function that will be able to pull just the date of births from our dataset in the Birthday column. The trick to solve this lies in writing a regular expression that matches everything except for the accurate-looking dates and then writing a short filter to exclude everything except for our desired column. We will achieve this by building on our last function.

Without going into detail, an acceptable regular expression to identify a date in the form of MM/DD/YYYY (the American style of writing dates) will be ^[1-9][0-9]?/[1-9][0-9]?/[12][0-9][0-9][0-9]$. By applying the =~ regular expression comparison operator to our data with this expression, we will only return the correctly formatted dates in our file (which is useless at this stage). We wish to return the poorly-formatted dates. To get everything that does not appear to be a date, we will wrap the evaluation expression with Haskell's not operator. Then, we will filter the data that does not match our desired column.

Here is the function that is used to take the data returned by identifyInCSVFile and filter it based on a column name:

identifyInCSVFileFromColumn :: 
  (String -> Bool) -> String -> String -> String ->
    IO (Either String [(String, String, String)])
  identifyInCSVFileFromColumn
    myRegexFunc inFileName idColumn desiredHeading = do
      allFields <- identifyInCSVFile
        myRegexFunc inFileName idColumn
        return $ either
        Left
        (af -> Right $
          filter
            ((_, heading, _) ->
            heading == desiredHeading
            )
         af
         )
       allFields

By allowing identifyInCSVFile to perform our heavy lifting, we can reuse our old code in the following way:

> identifyInCSVFileFromColumn  (x -> not (x =~ "^[1-9][0-9]?/[1-9][0-9]?/[12][0-9][0-9][0-9]$")) "poordata.csv" "Number" "Birthday"
Right [("4","Birthday","6-21-1951"),("8","Birthday","1992-08-11"),("10","Birthday","March 12, 1989")]

Good. By identifying just the poorly-formatted dates from the Birthday column, we can correct them. The majority of the dates are in the MM/DD/YYYY format. So, we will fix these data values to match the majority format. I edited the file so that the incorrect birthdays are now 06-21-1951, 08-11-1992, and 03-12-1989. After manually correcting the data (make backups first) and naming this file poordataFixed.csv, we will test it again, as follows:

> identifyInCSVFileFromColumn  (x -> not (x =~ "^[1-9][0-9]?/[1-9][0-9]?/[12][0-9][0-9][0-9]$")) "poordataFixed.csv" "Number" "Birthday"
Right []

An empty list tells us that all the fields matched the regular expression in the Birthday column.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset