Filtering data using regular expressions

It has been said before that if you have a problem and your solution is to use regular expressions, you now have two problems. Regular expression is a term used to represent the language to identify patterns found in text. The language itself is terse (a single character in this language can have a complex meaning). In the open source community, the most identifiable example of regular expressions in use is with the command-line tool, grep. The name is derived from an older text editor called ed, which had a command called g/re/p. Using grep, we can search for a pattern of text in a file and filter out anything that does not contain this pattern, for example, let's assume that we have a text file that represents the entire book of the Mark Twain classic, The Adventures of Huckleberry Finn. You can download this entire book from Project Gutenberg. I have renamed my text file huckfinn.txt.

To identify each line in the file that references the character of Jim in the story, we will use grep. Here, I'll use grep from the Linux command line, as follows:

$ grep Jim huckfinn.txt

In this example, Jim represents a regular expression. We are looking for any instance in the file where J is followed by i, which is then followed by m.

Creating a simplified version of grep in Haskell

Since this is a book that focuses on Haskell, we will recreate the grep tool in Haskell. To do this, we need to make sure that the regular expression library is installed on our system. We can do this using the cabal tool, as follows:

cabal install regex-posix

The primary usage of grep is that the first argument after the name of the command is a regular expression, and all the subsequent arguments represent filenames that need to be searched. We will divide the task of our program into two parts—managing the input arguments and searching files based on a pattern. First, we will search for files based on a pattern. We will begin the file with the following necessary import statements:

import Text.Regex.Posix ((=~))
import System.Environment (getArgs)

We will define our function to search lines in a file based on a regular expression. In the spirit of the original grep tool, this function will print lines rather than return matched lines. This can be seen using the following function:

myGrep :: String -> String -> IO ()
myGrep myRegex filename = do
    fileSlurp <- readFile filename
    mapM_ putStrLn $
        filter (=~ myRegex) (lines fileSlurp)

This function should be relatively straightforward. The filter function is the standard tool that is used to filter a list of values based on a Boolean expression. Lines break a slurped file into individual lines. The =~ operator is from the Text.Regex module that allows us to compare each line in a file to a pattern.

Now, we will set up the call to this function, as follows:

main :: IO ()
main = do
    (myRegex:filenames) <- getArgs
    mapM_ (filename -> myGrep myRegex filename) filenames

The first argument obtained from getArgs (found in the System.Environment module) is the regular expression, and the remainder of the list should be our filenames. We will call myGrep on each filename with the regular expression and get our results, as follows in the file:

$ runhaskell hgrep.hs Jim huckfinn.txt
CHAPTER II. The Boys Escape Jim. Torn Sawyer's Gang. Deep-laid Plans.
Island. Finding Jim. Jim's Escape. Signs. Balum.
CHAPTER XXIII. Sold. Royal Comparisons. Jim Gets Home-sick.
CHAPTER XXIV. Jim in Royal Robes. They Take a Passenger. Getting
CHAPTER XXXI. Ominous Plans. News from Jim. Old Recollections. A Sheep
(...Remaining lines clipped...)

Using simple functional programming tools such as filter and map, we can easily create a command-line tool to parse a number of files based on a regular expression in just a few lines of code.

Exhibit A – a horrible customer database

We need to seek out missing values in our datasets. In the next few examples, we are going to use a simple csv file. The data presented below was randomly generated thanks to the www.fakenamegenerator.com website. I have modified several fields to make this dataset purposely bad. The original csv file came with an odd Unicode character embedded as the first character, thus illustrating that even seemingly good csv files can still require cleaning. Here is the file that I named poordata.csv:

Number,Gender,GivenName,Surname,City,State,Birthday
1,female,Sue,Roberson,Monroe,LA,12/31/1791
2,,George,Chavez,Chicago,IL,11/11/1948
3,male,Dexter,Grubb,Plattsburgh,NY,6/4/1984
4,male,    ,Knight,Miami,Florida,6-21-1951
5,male,Jonathan,Thomas,Fort Wayne,IN,1/15/1967
6,MALE,Brandon,    ,pittsburgh,pa,8/3/1981
7,male,Daniel,Puga,Evansville,,8/19/1988
8,Female,Geneva,Espinoza,Springfield,MA,1992-08-11
9,female,Miriam,Levron,Hicksville,N.Y.,9/7/1965
10,F,Helen,Pitts,Gibsonia,PA,"March 12, 1989"

Imagine that you were given this file that represents your company's customer database. Thanks to the infinite wisdom of the developers who designed the registration system, the customers were allowed to freely type their birthday and gender in the birthday and gender fields. The system was not concerned with missing fields and there are a few blank fields in this dataset. If you inspect the State column, you will see that some people typed the two-letter capital initials for their state, some typed the initials with periods (see N.Y.), and some typed the entire state name (see Florida). Whoever was in charge of maintaining this data failed to do an adequate job.

The first thing that you should do is save your originals using the version control software. Overcome the temptation to immediately start fixing the flaws.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset