Regular expressions in CSV files

We need to know the importance of using regular expressions in various file formats such as CSV and SQLite3. In this section, we will be covering the CSV format. So, let's examine a question using one of our past datasets. Using our Baseball dataset, let's try to find out the average number of runs scored by away teams in the month of March. To do this, we'll need our CSV file of data, which has the dates in the first column, but is not organized by month.

So, in order to solve this, we're going to be crafting a regular expression to match a field in the CSV file. In this case, we will be using the first column of dates. We're going to be pairing that information with another column; and in this case, the other column is going to be the runs scored by away teams. Then, we're going to filter that information to get just the information that corresponds to the regular expression.

For this section, I have created a file called MyCSV.hs. You can find it in the source code folder, which is provided with this book. Let's check the file:

As you can see, I've created a module of the functions that we utilized in Chapter 1, Descriptive Statistics, such as the noEmptyRows function and the readIndex function. I've also included a function called getIndex. Now, readIndex is a partial function that allows us to parse data from a column; whereas getIndex simply returns information from a column as is, in the string format. So, let's create a new notebook called RegexLearning-CSV utilizing MyCSV. First, we need to load up our DescriptiveStats and MyCSV libraries:

Now, let's do our imports:

We will be utilizing the Text.CSV, Data.Maybe, regular expressions, and then finally, the two libraries that are custom-made for this section. Now, we need to parse our CSV file, and we're going to use the exact same dataset that we used in Chapter 1, Descriptive Statistics, the Baseball dataset:

Now, we need to get the game dates. And, if you recall, all of the game dates were in the first column, and so that means column 0:

Now, let's look at the first entry in gameDates:

We see the date here is March 5^th 2015, but it's one long string of numbers. What we need to do now is to craft a regular expression to get just the 04 portion of that. So, let's create marchDates, and this is going to be a Boolean array of length equal to gameDates; it will be True if it's in the month of March, and it will be False otherwise:

So, we have mapped our regular expression function, and we have passed in a regular expression of 04. We have added 4 dots at the beginning because we don't care about the year, we have 04 because we care about the month of March, and then 2 more dots afterward because we're not worried about the day. We have then passed in gameDates.

We also need the awayRuns that we've used in an earlier section:

If you recall, those are going to be on the 10th column, so we have passed in 9 for the index; and parsed as Integer. Next, we need to make sure that the length of marchDates and the length of awayRuns are identical:

Well, they both match, we got 2429 games. Now, what we need to do is combine these two datasets; let's do that by running the following command:

You're going to see the pairing of all of this information. You see a series of True because this is a chronological dataset beginning in the month of March. If you scroll down, at some point they transition into False because we go into the next month.

Now, we need to filter these based on what's True in the first portion of every tuple:

So here, fst stands for first, and then we pass in our combined datasets: marchDates and awayRuns. This produces just the list of elements that are in the month of March. You'll see it's a much shorter list, and the beginning of every tuple is True. Now, we need to strip out those Trues as they are 9no longer necessary. For that, we are going to use the following command:

So, we have used map snd (snd for second), and passed in our expression. This will give us all the away runs scored in the month of March. Finally, we can take this expression and we can pass it to the mean:

So, we can say that an average of 4.2 runs was scored by away teams in the month of March. Hopefully, you can identify how this approach can be adapted to a solution that works for every month of the year. With a little more effort, we can examine the month with the highest and lowest average. In our next section, we are going to use regular expressions with SQLite3.

Table of Contents for Regular expressions in CSV files

Create new playlist

Sign In

Sign Up

Table of Contents for
Regular expressions in CSV files