SQLite3 and regular expressions

Working with regular expressions in our SQLite3 database is no different than working with a CSV file. In this section, we will demonstrate how to filter our data using regular expressions, using the timestamp data from an SQLite3 database in a similar manner to our last section. So, we're going to be loading the data from the SQLite3 database, sifting through that data using a regular expression, and analyzing the data gleaned from that regular expression. Now, the problem that we will try to solve in this section is to determine how many earthquakes happen by hour in our 7-day database. Let's go and create a new Haskell notebook; we will name this notebook RegexLearning-SQLite3. Let's first import our libraries:

We won't be using any descriptive statistics in this section, so there's no need to load the descriptive statistics module. There is one import that you see here that we haven't covered yet, and that's Text.Printf. There's one little problem that we are going to solve using the printf function, and you'll see that coming up. So, let's import our Earthquakes database that we created in Chapter 2, SQLite3:

Now, we need the readColumn function that will allow us to read data from a column:

Now that we have our database ready, we need to pull the raw timestamps:

This command will pull all of the earthquake times. Now, we need to convert this raw time data into a string, which we will call timestamps:

So, we're reading the column and getting the head of the transpose of our data, and this will all be parsed into a list of strings. Now, let's examine the first timestamp:

As you can see, we have a much more complicated timestamp than what we were working with in the last section. There's lots more information here to work with. So, what we would like to do is to extract the hour from each timestamp, and the hour occurs right after the letter T, and the two-digit number 19. So, here's where the printf function comes in. I would like to take any number and convert it to a two-digit number as a string with T in the front, and printf does this for me:

I have passed in number 7, and it'll convert it to a two-digit number – zero-padded, with T in front. Likewise, we can try the number 19, and it will produce T19, as shown in the following screenshot:

So, now that we have our raw timestamps converted to strings, we would like to be able to discern which timestamps have an hour that we are looking for, and which timestamps do not; and we can do that with regular expressions:

So, countAtHour will be a function that takes an integer and returns an integer count of all of the earthquakes that happened at that hour. Then, countAtHour will take an hour, and we'll get the length of all of the data that matches the regular expression mentioned previously. We're basically going to filter anything that matches a regular expression defined by the return of the print statement, T%02d; then, we'll pass in our hour and make sure that is a String; and then, we will pass in our timestamps. Now, let's demonstrate how this works:

So, 76 earthquakes happened at hour 0. Similarly, we got 55 and 52 earhtquakes for hour 1 and hour 2 respectively. They will all be different based on your earthquake dataset. So, for our particular dataset, we got 76, 55, and 52. Now, what we would like to do is to get all of the earthquakes by hour, and that is simple by mapping countAtHour to the range of numbers, 0 to 23:

We now got the full breadth of the 24-hour period from 0 to 23. Next, we can bring in the descriptive statistics functions; and define the range and the standard deviation by hour. But hopefully, you can see how with this approach of using the regular expressions and the printf function, we have a lot of versatility in how we can find data in strings.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset