Slices of data

This section will be an overview of the versatility of the SELECT query in SQLite3. Most of the content of this section will pertain to the inner workings of the SELECT query, and not the Haskell language itself. We're going to have an understanding of the following SELECT clauses: WHERE, ORDER BY, and LIMIT. Each of these have their own utility, but when these clauses work together you can quickly see how data can be sliced into workable chunks that can be studied later. This is my preferred way of working with data where we let SQL do the slicing and Haskell do the dicing. Once we have a data slice that we're happy with, we'll spend some time at the end looking at how to parse that chunk of data into something usable by Haskell.

Let's go back to our Jupyter Notebook; we will flop over to our primary Earthquakes notebook. We will begin with a really simple question. If our dataset of worldwide earthquakes is truly worldwide, shouldn't there be a nearly equal number of northern hemisphere earthquakes/southern hemisphere earthquakes? Only one way to find out. An earthquake is in the northern hemisphere if it has a positive latitude, so let's start by counting the number of earthquakes that meet the criteria of having a positive latitude. We're going to call this the northern dataset:

All right, we have 1530. That seems like a lot. Let's do the same for earthquakes in the southern latitude. This is going to be relatively straightforward, as we're just looking for earthquakes with a negative latitude:

There's only 55. So, that's nowhere close to being equal; and what could account for these differences? So, at the time of writing, I personally contacted the US Geological Survey and asked them to explain this discrepancy between over 1,000 earthquakes in the northern hemisphere and less than 100 in the southern hemisphere. The reasoning I got back was that there are simply more earthquake monitors in North America, hence there are more earthquakes recorded; and that's really all there is. Since we're on the topic of North America, let's record how many earthquakes are in the northwest hemisphere, and that will require that we look at all of the earthquakes with a positive latitude and a negative longitude. So, this is an instance where we get to use the AND Boolean operator. So, we're going to call this northwest:

So, we got 1477 northwest earthquakes. Since we know there's not going to be many earthquakes in the southwest hemisphere, let's go ahead and look at the magnitude and depth of those earthquakes. First, we are going to get the southwest earthquakes where our condition would be both latitude and longitude less than zero:

So, we got only 28 southwest earthquakes. So, with the southwest dataset we are going to do what we did at the very end of the last section, where we performed the transpose to look at the depth information. Let's go ahead and run the following command:

So, here we got the depth information and we will read that as a list of Doubles. We will run the same command again, but now we will get the magnitude information, so we will have to pass 4 instead of 3:

This is all of our magnitude information for earthquakes in the southern hemisphere. Now, you'll notice that there doesn't seem to be any order to our information here, so let's do another southwest query. We're going to use what's known as the ORDER BY:

So, now our southwest is going to be ordered by magnitude.

If you read the fourth column again, you can quickly see that the smallest-magnitude earthquake in the southwest hemisphere was 4.2 and the largest is 6.0. By combining with the LIMIT command, it would simply mean order by magnitude, but take the first five. Here is the statement:

So now, whenever we read our columns again of the southwest hemisphere we should only get five values, and they're going to be the five smallest-value earthquakes, as shown as follows:

But, what if we want to flip this on its head and get the five largest earthquakes? Well, that's a simple change. We're going to expand on our query statement once again and after the magnitude, but before the LIMIT we will add DESC, and that stands for descending:

This will sort the magnitudes from largest to smallest, and then take the first five. So, we're going to run the transpose statement again and check what we get:

It works, we see the top 5 magnitudes.

So, in this section, all that we did was we took the SELECT query and we started molding it using the WHERE clause. After we experimented with a few circumstances in which we sliced up the data with WHERE, we began to sort that data based on one of the columns; and in this particular instance, it was by magnitude. Afterward, we looked at how we could order by the magnitude either from smallest to largest—which is the default—or from largest to smallest; and that requires us to add the DESC call.

We can get a finer-grained selection to our data, and with the LIMIT call, we can prevent ourselves from getting so much data that we can't work with it. If you feel like you're getting too much data, you can add that LIMIT to your SELECT query, but Haskell can handle as much as you can throw at it, depending on the memory of your system. In the next section, we will be looking at how to incorporate our descriptive statistics module with the material that we've covered in this section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset