Working with SQLite3 and descriptive statistics

So far, we've seen how to generate an SQLite3 database using the sqlite3 command-line utility, and how to interact with that database within Haskell. This section combines the knowledge of descriptive statistics from Chapter 1Descriptive Statistics, with our database work in Chapter 2, SQLite3. We will be using descriptive statistics with our SQLite3 database in this section. First, we will create our descriptive statistics module from functions found in our Baseball notebook. Second, we will slice up some data using SELECT queries. Third, we will pass data to our descriptive statistics functions, and discuss the results. We'll be looking at earthquakes, specifically in the region of Oklahoma. So, let's glide over to our virtual machine running our IHaskell Notebook, and what I would like to do is to discuss how we can take the functions that we wrote in Chapter 1Descriptive Statistics, related to our baseball data, and how we can use it here. So, let's go to our analysis folder in which we keep all of our notebooks; I have added a file called DescriptiveStats.hs:

I have highlighted the file; you should create a file called DescriptiveStats.hs, and you should take all of the functions that we wrote in our Baseball notebook and add them to this file. So, if you open and look at this file, you will that we have got our range, mean, stdev, median, countLeader, countPairs, and mode functions. So, these were the functions that we wrote. Now, in order to make this a full-fledged module, we're going to add our module line at the start of the file, as shown in the following screenshot:

As you can see, after the module line we have also written the import statements. Now, once you've completed that, you should have your DescriptiveStats module ready to go. So, let's go over to our notebook, and we are going to scroll up to the very top, and put the cursor in the very first cell up at the very top:

Now, we will go and click on Edit | Split Cell, and that's going to create a new cell before the import statements, as shown as follows:

In the new added first cell, we loaded our DescriptiveStats module. In our next block, added another import line where we import DescriptiveStats. Now that we've loaded the DescriptiveStats module and imported it, we are going to click on KernelRestart & Run All. This is going to reset the notebook and load our DescriptiveStats module, and re-import all of our imports. It's going to take a moment to wake up, and we're going to see it run. If you got any error messages from this step, chances are you mistyped something in the DescriptiveStats.hs file. You can always pull the DescriptiveStats.hs file that we have shared with this book. All right, we're all set to try to answer a question. So, what we would like to do is look at the state of Oklahoma:

There are many earthquakes around the world. We could simply compute the average over all of them in the past week, but we would like to focus in on Oklahoma. Oklahoma is situated between 33.5 degrees north and 37 degrees north, as well as 94.3 degrees west and 100 degrees west. For those of you who know about the shape of Oklahoma, I'm ignoring the rectangular portion on the left, known as the Panhandle, and I'm focusing just on the mainland, which encompasses most of the earthquakes that you see in the previous screenshot.

Now, earthquakes in Oklahoma are an interesting phenomenon. Historically, major earthquakes don't happen in Oklahoma, but Oklahoma is a special case. Recent years have brought the practice called fracking to Oklahoma. Fracking damages the earth, and it's believed that this is the cause of the increase in earthquakes. While we don't have access to years of data in our 7-day database, let's look at the magnitude of Oklahoma earthquakes compared to non-Oklahoma earthquakes:

In [25]: oklahoma <- quickQuery db "SELECT * FROM earthquakes WHERE              latitude  > 33.5 AND latitude < 37 And longitude < -94.3  AND          longitude > -100 AND magnitude IS NOT ''" []

So, oklahoma is going to be a connection to our database. We are going to select all from earthquakes where the latitude is from 33.5 to 37 and the longitude is from -94.3 to -100. Now, what I've noticed in the US Geological Survey database is that sometimes the magnitudes will appear blank, and so we have a called magnitude IS NOT blank. Let's hit Shift + Enter on this query, and we're going to see how many earthquakes we get:

So, we got 33 earthquakes. What we will do next is to get everything that's not in the state of Oklahoma. So, we will copy our Oklahoma query, and will make a few changes:

In[27]: notoklahoma <- quickQuery db "SELECT * FROM earthquakes WHERE         Not(latitude  > 33.5 AND latitude < 37 And longitude < -94.3          AND longitude > -100) AND magnitude IS NOT ''" []

So, we have changed oklahoma to notoklahoma, and we have introduced a new function, NOT, and wrapped all of the latitude and longitude coordinates in parentheses. We still include the AND magnitude IS NOT blank, but that's outside the latitude and longitude coordinates. So, what this says is if an earthquake exists inside the state of Oklahoma, we're going to exclude it, but if it's outside Oklahoma, we're going to include it. So, let's run that query, and see how many non-Oklahoma earthquakes we get:

We got 1549 non-Oklahoma earthquakes. Now, in our dataset, we know that the fourth column - remembering that the first column is really the 0th column - is our magnitude, and we already know how to do that by using transpose, and we will save it in the okmag list and read it as a Double:

We also know that our non-Oklahoma earthquakes notokmag comes from notoklahoma, and it's also on the 4:

Now, here's where I believe that we could have some fun. We simply pass these lists okmag and notokmag to our functions found in the descriptive statistics module.

So, the mean of okmag and notokamg would be as follows:

The average magnitude earthquake for okmag was 2.84, and for notokmag it was 1.47. So, earthquakes inside of Oklahoma have a higher magnitude than earthquakes outside of Oklahoma. Now, does this stack up with the information provided by the median? The median gets the true middle based on the amount of data that we have. So, let's check out the median for both okmag and notokmag:

We got 2.74 inside of Oklahoma, and 1.24 outside of Oklahoma. Now, which of these two states has the larger variation of earthquake magnitudes? Let's find out:

We see a standard deviation of 0.58 inside Oklahoma, and 1.19 outside Oklahoma. So, there's a higher standard deviation among non-Oklahoma earthquakes as opposed to Oklahoma earthquakes.

Since we are testing out different functions, let's go ahead and test out the mode function too. So, what is the most frequently seen magnitude across all of our data for the two datasets?

So, for Oklahoma, 7 earthquakes have a magnitude of 2.5, and for states outside Oklahoma, 47 earthquakes have a magnitude of just 1.2. We could further slice up our data based on the depth of these earthquakes, and we can study that information, but I believe that this is a good example of how to use descriptive statistics with the SQLite3 database.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset