Chapter 2. Getting Our Feet Wet

This chapter looks at Haskell's type system by examining where it works in your favor as well as the common obstacles that you may be face when trying to understand it. We will also work with csv files, a common format that is used to store datasets. The csv file type is cumbersome to work with. So, we will spend the remainder of this chapter in learning how to convert csv files into SQLite3 databases.

In this chapter, we will cover the following:

  • Type is king—the implications of strict types in Haskell
  • Working with csv files
  • Converting csv files to the SQLite3 format

Type is king – the implications of strict types in Haskell

Haskell is a language that prides itself with regard to the correctness and conciseness of the language, and the robust collection of libraries via which you can explore more while maintaining the purity of a purely functional programming language. For those who are new to Haskell, there are a number of innovative features. Those coming from the world of C-style programming languages will admire Haskell's type inference capabilities. A language that's both strongly typed and type-inferred during compile time is a welcome change. A variable can be assigned a type once and passed through a variety of functions without you ever having to be reminded of the type. Should the analyst use the variable in a context that is inappropriate for the assigned type, it can be flagged accordingly during compile time rather than run time. This is a blessing for data analysts. The analyst gets the benefits of a statically typed language without having to constantly remind the compiler of the types that are currently in play. Once the variable types are set, they are never going to change. Haskell will only change the structure of a variable with your permission.

The flip side of our strictly typed code is quickly encountered in the study of data analysis. Several popular languages will convert an integer to a rational number if the two are used in an expression together. Languages typically do this because we, as humans, typically think of numbers as just numbers, and if an expression works mathematically, it should work in an expression in a programming language. This is not always the case in Haskell. Some types (as we will see in the following section) must be explicitly converted if an analyst wants them to be used in a context that Haskell deems potentially unsafe.

Computing the mean of a list

Allow us to demonstrate this problem with an example, which also serves as our first step into the world of data analysis problems. The mean (or the average) of a list of values is considered a summary statistic. This means that it allows us to express lots of information with a single value. Because of the ease with which the mean can be calculated, it is one of the most frequently used (and misused) summary statistics. According to the United States Census Bureau, the average sale price of a new home in the United States in 2010 was $272,900. If you are familiar with home values in the United States, this value might seem high to you. The mean of a dataset is easily skewed by outlier information. In the context of home prices, there are a few, rare new homes that were sold that were worth more than $125 million. This high home price will shift the mean away from the middle concept that is generally believed to be represented by the mean. Let us begin by computing the mean of a list in Haskell. The mean of a list of numbers is computed by finding the summation of this list and dividing this sum by the number of elements in the list. The data presented here represents the final score made by the Atlanta Falcons in each of their games during the 2013 NFL football season. Not a football fan? Don't worry. Focus only on the numbers. The purpose of this example is to make you work on a small, real-world dataset. There are 16 games in this dataset, and we can present them all in a single line, as follows:

> let falconsScores = [17,31,23,23,28,31,13,10,10,28,13,34,21,27,24,20]

Computing the sum of a list

At this point, falconsScores is just a list of numbers. We will compute the sum of these values. The Prelude package consists of a handful of functions that are ready for use in the GHC environment. There is no need to import this library. These functions work right out of the box (so to speak). Two of these functions are sum and length:

> let sumOfFalconsScores = sum falconsScores
> sumOfFalconsScores
353

The sum function in the Prelude package does what you may expect; (from the Haskell documentation), the sum function computes the sum of a finite list of numbers.

Computing the length of a list

So far, so good. Next, we need the length of the list. We will find it out with the help of the following code:

> let numberOfFalconsGames = length falconsScores
> numberOfFalconsGames
16

The length function also does what you may expect. It returns the length of a finite list as an Int.

Attempting to compute the mean results in an error

In order to compute the mean score of the 2013 season of the Atlanta Falcons, we will divide the sum of the scores by the number of scores, as follows:

> let meanScoreofFalcons = sumOfFalconsScores / numberOfFalconsGames
<interactive>:82:61:
    Couldn't match expected type 'Integer' with actual type 'Int'
    In the second argument of '(/)', namely 'numberOfFalconsGames'
    In the expression: sumOfFalconsScores / numberOfFalconsGames
    In an equation for 'meanScoreofFalcons':
        meanScoreofFalcons = sumOfFalconsScores / numberOfFalconsGames

Oh dear. What is this? I assure you that Haskell can handle simple division.

Introducing the Fractional class

Since mathematical division involves fractions of numbers, Haskell requires us to use a Fractional type when dividing numbers. We will inspect our data types, as follows:

> :t sumOfFalconsScores
sumOfFalconsScores :: Integer
> :t numberOfFalconsGames
numberOfFalconsGames :: Int

The sum function returned an integer based on the data (some versions of Haskell will return the more generic Num class in this instance). Integer in Haskell is an arbitrary precision data type. We never specified a data type. So, Haskell inferred the Integer type based on the data. The length function returns data as an Int type. Not to be confused with Integer, Int is a bounded type with a maximum and minimum bound to the values of this type. Despite the similar use of both types, they have some important differences (Integer is unbounded, Int is bounded).Using Integer and Int together has a potential to fail at runtime. Instead of failing at runtime, the compiler notices this potential to fail and flags it during the compile time.

The fromIntegral and realToFrac functions

The fromIntegral function is our primary tool for converting integral data to the more generic Num class. Since our second operand (numberOfFalconsGames) is of the Int type, we can use fromIntegral to convert this from Int to Num. Our first operand is of Integer type and the fromIntegral function will work in this circumstance as well, but we should avoid that temptation (if this list consisted of floating-point numbers, fromIntegral would not work). Instead, we should use the realToFrac function, which converts (as the name implies) any numerical class that extends the Real type to a Fractional type on which the division operator depends and can hold an unbounded integer:

> let meanFalconsScore = (realToFrac sumOfFalconsScores) / (fromIntegral numberOfFalconsGames)
> meanFalconsScore
22.0625
> :t meanFalconsScore
meanFalconsScore :: Double

Creating our average function

Now that we have enjoyed exploring the Haskell type system, we should probably build an average function. We see that the type system automatically recognizes the types in our function and that we are converting a list of Real types to Fractional. Here is our average function:

> let average xs = realToFrac (sum xs) / fromIntegral (length xs)
> :t average :: (Fractional a, Real a1) => [a1] -> a

We see from the type description that type a is a Fractional type (our output) and that type a1 is a Real type (our input). This function should support integers, floating-point values, and mixtures of integer and floating point values. Haskell isn't being creative with automatically generated type names such as a and a1, but they will do. As always, we should test:

> average [1, 2, 3]
2.0
> average [1, 2, 3.5]
2.1666666666666665
> average [1.5, 2.5, 3.5]
2.5
> let a = [1, 2, 3]
> average a
2.0
> average []
NaN

The final test in this list reports that the average empty list is NaN (short for Not A Number).

The genericLength function

Things appear to be in working order. There is an additional way of finding the length of the list that is found in the Data.List package and is called genericLength. The genericLength function does the same as the length function, but with the added effect that the return value is Num and will happily work with the division operator without converting the value. Testing to see if this new version of average is working will be left as an exercise to the reader:

> import Data.List
> let average xs = realToFrac(sum xs) / genericLength xs
> :t average
average :: (Fractional a, Real b) => [b] -> a

We should add our new function to average a list of numbers to the LearningDataAnalysis02.hs module:

module LearningDataAnalysis02 where
-- Compute the average of a list of values
average :: (Real a, Fractional b) => [a] -> b
average xs = realToFrac(sum xs) / fromIntegral(length xs)

As a data analyst, you will be working with various forms of quantitative and qualitative data. It is essential that you understand how the typing system in Haskell reacts to data. You will be responsible for explicitly saying that a data type needs to change should the context of your analysis change.

Metadata is just as important as data

Data comes in all sizes, from the small sets that are typed on a command line to the large sets that require data warehouses. Data will also come in a variety of formats: spreadsheets, unstructured text files, structured text files, databases, and more. Should you be working for an organization and responsible for analyzing data, datasets will come to you from your management. Your first task will be to figure out the format of the data (and the necessary software required to interact with that data), an overview of that data, and how quickly the dataset becomes outdated. One of my first data projects was to manage a database of United States Civil War soldier death records. This is an example of a dataset that is useful to historians and the families of those who served in this war, but also one that is not growing. Your author believes that the most interesting datasets are the ones that are continually growing and changing. A helpful metadata document that accompanies the dataset will answer these questions for you. These documents do not always exist. So pull your manager over and get him or her to answer all of these questions and create the first metadata document of this dataset.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset