This chapter looks at Haskell's type system by examining where it works in your favor as well as the common obstacles that you may be face when trying to understand it. We will also work with csv files, a common format that is used to store datasets. The csv file type is cumbersome to work with. So, we will spend the remainder of this chapter in learning how to convert csv files into SQLite3 databases.
In this chapter, we will cover the following:
Haskell is a language that prides itself with regard to the correctness and conciseness of the language, and the robust collection of libraries via which you can explore more while maintaining the purity of a purely functional programming language. For those who are new to Haskell, there are a number of innovative features. Those coming from the world of C-style programming languages will admire Haskell's type inference capabilities. A language that's both strongly typed and type-inferred during compile time is a welcome change. A variable can be assigned a type once and passed through a variety of functions without you ever having to be reminded of the type. Should the analyst use the variable in a context that is inappropriate for the assigned type, it can be flagged accordingly during compile time rather than run time. This is a blessing for data analysts. The analyst gets the benefits of a statically typed language without having to constantly remind the compiler of the types that are currently in play. Once the variable types are set, they are never going to change. Haskell will only change the structure of a variable with your permission.
The flip side of our strictly typed code is quickly encountered in the study of data analysis. Several popular languages will convert an integer to a rational number if the two are used in an expression together. Languages typically do this because we, as humans, typically think of numbers as just numbers, and if an expression works mathematically, it should work in an expression in a programming language. This is not always the case in Haskell. Some types (as we will see in the following section) must be explicitly converted if an analyst wants them to be used in a context that Haskell deems potentially unsafe.
Allow us to demonstrate this problem with an example, which also serves as our first step into the world of data analysis problems. The mean (or the average) of a list of values is considered a summary statistic. This means that it allows us to express lots of information with a single value. Because of the ease with which the mean can be calculated, it is one of the most frequently used (and misused) summary statistics. According to the United States Census Bureau, the average sale price of a new home in the United States in 2010 was $272,900. If you are familiar with home values in the United States, this value might seem high to you. The mean of a dataset is easily skewed by outlier information. In the context of home prices, there are a few, rare new homes that were sold that were worth more than $125 million. This high home price will shift the mean away from the middle concept that is generally believed to be represented by the mean. Let us begin by computing the mean of a list in Haskell. The mean of a list of numbers is computed by finding the summation of this list and dividing this sum by the number of elements in the list. The data presented here represents the final score made by the Atlanta Falcons in each of their games during the 2013 NFL football season. Not a football fan? Don't worry. Focus only on the numbers. The purpose of this example is to make you work on a small, real-world dataset. There are 16 games in this dataset, and we can present them all in a single line, as follows:
> let falconsScores = [17,31,23,23,28,31,13,10,10,28,13,34,21,27,24,20]
At this point, falconsScores
is just a list of numbers. We will compute the sum of these values. The Prelude package consists of a handful of functions that are ready for use in the GHC environment. There is no need to import this library. These functions work right out of the box (so to speak). Two of these functions are sum
and length
:
> let sumOfFalconsScores = sum falconsScores > sumOfFalconsScores 353
The sum
function in the Prelude package does what you may expect; (from the Haskell documentation), the sum
function computes the sum of a finite list of numbers.
So far, so good. Next, we need the length of the list. We will find it out with the help of the following code:
> let numberOfFalconsGames = length falconsScores > numberOfFalconsGames 16
The length
function also does what you may expect. It returns the length of a finite list as an Int
.
In order to compute the mean score of the 2013 season of the Atlanta Falcons, we will divide the sum of the scores by the number of scores, as follows:
> let meanScoreofFalcons = sumOfFalconsScores / numberOfFalconsGames <interactive>:82:61: Couldn't match expected type 'Integer' with actual type 'Int' In the second argument of '(/)', namely 'numberOfFalconsGames' In the expression: sumOfFalconsScores / numberOfFalconsGames In an equation for 'meanScoreofFalcons': meanScoreofFalcons = sumOfFalconsScores / numberOfFalconsGames
Oh dear. What is this? I assure you that Haskell can handle simple division.
Since mathematical division involves fractions of numbers, Haskell requires us to use a Fractional
type when dividing numbers. We will inspect our data types, as follows:
> :t sumOfFalconsScores sumOfFalconsScores :: Integer > :t numberOfFalconsGames numberOfFalconsGames :: Int
The sum
function returned an integer based on the data (some versions of Haskell will return the more generic Num
class in this instance). Integer
in Haskell is an arbitrary precision data type. We never specified a data type. So, Haskell inferred the Integer
type based on the data. The length
function returns data as an Int
type. Not to be confused with Integer
, Int
is a bounded type with a maximum and minimum bound to the values of this type. Despite the similar use of both types, they have some important differences (Integer
is unbounded, Int
is bounded).Using Integer
and Int
together has a potential to fail at runtime. Instead of failing at runtime, the compiler notices this potential to fail and flags it during the compile time.
The fromIntegral
function is our primary tool for converting integral data to the more generic Num
class. Since our second operand (numberOfFalconsGames
) is of the Int
type, we can use fromIntegral
to convert this from Int
to Num
. Our first operand is of Integer
type and the fromIntegral
function will work in this circumstance as well, but we should avoid that temptation (if this list consisted of floating-point numbers, fromIntegral
would not work). Instead, we should use the realToFrac
function, which converts (as the name implies) any numerical class that extends the Real
type to a Fractional
type on which the division operator depends and can hold an unbounded integer:
> let meanFalconsScore = (realToFrac sumOfFalconsScores) / (fromIntegral numberOfFalconsGames) > meanFalconsScore 22.0625 > :t meanFalconsScore meanFalconsScore :: Double
Now that we have enjoyed exploring the Haskell type system, we should probably build an average
function. We see that the type system automatically recognizes the types in our function and that we are converting a list of Real
types to Fractional
. Here is our average
function:
> let average xs = realToFrac (sum xs) / fromIntegral (length xs) > :t average :: (Fractional a, Real a1) => [a1] -> a
We see from the type description that type a
is a Fractional
type (our output) and that type a1
is a Real
type (our input). This function should support integers, floating-point values, and mixtures of integer and floating point values. Haskell isn't being creative with automatically generated type names such as a
and a1
, but they will do. As always, we should test:
> average [1, 2, 3] 2.0 > average [1, 2, 3.5] 2.1666666666666665 > average [1.5, 2.5, 3.5] 2.5 > let a = [1, 2, 3] > average a 2.0 > average [] NaN
The final test in this list reports that the average empty list is NaN (short for Not A Number).
Things appear to be in working order. There is an additional way of finding the length of the list that is found in the Data.List
package and is called genericLength
. The genericLength
function does the same as the length
function, but with the added effect that the return value is Num
and will happily work with the division operator without converting the value. Testing to see if this new version of average
is working will be left as an exercise to the reader:
> import Data.List > let average xs = realToFrac(sum xs) / genericLength xs > :t average average :: (Fractional a, Real b) => [b] -> a
We should add our new function to average a list of numbers to the LearningDataAnalysis02.hs
module:
module LearningDataAnalysis02 where -- Compute the average of a list of values average :: (Real a, Fractional b) => [a] -> b average xs = realToFrac(sum xs) / fromIntegral(length xs)
As a data analyst, you will be working with various forms of quantitative and qualitative data. It is essential that you understand how the typing system in Haskell reacts to data. You will be responsible for explicitly saying that a data type needs to change should the context of your analysis change.
Data comes in all sizes, from the small sets that are typed on a command line to the large sets that require data warehouses. Data will also come in a variety of formats: spreadsheets, unstructured text files, structured text files, databases, and more. Should you be working for an organization and responsible for analyzing data, datasets will come to you from your management. Your first task will be to figure out the format of the data (and the necessary software required to interact with that data), an overview of that data, and how quickly the dataset becomes outdated. One of my first data projects was to manage a database of United States Civil War soldier death records. This is an example of a dataset that is useful to historians and the families of those who served in this war, but also one that is not growing. Your author believes that the most interesting datasets are the ones that are continually growing and changing. A helpful metadata document that accompanies the dataset will answer these questions for you. These documents do not always exist. So pull your manager over and get him or her to answer all of these questions and create the first metadata document of this dataset.