Performing linear algebra in Haskell

Through hmatrix, you will have access to a large collection of linear algebra functions. In this section, we are going to provide a brief introduction to this library. If you have ever taken a linear algebra course, you will have learned how powerful matrix operations are.

To begin, let's create a 3-by-4 matrix consisting of values from 1 to 12. This can be done using the matrix function in the following way:

> let a = matrix 4 [1 .. 12]
> a
(3><4)
 [ 1.0,  2.0,  3.0,  4.0
 , 5.0,  6.0,  7.0,  8.0
 , 9.0, 10.0, 11.0, 12.0 ]

We can compute the transpose of this matrix using the tr function. Here, we will compute the transpose of a, as follows:

> tr a
(4><3)
 [ 1.0, 5.0,  9.0
 , 2.0, 6.0, 10.0
 , 3.0, 7.0, 11.0
 , 4.0, 8.0, 12.0 ]

We can also perform matrix multiplication using the mul operator. Here, we will multiply "a" with the transpose of a, as follows:

> mul a $ tr a
(3><3)
 [  30.0,  70.0, 110.0
 ,  70.0, 174.0, 278.0
 , 110.0, 278.0, 446.0 ]

This set of modules allows for the use of the standard math operators (+, -, /, *, and ^) for pair-wise operations on matrices:

> a + a
(3><4)
 [  2.0,  4.0,  6.0,  8.0
 , 10.0, 12.0, 14.0, 16.0
 , 18.0, 20.0, 22.0, 24.0 ]

Computing the covariance matrix of a dataset

The function that is used to compute the covariance matrix of a dataset (meanCov, which comes with hmatrix) uses a slightly different formula to the one we presented in the chapter on linear regression. Rather than use the expectation function, meanCov will sum the necessary values then divide the covariance by (n-1) rather than divide by n. This is done to signify that this is a sampling of the data (and thus, there is always an unknown quality to the data) rather than a complete sampling of all the data. The equation for this can be denoted as follows:

Computing the covariance matrix of a dataset

To illustrate the use of meanCov, we are going to use the baseball dataset from Chapter 6, Correlation and Regression Analysis. Rather than using linear regression to estimate a line, we are going to use eigenvalue decomposition. We need to get our baseball data into a matrix data structure. I've morphed the current baseball dataset consisting of 30 pairs of runs per game and win percentages into a list of Double values.

We've reconstructed the baseball dataset from scratch in the following statements so that you don't have to flip back to that portion of the book:

> :l LearningDataAnalysis06
> queryDatabase "winloss.sql" "SELECT COUNT(*) FROM winloss"
> queryDatabase "winloss.sql" "SELECT COUNT(*) FROM winloss WHERE awayscore == homescore;"

> homeRecord <- queryDatabase "winloss.sql" "SELECT homeTeam, SUM(homescore > awayscore), SUM(homescore), COUNT(*) FROM winloss GROUP BY homeTeam;"

> awayRecord <- queryDatabase "winloss.sql" "SELECT awayTeam, SUM(awayscore > homescore), SUM(awayscore), COUNT(*) FROM winloss GROUP BY awayTeam;"

> let totalWins = zipWith (+) (readDoubleColumn homeRecord 1) (readDoubleColumn awayRecord 1)
> let totalRuns = zipWith (+) (readDoubleColumn homeRecord 2) (readDoubleColumn awayRecord 2)
> let totalGames = zipWith (+) (readDoubleColumn homeRecord 3) (readDoubleColumn awayRecord 3)
> let winPercentage = zipWith (/) totalWins totalGames
> let runsPerGame = zipWith (/) totalRuns totalGames
> let baseball = L.map ((a,b) -> [a,b]) $ zip winPercentage runsPerGame

> :t baseball
baseball :: [[Double]]
> baseball
[[0.6049382716049383,4.771604938271605],[0.39751552795031053,3.813664596273292],[0.4876543209876543,3.537037037037037],[0.5925925925925926,4.351851851851852], … [content clipped]

We can now compute the covariance matrix of our dataset by calling meanCov. This function will return two values. The first value is a vector consisting of the mean of each column of data. The second value is the covariance matrix:

> let (baseballMean, baseballCovMatrix) = meanCov $ fromLists baseball
> baseballMean 
fromList [4.066924826828208,0.4999948879175932]
> baseballCovMatrix 
(2><2)
 [   0.1204356428862124,  8.350015780569988e-3
 , 8.350015780569988e-3, 3.4790736953854207e-3 ]

Discovering eigenvalues and eigenvectors in Haskell

Following this step, we will perform the eigenvalue decomposition process. The eigSH function will return a list of eigenvalues and eigenvectors presorted so that the largest eigenvalues are ordered first:

> let (baseballEvalues, baseballEvectors) = eigSH baseballCovMatrix
> baseballEvalues
fromList [0.12102877720686984,2.8859393747279805e-3]
> baseballEvectors 
(2><2)
 [   -0.9974865990115773, 7.085537941692915e-2
 , -7.085537941692915e-2,  -0.9974865990115773 ]

The eigenvectors are returned in a single matrix. Each eigenvector can be identified in each column. The largest eigenvalue (0.12) is associated with the first column of the eigenvector matrix (-0.9975, -7.0855e-2). The second largest eigenvalue (2.89e-3) is associated with the second column of the eigenvector matrix (7.0855e-2, -0.9975).

Here, we will visualize what happens when we plot a line with the first eigenvector of (-0.9975, -7.0855e-2) that is centered at the mean position (4.0667, 0.5) of the dataset with a blue line. We do the same for our second eigenvector (7.0855e-2, -0.9975) with a red line. I call these lines eigenlines. This is done using the following statements:

> let xmean = 4.0667
> let ymean = 0.5
> let samplePoints = [3.3, 3.4 .. 4.7]

> let firstSlope = (-7.085537941692915e-2) / (-0.9974865990115773)
> let secondSlope = (-0.9974865990115773) / (7.085537941692915e-2)

> let firstIntercept = ymean - (xmean * firstSlope)
> let secondIntercept = ymean - (xmean * secondSlope)

> let firstEigenline = zip samplePoints $ L.map (x -> x*firstSlope + firstIntercept) samplePoints
> let secondEigenline = zip samplePoints $ L.map (x -> x*secondSlope + secondIntercept) samplePoints

> import Graphics.EasyPlot
> plot (PNG "runs_and_wins_with_eigenlines.png") [Data2D [Title "Runs Per Game VS Win % in 2014"] [] (zip runsPerGame winPercentage), Data2D [Title "First Eigenline", Style Lines, Color Blue] [] firstEigenline, Data2D [Title "Second Eigenline", Style Lines, Color Red] [] secondEigenline]
True
> plot (PNG "runs_and_wins_first_eigenline.png") [Data2D [Title "Runs Per Game VS Win % in 2014"] [] (zip runsPerGame winPercentage), Data2D [Title "First Eigenline", Style Lines, Color Blue] [] firstEigenline]

The preceding command would give the following chart as a result:

Discovering eigenvalues and eigenvectors in Haskell

The scale is causing some distortion in this image, but the red and blue lines are perpendicular to each other. By removing the red line for the second red eigenline from the image, we see an image that is almost identical to the one presented in Chapter 6, Correlation and Regression Analysis. Any difference between the following image and the similar image depicting linear regression can be chalked up to our differences in computing the covariance:

Discovering eigenvalues and eigenvectors in Haskell
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset