Does a home-field advantage really exist?

We perform tests on data for the following two reasons:

  • We wish to evaluate a claim
  • We have a limited amount of data

Both must be true in order to justify a test. If you have a complete picture of the data related to a claim, there will be no need to perform a test because a simple calculation will suffice. In the next example, we are going to have a look at an age-old claim in sports and use a limited amount of data to test the claim.

There is some important terminology that we will discuss here—population and sample. A complete picture of data is called a population dataset. Anything less than a complete picture is called a sample dataset. As discussed in Chapter 1, Tools of the Trade, sometimes the population dataset is so large that we are unable to work with it using the resources that are available on a single computer. A strategy that can be used to work with big data is to take a sample of the data that is small enough to be worked upon on a single computer and then test our claims using the limited dataset.

Baseball in America has a history dating back to the 19th century, and the data that we have on some teams is just as old. We shall continue our discussion of testing while using some historic professional baseball data. The Retrosheet website, http://retrosheet.com/, has many of these records available for viewing purposes in the CSV format. We will test the claim that a team scores more runs while playing at its home stadium as opposed to an away stadium. For this, we will need data (provided by Retrosheet), and we need to establish the null hypothesis and an alternative hypothesis using the following theory:

  • The null hypothesis will be that there exists no difference in runs, or that a lesser number of runs were scored while playing at a home stadium as opposed to an away stadium
  • The alternative hypothesis will be that more runs were scored while playing at a home stadium as opposed to an away stadium

I visited the Retrosheet website and found out that their page for game logs according to the year. Then, I downloaded the data corresponding to the 2014 dataset. At the time of writing this book, the page where you can download the game log data is http://www.retrosheet.org/gamelogs/index.html.

Converting the data to SQLite3

On downloading the file and unzipping its contents, you will quickly realize that there are many columns in the csv file with a huge amount of data. It will be fun to explore all of this data, but we are only interested in a few columns. To break this file into a few necessary columns, I used the cut gnuplot tool (with apologizes to my readers using Windows) to filter this data down to something far more manageable, as follows:

$ cut -d, -f 1,4,7,10,11 GL2014.TXT > winloss2014.csv

The -d flag tells cut that we are using a comma delimiter. The -f 1,4,7,10,11 flag tells cut that we require the first (date), fourth (away team), seventh (home team), tenth (away runs), and eleventh (home runs) columns. The convenient aspect of baseball data is that any time a baseball game is described, the away team is always listed first, followed by the home team. If you don't have the cut tool, you can load the csv file into your favorite spreadsheet software, delete the unnecessary columns, and export the document back to the csv file type.

Once the dataset has been filtered down to just these five columns, we can import the data into a SQLite3 database using Haskell, as follows:

> convertCSVFileToSQL "winloss2014.csv" "winloss.sql" "winloss" ["date TEXT", "awayteam TEXT", "hometeam TEXT", "awayscore INTEGER", "homescore INTEGER"]
Successful

Exploring the data

We can quickly get an idea of the awayteam runs and hometeam runs using the sum function within the SQL query, as follows:

> queryDatabase "winloss.sql" "SELECT SUM(awayscore), sum(homescore) FROM winloss"
[[SqlByteString "9791",SqlByteString "9966"]]

We can immediately see that there is a difference between the away team runs and the home team runs, and the home team runs scored more. Perhaps, there is some validity to the claim. First, we should get the performance of each team when they played at home. This is best done with SQL. With the following line, we will see the number of runs each team scored while playing at their own stadium. The ORDER BY clause will help us keep our teams in order, as follows:

> runsAtHome <- queryDatabase "winloss.sql" "SELECT hometeam, SUM(homescore) FROM winloss GROUP BY hometeam ORDER BY hometeam"
> runsAtHome
[[SqlByteString "ANA",SqlByteString "362"],[SqlByteString "ARI",SqlByteString "343"],[SqlByteString "ATL",SqlByteString "280"],... remaining content clipped

This time, we will gather the performance of each team as they play stadiums away from home, as follows:

> runsAway <- queryDatabase "winloss.sql" "SELECT awayteam, sum(awayscore) FROM winloss GROUP BY awayteam ORDER BY awayteam"
> runsAway
[[SqlByteString "ANA",SqlByteString "411"],[SqlByteString "ARI",SqlByteString "271"],[SqlByteString "ATL",SqlByteString "293"],... remaining content clipped

Use zip to get a pairwise comparison of each team. We will also use the readDoubleColumn function that was created in Chapter 4, Plotting as follows:

> let runsHomeAway = zip (readDoubleColumn runsAtHome 1) (readDoubleColumn runsAway 1)
> runsHomeAway
[(362.0,411.0),(343.0,271.0),(280.0,293.0),(341.0,364.0),(324.0,310.0),(335.0,325.0),(308.0,306.0),(306.0,289.0),(323.0,346.0),(500.0,255.0),(364.0,393.0),(318.0,311.0),(300.0,351.0),(328.0,387.0),(349.0,296.0),(329.0,321.0),(368.0,347.0),(304.0,329.0),(286.0,343.0),(376.0,353.0),(303.0,316.0),(350.0,332.0),(267.0,268.0),(281.0,353.0),(325.0,340.0),(332.0,287.0),(317.0,295.0),(298.0,339.0),(387.0,336.0),(362.0,324.0)]

Plotting what looks interesting

I bet that this is some interesting data. We will plot it, as follows:

> import Graphics.EasyPlot
> plot (PNG "HomeScoreAwayScore.png") $ Data2D [Title "Runs at Home (x axis) and Runs Away (y axis)"] [] runsHomeAway
True

The following screenshot shows the result of the preceding command:

Plotting what looks interesting

The preceding screenshot can be far more useful to us if we can identify the team that is associated with each data point (which is not something that the EasyPlot library currently allows). We will subtract the away scores from the home scores so that the data is focused on the scores that are around zero. Positive values represent more runs at home, and negative scores represent more away scores:

> let runsHomeAwayDiff = map ((a,b) -> a-b) runsHomeAway
> plot (PNG "HomeScoreAwayScoreDiff.png") $ Data2D [Title "Difference in Runs at Home and Runs Away"] [] $ zip [1..] runsHomeAwayDiff
True

The following screenshot shows the result of the preceding command:

Plotting what looks interesting

You may note that there is a data point in this chart that represents almost 250 positive runs. This data point represents the Colorado Rockies team, which played at Mile High Stadium. This stadium is located a mile above sea level. Rockies aside, you will note that most of the data points range from -50 to 50 runs. Our goal is to figure out whether this data supports the claim that a team playing in its own stadium produces any scoring advantage. At a glance, it is still hard to tell from the picture.

Returning to our test

Now is a good time to restate our claim in mathematical terms, as follows:

  • The null hypothesis is that the difference of runs scored at home and runs scored at the away games will be less than or equal to 0
  • The alternative hypothesis is that the difference of runs scored at home and runs scored at the away games will be greater than 0

The next step is to compute the sample mean of this dataset. Since this is a sample dataset, we will call our mean a sample mean. Simply compute the average of each difference in runs for the listing of teams. We can easily compute the sample mean on a single machine. The population mean is a part of an unknown quantity that we will test. The following line requires you to keep the LearningDataAnalysis02 module loaded:

> average runsHomeAwayDiff 
5.833333333333333

Each team scored, on average, almost 6 more runs at home than away during the entire 2014 season. While this may seem like evidence to support our claim, we are not finished with the test.

The standard deviation

The standard deviation is a term that can be interpreted as the spread of the data. If the standard deviation is a small value, then the data is clustered around the mean. If the standard deviation is a large value, then data is scattered. We compute the standard deviation by finding the adjusted average squared distance from the average. We say that this is an adjusted average instead of just an average to take into account that this is a sample and not the entire population. To compute the adjusted average, we will divide the sum by the number of elements subtracted by 1. This n-1 adjustment is known as Bessel's correction. The sample standard deviation is easily computed with the data sampled from the year 2014. The population sample deviation represents an unknown quantity.

We need to make sure the following import statements are found at the beginning of our LearningDataAnalysis05.hs file:

import Data.List
import Math.Combinatorics.Exact.Binomial
import LearningDataAnalysis02
standardDeviation :: [Double] -> Double
standardDeviation values =
    (sqrt . sum $ map (x -> (x-mu)*(x-mu)) values) /sqrt_nm1
  where
    mu = average values
    sqrt_nm1 = sqrt $ (genericLength values - 1)

Good. We will now calculate the standard deviation:

> standardDeviation runsHomeAwayDiff 
57.90365561286553

The sample mean is 5.83 runs and the sample deviation is 57.90 runs, which means that the average team scored 63.73 more runs at home or 52.07 more runs at away stadiums during the 2014 season.

The standard error

We will compute the standard error of the samples by dividing the standard deviation by the square root of the number of samples. The standard error is our way of expressing the mean to a precision:

> import Data.List
> standardDeviation runsHomeAwayDiff / (sqrt $ genericLength runsHomeAwayDiff)
10.571712780392359

The standard error is 10.57 runs. We believe that the population mean of the runs' difference is 5.83 runs plus or minus 10.57 runs. Since 5.83 runs subtracted by 10.57 is -4.74 runs (and this is less than 0), we now have a small reason to doubt our original claim.

The confidence interval

So, is the population mean greater than 0 or not? We will attempt to answer this question with a confidence interval. A confidence interval is a way of expressing that a value is within two end points on a number line at a set confidence level. As with our last example, selecting the confidence level is arbitrary, but it needs to be a high value. In this example, we are going to use the 95 percent confidence level.

The example involving coin flips was contrived because it is possible to mathematically compute all the possible outcomes of 1,000 coin flips, and we showed how the outcomes followed a binomial distribution. In this example involving baseball data, we do not know the distribution of the data. In cases where the distribution of data is unknown, we have to assume a distribution. According to the central limit theorem, we can approximate a normal distribution with a defined mean and standard deviation, even when we aren't sure of the real distribution. The normal distribution has a curve that is similar to that of the binomial distribution. The normal distribution goes by another name that you are probably familiar with—the bell curve. We are going to assume that the difference in the runs at home to the runs at away stadiums follows a normal distribution.

The standard normal distribution has a mean of 0 and a standard deviation of 1. This is the probability density function for the standard normal distribution. Similar to the binomial distribution, the integral of this function ranging from negative infinity to positive infinity is one. Its formula can be denoted as follows:

The confidence interval

We will plot this formula on the range of -4 to 4. This function extends infinitely in both directions:

> plot (PNG "standardNormal.png") $ Function2D [Title "Standard Normal"] [Range (-4) 4] (x -> exp(-(x*x)/2)/sqrt(2*pi)) 
True

The following screenshot shows the result of the preceding command:

The confidence interval

Based on our assumption that the difference in runs at home stadiums and away stadiums follows a normal distribution, we can establish a confidence interval that is centered around our sample mean. We must normalize this dataset so that we can utilize the standard normal curve. The normalizing of our dataset yields a distribution that is t-distributed. Just like what we did with the coin flip problem, we need to discover the interval that represents 95 percent of the area under the curve from the starting interval point to the ending interval point, and the center of this area is aligned to 0. The discovering of the area under the curve usually requires calculus, but there is a Haskell module that is designed just for this purpose, which makes the job a little easier.

We are going to work with the Error Function (Erf) module. You can install the Erf module using cabal, as follows:

$ cabal install erf

An introduction to the Erf module

Once this library is installed, you will have access to seven new functions that allow you to evaluate normal curves, including normal cumulative density function (normcdf) and inverse normal cumulative density function (invnormcdf). The normcdf function will take a parameter (x, which ranges from negative infinity to infinity) and returns the area under the curve of the normal standard starting from negative infinity to x. The invnormcdf function does the opposite of normcdf. Given a parameter (area, which ranges from 0 to 1), this function will return the value of x that produces a range from negative infinity to x, which has an area under the curve equal to the specified area.

For example, the mean of the standard normal curve is 0 (it is the center of the plot). The area under the curve to the left of 0 is 0.5 units. The area towards the right under the curve is also 0.5 units. Simply calling normcdf 0 produces the area that is located towards the left of the plot, as follows:

> import Data.Number.Erf
> normcdf 0
0.5

We know that the total area under the curve is 1, which means that to produce the area to the right side of the plot of any value x, we must subtract normcdf x from 1:

> 1 - (normcdf 0)
0.5

If we are interested in the value of x that produces half the area of the chart, we can use invnormcdf for this. We have already established that the value of x that designates exactly half the curve has to have the value of 0. So, it is no surprise that invnormcdf 0.5 produces 0.0:

> invnormcdf 0.5
0.0

Likewise, you can ask for the position of x that produces an area of 1, as follows:

> invnormcdf 1
Infinity

Haskell tells it like it is.

Using Erf to test the claim

We want to know the starting left position and the ending right position of a range that has an area of 0.95 (our confidence level), which is also centered at the middle of the standard normal distribution. This is going to be trickier than a simple normcdf 0.95. Because the range we desire will have 0.95 in the middle, we need to compute the area on both the sides of this range. Subtract 0.95 from 1 and divide the result by 2:

> (1-0.95)/2
2.5000000000000022e-2

We must compute the position of x with an area of 0.025, as follows:

> invnormcdf 0.025
-1.9599639845400543

-1.96 will be the left side of our interval. Since we know that the area of the unused section towards the far right of the plot is also 0.025 units, the right side of the interval will have the value 1.96. The full interval at the 95 percent confidence level is -1.96 to 1.96. We can demonstrate this by calling normcdf 1.96 and subtracting the area towards the far left tail of the curve from it:

> normcdf 1.96 - 0.025
0.9500021048517795

Next, we will multiply the standard error by each of the interval end points and add the mean of the sample data to it:

(5.83 + -1.96 * 10.57, 5.83 + 1.96 * 10.57) = (-14.89, 26.55)

A discussion of the test

The 95 percent confidence interval shows that, in the 2014 season, the teams score somewhere between -14.89 to 26.55 more runs at home stadiums than at away stadiums. As stated earlier, the null hypothesis is that the difference of runs scored at home and away games will be less than or equal to 0. Since our confidence interval clearly contains the 0 value in the range, we have failed to reject the null hypothesis. There might be a home-field advantage, but our sampling of the 2014 games failed to demonstrate this at the designated confidence level.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset