Chapter 6. Correlation and Regression Analysis

In the previous chapter, we examined the classical procedure to test claims using the normal distribution curve. We also discussed the fundamental concept of variance and presented the function that computes the standard deviation of a dataset.

This chapter will examine the relationships between the input and output data. This is a truism to most sports fans—there exists a relationship between scoring and winning. It should seem obvious that sports teams that score higher points tend to win more games. As you might expect, the teams that don't often score high points tend to not win that often. The craft of measuring the relationship between the input (the number of points scored) and output (whether the team won or not) data is known as correlation analysis. Regression analysis allows us to estimate the result of an unknown output based on the input by creating an equation that minimizes errors between the independent and dependent variables that are believed to be linked. After creating a regression equation, we estimate the output for each of the known outputs based on the known inputs. This understanding is not without its drawbacks. A discussion of this approach should also come with an understanding of the potential errors that result from using it.

In this chapter, we will cover the following:

  • The terminology of correlation and regression
  • Study – is there a relationship between scoring and winning in baseball?
  • Regression analysis
  • The pitfalls of regression analysis

The terminology of correlation and regression

Before we explore any data, we will discuss some terminology. When using regression analysis, we need a set of input and output variables. Analysis where a single column of data is used as an input variable is known as univariate analysis. Analysis that takes multiple sets of input variables is known as multivariate analysis. Regression analysis allows us to estimate unknown values in a single column of output. The input data is known as an independent variable. There are no assumptions being made as to how independent variables behave. Unlike the input, the output is known as a dependent variable. The assumption in this case is that the input variables impact the output variable, for example, we will return to our assumption that scoring high points leads to a team winning more often than an average number of times. An average team in this context is a team that wins and loses an equal number of times. A team can work hard, practice, and exercise to their full potential in order to account for the offensive effort of the team, which will account for the team scoring more points than the average number of points. We generally believe that these actions translate to a team that wins more often than an average team (while we ignore some equally important factors like the defensive abilities of an opposing team). Likewise, a team puts forth a minimal amount of effort into practice in order to score fewer points. We also believe that these actions translate to a team that wins less often than an average number of times. In both cases, the team represents an independent variable. The amount of hard work put towards an offensive effort is purely under their control. Unlike offensive effort, winning is not completely under their control. An opposing team might have a better offensive or defensive plan, or more capable players on their side. These qualities represent an unknown variable when performing regression. Finally, there is luck. Ignoring luck, our assumption is that winning is dependent on scoring, and scoring is dependent on offensive effort.

The expectation of a variable

An independent variable is denoted as X. If we have multiple independent variables, then you will see each independent variable denoted as X1, X2, X3, …, Xm. A dependent variable is denoted as Y. The X and Y variables represent a dataset of n values (or observations). The mean of X is known as X-bar. If we wish to take every value of X and subtract X-bar from it, we will write it in the following way:

The expectation of a variable

The term average (which is the result of a sum of a listing of values divided by the number of values) also goes by the term mean as well as expectation. Sometimes, we will see the preceding formula written as follows:

The expectation of a variable

You can think of E as a function that computes the average of a list of values.

The result of this operation will be a new dataset with the average being 0. If a value in this dataset is positive, we know that the result is above average. Likewise, negative values will give the result as below average.

The variance of a variable

We would like to know the spread of this variable. To do this, we will take each value in the dataset, subtract the mean, and then square the result. This produces a dataset that consists of positive values. After this step, add up all of the values and divide the sum by the number of observations to get the average squared distance from the average. This, as we discussed in Chapter 5, Hypothesis Testing, is the population variance. To find out the sample variance, rather than dividing by n, we divide it by (n-1):

The variance of a variable

We can write the same formula with the help of a cleaner notation by using the E function:

The variance of a variable

To find out the population standard deviation, we will take the square root of the population variance. The standard deviation is signified by the Greek letter, sigma (denoted as σ):

The variance of a variable

Normalizing a variable

Now that we know the average distance from the average of the values in the dataset, we can use this to normalize the X variable. To do this, we will divide each value in X- X-bar by the standard deviation:

Normalizing a variable

This normalized version of our dataset still contains positive and negative values, but it is also a measure of how extreme in distance the normalized variable is from the mean. A score between -1 and 1 means that a value is closer to the mean than the typical data value. Scores ranging from -2 to -1 and from 1 to 2 mean that the value is between one and two times distant from the mean than the typical value. Most of your data should fall between a score of -2 and 2. Scores from -3 to -2 and from 2 to 3 indicate that the value is a little more distant. A value greater than 3 or less than -3 means that this value is more than 3 times the distance from the average of the values in the dataset than the typical value. Values with a score in this range are considered rare and indicate special circumstances that merit investigation. Values that deviate significantly from the majority of a dataset are called outliers. When an outlier value has been identified, it possibly represents special circumstances and should be investigated for unique qualities. It can also mean something less special—a noisy data point was not properly identified in the cleaning phase of data analysis.

The covariance of two variables

When working with two variables (in our case, an input and an output variable), we may want to study how the variables move in conjunction with each other. Like variance, the tool known as covariance helps us to measure how variables relate to each other. Instead of one X variable, we now have two—X and Y.

We will begin by subtracting the mean of X from each value of X, whose answer is then multiplied by the result of the mean of Y subtracted from each value of Y:

The covariance of two variables

Again, we add each of these distance measures and divide the sum by the number of observations to find out the population covariance coefficient. To find the sample covariance coefficient, we change n to n-1:

The covariance of two variables

Again, we can write the same formula by using the E notation:

The covariance of two variables

The covariance coefficient is a measurement of how the variables relate to each other. If the X variable increases as the Y variable increases, then this coefficient will have a positive value. Likewise, if X or Y increases and the other variable decreases, then this coefficient will have a negative value.

Finding the Pearson r correlation coefficient

We can normalize this value as we did before. This normalized value will always have a value from -1 to 1. We need the standard deviations of X and Y (σx and σy). This normalized version of the covariance value is known as the Pearson r correlation coefficient. The formula for this can be denoted as follows:

Finding the Pearson r correlation coefficient

As with the correlation coefficient, a positive r value informs us that the variables are linearly correlated (as the value of one variable increases, the value of the other increases) and a negative r value informs us that the variables are inversely correlated (as the value of a variable increases, the other decreases). The closer an r value is to the extremes of -1 or 1, the more the strength of the correlation it indicates. An r value that is close to 0 tells us that the connection between the two variables is weak.

Finding the Pearson r2 correlation coefficient

The r value is tweaked further to r2 (we simply multiply r by itself). Because r ranges from -1 to 1, the value of r2 will always be a value from 0 to 1. A higher r2 implies that there is stronger evidence of a correlation. A lower r2 implies little or no correlation. An r2 that is greater than 0.9 is considered to be an excellent correlation, while an r2 that is less than 0.5 is considered to be weak. Interpreting an r2 is a form of art—it is what you make of it. The r2 values go by the name, coefficient of determination.

We will not continue this discussion without talking about a key pitfall of r2—the discovery of an input and output variable that produces a high r2 value does not automatically imply that the input has an impact on the output. This is known by the saying, correlation does not imply causation. When data mining for correlations, you are bound to find pairs of variables that produce high r2 values. Some of these input variables might have a causal effect on the output variable, and at times, you may find a similar pattern between two unrelated variables. This frequently happens when two output variables are compared. This measurement only tells you what correlates, not whether the discovered correlations are sensible. You might discover a link between the sale of food and the sale of beverages at a restaurant, but there is nothing interesting about this correlation because these variables are dependent on the number of customers who frequent the restaurant.

Translating what we've learned to Haskell

We can express these formulas to Haskell in the following way:

module LearningDataAnalysis06 where

  import Data.List
  import Graphics.EasyPlot
  import LearningDataAnalysis02
  import LearningDataAnalysis04
  import LearningDataAnalysis05

  {- Covariance -}
  covariance :: [Double] -> [Double] -> Double
  covariance x y = average $ zipWith (xi yi -> (xi-xavg) * (yi- yavg)) x y
    where
      xavg = average x
      yavg = average y

  {- Pearson r Correlation Coefficient -}
  pearsonR :: [Double] -> [Double] -> Double
  pearsonR x y = r
    where
      xstdev = standardDeviation x
      ystdev = standardDeviation y
      r = covariance x y / (xstdev * ystdev)

  {- Pearson r-squared -}
  pearsonRsqrd :: [Double] -> [Double] -> Double
  pearsonRsqrd x y = pearsonR x y ^ 2
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset