Regression analysis

Should we tell our coaches that scoring is not important? Of course not. A team needs to score at least one run to have a chance of winning a game. We should communicate to our coaches the importance of scoring more runs per game, even when we know that there is a weak correlation between scoring and winning. We communicate this importance by using regression analysis. With regression analysis, we create an equation that will allow us to estimate the win percentage of a team based on their runs per game value.

The approach that we will follow is known as simple linear regression. Linear regression is the simplest type of regression. The working assumption is that our data forms a straight line. While we admit that it is difficult to make out a line in our data, we shall make the assumption that a line exists. This line indicates the increase in the win percentage of a team as the team scores more runs per game. When one factor goes up, the other goes up linearly.

The regression equation line

A linear equation is as follows:

The regression equation line

In this equation, x represents the number of runs per game, B represents the slope (or the gradient) of the line this equation gives, A represents the y-intercept, and y represents the estimated winning percentage of a team that is able to score x runs per game. The equation line will represent a best fit that will closely follow down the middle of the data, minimizing the difference in the distance between the points above and below the line.

Estimating the regression equation

The regression equation is a best fit line. The goal in crafting this equation is to produce the smallest overall error between the real data and what the equation will estimate the data to be. We can minimize the error term by computing the covariance of the X and Y variables and dividing it by the variance of X:

Estimating the regression equation

We can compute the value of A (our y-intercept) by computing the average of X and Y and substituting these values into our linear equation:

Estimating the regression equation

Translate the formulas to Haskell

In Haskell, the preceding formula will look like the following code:

{- Perform simple linear regression to find a best-fit line.
  Returns a tuple of (gradient, intercept) -} 
linearRegression :: [Double] -> [Double] -> (Double, Double)
linearRegression x y = (gradient, intercept)
  where
    xavg = average x
    yavg = average y
    xstdev = standardDeviation x
    gradient = covariance x y / (xstdev * xstdev)
    intercept = yavg - gradient * xavg

Here, I have renamed B to gradient and A to intercept.

Returning to the baseball analysis

When we execute the preceding code with our baseball data, we get the following output:

> let (gradient, intercept) = linearRegression runsPerGame winPercentage 
> gradient
6.702070689192714e-2
> intercept
0.22742671114723823

The value of the slope is 0.07 and the value of the intercept is 0.23. A slope of value 0.07 indicates that if a team increases their runs per game by 1 run per game, it should increase their seasonal win percentage by an estimated seven percent. This percentage grows with a linear progression, so an increase of 2 runs per game will increase their win percentage by an estimated 14 percent (and so forth). Here, we will estimate the win percentage for a fictional team that was able to score 3, 4, and 5 runs per game for an entire season:

> 3*gradient+intercept
0.42848883182301967
> 4*gradient+intercept
0.4955095387149468
> 5*gradient+intercept
0.5625302456068739

A team that scores 3 runs per game should win about 43 percent of their games. At 4 runs per game, the estimated win percentage is 50 percent. At 5 runs per game, this increases to 56 percent. In our dataset, the team with the highest win percentage won over 60 percent of their games while not quite hitting 5 runs per game.

Plotting the baseball analysis with the regression line

In the previous chart that displays the runs per game and the win percentage for each team, the chart ranges between 3.2 and 4.8 runs per game. I created a new dataset of line estimates based on the values of runs per year that range from 3.3 to 4.7. This way, we have a line that can fit nicely within the existing chart:

> let winEstimate = map (x -> x*gradient + intercept) [3.3, 3.4 .. 4.7]
> let regressionLine = zip [3.3, 3.4 .. 4.7] winEstimate
> plot (PNG "runs_and_wins_with_regression.png") [Data2D [Title "Runs Per Game VS Win % in 2014"] [] (zip runsPerGame winPercentage), Data2D [Title "Regression Line", Style Lines, Color Blue] [] regressionLine]

The preceding statements would give the following chart as a result:

Plotting the baseball analysis with the regression line

What makes the regression line so helpful in our analysis is that it allows us to quickly identify the teams that are above average and below average. Above the line exists the teams that did better than our estimate suggests. Below the line exists the teams that did worse than our estimate.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset