Chapter 5. Linear Regression with Python

If you have mastered the content of the last two chapters, implementing predictive models will be a cake walk. Remember the 80-20% split between the data cleaning + wrangling and modelling? Then what is the need of dedicating a full chapter to illustrate the model? The reason is not about running a predictive model; it is about understanding the mathematics (algorithms) that goes behind the ready-made methods which we will be using to implement these algorithms. It is about interpreting the swathe of results these models spew after the model implementation and making sense of them in the context. Thus, it is of utmost importance to understand the mathematics behind the algorithms and the result parameters of these models.

With this chapter onwards, we will deal with one predictive modelling algorithm in each chapter. In this chapter, we will discuss a technique called linear regression. It is the most basic and generic technique to create a predictive model out of a historical dataset with an output variable.

The agenda of this chapter is to thoroughly understand the mathematics behind linear regression and the results generated by it by illustrating its implementation on various datasets. The broad agenda of this chapter is, as follows:

  • The maths behind the linear regression: How does the model work? How is the equation of the model created based on the dataset? What are the assumptions for this calculation?
  • Implementing linear regression with Python: There are a couple of ready-made methods to implement linear regression in Python. Instead of using these ready-made methods, one can write one's own Python code snippet for the entire calculation with custom inputs. However, as linear regression is a regularly used algorithm, the use of ready-made methods is quite common. Its implementation from scratch is generally used to illustrate the maths behind the algorithm.
  • Making sense of result parameters: There will be tons of result parameters, such as slope, co-efficient, p-values, and so on. It is very important to understand what each parameter means and the range their values lie in, for the model to be an efficient model.
  • Model validation: Any predictive model needs to be validated. One common method of validating is splitting the available dataset into training and testing datasets, as discussed in the previous chapter. The training dataset is used to develop the model while the testing is used to compare the result predicted by the model to the actual values.
  • Handling issues related to linear regression: Issues, such as multi-collinearity, handling categorical variables, non-linear relationships, and so on come up while implementing a linear regression; these need to be taken care of to ensure an efficient model.

Before we kick-start the chapter, let's discuss what a model means and entails. A mathematical/statistical/predictive model is nothing but a mathematical equation consisting of input variables yielding an output when values of the input variables are provided. For example, let us, for a moment, assume that the price (P) of a house is linearly dependent upon its size (S), amenities (A), and availability of transport (T). The equation will look like this:

Linear Regression with Python

This is called the model and the variables a1, a2, and a3 are called the variable coefficients. The variable P is the predicted output while the S, A, and T are input variables. Here, S, A, and T are known but a1, a2, and a3 are not. These parameters are estimated using the historical input and output data. Once, the value of these parameters is found, the equation (model) becomes ready for testing. Now, S, A, and T can be numerical, binary, categorical, and so on; while P can also be numerical, binary, or categorical and it is this need to tackle various types of variables that gives rise to a large number of models.

Understanding the maths behind linear regression

Let us assume that we have a hypothetical dataset containing information about the costs of several houses and their sizes (in square feet):

Size (square feet) X

Cost (lakh INR) Y

1500

45

1200

38

1700

48

800

27

There are two kinds of variables in a model:

  • The input or predictor variable, the one which helps predict the value of output variable
  • The output variable, the one which is predicted

In this case, cost is the output variable and the size is the input variable. The output and the input variables are generally referred as Y and X respectively.

In the case of linear regression, we assume that Y (Cost) is a linear function of X (Size) and to estimate Y, we write:

Understanding the maths behind linear regression

Where Y e is the estimated or predicted value of Y based on our linear equation.

The purpose of linear regression is to find statistically significant values of a and ß, which minimize the difference between Y and Y e. If we are able to determine the values of these two parameters satisfying these conditions, then we will have an equation which we can predict the values of Y, given the value of X.

So, to summarize, linear regression (like any other supervised algorithm) requires historical data with one output variable and one or more than one input variables to make a model/equation, using which output variables can be calculated/predicted if the input variable is present. In the preceding case, if we find the value of a =2 and ß=.3, then the equation will be:

Understanding the maths behind linear regression

Using this equation, we can find the cost of a home of any size. For a 900 square feet house, the cost will be:

Understanding the maths behind linear regression

The next question that we can ask is how do we estimate a and ß. We use a method called least square sum of the difference between Y and Ye. The difference between the Y and Ye can be represented as e:

Understanding the maths behind linear regression

Thus, the objective is to minimize Understanding the maths behind linear regression; the summation is over all the data points.

We can also minimize: Understanding the maths behind linear regression, where n is the number of data points.

Using calculus, we can show that the values of the unknown parameters are as follows:

Understanding the maths behind linear regression

where Xm – mean of X values and Ym-mean of Y values

If you are interested to know how these formulae come up, you can go through the following information box, which describes the derivation. The steps for this derivation can be summarized, as follows:

  • Take partial derivatives of e2 with respect to all the variable coefficients and equate them to 0 (at maxima or minima, the derivative of a function is 0). This will give us as many equations, as there are variables:
    Understanding the maths behind linear regression

    where S= (Y-Ye), Y-actual value of Y, Ye-estimated/predicted value of Y= a+ ß*X

  • Solve these equations to get the values of the variable coefficients:
Understanding the maths behind linear regression

Almost all the statistical tools have ready-made programs to calculate the coefficients a and ß. However, it is still very important to understand how they are calculated behind the curtain.

Linear regression using simulated data

For the purpose of linear regression, we write that Linear regression using simulated data; whereas Y will rarely be perfectly linear and would have an error component or residual and we write Linear regression using simulated data.

In the above example, K is the error component or residual. It is a random variable and is assumed to be normally distributed.

Fitting a linear regression model and checking its efficacy

Let us simulate the data for the X and Y variables and try to look at how the predicted values (Ye) differ from the actual value (Y).

For X, we generate 100 normally distributed random numbers with mean 1.5 and standard deviation 2.5 (you can take any other number of your choice and try). For predicted value(Ye),we assume an intercept of 2 and a slope of .3 and we write Ye =2+.3*x. Later, we will calculate the values of a and ß using the preceding data and see how that changes the efficacy of the model. For the actual value, we add a residual term (res) that is nothing but a random variable distributed normally with mean 0 and a standard deviation of .5.

The following is the code snippet to generate these numbers and convert these three columns in a data frame:

import pandas as pd
import numpy as np
x=2.5*np.random.randn(100)+1.5
res=.5*np.random.randn(100)+0
ypred=2+.3*x
yact=2+.3*x+res
xlist=x.tolist()
ypredlist=ypred.tolist()
yactlist=yact.tolist()
df=pd.DataFrame({'Input_Variable(X)':xlist,'Predicted_Output(ypred)':ypredlist,'Actual_Output(yact)':yactlist})
df.head()

The resultant data frame df output looks similar to the following table:

Fitting a linear regression model and checking its efficacy

Fig. 5.1: Dummy dataset

Let us now plot both, the actual output (yact) and predicted output (ypred) against the input variable (x) for the sake of comparing yact and ypred and see what the difference between them is. This ultimately answers the bigger question, as to how accurately the proposed equation has been able to predict the value of the output:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
x=2.5*np.random.randn(100)+1.5
res=.5*np.random.randn(100)+0
ypred=2+.3*x
yact=2+.3*x+res
plt.plot(x,ypred)
plt.plot(x,yerr,'ro')

plt.title('Actual vs Predicted')

The output of the snippet looks similar to the following screenshot. The red dots are the actual values (yact) while the blue line is the predicted value (ypred):

Fitting a linear regression model and checking its efficacy

Fig. 5.2: Plot of Actual vs Predicted values from the dummy dataset

Let us add a line representing the mean of the actual values for a better perspective of the comparison. The line in green represents the mean of the actual values:

Fitting a linear regression model and checking its efficacy

Fig. 5.3: Plot of Actual vs. Predicted values from the dummy dataset with mean actual value

This could be achieved by the following code snippet that is obtained by a little tweaking of the preceding code snippet:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
x=2.5*np.random.randn(100)+1.5
res=.5*np.random.randn(100)+0
ypred=2+.3*x
yact=2+.3*x+res
ymean=np.mean(yact)
yavg=[ymean for i in range(1,len(xlist)+1)]
plt.plot(x,ypred)
plt.plot(x,yact,'ro')
plt.plot(x,yavg)
plt.title('Actual vs Predicted')

Now, the question to be asked is why we chose to plot the mean value of the yact. This is because in the case when we don't have any predictor model, our best bet is to go with the mean value of the observed value and say that this will be the predicted value.

Another point to think about is how to judge the efficacy of our model. If you pass any data containing two variables—one input and one output, the statistical program will generate some values of alpha and beta. But, how do we understand that these values are giving us a good model?

Fitting a linear regression model and checking its efficacy

Fig. 5.4: Actual vs. Predicted vs. Fitted (Regressed) line from the dummy dataset (a picture to always keep in mind whenever you think of R2)

In case, there is no model and the total variability is explained as the Total Sum of Squares or SST:

Fitting a linear regression model and checking its efficacy

Now, this total error is composed of two terms—one which is the difference between the regression value and the mean value, this is the difference which the model seeks to explain and is called Regression Sum of Squares or SSR:

Fitting a linear regression model and checking its efficacy

The unexplained random term, let us call it, Difference Sum of Squares or SSD is:

Fitting a linear regression model and checking its efficacy

As you can see, in the preceding figure or you can guess intuitively that:

Fitting a linear regression model and checking its efficacy

Where, SSR is the difference explained by the model; SSD is the difference not explained by the model and is random; SST is the total error.

The more the share of SSR in SST, the better the model is. This share is quantified by something called R2 or R-squared or coefficient of determination

Fitting a linear regression model and checking its efficacy

Since SST>SSR, the value of R2 can range from 0 to 1. The closer it is to 1, the better the model. A model with R2 =0.9 will be compared to a model with R2 =0.6, all the other factors remaining the same. That being said, a good R2 alone doesn't mean that the model is a very efficient one. There are many other factors that we need to analyze before we come to that conclusion. But, R2 is a pretty good indicator that a linear regression model will be effective.

Let us see what the value of R2 is for the dataset that we created in the preceding section. When we perform a linear regression, the R2 value will be calculated automatically. Nevertheless, it is great to have an understanding of how it is calculated.

In the following code snippet, SSR and SST have been calculated according to the formulae described in the preceding section and have been divided to calculate R2:

df['SSR']=(df['Predicted_Output(ypred)']-ymean)**2
df['SST']=(df['Actual_Output(yact)']-ymean)**2
SSR=df.sum()['SSR']
SST=df.sum()['SST']
SSR/SST

The value of R2 comes out to be 0.57, suggesting that ypred provides a decent prediction of the yact. In this case, we have randomly assumed some values for a and ß. a =2, ß=.3. This might or might not be the best values of a, ß. We have seen earlier that a least sum of square methods is used to find an optimum value for a, ß. Let us use these formulae to calculate a +ß*X and see if there is an improvement in R2 or not. Hopefully, there will be.

Finding the optimum value of variable coefficients

Let us go back to the data frame df that we created a few pages back. The Input_Variable(X)column is the predictor variable using which the (a, ß) model will be derived. The Actual_Output(yact) variable, as the name suggests, is the actual output variable. Using these two variables, we will calculate the values of a and ß, according to the formulae described previously.

One thing to be cautious about while working with random numbers is that they might not produce the same result, most of the times. It is very likely that you will get a number different than what is mentioned in this text and it is alright as long as you grasp the underlying concept.

To calculate the coefficients, we will create a few more columns in the df data frame that is already defined, as we did while calculating the value of R2. Just to reiterate, here are the formulas again:

Finding the optimum value of variable coefficients
Finding the optimum value of variable coefficients

We write the following code snippet to calculate these values:

xmean=np.mean(df['Input_Variable(X)'])
ymean=np.mean(df['Actual_Output(yact)'])
df['beta']=(df['Input_Variable(X)']-xmean)*(df['Actual_Output(yact)']-ymean)
df['xvar']=(df['Input_Variable(X)']-xmean)**2
betan=df.sum()['beta']
betad=df.sum()['xvar']
beta=betan/betad

alpha=ymean-(betan/betad)*xmean
beta,alpha

If you go through the code carefully, you will find out that betan and betad are the numerators and denominators of the formula to calculate beta. Once, beta is calculated, getting alpha is a cakewalk. The snippet outputs the value of alpha and beta. For my run, I got the following values:

Finding the optimum value of variable coefficients

As we can see, the values are a little different from what we had assumed earlier, that is a =2 and β=.3. Let us see how the value of R2 changes if we use the values predicted by the model consisting of these parameters. The equation for the model can be written as:

Finding the optimum value of variable coefficients

Let us create a new column in our df data frame to accommodate the values generated by this equation and call this ymodel. To do that we write the following code snippet:

df['ymodel']=beta*df['Input_Variable(X)']+alpha

Let us calculate the value of R2 based on this column and see whether it has improved or not. To calculate that, we can reuse the code snippet we wrote earlier by replacing the Predicted_Output(ypred) variable with ymodel variable:

df['SSR']=(df['ymodel']-ymean)**2
df['SST']=(df['Actual_Output(yact)']-ymean)**2
SSR=df.sum()['SSR']
SST=df.sum()['SST']
SSR/SST

The value of new R2 comes out to be 0.667, a decent improvement from the value of 0.57 when we assumed the values for a and β.

Let us also plot this new result derived from the model equation against the actual and our earlier assumed model, just to get a better visual understanding. We will introduce one more line to represent the model equation we just created:

%matplotlib inline
plt.plot(x,ypred)
plt.plot(x,df['ymodel'])
plt.plot(x,yact,'ro')
plt.plot(x,yavg)
plt.title('Actual vs Predicted vs Model')

The graph looks similar to the following figure. As we can see, the ymodel and ypred are more or less overlapping, as the values of a and β are not very different.

Legend: Blue line –ypred, green line – ymodel, red line – ymean, red dots -x

Finding the optimum value of variable coefficients

Fig. 5.5: Actual vs Predicted vs Fitted line from the dummy dataset where model coefficients have been calculated rather than assumed

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset