Implementing linear regression with Python

Let's now go ahead and try to make a simple linear regression model and see what are the issues that we face and how can they be resolved to make the model more robust. We will use the advertising data that we used earlier for illustrating the correlation.

The following two methods implement linear regression in Python:

  • The ols method and the statsmodel.formula.api library
  • The scikit-learn package

Let's implement a simple linear regression using the first method and then build upon a multiple-linear regression model. We will then also look at how the second method is used to do the same.

Linear regression using the statsmodel library

Let's first import the Advertising data, as shown:

import pandas as pd
advert=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Linear Regression/Advertising.csv')
advert.head()

To reiterate, this dataset contains data about the advertising budget spent on TV, Radio, and Newspapers, for a particular product and the resulting sales. We will expect a positive correlation between such advertising costs and sales. We have already seen that there is a good correlation between TV advertising costs and sales. Let's see whether it is present or not. If yes, how does the relationship look like and to do that we write the following code:

import statsmodels.formula.api as smf
model1=smf.ols(formula='Sales~TV',data=advert).fit()
model1.params

In this code snippet, we have assumed a linear relationship between advertising costs on TV and sales. We have also created a best fit using the least sum of square method. This snippet will output the values for model parameters that is a and β. The following is the output:

Linear regression using the statsmodel library

In the notation that we have been using, a is the intercept and β is the slope. Thus:

Linear regression using the statsmodel library

The equation for the model will be:

Linear regression using the statsmodel library

The equation implies that an increase of 100 units in advertising costs will increase the sale by four units.

If you remember, we learnt that the values of these parameters are estimates and there will be a p-value associated to these. If the p-values are very small, then it can be accepted that these parameters have a non-zero value and are statistically significant in the model. Let's have a look at the p-values for these parameters:

model1.pvalues
Linear regression using the statsmodel library

As it can be seen, the p-values are very small; hence, the parameters are significant.

Let's also check another important indicator of the model efficacy and that is R2. As we saw earlier, there is a ready-made method for doing this. This can be done by typing the following code line:

model1.rsquared

The value comes out to be 0.61.

If we want the entire important model parameters at one go, we can take a look at the model summary by writing this snippet:

model1.summary()

The result is as follows:

Linear regression using the statsmodel library

Fig. 5.6: Model 1 Summary

As we can see, the F-statistic for this model is very high and the associated p-value is negligible, suggesting that the parameter estimates for this model were all significant and non-zero.

Let's now predict the values of sales based on the equation we just derived. This can be done using the following snippet:

sales_pred=model1.predict(pd.DataFrame(advert['TV']))
sales_pred

This equation basically calculates the predicted sales value for each row based on the model equation using TV costs. One can plot sales_pred against the TV advertising costs to find the line of best fit. Let's do that:

import matplotlib.pyplot as plt
%matplotlib inline
advert.plot(kind='scatter', x='TV', y='Sales')
plt.plot(pd.DataFrame(advert['TV']),sales_pred,c='red',linewidth=2)

We get the following plot as the output. The red line is the line of best fit (obtained from the model). The blue dots represent the actual data present:

Linear regression using the statsmodel library

Fig. 5.7: Line of best fit (obtained from the model) and the scatter plot of actual data

Now, let's calculate the RSE term for our prediction using the following code snippet:

advert['sales_pred']=0.047537*advert['TV']+7.03
advert['RSE']=(advert['Sales']-advert['sales_pred'])**2
RSEd=advert.sum()['RSE']
RSE=np.sqrt(RSEd/198)
salesmean=np.mean(advert['Sales'])
error=RSE/salesmean
RSE,salesmean,error

The output consists of three numbers, first of which is RSE=3.25, second is salesmean (mean of actual sales) = 14.02 and error is their ratio, which is equal to 0.23. Thus, on an average this model will have 23%, even if the coefficients are correctly predicted. This is a significant amount of errors and we would like to bring it down by some means. Also, the R2 value of 0.61 can be improved upon. One thing we can try is to add more columns in the model, as predictors and see whether it improves the result or not.

Multiple linear regression

When linear regression involves more than one predictor variable, then it is called multiple linear regression. The nature of the model remains the same (linear), except that there might be separate slope (β) coefficients associated with each of the predictor variables. The model will be represented, as follows:

Multiple linear regression

Each βi will be estimated using the same least sum of squares method; hence, would have a p-value associated with the estimation. The smaller the p-value associated with a variable, the more the significance of that variable to the model. The variables with large p-values should be eliminated from the model as they aren't good predictors of the output variable.

While the multiple regression gives us with the possibility of using more variables as predictors; hence, it increases the efficiency of the model. It also increases the complexity of the process of model building, as the selection of the variables to be kept and discarded in the model becomes tedious.

With this simple dataset of three predictor variables, there can be seven possible models. They are as follows:

  • Model 1: Sales~TV
  • Model 2: Sales~Newspaper
  • Model 3: Sales~Radio
  • Model 4: Sales~TV+Radio
  • Model 5: Sales~TV+Newspaper
  • Model 6: Sales~Newspaper+Radio
  • Model 7: Sales~TV+Radio+Newspaper

For a model with p possible predictor variables, there can be 2p-1 possible models; hence, as the number of predictors increases, the selection becomes tedious.

It would have been a tedious task to choose from so many possible models. Thankfully, there are a few guidelines to filter some of these and then navigate towards the most efficient one. The following are the guidelines:

  • Keep the variables with low p-values and eliminate the ones with high p-values
  • Inclusion of a variable to the model should ideally increase the value of R2 (although it is not a very reliable indicator of the same and looking at the adjusted R2 is preferred. The concept of adjusted R2 and why it is a better indicator than R2 will be explained later in this chapter).

Based on these guidelines, there are two kinds of approaches to select the predictor variables to go in the final model:

  • Forward selection: In this approach, we start with a null model without any predictor and then start adding predictor variables one by one. The variable whose addition results into a model with the lowest residual sum of squares will be added first to the model. If the p-value for the variable is small enough and the value of the adjusted R2 goes up; the predictor variable is included in the model. Otherwise, it is not included in the model.
  • Backward selection: In this approach, one starts with a model that has all the possible predictor variables in the model and discards some of them. If the p-value of a predictor variable is large and the value of the adjusted R2 goes up, the predictor variable is discarded from the model. Otherwise, it remains a part of the model.

Many of the statistical programs, including the Python, give us an option to select from the two preceding approaches while implementing a linear regression. The statistical program then implements the linear regression using the selected approach.

For now, let us manually add a few variables and see how it changes the model parameters and efficacy, so that we can get a better glimpse of what goes on behind the curtain when these approaches are implemented by the statistical program.

We have already seen one model assuming a linear relationship between sales and TV advertising costs. We can ignore the other models consisting of single variables (that is newspaper and radio, as they have a small correlation compared to TV). Let us now try to add more variables to the model we already have and see how the parameters and efficacy change.

Let us try adding the newspaper variable to the model using the following code snippet:

import statsmodels.formula.api as smf
model2=smf.ols(formula='Sales~TV+Newspaper',data=advert).fit()
model2.params

The following are the results:

Multiple linear regression

Fig. 5.8: Model 2 coefficients and p-values

The p-values for the coefficients are very small, suggesting that all the estimates are significant. The equation for this model will be:

Multiple linear regression

The values of R2 and adjusted R2 are 0.646 and 0.642, which is just a minor improvement from the value obtained in the earlier model.

The values can be predicted using the following snippet:

sales_pred=model2.predict(advert[['TV','Newspaper']])
sales_pred

To calculate the RSE, we modify the snippet a little:

import numpy as np
advert['sales_pred']=5.77 + 0.046*advert['TV'] + 0.04*advert['Newspaper']
advert['RSE']=(advert['Sales']-advert['sales_pred'])**2
RSEd=advert.sum()['RSE']
RSE=np.sqrt(RSEd/197)
salesmean=np.mean(advert['Sales'])
error=RSE/salesmean
RSE,salesmean,error

The value of RSE comes out to be 3.12 (22%), not very different from the model with only TV. The number 197 comes from the (n-p-1) term in the formula for RSE, where n=200, p=2 for the current model. The following table is the model summary:

Multiple linear regression

Fig. 5.9: Model 2 Summary

Although as the F-statistic decreases, the associated p-value also decreases. But, it is just a marginal improvement to the model, as we can see in the adj. R2 value. So, adding newspaper didn't improve the model significantly.

Let's try adding radio to the model instead of the newspaper. Radio had the second best correlation with the Sales variable in the correlation matrix we created earlier in the chapter. Thus, one expects some significant improvement in the model upon its addition to the model. Let's see if that happens or not:

import statsmodels.formula.api as smf
model3=smf.ols(formula='Sales~TV+Radio',data=advert).fit()
model3.params

The output parameters and the associated p-values of this model are, as follows:

Multiple linear regression

Fig. 5.10: Model 2 coefficients and p-values

The model can be represented as the following:

Multiple linear regression

The values can be predicted based on the preceding model using the following snippet:

sales_pred=model3.predict(advert[['TV','Radio']])
sales_pred

The model summary looks something similar to the following screenshot:

Multiple linear regression

Fig. 5.11: Model 3 summary

One thing to observe here is that the R2 value has improved considerably due to the addition of radio to the model. Also, the F-statistic has increased significantly from the last model indicating a very efficient model.

The RSE can be calculated using the same method described previously. The value for this model comes out to be 1.71 (around 12%),which is much better than the 23% and 22% in the previous model.

Thus, we can conclude that radio is a great addition to the model and TV and radio advertising costs have been able to describe the sales very well and this model itself is a very efficient model. But, can we improve it a bit further by combining all three predictor variables?

The last thing that we should try is, all the predictor variables together by using the following code:

import statsmodels.formula.api as smf
model4=smf.ols(formula='Sales~TV+Radio+Newspaper',data=advert).fit()
model4.params

The estimates of the coefficients and the associated p-values for this model will be as follows:

Multiple linear regression

Fig. 5.12: Model 4 coefficients and p-values

The p-values for the coefficients are very small, suggesting that all the estimates are significant. The equation for this model will be:

Multiple linear regression

The values of sales can be predicted using the following snippet:

sales_pred=model4.predict(advert[['TV','Radio','Newspaper']])
sales_pred

The summary of the model is shown in the following table:

Multiple linear regression

Fig. 5.13: Model 4 summary

The most striking feature of this model is that the estimate of the coefficients is very similar to that in the previous model. The intercept, coefficient for TV, and the coefficient for Radio are more or less the same. The values of R2 and adj-R2 are also similar to the previous model.

The value of RSE can be calculated in a similar way, as before. The value comes out to 2.57 (18%), which is more than the previous model.

Other things to note about this model are the following:

  • There is a small negative coefficient for the newspaper. When we considered only TV and newspaper in the model, the coefficient of the newspaper was significantly positive. Something affected the coefficient of the newspaper when it became a part of the model in presence of TV and radio.
  • For this model, the F-statistic has decreased considerably to 570.3 from 859.6 in the previous model. This suggests that the partial benefit of adding newspaper to the model containing TV and radio is negative.
  • The value of RSE increases on addition of newspaper to the model.

All these point in the direction that the model actually became a little less efficient on addition of newspaper to the previous model. What is the reason?

Multi-collinearity

Multi-collinearity is the reason for the sub-optimal performance of the model when newspaper was added to the final model. Multi-collinearity alludes to the correlation between the predictor variables of the model.

These are some of the signs of a common problem encountered during a regression called multi-collinearity. Go back a few pages to the correlation matrix that we created for this dataset and you will find that there is a significant correlation of 0.35 between radio and newspaper. This means that the expense on Newspaper is related to that on the Radio. This relationship between the predictor variable increases the variability of the co-efficient estimates of the related predictor variables.

The t-statistic for these coefficients is calculated by dividing the mean value by the variability (or error). As this error goes up, the value of t-statistic goes down and the value of p-value increases. Thus, the chances that the null hypothesis for the hypothesis test associated with the F-statistic will be accepted are increased. This decreases the significance of the variable in the model.

Multi-collinearity

Where βm=mean of estimates of β, SE(βm)=variability in the estimate of β.

Thus collinearity is an issue that needs to be taken care of. For highly correlated predicted variables, we need to do a deep-dive with these variables and see whose inclusion in the model makes the model more efficient.

It is a good practice to identify the pair of predictor variables with high correlation, using the correlation matrix and check the pairs of multi-collinearity effect on the model. The culprit variables should be removed from the model. The VIF is a method to tackle this issue.

Variance Inflation Factor

A fool-proof way to detect this menace called multi-collinearity is a statistic called Variance Inflation Factor (VIF). It is a method to quantify the rise in the variability of the coefficient estimate of a particular variable because of high correlation between two or more than two predictor variables. The VIF needs to be calculated for each of the variables and if the value is very high for a particular variable, then that predictor needs to be eliminated from the model. Some statistical processes calculate VIF when fed with the option to do so. The following process goes under the hood for calculation of VIF:

  1. Write Xi as a linear function of other predictor variables:
    Variance Inflation Factor
  2. Calculate the coefficient of determination for this model and call it Ri2. The VIF for Xi is given by:
    Variance Inflation Factor
  3. If the VIF=1, then the variables are not correlated. If 1<VIF<5, then the variables are moderately correlated with other predictor variables and can still be part of the model. If VIF>5, then variables are highly correlated and need to be eliminated from the model.

Let us write a short snippet to calculate the VIF to understand the calculation better:

model=smf.ols(formula='Newspaper~TV+Radio',data=advert).fit()
rsquared=model.rsquared 
VIF=1/(1-rsquared)
VIF

This will give a VIF for the Newspaper. By changing the formula in the snippet, we can calculate the VIF for the other variables. The following are the values:

Newspaper

Radio

TV

1.14

1.14

1.04

The newspaper and TV have almost the same VIF, indicating that they are correlated to just one another and not the TV.

In this case, radio and newspaper are correlated. However, the model with TV and radio, as predictor variables, are far superior to the model with TV and newspaper as the predictor variables. The model with the all the three variables as predictors doesn't improve the model much. In fact, it increases the variability and the F-statistic. It will be a decent choice to drop the newspaper variable from the model and pick model 3 as the best candidate for the final model:

Variance Inflation Factor
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset