Model validation

Any predictive model needs to be validated to see how it is performing on different sets of data, whether the accuracy of the model is constant over all the sources of similar data or not. This checks the problem of over-fitting, wherein the model fits very well on one set of data but doesn't fit that well on another dataset. One common method is to validate a model train-test split of the dataset. Another method is k-fold cross validation, about which we will learn more in the later chapter.

Training and testing data split

Ideally, this step should be done right at the onset of the modelling process so that there are no sampling biases in the model; in other words, the model should perform well even for a dataset that has the same predictor variables, but their means and variances are very different from what the model has been built upon. This can happen because the dataset on which the model is built (training) and the one on which it is applied (testing) can come from different sources. A more robust way to do this is a process called the k-fold cross validation, about which we will read in detail in a little while.

Let's see how we can split the available dataset in the training and testing dataset and apply the model to the testing dataset to get other results:

import numpy as np
a=np.random.randn(len(advert))
check=a<0.8
training=advert[check]
testing=advert[~check]

The ratio of split between training and testing datasets is 80:20; in other words, 160 rows of the advert dataset will be in training and 40 rows in testing.

Let's create a model on training the data and test the model performance on the testing data. Let us create the only model that works best (we have found it already), the one with TV and radio variables, as predictor variables:

import statsmodels.formula.api as smf
model5=smf.ols(formula='Sales~TV+Radio',data=training).fit()
model5.summary()
Training and testing data split

Fig. 5.14: Model 5 coefficients and p-values

Most of the model parameters, such as intercept, coefficient estimates, and R2 are very similar. The difference in F-statistics can be attributed to a smaller dataset. The smaller the dataset, the larger the value of SSD and the smaller the value of the (n-p-1) term in F-statistic formula; both contribute towards the decrease in the F-statistic value.

The model can be written, as follows:

Training and testing data split

Let us now predict the sales values for the testing dataset:

sales_pred=model5.predict(training[['TV','Radio']])
sales_pred

The value of RSE for this prediction on the testing dataset can be calculated using the following snippet:

import numpy as np
testing['sales_pred']=2.86 + 0.04*testing['TV'] + 0.17*testing['Radio']
testing['RSE']=(testing['Sales']-testing['sales_pred'])**2
RSEd=testing.sum()['RSE']
RSE=np.sqrt(RSEd/51)
salesmean=np.mean(testing['Sales'])
error=RSE/salesmean
RSE,salesmean,error

The value of RSE comes out to be 2.54 over a sales mean (in the testing data) of 14.80 amounting to an error of 17%.

We can see that the model doesn't generalize very well on the testing dataset, as the RSE for the same model is different in the two cases. It implies some degree of over fitting when we tried to build the model based on the entire dataset. The RSE with the training-testing split, albeit a bit more, is more reliable and replicable.

Summary of models

We have tried four models previously. Let us summarize the major results from each of the models, at one place:

Name

Definition

R2/Adj-R2

F-statistic

F-statistic (p-value)

RSE

Model 1

Sales ~ TV

0.612/0.610

312.1

1.47e-42

3.25 (23%)

Model 2

Sales ~ TV+Newspaper

0.646/0.642

179.6

3.95e-45

3.12(22%)

Model 3

Sales ~ TV+Radio

0.897/0.896

859.6

4.83e-98

1.71(12%)

Model 4

Sales ~ TV+Radio+Newspaper

0.897/0.896

570.3

1.58e-96

1.80(13%)

Guide for selection of variables

To summarize, for a good linear model, the predictor variables should be chosen based on the following criteria:

  • R2: R2 will always increase when you add a new predictor variable to the model. However, it is not a very reliable check of the increased efficiency of the model. Rather, for an efficient model, we should check the adjusted-R2. This should increase on adding a new predictor variable.
  • p-values: The lower the p-value for the estimate of the predictor variable, the better it is to add the predictor variable to the model.
  • F-statistic: The value of the F-statistic for the model should increase after the addition of a new predictor variable for a predictor variable to be an efficient addition to the model. The increase in the F-statistic is a proxy to the improvement in the model brought upon solely by the addition of that particular variable. Alternatively, the p-value associated with the F-statistic should decrease on the addition of a new predictor variable.
  • RSE: The value of RSE for the new model should decrease on the addition of the new predictor variable.
  • VIF: To take care of the issues arising due to multi-collinearity one needs to eliminate the variables with large values of VIF.

Linear regression with scikit-learn

Let's now re-implement the linear regression model using the scikit-learn package. This method is more elegant as it has more in-built methods to perform the regular processes associated with regression. For example, you might remember from the last chapter that there is a separate method for splitting the dataset into training and testing datasets:

from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split
feature_cols = ['TV', 'Radio']
X = advert[feature_cols]
Y = advert['Sales']
trainX,testX,trainY,testY = train_test_split(X,Y, test_size = 0.2)
lm = LinearRegression()
lm.fit(trainX, trainY)

We split the advert dataset into train and test dataset and built the model on TV and radio variables from the test dataset. The following are the parameters of the model:

print lm.intercept_
print lm.coef_

The result is as follows: Intercept – 2.918, TV coefficient – 0.04, Radio coefficient – 0.186

A better way to look at the coefficients is to use the zip method to write the variable name and coefficient together. The required snippet and the output are mentioned in the following code:

zip(feature_cols, lm.coef_)
[('TV', 0.045706061219705982), ('Radio', 0.18667738715568111)]

The value of R2 is calculated by typing the following code:

lm.score(trainX, trainY)

The value comes out to be around 0.89, very close to the value obtained by the method used earlier.

The model can be used to predict the value of sales using TV and radio variables from the test dataset, as follows:

lm.predict(testX)

Feature selection with scikit-learn

As stated before, many of the statistical tools and packages have in-built methods to conduct a variable selection process (forward selection and backward selection). If it is done manually, it will consume a lot of time and selecting the most important variables will be a tedious task compromising the efficiency of the model.

One advantage of using the scikit-learn package for regression in Python is that it has this particular method for feature selection. This works more or less like backward selection (not exactly) and is called Recursive Feature Elimination (RFE). One can specify the number of variables they want in the final model.

The model is first run with all the variables and certain weights are assigned to all the variables. In the subsequent iterations, the variables with the smallest weights are pruned from the list of variables till the desired number of variables is left.

Let us see how one can do a feature selection in scikit-learn:

from sklearn.feature_selection import RFE
from sklearn.svm import SVR
feature_cols = ['TV', 'Radio','Newspaper']
X = advert[feature_cols]
Y = advert['Sales']
estimator = SVR(kernel="linear")
selector = RFE(estimator,2,step=1)
selector = selector.fit(X, Y)

We use the methods named RFE and SVR in-built in scikit-learn. We indicate that we want to estimate a linear model and the number of desired variables in the model to be two.

To get the list of selected variables, one can write the following code snippet:

selector.support_

It results in an array mentioning whether the variables in X have been selected for the model or not. True means that the variable has been selected, while False means otherwise. In this case, the result is as follows:

Feature selection with scikit-learn

Fig. 5.15: Result of feature selection process

In our case, X consists of three variables: TV, radio, and newspaper. The preceding array suggests that TV and radio have been selected for the model, while the newspaper hasn't been selected. This concurs with the variable selection we had done manually.

This method also returns a ranking, as described in the following example:

selector.ranking_
Feature selection with scikit-learn

Fig. 5.16: Result of feature selection process

All the selected variables will have a ranking of 1 while the subsequent ones will be ranked in descending order of their significance. A variable with rank 2 will be more significant to the model than the one with a rank of 3 and so on.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset