Any predictive model needs to be validated to see how it is performing on different sets of data, whether the accuracy of the model is constant over all the sources of similar data or not. This checks the problem of over-fitting, wherein the model fits very well on one set of data but doesn't fit that well on another dataset. One common method is to validate a model train-test split of the dataset. Another method is k-fold cross validation, about which we will learn more in the later chapter.
Ideally, this step should be done right at the onset of the modelling process so that there are no sampling biases in the model; in other words, the model should perform well even for a dataset that has the same predictor variables, but their means and variances are very different from what the model has been built upon. This can happen because the dataset on which the model is built (training) and the one on which it is applied (testing) can come from different sources. A more robust way to do this is a process called the k-fold cross validation, about which we will read in detail in a little while.
Let's see how we can split the available dataset in the training and testing dataset and apply the model to the testing dataset to get other results:
import numpy as np a=np.random.randn(len(advert)) check=a<0.8 training=advert[check] testing=advert[~check]
The ratio of split between training and testing datasets is 80:20; in other words, 160 rows of the advert dataset will be in training and 40 rows in testing.
Let's create a model on training the data and test the model performance on the testing data. Let us create the only model that works best (we have found it already), the one with TV and radio variables, as predictor variables:
import statsmodels.formula.api as smf model5=smf.ols(formula='Sales~TV+Radio',data=training).fit() model5.summary()
Most of the model parameters, such as intercept, coefficient estimates, and R2 are very similar. The difference in F-statistics can be attributed to a smaller dataset. The smaller the dataset, the larger the value of SSD and the smaller the value of the (n-p-1) term in F-statistic formula; both contribute towards the decrease in the F-statistic value.
The model can be written, as follows:
Let us now predict the sales values for the testing dataset:
sales_pred=model5.predict(training[['TV','Radio']]) sales_pred
The value of RSE for this prediction on the testing dataset can be calculated using the following snippet:
import numpy as np testing['sales_pred']=2.86 + 0.04*testing['TV'] + 0.17*testing['Radio'] testing['RSE']=(testing['Sales']-testing['sales_pred'])**2 RSEd=testing.sum()['RSE'] RSE=np.sqrt(RSEd/51) salesmean=np.mean(testing['Sales']) error=RSE/salesmean RSE,salesmean,error
The value of RSE comes out to be 2.54 over a sales mean (in the testing data) of 14.80 amounting to an error of 17%.
We can see that the model doesn't generalize very well on the testing dataset, as the RSE for the same model is different in the two cases. It implies some degree of over fitting when we tried to build the model based on the entire dataset. The RSE with the training-testing split, albeit a bit more, is more reliable and replicable.
We have tried four models previously. Let us summarize the major results from each of the models, at one place:
Name |
Definition |
R2/Adj-R2 |
F-statistic |
F-statistic (p-value) |
RSE |
---|---|---|---|---|---|
Model 1 |
Sales ~ TV |
0.612/0.610 |
312.1 |
1.47e-42 |
3.25 (23%) |
Model 2 |
Sales ~ TV+Newspaper |
0.646/0.642 |
179.6 |
3.95e-45 |
3.12(22%) |
Model 3 |
Sales ~ TV+Radio |
0.897/0.896 |
859.6 |
4.83e-98 |
1.71(12%) |
Model 4 |
Sales ~ TV+Radio+Newspaper |
0.897/0.896 |
570.3 |
1.58e-96 |
1.80(13%) |
Guide for selection of variables
To summarize, for a good linear model, the predictor variables should be chosen based on the following criteria:
Let's now re-implement the linear regression model using the scikit-learn
package. This method is more elegant as it has more in-built methods to perform the regular processes associated with regression. For example, you might remember from the last chapter that there is a separate method for splitting the dataset into training and testing datasets:
from sklearn.linear_model import LinearRegression from sklearn.cross_validation import train_test_split feature_cols = ['TV', 'Radio'] X = advert[feature_cols] Y = advert['Sales'] trainX,testX,trainY,testY = train_test_split(X,Y, test_size = 0.2) lm = LinearRegression() lm.fit(trainX, trainY)
We split the advert dataset into train and test dataset and built the model on TV and radio variables from the test dataset. The following are the parameters of the model:
print lm.intercept_ print lm.coef_
The result is as follows: Intercept – 2.918, TV coefficient – 0.04, Radio coefficient – 0.186
A better way to look at the coefficients is to use the zip
method to write the variable name and coefficient together. The required snippet and the output are mentioned in the following code:
zip(feature_cols, lm.coef_) [('TV', 0.045706061219705982), ('Radio', 0.18667738715568111)]
The value of R2 is calculated by typing the following code:
lm.score(trainX, trainY)
The value comes out to be around 0.89, very close to the value obtained by the method used earlier.
The model can be used to predict the value of sales using TV and radio variables from the test dataset, as follows:
lm.predict(testX)
As stated before, many of the statistical tools and packages have in-built methods to conduct a variable selection process (forward selection and backward selection). If it is done manually, it will consume a lot of time and selecting the most important variables will be a tedious task compromising the efficiency of the model.
One advantage of using the scikit-learn
package for regression in Python is that it has this particular method for feature selection. This works more or less like backward selection (not exactly) and is called
Recursive Feature Elimination (RFE). One can specify the number of variables they want in the final model.
The model is first run with all the variables and certain weights are assigned to all the variables. In the subsequent iterations, the variables with the smallest weights are pruned from the list of variables till the desired number of variables is left.
Let us see how one can do a feature selection in scikit-learn
:
from sklearn.feature_selection import RFE from sklearn.svm import SVR feature_cols = ['TV', 'Radio','Newspaper'] X = advert[feature_cols] Y = advert['Sales'] estimator = SVR(kernel="linear") selector = RFE(estimator,2,step=1) selector = selector.fit(X, Y)
We use the methods named RFE and SVR in-built in scikit-learn
. We indicate that we want to estimate a linear model and the number of desired variables in the model to be two.
To get the list of selected variables, one can write the following code snippet:
selector.support_
It results in an array mentioning whether the variables in X have been selected for the model or not. True means that the variable has been selected, while False means otherwise. In this case, the result is as follows:
In our case, X consists of three variables: TV, radio, and newspaper. The preceding array suggests that TV and radio have been selected for the model, while the newspaper hasn't been selected. This concurs with the variable selection we had done manually.
This method also returns a ranking, as described in the following example:
selector.ranking_
All the selected variables will have a ranking of 1 while the subsequent ones will be ranked in descending order of their significance. A variable with rank 2 will be more significant to the model than the one with a rank of 3 and so on.