Model validation and evaluation

The preceding logistic regression model is built on the entire data. Let us now split the data into training and testing sets, build the model using the training set, and then check the accuracy using the testing set. The ultimate goal is to see whether it improves the accuracy of the prediction or not:

from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)

The preceding code snippet creates testing and training datasets for a predictor and also outcome variables. Let us now build a logistic regression model over the training set:

from sklearn import linear_model
from sklearn import metrics
clf1 = linear_model.LogisticRegression()
clf1.fit(X_train, Y_train)

The preceding code snippet creates the model. If you remember the equation behind the model, you will know that the model predicts probabilities and not the classes (binary output, that is, 0 or 1). One needs to select a threshold over these probabilities to classify them into two categories. Something of this sort: if the probability is less than the threshold, then it is a 0 outcome, and if it is greater than the threshold, then it is a 1 outcome.

Let us see how we can get those probabilities and classifications:

probs = clf1.predict_proba(X_test)

This gives the probability of a negative and positive outcome for each row of the data:

Model validation and evaluation

Fig. 6.24: Predicted probability values for each observation

The second column provides the probability of a positive outcome (purchase of a deposit outcome in our case). By default, if this probability is more than 0.5, then the observation is categorized as a positive outcome, and as a negative outcome if it is less than that.

The default outcome for predicting the class can be found out using the following code snippet:

predicted = clf1.predict(X_test)

The output is an array consisting of 0 and 1. In the default case, it categorizes probabilities less than 0.5 as 0, and more than that as 1. One can use different cutoffs for this as well. One can change it to 0.15 or 0.20 depending upon the situation. In our case, we have seen that only 10% of the customers buy the product; hence, probability=0.10 can be a good threshold. If an observation has a probability of more than 0.10, we can classify it as a positive outcome (a customer will buy the product). An observation with the probability less than 0.10 will be classified as a negative outcome.

The changing of threshold values can be done using the following code snippet:

import pandas as pd
import numpy as np
prob=probs[:,1]
prob_df=pd.DataFrame(prob)
prob_df['predict']=np.where(prob_df[0]>=0.10,1,0)
prob_df.head() 

The number of positive and negative responses will change with the threshold values. The percentage of positive outcomes with three different threshold probabilities are mentioned as follows:

Threshold

% Positive outcome

0.10

28%

0.15

18%

0.05

65%

Let us check the accuracy of this model using the following code snippet:

print metrics.accuracy_score(Y_test, predicted)

This model has the same accuracy of 90.21% as the previous model.

Cross validation

Cross validation is performed on a dataset while predicting to check how well the model will generalize its results on an independent dataset.

Cross validation is required to deal with an issue common with the predictive models. The models are developed based on one set of data, and most of the model parameters are calculated using the criterion of the most optimal fit with the data on which the model is being built. This leads to a problem called overfitting, wherein the model fits the given (training) data very well, but doesn't reproduce the good fitting with some other (testing) dataset. This problem is more severe in the case of datasets with less observations.

Splitting up a dataset in the training and testing dataset is the simplest way to do cross validation. This is called a holdout method, wherein the training set and testing set are randomly chosen.

The most popular way to perform a cross validation is using something called as k-fold cross validation. It is done as follows:

  1. Divide the data set into k partitions.
  2. One partition is used as the test set, while the other k-1 partitions together are used as the training set.
  3. The process in (2) is repeated k times with a different partition as the testing dataset and the rest of them as the training dataset in each iteration.
  4. For each iteration, the model accuracy is calculated and averaged out over the iterations. The averaged value is the value of the model accuracy.
  5. If the accuracy of the model doesn't vary much and the average accuracy remains closer to the accuracy numbers calculated for the model before, then it can be confirmed that the model generalizes well.

Each observation gets to be part of the testing dataset exactly once, while each row becomes part of the training dataset exactly k-1 times. One advantage of this method is that each of the observations get to be part of either testing or training dataset at least once; hence, it leads to a better generalization. K=10 is generally the norm, but can be changed according to the situation. In scikit-learn, there is a separate method to perform cross validation which can be done very easily:

from sklearn.cross_validation import cross_val_score
scores = cross_val_score(linear_model.LogisticRegression(), X, Y, scoring='accuracy', cv=8)
print scores
print scores.mean()

The preceding code snippet basically runs an 8-fold cross validation method and calculates the accuracy for each of the iterations. The average accuracy is also printed:

Cross validation

Fig. 6.25: The accuracy for each run (fold) of the model during cross validation

The average accuracy remains very close to the accuracy we have observed before; hence, we can conclude that the model generalizes well.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset