Dealing with reducible error components

High bias:

  • Add more features
  • Apply a more complex model
  • Use less instances to train
  • Reduce regularization

High variance:

  • Conduct feature selection and use less features
  • Get more training data
  • Use regularization to help overcome the issues due to complex models

Cross validation

Cross-validation is an important step in the model validation and evaluation process. It is a technique to validate the performance of a model before we apply it on an unobserved dataset. It is not advised to use the full training data to train the model, because in such a case we would have no idea how the model is going to perform in practice. As we learnt in the previous section, a good learner should be able to generalize well on an unseen dataset; that can happen only if the model is able to extract and learn the underlying patterns or relations among the dependent and independent attributes. If we train the model on the full training data and apply the same on a test data, it is very likely that we will have very low prediction errors on the training set while the accuracy on the test set is significantly low. This would mean that the model is overfitting. The classifier did not learn from the data, but it just remembered a specific interaction of independent variables to the class variable. Thus it was not able to generalize well on the unseen test data. The scenarios where we have a fewer number of instances or a high number of parameters is when it is very likely that the model may overfit. How do we handle it?

Cross-validation helps in validating the model fit on an artificially created test data, which is kept aside during the training process, that is, it is restricted from the learner to train the model on. It is very common to divide the annotated data into two parts: training set and cross-validation or holdout set, typically in the ration of 60%-40% or 70%-30% of the whole data. The training is performed on the training set and the trained model is used to predict on the cross-validation set. If the model performs well on the training data but the accuracy falls down significantly on the cross-validation set, it's a typical case of overfitting.

Cross validation

Some of the other model validation methods are:

  • Leave-one-out
  • k-fold
  • Repeated hold
  • Bootstrap
  • Stratified

Leave-one-out

Leave-one-out is a variant of the Leave-p-out method of cross-validation where p=1. In this method, the entire instance except for one is left out during the model training process and the model is tested on the left out instance. This is done exhaustively for all the instances, and the error rate is calculated. In simple words, the algorithm leaves out each instance in turn, validates the hypotheses generated by training the model on the remaining instances, and stores the correctness of the model in binary form. Once this step runs for all the instances, the results for all the iteration (typically many instances are available in data) is averaged to come up with the final estimate of the error.

There are a few obvious benefits of this method. In this method, the largest possible set of data is used for training, which in an ideal scenario, should increase the classifier accuracy. This mechanism of sampling data is unbiased, it is deterministic in nature, and does not involve random sampling.

One glaring concern associated with this method is the computational cost. Leave-one-out can even be called n-fold cross-validation, where n is the number of instances in the data. The process repeats itself n number of times, leaving one instance out in turn and training for the remaining instance. The process completes only after the error is estimated in the context of each instance being left out in the training process once. This makes this method computationally very expensive. The other issue associated to this mechanism is that we can never have stratified samples in training if we perform leave-one-out sampling. We have not yet learnt about k-fold and stratified; these concerns can be understood better after we learn about all the cross-validation methods.

k-Fold

For all practical purposes, we need to evaluate our model both on unbiased data as well as a variation of datasets. A good way of to get this estimate is k-fold cross-validation. In this process, we first randomly divide the dataset in k parts, each part has almost the same number of data points. Out of these k parts one part is retained as the test data and the remaining parts are used for training purposes. The entire process is repeated K number of times and during each pass exactly one part is held out for testing. All the results can be averaged to get the overall error rate. One upside to this method is that all the parts are used for training and each part is used for validating at least once. The standard way is to use 10-fold cross-validation and compute the overall error estimate. It's debatable if this thumb rule applies universally, although there are mathematical proofs to back the claim that 10 in most of the cases is a good approximation of the number of folds to be chosen. k-fold cross-validation may produce different outcomes when run on the same dataset using the same model because of the variations. Although stratification reduces variation to a good extent there are still some differences in the outcomes. It's a common practice to apply k-fold cross validation k number of times, which means the model would run k2 number of times, on the training data which would be (k-1/k)th the size of the full data. This makes it very obvious that it's a computationally expensive process.

k-Fold

Some R Packages that can be used for k-fold cross-validation.

The R package, cvTools Function cvFit() can be used to predict the errors of a model using k-fold cross-validation:

install.packages("cvTools")
install.packages("robustbase")
data("coleman")
fit <- lmrob(Y ~ ., data=coleman)
cvFit(fit, data = coleman, y = coleman$Y, cost = rtmspe, K = 5, R = 10, costArgs = list(trim = 0.1), seed = 1234)

5-fold CV results:
       CV 
0.9867381 

install.packages("boot")
cv.glm(data, glmfit, cost, K)

The parameters to the glm() function are described as follows:

  • glmfit: Results of a generalized linear model fitted to data
  • data: A matrix or data frame containing the data
  • cost: A function of two vector arguments specifying the cost function for the cross-validation
  • K: The number of groups into which the data should be split to estimate the cross-validation prediction error

Bootstrap

Both the cross-validation and bootstrapping methods are used to estimate the generalization errors in a model by performing validation on resampled data. In the bootstrap sampling procedure, we will sample the dataset with replacement. In all the preceding methods when a sample was picked form the dataset it was not replaced, that is, the same instance once selected could not be used again. If the model is learning on a sample that has reputation then the results of the learning may be different. This is the idea behind bootstrap cross-validation. Since we are sampling the data with replacement, as a result some of the instances would occur more than once in a sample wherein some of them may never appear. What is the probability of any specific instance not being picked for the training set?

If the dataset has n number of instances, the probability of a specific instance to be picked for training is 1/n and thus the probability of this instance not being picked up comes to be 1-1/n. This is the probability of the specific instance I not being picked for a specific iteration. The probability of any specific instance never being picked in any of the samples comes out to be (1-1/n)n which is approximately equal to 1/e, approximately equal to 0.367.

This implies, for a sufficiently large dataset, approximately 63% of the data would appear in the training set while about 37% of the data would never feature in training samples and thus it can be used at the test set.

Stratified

In stratified sampling, another probability sampling technique in which the entire dataset is divided into separate groups, these groups are called strata. A random sample is drawn from each group. It has many advantages over just dividing the datasets into test and train sets, as the data is distributed evenly across samples, which will help the model to learn most of the patterns in the provided sample. One thing to keep in mind is, if the datasets have overlapping strata then the probability of a particular data point being picked might increase.

The caret package in R has a method, trainControl(), which supports various sampling methods:

  • boot
  • boot632
  • cv
  • repeatedcv
  • LOOCV
  • LGOCV
  • adaptive_cv
  • adaptive_boot

For more information refer to https://cran.r-project.org/web/packages/caret/caret.pdf

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset