Reducing overfitting with dropout

You may notice that we had employed L2 regularization in the MXNet solution, which adds penalties for large weights in order to avoid overfitting; but we did not do so in this Keras solution. This results in a slight difference in classification accuracy on the testing set (99.30% versus 98.65%). We are going to employ regularization in our Keras solution, specifically dropout this time.

Dropout is a regularization technique in neural networks initially proposed by Geoffrey Hinton et. al. in 2012 (Improving Neural Networks by Preventing Co-adaptation of Feature Detectors in Neural and Evolutionary Computing). As the name implies, it ignores a small subset of neurons (can be hidden or visible) that are randomly selected in a neural network during training. The dropped-out neurons temporarily make no contribution to the activation of downstream neurons or the weight updates to neurons on backward pass. So how is the dropout technique able to prevent overfitting?

Recall that in a standard neural network, neurons are co-dependent among neighboring neurons during training. And weights of neurons are tuned for a particular context within the network, which restricts the individual power of each neuron. Such reliance on context may cause the model to be too specialized to training data. When some neurons in the network are not considered, the weights of neurons become less sensitive to those of other neurons. Neurons are forces to learn useful information more independently. Co-adaptation on training data is penalized.

Employing dropout is simple. During the training phase and in a layer with dropout rate p, for each iteration, we randomly switch off a fraction p of neurons. In the testing phase, we use all neurons but scale their activations by a factor of q = 1 - p, in order to account for the dropped-out activations in the training phase.

Here is a standard neural network (first image) and the same network with dropout (second image):

In this example, dropout is applied to a visible layer, the input layer, besides the hidden layers.

In practice, the dropout rate is usually set from 20% to 50%. A layer with a too low dropout rate makes little difference whereas a too high rate causes underfitting.

Now let's apply dropout to our Keras solution by using the function layer_dropout(p). We define a function that initializes and compiles a CNN model with dropout (for reuse purposes):

> init_cnn_dropout <- function(){ 
+     model_dropout <- keras_model_sequential() 
+     model_dropout %>% 
+         layer_conv_2d( 
+             filter = 32, kernel_size = c(5,5),  
+             input_shape = c(32, 32, 1) 
+         ) %>% 
+         layer_activation("relu") %>% 
+         layer_max_pooling_2d(pool_size = c(2,2)) %>% 
+          
+         # Second hidden convolutional layer layer 
+         layer_conv_2d(filter = 64, kernel_size = c(5,5)) %>% 
+         layer_activation("relu") %>% 
+         # Use max pooling 
+         layer_max_pooling_2d(pool_size = c(2,2)) %>% 
+         layer_dropout(0.25) %>% 
+          
+         # Flatten and feed into dense layer 
+         layer_flatten() %>% 
+         layer_dense(1000) %>% 
+         layer_activation("relu") %>% 
+         layer_dropout(0.25) %>% 
+          
+         # Outputs from dense layer  
+         layer_dense(43) %>% 
+         layer_activation("softmax") 
+      
+     opt <- optimizer_sgd(lr = 0.005, momentum = 0.9) 
+      
+     model_dropout %>% compile( 
+         loss = "categorical_crossentropy", 
+         optimizer = opt, 
+         metrics = "accuracy" 
+     ) 
+     return(model_dropout) 
+ }

Obtain a new model:

> model_dropout <- init_cnn_dropout()

We just employ 25% dropout in the second max pooling layer and 25% dropout in the fully connected hidden layer. By calling summary(model_dropout), we can see two dropout layers right below the second MaxPooling2D layer and the first dense and activation layer respectively.

Continue with the model training:

> model_dropout %>% fit( 
+   x_train, y_train, 
+   batch_size = 100, 
+   epochs = 30, 
+   validation_data = list(x_test, y_test), 
+   shuffle = FALSE 
+ ) 
Train on 29409 samples, validate on 9800 samples 
Epoch 1/30 
29409/29409 [==============================] - 108s 4ms/step - loss: 3.1078 - acc: 0.1987 - val_loss: 1.4475 - val_acc: 0.6487 
Epoch 2/30 
29409/29409 [==============================] - 106s 4ms/step - loss: 0.9772 - acc: 0.7337 - val_loss: 0.4570 - val_acc: 0.8934 
Epoch 3/30 
29409/29409 [==============================] - 106s 4ms/step - loss: 0.5194 - acc: 0.8598 - val_loss: 0.3043 - val_acc: 0.9310 
Epoch 4/30 
29409/29409 [==============================] - 106s 4ms/step - loss: 0.3606 - acc: 0.9037 - val_loss: 0.2058 - val_acc: 0.9529 
Epoch 5/30 
29409/29409 [==============================] - 106s 4ms/step - loss: 0.2828 - acc: 0.9250 - val_loss: 0.1677 - val_acc: 0.9640 
Epoch 6/30 
29409/29409 [==============================] - 106s 4ms/step - loss: 0.2272 - acc: 0.9406 - val_loss: 0.1424 - val_acc: 0.9707 
Epoch 7/30 
29409/29409 [==============================] - 106s 4ms/step - loss: 0.1910 - acc: 0.9494 - val_loss: 0.1138 - val_acc: 0.9793 
Epoch 8/30 
29409/29409 [==============================] - 106s 4ms/step - loss: 0.1560 - acc: 0.9602 - val_loss: 0.0986 - val_acc: 0.9797 
Epoch 9/30 
29409/29409 [==============================] - 106s 4ms/step - loss: 0.1423 - acc: 0.9621 - val_loss: 0.0956 - val_acc: 0.9804 
Epoch 10/30 
29409/29409 [==============================] - 106s 4ms/step - loss: 0.1256 - acc: 0.9663 - val_loss: 0.0814 - val_acc: 0.9841 
Epoch 11/30 
29409/29409 [==============================] - 106s 4ms/step - loss: 0.1111 - acc: 0.9708 - val_loss: 0.0760 - val_acc: 0.9847 
Epoch 12/30 
29409/29409 [==============================] - 106s 4ms/step - loss: 0.0987 - acc: 0.9735 - val_loss: 0.0795 - val_acc: 0.9824 
Epoch 13/30 
29409/29409 [==============================] - 106s 4ms/step - loss: 0.0899 - acc: 0.9752 - val_loss: 0.0626 - val_acc: 0.9876 
Epoch 14/30 
29409/29409 [==============================] - 106s 4ms/step - loss: 0.0799 - acc: 0.9787 - val_loss: 0.0665 - val_acc: 0.9868 
Epoch 15/30 
29409/29409 [==============================] - 106s 4ms/step - loss: 0.0794 - acc: 0.9792 - val_loss: 0.0571 - val_acc: 0.9887 
Epoch 16/30 
29409/29409 [==============================] - 106s 4ms/step - loss: 0.0691 - acc: 0.9817 - val_loss: 0.0534 - val_acc: 0.9898 
Epoch 17/30 
29409/29409 [==============================] - 106s 4ms/step - loss: 0.0668 - acc: 0.9817 - val_loss: 0.0560 - val_acc: 0.9892 
Epoch 18/30 
29409/29409 [==============================] - 106s 4ms/step - loss: 0.0583 - acc: 0.9846 - val_loss: 0.0486 - val_acc: 0.9916 
Epoch 19/30 
29409/29409 [==============================] - 106s 4ms/step - loss: 0.0541 - acc: 0.9861 - val_loss: 0.0484 - val_acc: 0.9914 
Epoch 20/30 
29409/29409 [==============================] - 106s 4ms/step - loss: 0.0529 - acc: 0.9858 - val_loss: 0.0494 - val_acc: 0.9906 
Epoch 21/30 
29409/29409 [==============================] - 106s 4ms/step - loss: 0.0500 - acc: 0.9864 - val_loss: 0.0449 - val_acc: 0.9909 
Epoch 22/30 
29409/29409 [==============================] - 106s 4ms/step - loss: 0.0469 - acc: 0.9872 - val_loss: 0.0414 - val_acc: 0.9926 
Epoch 23/30 
29409/29409 [==============================] - 106s 4ms/step - loss: 0.0473 - acc: 0.9863 - val_loss: 0.0415 - val_acc: 0.9917 
Epoch 24/30 
29409/29409 [==============================] - 107s 4ms/step - loss: 0.0406 - acc: 0.9894 - val_loss: 0.0416 - val_acc: 0.9916 
Epoch 25/30 
29409/29409 [==============================] - 108s 4ms/step - loss: 0.0413 - acc: 0.9888 - val_loss: 0.0445 - val_acc: 0.9909 
Epoch 26/30 
29409/29409 [==============================] - 108s 4ms/step - loss: 0.0337 - acc: 0.9906 - val_loss: 0.0412 - val_acc: 0.9922 
Epoch 27/30 
29409/29409 [==============================] - 108s 4ms/step - loss: 0.0333 - acc: 0.9911 - val_loss: 0.0388 - val_acc: 0.9928 
Epoch 28/30 
29409/29409 [==============================] - 108s 4ms/step - loss: 0.0332 - acc: 0.9905 - val_loss: 0.0395 - val_acc: 0.9933 
Epoch 29/30 
29409/29409 [==============================] - 108s 4ms/step - loss: 0.0312 - acc: 0.9910 - val_loss: 0.0371 - val_acc: 0.9937 
Epoch 30/30 
29409/29409 [==============================] - 108s 4ms/step - loss: 0.0305 - acc: 0.9917 - val_loss: 0.0383 - val_acc: 0.9940

With dropout, the prediction accuracy on the testing set is increased to 99.40%.

Table of Contents for Reducing overfitting with dropout

Create new playlist

Sign In

Sign Up

Table of Contents for
Reducing overfitting with dropout