Dealing with a small training set – data augmentation

We have been very fortunate so far to possess a large-enough training dataset with 75% of 39,209 samples. This is one of the reasons why we are able to achieve a 99.3% to 99.4% classification accuracy. However, in reality, obtaining a large training set is not easy in most supervised learning cases, where manual work is necessary or the cost of data collection and labeling is high. In our traffic signs classification project, can we still achieve the same performance if we are given a lot less training samples to begin with? Let's give it a shot.

We simulate a small training set with only 10% of the 39,209 samples and a testing set with the rest 90%:

> train_perc_1 = 0.1 
> train_index_1 <- createDataPartition(data.y, p=train_perc_1, list=FALSE) 
> train_index_1 <- train_index_1[sample(nrow(train_index_1)),] 
> data_train_1.x <- data.x[train_index_1,] 
> data_train_1.y <- data.y[train_index_1] 
> data_test_1.x <- data.x[-train_index_1,] 
> data_test_1.y <- data.y[-train_index_1] 
> x_train_1 <- data_train_1.x 
> dim(x_train_1) <- c(nrow(data_train_1.x), 32, 32, 1) 
> x_test_1 <- data_test_1.x 
> dim(x_test_1) <- c(nrow(data_test_1.x), 32, 32, 1) 
> y_train_1 <- to_categorical(data_train_1.y, num_classes = 43) 
> y_test_1 <- to_categorical(data_test_1.y, num_classes = 43) 

Initialize a new model and fit it with the new training set:

> model_1 <- init_cnn_dropout() 
> model_1 %>% fit( 
+   x_train_1, y_train_1, 
+   batch_size = 100, 
+   epochs = 1, 
+   validation_data = list(x_test_1, y_test_1), 
+   shuffle = FALSE 
+ )  

Train on 3,921 samples; validate on 35,288 samples:

Epoch 1/30 
3921/3921 [==============================] - 19s 5ms/step - loss: 3.6705 - acc: 0.0594 - val_loss: 3.5191 - val_acc: 0.0592 
Epoch 2/30 
3921/3921 [==============================] - 17s 4ms/step - loss: 3.5079 - acc: 0.0681 - val_loss: 3.4663 - val_acc: 0.0529 
...... 
...... 
Epoch 29/30 
3921/3921 [==============================] - 17s 4ms/step - loss: 0.1935 - acc: 0.9462 - val_loss: 0.2760 - val_acc: 0.9381 
Epoch 30/30 
3921/3921 [==============================] - 17s 4ms/step - loss: 0.1962 - acc: 0.9431 - val_loss: 0.2772 - val_acc: 0.9393 

It is not bad to achieve 93.93% accuracy by a model trained with only 3,921 samples. But can we do better, at least close to 99% as we accomplished with sufficient training data? Yes! One solution is data augmentation.

Data augmentation simply means expanding the size of the existing data that we feed to the supervised learning models in order to compensate for the cost of further data collection and labeling.

There are many ways to augment data in computer vision. The simplest one is probably flipping an image horizontally or vertically. Take the General caution sign as an example; we implement flipping using the function flow_images_from_data() in Keras as follows.

Load the General caution sample:

> img<-image_load(paste(training_path, "00018/00001_00004.ppm", sep="")) 
> img1<-image_to_array(img) 
> dim(img1)<-c(1,dim(img1)) 

We generate a horizontally flipped image and save the resulting image in the augmented directory we created:

> images_iter  <- flow_images_from_data(img1, , generator =  
                  image_data_generator(horizontal_flip = TRUE), 
+                 save_to_dir = 'augmented', 
+                 save_prefix = "horizontal", save_format = "png") 
> reticulate::iter_next(images_iter)  

The flipped sign (right) along with the original image (left) is displayed as follows:

The horizontally flipped sign image conveys the same message as the original one. It should be noted that flipping works only in orientation-insensitive cases, such as a classification between cats and dogs, or our recognition of traffic lights. However, in cases where orientation matters, such as a classification between right turn and left turn, a small to medium degree rotation can still be applied. For instance, flipping the Dangerous curve to the right sign is absolutely dangerous, but rotating it by at most 20 degrees is harmless and even helpful, as we can see in the following example:

> img<-image_load(paste(training_path, "00020/00002_00017.ppm", sep="")) 
> img1<-image_to_array(img) 
> dim(img1)<-c(1,dim(img1)) 
> images_iter  <- flow_images_from_data(img1, , generator =         
                  image_data_generator(rotation_range = 20), 
+                 save_to_dir = 'augmented', 
+                 save_prefix = "rotation", save_format = "png") 
> reticulate::iter_next(images_iter) 

The rotated sign (right) and the original image (left) contain identical information:

Shifting is perhaps the most common augmentation method. Moving the image horizontally and/or vertically by a small number of pixels generates an identically functioning image. Using the same example as before, we shift it horizontally and vertically by at most 20% of the width and height:

> images_iter  <- flow_images_from_data(img1,                                      
                generator=image_data_generator(width_shift_range=0.2,                                                                         
                height_shift_range=0.2), save_to_dir = 'augmented', 
+               save_prefix = "shift", save_format = "png") 
> reticulate::iter_next(images_iter) 

This results in a shifted image in the right half here:

Armed with common augmentation approaches, let's augment our small training dataset by at most 20 degrees' rotation and at most 20% shifting (note that we cannot apply flipping as some signs, such as class 19, 20, 21, 33, and 34, are not semantically symmetric):

> datagen <- image_data_generator( 
+   rotation_range = 20, 
+   width_shift_range = 0.2, 
+   height_shift_range = 0.2, 
+   horizontal_flip = FALSE 
+ ) 
> 
> datagen %>% fit_image_data_generator(x_train_1) 
Augmented data generator is defined and now being applied to a CNN model with function fit_generator:  
> model_2 <- init_cnn_dropout() 
> model_2 %>% fit_generator( 
+   flow_images_from_data(x_train_1, y_train_1,  
                          datagen, batch_size = 100), 
+   steps_per_epoch = as.integer(50000/100),  
+   epochs = 30,  
+   validation_data = list(x_test_1, y_test_1) 
+ ) 
Epoch 1/30 
500/500 [==============================] - 74s 149ms/step - loss: 3.4566 - acc: 0.0798 - val_loss: 3.2963 - val_acc: 0.1322 
Epoch 2/30 
500/500 [==============================] - 77s 153ms/step - loss: 3.0920 - acc: 0.1666 - val_loss: 2.1010 - val_acc: 0.4249 
...... 
...... 
Epoch 25/30 
500/500 [==============================] - 83s 166ms/step - loss: 0.1396 - acc: 0.9584 - val_loss: 0.0636 - val_acc: 0.9860 
Epoch 26/30 
500/500 [==============================] - 79s 158ms/step - loss: 0.1359 - acc: 0.9592 - val_loss: 0.0672 - val_acc: 0.9859 
Epoch 27/30 
500/500 [==============================] - 80s 160ms/step - loss: 0.1344 - acc: 0.9600 - val_loss: 0.0727 - val_acc: 0.9843 
Epoch 28/30 
500/500 [==============================] - 81s 163ms/step - loss: 0.1227 - acc: 0.9628 - val_loss: 0.0647 - val_acc: 0.9862 
Epoch 29/30 
500/500 [==============================] - 79s 158ms/step - loss:
0.1222 - acc: 0.9627 - val_loss: 0.0668 - val_acc: 0.9858 
Epoch 30/30 
500/500 [==============================] - 80s 160ms/step - loss: 0.1220 - acc: 0.9636 - val_loss: 0.0614 - val_acc: 0.9870 

With data augmentation, we can get excellent results of 98.70% accuracy with a small training set.

Besides the 4.77% (93.93% to 98.70%) performance increase compared to no data augmentation, we observe that each iteration takes a longer time than before (around 20 s to around 80 s). This is because image_data_generator() generates mini-batches of augmented image data in real time for each iteration. So, even if the same set of samples is used in two iterations, the augmented data can be very different. Such setting adds more variation to the training set, which in turn makes the model more robust. And that is why data augmentation is considered an approach to reduce overfitting.

Another particularly useful application of data augmentation is balancing a dataset. In most unbalanced classification cases (such as online ads click-through prediction, or banking fraud detection), we usually down-sample the dominant class. However, this can be counterproductive for small datasets. The alternative solution is to augment data from the minor class.

Last but not least, there are other approaches not mentioned previously to augment image data. For example, rescaling multiplies pixel values by a factor and, as a result, changes the lighting condition. Shearing and zooming are useful data augmentation transformations as well. They can be specified in image_data_generator(). If interested, you can try applying any of these approaches and see whether you can beat 98.70%.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset