12

Categorizing Images of Clothing with Convolutional Neural Networks

The previous chapter wrapped up our coverage of the best practices for general and traditional machine learning. Starting from this chapter, we will dive into the more advanced topics of deep learning and reinforcement learning.

When we deal with image classification, we usually flatten the images and get vectors of pixels and feed them to a neural network (or another model). Although this might do the job, we lose critical spatial information. In this chapter, we will use Convolutional Neural Networks (CNNs) to extract rich and distinguishable representations from images. You will see how CNN representations make a "9" a "9", a "4" a "4", a cat a cat, or a dog a dog.

We will start with exploring individual building blocks in the CNN architecture. Then, we will develop a CNN classifier in TensorFlow to categorize clothing images and demystify the convolutional mechanism. Finally, we will introduce data augmentation to boost the performance of CNN models.

We will cover the following topics in this chapter:

  • CNN building blocks
  • CNNs for classification
  • Implementation of CNNs with TensorFlow and Keras
  • Classifying clothing images with CNNs
  • Visualization of convolutional filters
  • Data augmentation and implementation

Getting started with CNN building blocks

Although regular hidden layers (the fully connected layers we have seen so far) do a good job of extracting features from data at certain levels, these representations might be not useful in differentiating images of different classes. CNNs can be used to extract richer and more distinguishable representations that, for example, make a car a car, a plane a plane, or the handwritten letters "y" a "y", "z" a "z", and so on. CNNs are a type of neural network that is biologically inspired by the human visual cortex. To demystify CNNs, I will start by introducing the components of a typical CNN, including the convolutional layer, the nonlinear layer, and the pooling layer.

The convolutional layer

The convolutional layer is the first layer in a CNN, or the first few layers in a CNN if it has multiple convolutional layers. It takes in input images or matrices and simulates the way neuronal cells respond to receptive fields by applying a convolutional operation to the input. Mathematically, it computes the dot product between the nodes of the convolutional layer and individual small regions in the input layer. The small region is the receptive field, and the nodes of the convolutional layer can be viewed as the values on a filter. As the filter moves along on the input layer, the dot product between the filter and current receptive field (sub-region) is computed. A new layer called the feature map is obtained after the filter has convolved over all the sub-regions. Let's look at a simple example, as follows:

Figure 12.1: How a feature map is generated

In this example, layer l has 5 nodes and the filter is composed of 3 nodes [w1, w2, w3]. We first compute the dot product between the filter and the first three nodes in layer l and obtain the first node in the output feature map; then, we compute the dot product between the filter and the middle three nodes and generate the second node in the output feature map; finally, the third node is generated from the convolution on the last three nodes in layer l.

Now, we take a closer look at how convolution works in the following example:

Figure 12.2: How convolution works

In this example, a 3*3 filter is sliding around a 5*5 input matrix from the top left sub-region to the bottom right sub-region. For each sub-region, the dot product is computed using the filter. Take the top left sub-region (in the orange rectangle) as an example: we have 1 * 1 + 1 * 0 + 1 * 1 = 2, therefore the top left node (in the upper-left orange rectangle) in the feature map is of value 2. For the next leftmost sub-region (in the blue rectangle), we calculate the convolution as 1 * 1 + 1 * 1 + 1 * 1 = 3, so the value of the next node (in the upper-middle blue rectangle) in the resulting feature map becomes 3. At the end, a 3*3 feature map is generated as a result.

So what do we use convolutional layers for? They are actually used to extract features such as edges and curves. The pixel in the output feature map will be of high value if the corresponding receptive field contains an edge or curve that is recognized by the filter. For instance, in the preceding example, the filter portrays a backslash-shape "" diagonal edge; the receptive field in the blue rectangle contains a similar curve and hence the highest intensity 3 is created. However, the receptive field at the top-right corner does not contain such a backslash shape, hence it results in a pixel of value 0 in the output feature map. The convolutional layer acts as a curve detector or a shape detector.

Also, a convolutional layer usually has multiple filters detecting different curves and shapes. In the simple preceding example, we only apply one filter and generate one feature map, which indicates how well the shape in the input image resembles the curve represented in the filter. In order to detect more patterns from the input data, we can employ more filters, such as horizontal, vertical curve, 30-degree, and right-angle shape.

Additionally, we can stack several convolutional layers to produce higher-level representations such as the overall shape and contour. Chaining more layers will result in larger receptive fields that are able to capture more global patterns.

In reality, the CNNs, specifically their convolutional layers, mimic the way our visual cells work, as follows:

  • Our visual cortex has a set of complex neuronal cells that are sensitive to specific sub-regions of the visual field and that are called receptive fields. For instance, some cells only respond in the presence of vertical edges; some cells fire only when they are exposed to horizontal edges; some react stronger when they are shown edges of a certain orientation. These cells are organized together to produce the entire visual perception, with each cell being specialized in a specific component. A convolutional layer in a CNN is composed of a set of filters that act as those cells in humans' visual cortexes.
  • A simple cell only responds when the edge-like patterns are presented within its receptive sub-regions. A more complex cell is sensitive to larger sub-regions, and as a result, can respond to edge-like patterns across the entire visual field. A stack of convolutional layers is a bunch of complex cells that can detect patterns in a bigger scope.

Right after each convolutional layer, we often apply a nonlinear layer.

The nonlinear layer

The nonlinear layer is basically the activation layer we have seen in Chapter 8, Predicting Stock Prices with Artificial Neural Networks. It is used to introduce non-linearity, obviously. Recall that in the convolutional layer, we only perform linear operations (multiplication and addition). And no matter how many linear hidden layers a neural network has, it will just behave as a single-layer perceptron. Hence, we need a nonlinear activation right after the convolutional layer. Again, ReLU is the most popular candidate for the nonlinear layer in deep neural networks.

The pooling layer

Normally after one or more convolutional layers (along with nonlinear activation), we can directly use the derived features for classification. For example, we can apply a softmax layer in the multiclass classification case. But let's do some math first.

Given 28 * 28 input images, supposing that we apply 20 5 * 5 filters in the first convolutional layer, we will obtain 20 output feature maps and each feature map layer will be of size (28 – 5 + 1) * (28 – 5 + 1) = 24 * 24 = 576. This means that the number of features as inputs for the next layer increases to 11,520 (20 * 576) from 784 (28 * 28). We then apply 50 5 * 5 filters in the second convolutional layer. The size of the output grows to 50 * 20 * (24 – 5 + 1) * (24 – 5 + 1) = 400,000. This is a lot higher than our initial size of 784. We can see that the dimensionality increases dramatically with every convolutional layer before the final softmax layer. This can be problematic as it leads to overfitting easily, not to mention the cost of training such a large number of weights.

To address the issue of drastically growing dimensionality, we often employ a pooling layer after the convolutional and nonlinear layer. The pooling layer is also called the downsampling layer. As you can imagine, it reduces the dimensions of the feature maps. This is done by aggregating the statistics of features over sub-regions. Typical pooling methods include:

  • Max pooling, which takes the max values over all non-overlapping sub-regions
  • Mean pooling, which takes the mean values over all non-overlapping sub-regions

In the following example, we apply a 2 * 2 max-pooling filter on a 4 * 4 feature map and output a 2 * 2 one:

Figure 12.3: How max pooling works

Besides dimensionality reduction, the pooling layer has another advantage: translation invariance. This means that its output doesn't change even if the input matrix undergoes a small amount of translation. For example, if we shift the input image a couple of pixels to the left or right, as long as the highest pixels remain the same in the sub-regions, the output of the max-pooling layer will still be the same. In other words, the prediction becomes less position-sensitive with pooling layers. The following example illustrates how max pooling achieves translation invariance.

Here is the 4 * 4 original image, along with the output from max pooling with a 2 * 2 filter:

Figure 12.4: The original image and the output from max pooling

And if we shift the image 1 pixel to the right, we have the following shifted image and the corresponding output:

Figure 12.5: The shifted image and the output

We have the same output even if we horizontally move the input image. Pooling layers increase the robustness of image translation.

You've now learned all of the components of a CNN. It was easier than you thought, right? Let's see how they compose a CNN next.

Architecting a CNN for classification

Putting the three types of convolutional-related layers together, along with the fully connected layer(s), we can structure the CNN model for classification as follows:

Figure 12.6: CNN architecture

In this example, the input images are first fed into a convolutional layer (with ReLU activation) composed of a bunch of filters. The coefficients of the convolutional filters are trainable. A well-trained initial convolutional layer is able to derive good low-level representations of the input images, which will be critical to downstream convolutional layers if there are any, and also downstream classification tasks. Each resulting feature map is then downsampled by the pooling layer.

Next, the aggregated feature maps are fed into the second convolutional layer. Similarly, the second pooling layer reduces the size of the output feature maps. You can chain as many pairs of convolutional and pooling layers as you want. The second (or more, if any) convolutional layer tries to compose high-level representations, such as the overall shape and contour, through a series of low-level representations derived from previous layers.

Up until this point, the feature maps are matrices. We need to flatten them into a vector before performing any downstream classification. The flattened features are just treated as the input to one or more fully-connected hidden layers. We can think of a CNN as a hierarchical feature extractor on top of a regular neural network. CNNs are well suited to exploit strong and unique features that differentiate images.

The network ends up with a logistic function if we deal with a binary classification problem, a softmax function for a multiclass case, or a set of logistic functions for multi-label cases.

By now you should have a good understanding of CNNs, and should be ready to solve the clothing image classification problem. Let's start by exploring the dataset.

Exploring the clothing image dataset

The clothing Fashion-MNIST (https://github.com/zalandoresearch/fashion-mnist) is a dataset of images from Zalando (Europe's biggest online fashion retailer). It consists of 60,000 training samples and 10,000 test samples. Each sample is a 28 * 28 grayscale image, associated with a label from the following 10 classes, each representing articles of clothing:

  • 0: T-shirt/top
  • 1: Trouser
  • 2: Pullover
  • 3: Dress
  • 4: Coat
  • 5: Sandal
  • 6: Shirt
  • 7: Sneaker
  • 8: Bag
  • 9: Ankle boot

Zalando seeks to make the dataset as popular as the handwritten digits MNIST dataset (http://yann.lecun.com/exdb/mnist/) for benchmarking algorithms, and hence calls it Fashion-MNIST.

You can download the dataset from the direct links in the Get the data section using the GitHub link, or simply import it from Keras, which already includes the dataset and its API. We will take the latter approach, as follows:

>>> import tensorflow as tf
>>> fashion_mnist = tf.keras.datasets.fashion_mnist
>>> (train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

We just import TensorFlow and load the Fashion-MNIST from the Keras module. We now have the training images and their labels, along with the test images and their labels. Feel free to print a few samples from these four arrays, for example, the training labels as follows:

>>> print(train_labels)
[9 0 0 ... 3 0 5]

The label arrays do not include class names. Hence, we define them as follows and will use them for plotting later on:

>>> class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

Take a look at the format of the image data as follows:

>>> print(train_images.shape)
(60000, 28, 28)

There are 60,000 training samples and each is represented as 28 * 28 pixels.

Similarly for the 10,000 testing samples, we check the format as follows:

>>> print(test_images.shape)
(10000, 28, 28)

Let's now inspect a random training sample as follows:

>>> import matplotlib.pyplot as plt
>>> plt.figure()
>>> plt.imshow(train_images[42])
>>> plt.colorbar()
>>> plt.grid(False)
>>> plt.title(class_names[train_labels[42]])
>>> plt.show()

Refer to the following image as the end result:

Figure 12.7: A training sample from Fashion-MNIST

You may run into an error similar to the following:

OMP: Error #15: Initializing libiomp5.dylib, but found libiomp5.dylib already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.
Abort trap: 6

If so, please add the following code at the beginning of your code:

>>> import os
>>> os.environ['KMP_DUPLICATE_LIB_OK'] = 'True'

In the ankle boot sample, the pixel values are in the range of 0 to 255. Hence, we need to rescale the data to a range of 0 to 1 before feeding it to the neural network. We divide the values of both training samples and test samples by 255 as follows:

>>> train_images = train_images / 255.0
>>> test_images = test_images / 255.0

Now we display the first 16 training samples after the preprocessing, as follows:

>>> for i in range(16):
...     plt.subplot(4, 4, i + 1)
...     plt.subplots_adjust(hspace=.3)
...     plt.xticks([])
...     plt.yticks([])
...     plt.grid(False)
...     plt.imshow(train_images[i], cmap=plt.cm.binary)
...     plt.title(class_names[train_labels[i]])
... plt.show()

Refer to the following image of the end result:

Figure 12.8: The end result

In the next section, we will be building our CNN model to classify these clothing images.

Classifying clothing images with CNNs

As mentioned, the CNN model has two main components: the feature extractor composed of a set of convolutional and pooling layers, and the classifier backend similar to a regular neural network.

Architecting the CNN model

As the convolutional layer in Keras only takes in individual samples in three dimensions, we need to first reshape the data into four dimensions as follows:

>>> X_train = train_images.reshape((train_images.shape[0], 28, 28, 1))
>>> X_test = test_images.reshape((test_images.shape[0], 28, 28, 1))
>>> print(X_train.shape)
(60000, 28, 28, 1)

The first dimension is the number of samples, and the fourth dimension is the appended one representing the grayscale images.

Before we develop the CNN model, let's specify the random seed in TensorFlow for reproducibility:

>>> tf.random.set_seed(42)

We now import the necessary modules from Keras and initialize a Keras-based model:

>>> from tensorflow.keras import datasets, layers, models, losses
>>> model = models.Sequential()

For the convolutional extractor, we are going to use three convolutional layers. We start with the first convolutional layer with 32 small-sized 3 * 3 filters. This is implemented by the following code:

>>> model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))

Note that we use ReLU as the activation function.

The convolutional layer is followed by a max-pooling layer with a 2 * 2 filter:

>>> model.add(layers.MaxPooling2D((2, 2)))

Here comes the second convolutional layer. It has 64 3 * 3 filters and comes with a ReLU activation function as well:

>>> model.add(layers.Conv2D(64, (3, 3), activation='relu'))

The second convolutional layer is followed by another max-pooling layer with a 2 * 2 filter:

>>> model.add(layers.MaxPooling2D((2, 2)))

We continue adding the third convolutional layer. It has 128 3 * 3 filters at this time:

>>> model.add(layers.Conv2D(128, (3, 3), activation='relu'))

The resulting filter maps are then flattened to provide features to the downstream classifier backend:

>>> model.add(layers.Flatten())

For the classifier backend, we just use one hidden layer with 64 nodes:

>>> model.add(layers.Dense(64, activation='relu'))

The hidden layer here is the regular fully-connected dense layer, with ReLU as the activation function.

And finally, the output layer has 10 nodes representing 10 different classes in our case, along with a softmax activation:

>>> model.add(layers.Dense(10, activation='softmax'))

Now we compile the model with Adam as the optimizer, cross-entropy as the loss function, and classification accuracy as the metric:

>>> model.compile(optimizer='adam',
...               loss=losses.sparse_categorical_crossentropy,
...               metrics=['accuracy'])

Let's take a look at the model summary as follows:

>>> model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
conv2d (Conv2D)              (None, 26, 26, 32)        320
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 32)        0
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 11, 11, 64)        18496
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64)          0
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 3, 3, 128)         73856
_________________________________________________________________
flatten (Flatten)            (None, 1152)              0
_________________________________________________________________
dense (Dense)                (None, 64)                73792
_________________________________________________________________
dense_1 (Dense)              (None, 10)                650
=================================================================
Total params: 167,114
Trainable params: 167,114
Non-trainable params: 0
_________________________________________________________________

It displays each layer in the model, the shape of its single output, and the number of its trainable parameters. As you may notice, the output from a convolutional layer is three-dimensional, where the first two are the dimensions of the feature maps and the third is the number of filters used in the convolutional layer. The size (the first two dimensions) of the max-pooling output is half of its input feature map in the example. Feature maps are downsampled by the pooling layer. You may want to see how many parameters there would be to be trained if you take out all the pooling layers. Actually, it is 4,058,314! So, the benefits of applying pooling are obvious: avoiding overfitting and reducing training cost.

You may wonder why the numbers of convolutional filters keep increasing over the layers. Recall that each convolutional layer attempts to capture patterns of a specific hierarchy. The first convolutional layer captures low-level patterns, such as edges, dots, and curves. Then the subsequent layers combine those patterns extracted in previous layers to form high-level patterns, such as shapes and contours. As we move forward in these convolutional layers, there are more and more combinations of patterns to capture in most cases. As a result, we need to keep increasing (or at least not decreasing) the number of filters in the convolutional layers.

Fitting the CNN model

Now it's time to train the model we just built. We train it for 10 iterations and evaluate it using the testing samples:

>>> model.fit(X_train, train_labels, validation_data=(X_test, test_labels), epochs=10)

Note that the batch size is 32 by default. Here is how the training progresses:

Train on 60000 samples, validate on 10000 samples
Epoch 1/10
60000/60000 [==============================] - 68s 1ms/sample - loss: 0.4703 - accuracy: 0.8259 - val_loss: 0.3586 - val_accuracy: 0.8706
Epoch 2/10
60000/60000 [==============================] - 68s 1ms/sample - loss: 0.3056 - accuracy: 0.8882 - val_loss: 0.3391 - val_accuracy: 0.8783
Epoch 3/10
60000/60000 [==============================] - 69s 1ms/sample - loss: 0.2615 - accuracy: 0.9026 - val_loss: 0.2655 - val_accuracy: 0.9028
Epoch 4/10
60000/60000 [==============================] - 69s 1ms/sample - loss: 0.2304 - accuracy: 0.9143 - val_loss: 0.2506 - val_accuracy: 0.9096
Epoch 5/10
60000/60000 [==============================] - 69s 1ms/sample - loss: 0.2049 - accuracy: 0.9233 - val_loss: 0.2556 - val_accuracy: 0.9058
Epoch 6/10
60000/60000 [==============================] - 71s 1ms/sample - loss: 0.1828 - accuracy: 0.9312 - val_loss: 0.2497 - val_accuracy: 0.9122
Epoch 7/10
60000/60000 [==============================] - 68s 1ms/sample - loss: 0.1638 - accuracy: 0.9386 - val_loss: 0.3006 - val_accuracy: 0.9002
Epoch 8/10
60000/60000 [==============================] - 70s 1ms/sample - loss: 0.1453 - accuracy: 0.9455 - val_loss: 0.2662 - val_accuracy: 0.9119
Epoch 9/10
60000/60000 [==============================] - 69s 1ms/sample - loss: 0.1301 - accuracy: 0.9506 - val_loss: 0.2885 - val_accuracy: 0.9057
Epoch 10/10
60000/60000 [==============================] - 68s 1ms/sample - loss: 0.1163 - accuracy: 0.9559 - val_loss: 0.3081 - val_accuracy: 0.9100
10000/1 - 5s - loss: 0.2933 - accuracy: 0.9100

We are able to achieve an accuracy of around 96% on the training set and 91% on the test set.

If you want to double-check the performance on the test set, you can do the following:

>>> test_loss, test_acc = model.evaluate(X_test, test_labels, verbose=2)
>>> print('Accuracy on test set:', test_acc)
Accuracy on test set: 0.91

Now that we have a well-trained model, we can make predictions on the test set using the following code:

>>> predictions = model.predict(X_test)

Take a look at the first sample; we have the prediction as follows:

>>> print(predictions[0])
[1.8473367e-11 1.1924335e-07 1.0303306e-13 1.2061150e-12 3.1937938e-07
 3.5260896e-07 6.2364621e-13 9.1853758e-07 4.0739218e-11 9.9999821e-01]

We have the predicted probabilities for this sample. To obtain the predicted label, we do the following:

>>> import numpy as np
>>> print('Predicted label for the first test sample: ', np.argmax(predictions[0]))
Predicted label for the first test sample: 9

And we do a fact check as follows:

>>> print('True label for the first test sample: ',test_labels[0])
True label for the first test sample: 9

We take one step further by plotting the sample image and the prediction results, including the probabilities of 10 possible classes:

>>> def plot_image_prediction(i, images, predictions, labels, class_names):
...     plt.subplot(1,2,1)
...     plt.imshow(images[i], cmap=plt.cm.binary)
...     prediction = np.argmax(predictions[i])
...     color = 'blue' if prediction == labels[i] else 'red'
...     plt.title(f"{class_names[labels[i]]} (predicted 
            {class_names[prediction]})", color=color)
...     plt.subplot(1,2,2)
...     plt.grid(False)
...     plt.xticks(range(10))
...     plot = plt.bar(range(10), predictions[i], color="#777777")
...     plt.ylim([0, 1])
...     plot[prediction].set_color('red')
...     plot[labels[i]].set_color('blue')
...     plt.show()

The original image (on the left) will have the title <true label> (predicted <predicted label>) in blue if the prediction matches the label, or in red if not. The predicted probability (on the right) will be a blue bar on the true label, or a red bar on the predicted label if the predicted label is not the same as the true label.

Let's try it with the first test sample:

>>> plot_image_prediction(0, test_images, predictions, test_labels, class_names)

Refer to the following screenshot for the end result:

Figure 12.9: A sample of the original image with its prediction result

Feel free to play around with other samples, especially those that aren't predicted accurately, such as item 17.

You have seen how the trained model performs, and you may wonder what the learned convolutional filters look like. You will find the answer in the next section.

Visualizing the convolutional filters

We extract the convolutional filters from the trained model and visualize them with the following steps:

  1. From the model summary, we know that the layers of indexes 0, 2, and 4 in the model are convolutional layers. Using the second convolutional layer as an example, we obtain its filters as follows:
    >>> filters, _ = model.layers[2].get_weights()
    
  2. Next, we normalize the filter values to the range of 0 to 1 so we can visualize them more easily:
    >>> f_min, f_max = filters.min(), filters.max()
    >>> filters = (filters - f_min) / (f_max - f_min)
    
  3. Recall we have 64 filters in this convolutional layer. We visualize the first 16 filters in four rows and four columns:
    >>> n_filters = 16
    >>> for i in range(n_filters):
    ...     filter = filters[:, :, :, i]
    ...     plt.subplot(4, 4, i+1)
    ...     plt.xticks([])
    ...     plt.yticks([])
    ...     plt.imshow(filter[:, :, 0], cmap='gray')
    ... plt.show()
    

    Refer to the following screenshot for the end result:

Figure 12.10: Trained convolutional filters

In a convolutional filter, the dark squares represent small weights and the white squares indicate large weights. Based on this intuition, we can see that the second filter in the second row detects the vertical line in a receptive field, while the third filter in the first row detects a gradient from light in the bottom right to dark in the top left.

In the previous example, we trained the clothing image classifier with 60,000 labeled samples. However, it is not easy to gather such a big labeled dataset in reality. Specifically, image labeling is expensive and time-consuming. How can we effectively train an image classifier with a limited number of samples? One solution is data augmentation.

Boosting the CNN classifier with data augmentation

Data augmentation means expanding the size of an existing training dataset in order to improve the generalization performance. It overcomes the cost involved in collecting and labeling more data. In TensorFlow, we use the ImageDataGenerator module (https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator) from the Keras API to implement image augmentation in real time.

Horizontal flipping for data augmentation

There are many ways to augment image data. The simplest one is probably flipping an image horizontally or vertically. For instance, we will have a new image if we flip an existing image horizontally. To generate horizontal images, we should create an image data generator, as follows:

>>> import os
>>> from tensorflow.keras.preprocessing.image import ImageDataGenerator, load_img 
>>> da tagen = ImageDataGenerator(horizontal_flip=True)

We will create manipulated images using this generator. And now we first develop a utility function to generate images given an augmented image generator and display them as follows:

>>> def generate_plot_pics(datagen, original_img, save_prefix):
...     folder = 'aug_images'
...     i = 0
...     for batch in datagen.flow(original_img.reshape(
                                   (1, 28, 28, 1)),
...                               batch_size=1,
...                               save_to_dir=folder,
...                               save_prefix=save_prefix,
...                               save_format='jpeg'):
...         i += 1
...         if i > 2:
...             break
...     plt.subplot(2, 2, 1, xticks=[],yticks=[])
...     plt.imshow(original_img)
...     plt.title("Original")
...     i = 1
...     for file in os.listdir(folder):
...         if file.startswith(save_prefix):
...             plt.subplot(2, 2, i + 1, xticks=[],yticks=[])
...             aug_img = load_img(folder + "/" + file)
...             plt.imshow(aug_img)
...             plt.title(f"Augmented {i}")
...             i += 1
...     plt.show()

The generator first randomly generates three (in this example) images given the original image and the augmentation condition. The function then plots the original image along with three artificial images. The generated images are also stored in the local disk in the folder named aug_images.

Let's try it out with our horizontal_flip generator using the first training image (feel free to use any other image) as follows:

>>> generate_plot_pics(datagen, train_images[0], 'horizontal_flip')

Refer to the following screenshot for the end result:

Figure 12.11: Horizontally flipped images for data augmentation

As you can see, the generated images are either horizontally flipped or not flipped. Why don't we try one with both horizontally and vertically flips simultaneously? We can do so as follows:

>>> datagen = ImageDataGenerator(horizontal_flip=True,
...                              vertical_flip=True)
>>> generate_plot_pics(datagen, train_images[0], 'hv_flip')

Refer to the following screenshot for the end result:

Figure 12.12: Horizontally and vertically flipped images for data augmentation

Besides being horizontally flipped or not, the generated images are either vertically flipped or not flipped.

In general, the horizontally flipped images convey the same message as the original ones. Vertically flipped images are not frequently seen. It is also worth noting that flipping only works in orientation-insensitive cases, such as classifying cats and dogs or recognizing parts of cars. On the contrary, it is dangerous to do so in cases where orientation matters, such as classifying between right and left turn signs.

Rotation for data augmentation

Instead of rotating every 90 degrees as in horizontal or vertical flipping, a small-to-medium degree rotation can also be applied in image data augmentation. Let's see rotation in the following example:

>>> datagen = ImageDataGenerator(rotation_range=30)
>>> generate_plot_pics(datagen, train_images[0], 'rotation')

Refer to the following screenshot for the end result:

Figure 12.13: Rotated images for data augmentation

In the preceding example, the image is rotated by any degree ranging from -30 (counterclockwise) to 30 (clockwise).

Shifting for data augmentation

Shifting is another commonly used augmentation method. It generates new images by moving the original image horizontally or vertically by a small number of pixels. In TensorFlow, you can either specify a maximal number of pixels the image will be shifted by, or a maximal portion of the weight or height. Let's take a look at the following example where we shift the image horizontally by at most 8 pixels:

>>> datagen = ImageDataGenerator(width_shift_range=8)
>>> generate_plot_pics(datagen, train_images[0], 'width_shift')

Refer to the following screenshot for the end result:

12.14: Horizontally shifted images for data augmentation

As you can see, the generated images are horizontally shifted by no more than 8 pixels. Let's now try shifting both horizontally and vertically at the same time:

>>> datagen = ImageDataGenerator(width_shift_range=8,
...                              height_shift_range=8)
>>> generate_plot_pics(datagen, train_images[0], 'width_height_shift')

Refer to the following screenshot for the end result:

Figure 12.15: Horizontally and vertically shifted images for data augmentation

Improving the clothing image classifier with data augmentation

Armed with several common augmentation methods, we now apply them to train our image classifier on a small dataset in the following steps:

  1. We start by constructing a small training set:
    >>> n_small = 500
    >>> X_train = X_train[:n_small]
    >>> train_labels = train_labels[:n_small]
    >>> print(X_train.shape)
    (500, 28, 28, 1)
    

    We only use 500 samples for training.

  2. We architect the CNN model using the Keras Sequential API:
    >>> model = models.Sequential()
    >>> model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
    >>> model.add(layers.MaxPooling2D((2, 2)))
    >>> model.add(layers.Conv2D(64, (3, 3), activation='relu'))
    >>> model.add(layers.Flatten())
    >>> model.add(layers.Dense(32, activation='relu'))
    >>> model.add(layers.Dense(10, activation='softmax'))
    

    As we have training data of a small size, we use only two convolutional layers and adjust the size of the hidden layer accordingly: the first convolutional layer has 32 small-sized 3 * 3 filters, the second convolutional layer has 64 filters of the same size, and the fully-connected hidden layer has 32 nodes.

  3. We compile the model with Adam as the optimizer, cross-entropy as the loss function, and classification accuracy as the metric:
    >>> model.compile(optimizer='adam',
    ...               loss=losses.sparse_categorical_crossentropy,
    ...               metrics=['accuracy'])
    
  4. We first train the model without data augmentation:
    >>> model.fit(X_train, train_labels, validation_data=(X_test, test_labels), epochs=20, batch_size=40)
    Train on 500 samples, validate on 10000 samples
    Epoch 1/20
    500/500 [==============================] - 6s 11ms/sample - loss: 1.8791 - accuracy: 0.3200 - val_loss: 1.3738 - val_accuracy: 0.4288
    Epoch 2/20
    500/500 [==============================] - 4s 8ms/sample - loss: 1.1363 - accuracy: 0.6100 - val_loss: 1.0929 - val_accuracy: 0.6198
    Epoch 3/20
    500/500 [==============================] - 4s 9ms/sample - loss: 0.8669 - accuracy: 0.7140 - val_loss: 0.9237 - val_accuracy: 0.6753
    ……
    ……
    Epoch 18/20
    500/500 [==============================] - 5s 10ms/sample - loss: 0.1372 - accuracy: 0.9640 - val_loss: 0.7142 - val_
    accuracy: 0.7947
    Epoch 19/20
    500/500 [==============================] - 5s 10ms/sample - loss: 0.1195 - accuracy: 0.9600 - val_loss: 0.6885 - val_accuracy: 0.7982
    Epoch 20/20
    500/500 [==============================] - 5s 10ms/sample - loss: 0.0944 - accuracy: 0.9780 - val_loss: 0.7342 - val_accuracy: 0.7924
    

    We train the model for 20 iterations.

  5. Let's see how it performs on the test set:
    >>> test_loss, test_acc = model.evaluate(X_test, test_labels, verbose=2)
    >>> print('Accuracy on test set:', test_acc)
       Accuracy on test set: 0.7924
    

    The model without data augmentation has a classification accuracy of 79.24% on the test set.

  6. Now we work on the data augmentation and see if it can boost the performance. We first define the augmented data generator:
    >>> datagen = ImageDataGenerator(height_shift_range=3,
    ...                              horizontal_flip=True
    ...                              )
    

    We herein apply horizontal flipping and vertical shifting. We notice that none of the clothing images are upside down, hence vertical flipping won't provide any normal-looking images. Also, most clothing images are perfectly horizontally centered, so we are not going to perform any width shift. To put it simply, we try to avoid creating augmented images that will look different from the original ones.

  7. We clone the CNN model we used previously:
    >>> model_aug = tf.keras.models.clone_model(model)
    

    It only copies the CNN architecture and creates new weights instead of sharing the weights of the existing model.

    We compile the cloned model as before, with Adam as the optimizer, cross-entropy as the loss function, and classification accuracy as the metric:

    >>> model_aug.compile(optimizer='adam',
    ...               loss=losses.sparse_categorical_crossentropy,
    ...               metrics=['accuracy'])
    
  8. Finally, we fit this CNN model on data with real-time augmentation:
    >>> train_generator = datagen.flow(X_train, train_labels, seed=42, batch_size=40)
    >>> model_aug.fit(train_generator, epochs=50, validation_data=(X_test, test_labels))
    Epoch 1/50
    13/13 [==============================] - 5s 374ms/step - loss: 2.2150 - accuracy: 0.2060 - val_loss: 2.0099 - val_accuracy: 0.3104
    ……
    ……
    Epoch 48/50
    13/13 [==============================] - 4s 300ms/step - loss: 0.1541 - accuracy: 0.9460 - val_loss: 0.7367 - val_accuracy: 0.8003
    Epoch 49/50
    13/13 [==============================] - 4s 304ms/step - loss: 0.1487 - accuracy: 0.9340 - val_loss: 0.7211 - val_accuracy: 0.8035
    Epoch 50/50
    13/13 [==============================] - 4s 306ms/step - loss: 0.1031 - accuracy: 0.9680 - val_loss: 0.7446 - val_accuracy: 0.8109
    

    During the training process, augmented images are randomly generated on the fly to feed the model. We train the model with data augmentation for 50 iterations this time, as it takes more iterations for the model to learn the patterns.

  9. Let's see how it performs on the test set:
    >>> test_loss, test_acc = model_aug.evaluate(X_test, test_labels, verbose=2)
    >>> print('Accuracy on test set:', test_acc)
       Accuracy on test set: 0.8109
    

    The accuracy increases to 81.09% from 79.24% with data augmentation.

Feel free to fine-tune the hyperparameters as we did in Chapter 8, Predicting Stock Prices with Artificial Neural Networks, and see if you can further improve the classification performance.

Summary

In this chapter, we worked on classifying clothing images using CNNs. We started with a detailed explanation of individual components of a CNN model and learned how CNNs are inspired by the way our visual cells work. We then developed a CNN model to categorize fashion-MNIST clothing images from Zalando. We also talked about data augmentation and several popular image augmentation methods. We practiced implementing deep learning models again with the Keras module in TensorFlow.

In the next chapter, we will focus on another type of deep learning networks: Recurrent Neural Networks (RNNs). CNNs and RNNs are the two most powerful deep neural networks that make deep learning so popular nowadays.

Exercises

  1. As mentioned before, can you try to fine-tune the CNN image classifier and see if you can beat what we have achieved?
  2. Can you also employ dropout and early stopping techniques?
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset