In the previous chapter, we implemented a GAN whose Generator and Discriminator were simple feed-forward neural networks with a single hidden layer. Despite this simplicity, many of the images of handwritten digits that the GAN’s Generator produced after being fully trained were remarkably convincing. Even the ones that were not recognizable as human-written numerals had many of the hallmarks of handwritten symbols, such as discernible line edges and shapes—especially when compared to the random noise used as the Generator’s raw input.
Imagine what we could accomplish with more powerful network architecture. In this chapter, we will do just that: instead of simple two-layer feed-forward networks, both our Generator and Discriminator will be implemented as convolutional neural networks (CNNs, or ConvNets). The resulting GAN architecture is known as Deep Convolutional GAN, or DCGAN for short.
Before delving into the nitty-gritty of the DCGAN implementation, we will review the key concepts underlying ConvNets, review the history behind the discovery of the DCGAN, and cover one of the key breakthroughs that made complex architectures like DCGAN possible in practice: batch normalization.
We expect that you’ve already been exposed to convolutional networks; that said, if this technique is new to you, don’t worry. In this section, we review all the key concepts you need for this chapter and the rest of this book.
Unlike a regular feed-forward neural network whose neurons are arranged in flat, fully connected layers, layers in a ConvNet are arranged in three dimensions (width × height × depth). Convolutions are performed by sliding one or more filters over the input layer. Each filter has a relatively small receptive field (width × height) but always extends through the entire depth of the input volume.
At every step as it slides across the input, each filter outputs a single activation value: the dot product between the input values and the filter entries. This process results in a two-dimensional activation map for each filter. The activation maps produced by each filter are then stacked on top of one another to produce a three-dimensional output layer; the output depth is equal to the number of filters used.
Importantly, filter parameters are shared by all the input values to the given filter. This has both intuitive and practical advantages. Intuitively, parameter sharing allows us to efficiently learn visual features and shapes (such as lines and edges) regardless of where they are located in the input image. From a practical perspective, parameter sharing drastically reduces the number of trainable parameters. This decreases the risk of overfitting and allows this technique to scale up to higher-resolution images without a corresponding exponential increase in trainable parameters, as would be the case with a traditional, fully connected network.
If all this sounds confusing, let’s make these concepts a little less abstract by visualizing them. Diagrams make everything easier to understand for most people (us included!). Figure 4.1 shows a single convolution operation; figure 4.2 illustrates the convolution operation in the context of the input and output layers in a ConvNet.
(Source: “A Guide to Convolution Arithmetic for Deep Learning,” by Vincent Dumoulin and Francesco Visin, 2016, https://arxiv.org/abs/1603.07285.)
Figure 4.1 depicts the convolution operation for a single filter over a two-dimensional input. In practice, the input volume is usually three-dimensional, and we use several stacked filters. The underlying mechanics, however, remain the same: each filter produces a single value per step, regardless of the depth of the input volume. The number of filters we use determines the depth of the output volume, as their resulting activation maps are stacked on top of one another. All this is illustrated in figure 4.2.
(Source: “Convolutional Neural Network,” by Nameer Hirschkind et al., Brilliant.org, retrieved November 1, 2018, http://mng.bz/8zJK.)
If you would like to dive deeper into convolutional networks and the underlying concepts, we recommend reading the relevant chapters in François Chollet’s Deep Learning with Python (Manning, 2017), which provides an outstanding, hands-on introduction to all the key concepts and techniques in deep learning, including ConvNets. For those with a more academic bent, a great resource is Andrej Karpathy’s excellent lecture notes from his Stanford University class on Convolutional Neural Networks for Visual Recognition (http://cs231n.github.io/convolutional-networks/).
Introduced in 2016 by Alec Radford, Luke Metz, and Soumith Chintala, DCGAN marked one of the most important early innovations in GANs since the technique’s inception two years earlier.[1] This was not the first time a group of researchers tried harnessing ConvNets for use in GANs, but it was the first time they succeeded at incorporating ConvNets directly into a full-scale GAN model.
See “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks,” by Alec Radford et al., 2015, https://arxiv.org/abs/1511.06434.
The use of ConvNets exacerbates many of the difficulties plaguing GAN training, including instability and gradient saturation. Indeed, these challenges proved so daunting that some researchers resorted to alternative approaches, such as the LAPGAN, which uses a cascade of convolutional networks within a Laplacian pyramid, with a separate ConvNet being trained at each level using the GAN framework.[2] If none of this makes sense to you, don’t worry. Superseded by superior methods, LAPGAN has been largely relegated to the dustbin of history, so it is not important to understand its internals.
See “Deep Generative Image Models Using a Laplacian Pyramid of Adversarial Networks,” by Emily Denton et al., 2015, https://arxiv.org/abs/1506.05751.
Although inelegant, complex, and computationally taxing, LAPGAN yielded the highest-quality images to date at the time of its publication, with fourfold improvement over the original GAN (40% versus 10% of generated images mistaken for real by human evaluators). As such, LAPGAN demonstrated the enormous potential of marrying GANs with ConvNets.
With DCGAN, Radford and his collaborators introduced techniques and optimizations that allowed ConvNets to scale up to the full GAN framework without the need to modify the underlying GAN architecture and without reducing GAN to a subroutine of a more complex model framework, like LAPGAN. One of the key techniques Radford et al. used is batch normalization, which helps stabilize the training process by normalizing inputs at each layer where it is applied. Let’s take a closer look at what batch normalization is and how it works.
Batch normalization was introduced by Google scientists Sergey Ioffe and Christian Szegedy in 2015.[3] Their insight was as simple as it was groundbreaking. Just as we normalize network inputs, they proposed to normalize the inputs to each layer, for each training mini-batch as it flows through the network.
See “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” by Sergey Ioffe and Christian Szegedy, 2015, https://arxiv.org/abs/1502.03167.
It helps to remind ourselves what normalization is and why we bother normalizing the input feature values in the first place. Normalization is the scaling of data so that it has zero mean and unit variance. This is accomplished by taking each data point x, subtracting the mean μ, and dividing the result by the standard deviation, σ, as shown in equation 4.1:
Normalization has several advantages. Perhaps most important, it makes comparisons between features with vastly different scales easier and, by extension, makes the training process less sensitive to the scale of the features. Consider the following (rather contrived) example. Imagine we are trying to predict the monthly expenditures of a family based on two features: the family’s annual income and the family size. We would expect that, in general, the more a family earns, the more they spend; and the bigger a family is, the more they spend.
However, the scales of these features are vastly different—an extra $10 in annual income probably wouldn’t influence how much a family spends, but an additional 10 members would likely wreak havoc on any family’s budget. Normalization solves this problem by scaling each feature value onto a standardized scale, such that each data point is expressed not as its face value but as a relative “score” indicating how many standard deviations the given data point is from the mean.
The insight behind batch normalization is that normalizing inputs alone may not go far enough when dealing with deep neural networks with many layers. As the input values flow through the network, from one layer to the next, they are scaled by the trainable parameters in each of those layers. And as the parameters get tuned by backpropagation, the distribution of each layer’s inputs is prone to change in subsequent training iterations, which destabilizes the learning process. In academia, this problem is known as covariate shift. Batch normalization solves it by scaling values in each mini-batch by the mean and variance of that mini-batch.
The way batch normalization is computed differs in several respects from the simple normalization equation we presented earlier. This section walks through it step by step.
Let μB be the mean of the mini-batch B, and σB2 be the variance (mean squared deviation) of the mini-batch B. The normalized value is computed as shown in equation 4.2:
The term ϵ (epsilon) is added for numerical stability, primarily to avoid division by zero. It is set to a small positive constant value, such as 0.001.
In batch normalization, we do not use these normalized values directly. Instead, we multiply them by γ (gamma) and add β (beta) before passing them as inputs to the next layer; see equation 4.3.
Importantly, the terms γ and β are trainable parameters, which—just like weights and biases—are tuned during network training. The reason for this is that it may be beneficial for the intermediate input values to be standardized around a mean other than 0 and have a variance other than 1. Because γ and β are trainable, the network can learn what values work best.
Fortunately for us, we don’t have to worry about any of this. The Keras function keras.layers.BatchNormalization handles all the mini-batch computations and updates behind the scenes for us.
Batch normalization limits the amount by which updating the parameters in the previous layers can affect the distribution of inputs received by the current layer. This decreases any unwanted interdependence between parameters across layers, which helps speed up the network training process and increase its robustness, especially when it comes to network parameter initialization.
Batch normalization has proven essential to the viability of many deep learning architectures, including the DCGAN, which you will see in action in the following tutorial.
In this tutorial, we will revisit the MNIST dataset of handwritten digits from chapter 3. This time, however, we will use the DCGAN architecture and represent both the Generator and the Discriminator as convolutional networks, as shown in figure 4.3. Besides this change, the rest of the network architecture remains unchanged. At the end of the tutorial, we will compare the quality of the handwritten numerals produced by the two GANs (traditional versus DCGAN) so you can see the improvement made possible by the use of a more advanced network architecture.
As in chapter 3, much of the code in this tutorial was adapted from Erik Linder-Norén’s open source GitHub repository of GAN models in Keras (https://github.com/eriklindernoren/Keras-GAN), with numerous modifications and improvements spanning both the implementation details and network architectures. A Jupyter notebook with the full implementation, including added visualizations of the training progress, is available in the GitHub repository for this book at https://github.com/GANs-in-Action/gans-in-action, under the chapter-4 folder. The code was tested with Python 3.6.0, Keras 2.1.6, and TensorFlow 1.8.0. To speed up the training time, it is recommended to run the model on a GPU.
First, we import all the packages, modules, and libraries we need to train and run the model. Just as in chapter 3, the MNIST dataset of handwritten digits is imported directly from keras.datasets.
%matplotlib inline import matplotlib.pyplot as plt import numpy as np from keras.datasets import mnist from keras.layers import ( Activation, BatchNormalization, Dense, Dropout, Flatten, Reshape) from keras.layers.advanced_activations import LeakyReLU from keras.layers.convolutional import Conv2D, Conv2DTranspose from keras.models import Sequential from keras.optimizers import Adam
We also specify the model input dimensions: the image shape and the length of the noise vector z.
img_rows = 28 img_cols = 28 channels = 1 img_shape = (img_rows, img_cols, channels) 1 z_dim = 100 2
ConvNets have traditionally been used for image classification tasks, in which the network takes in an image with the dimensions height × width × number of color channels as input and—through a series of convolutional layers—outputs a single vector of class scores, with the dimensions 1 × n, where n is the number of class labels. To generate an image by using the ConvNet architecture, we reverse the process: instead of taking an image and processing it into a vector, we take a vector and up-size it to an image.
Key to this process is the transposed convolution. Recall that regular convolution is typically used to reduce input width and height while increasing its depth. Transposed convolution goes in the reverse direction: it is used to increase the width and height while reducing depth, as you can see in the Generator network diagram in figure 4.4.
The Generator starts with a noise vector z. Using a fully connected layer, we reshape the vector into a three-dimensional hidden layer with a small base (width × height) and large depth. Using transposed convolutions, the input is progressively reshaped such that its base grows while its depth decreases until we reach the final layer with the shape of the image we are seeking to synthesize, 28 × 28 × 1. After each transposed convolution layer, we apply batch normalization and the Leaky ReLU activation function. At the final layer, we do not apply batch normalization and, instead of ReLU, we use the tanh activation function.
Putting all the steps together, we do the following:
The following listing shows what the Generator network looks like when implemented in Keras.
def build_generator(z_dim): model = Sequential() model.add(Dense(256 * 7 * 7, input_dim=z_dim)) 1 model.add(Reshape((7, 7, 256))) model.add(Conv2DTranspose(128, kernel_size=3, strides=2, padding='same'))2 model.add(BatchNormalization()) 3 model.add(LeakyReLU(alpha=0.01)) 4 model.add(Conv2DTranspose(64, kernel_size=3, strides=1, padding='same')) 5 model.add(BatchNormalization()) 3 model.add(LeakyReLU(alpha=0.01)) 4 model.add(Conv2DTranspose(1, kernel_size=3, strides=2, padding='same')) 6 model.add(Activation('tanh')) 7 return model
The Discriminator is a ConvNet of the familiar kind, one that takes in an image and outputs a prediction vector: in this case, a binary classification indicating whether the input image was deemed to be real rather than fake. Figure 4.5 depicts the Discriminator network we will implement.
The input to the Discriminator is a 28 × 28 × 1 image. By applying convolutions, the image is transformed such that its base (width × height) gets progressively smaller and its depth gets progressively deeper. On all convolutional layers, we apply the Leaky ReLU activation function. Batch normalization is used on all convolutional layers except the first. For output, we use a fully connected layer and the sigmoid activation function.
Putting all the steps together, we do the following:
The following listing is a Keras implementation of the Discriminator model.
def build_discriminator(img_shape): model = Sequential() model.add( 1 Conv2D(32, kernel_size=3, strides=2, input_shape=img_shape, padding='same')) model.add(LeakyReLU(alpha=0.01)) 2 model.add( 3 Conv2D(64, kernel_size=3, strides=2, input_shape=img_shape, padding='same')) model.add(BatchNormalization()) 4 model.add(LeakyReLU(alpha=0.01)) 5 model.add( 6 Conv2D(128, kernel_size=3, strides=2, input_shape=img_shape, padding='same')) model.add(BatchNormalization()) 7 model.add(LeakyReLU(alpha=0.01)) 8 model.add(Flatten()) 9 model.add(Dense(1, activation='sigmoid')) return model
Aside from the network architectures used for the Generator and the Discriminator, the rest of the DCGAN network setup and implementation is the same as the one we used for the simple GAN in chapter 3. This underscores the versatility of the GAN architecture. Listing 4.5 code builds the model, and listing 4.6 trains the model.
def build_gan(generator, discriminator): model = Sequential() model.add(generator) 1 model.add(discriminator) return model discriminator = build_discriminator(img_shape) 2 discriminator.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy']) generator = build_generator(z_dim) 3 discriminator.trainable = False 4 gan = build_gan(generator, discriminator) 5 gan.compile(loss='binary_crossentropy', optimizer=Adam())
losses = [] accuracies = [] iteration_checkpoints = [] def train(iterations, batch_size, sample_interval): (X_train, _), (_, _) = mnist.load_data() 1 X_train = X_train / 127.5 - 1.0 2 X_train = np.expand_dims(X_train, axis=3) real = np.ones((batch_size, 1)) 3 fake = np.zeros((batch_size, 1)) 4 for iteration in range(iterations): idx = np.random.randint(0, X_train.shape[0], batch_size) 5 imgs = X_train[idx] z = np.random.normal(0, 1, (batch_size, 100)) 6 gen_imgs = generator.predict(z) d_loss_real = discriminator.train_on_batch(imgs, real) 7 d_loss_fake = discriminator.train_on_batch(gen_imgs, fake) d_loss, accuracy = 0.5 * np.add(d_loss_real, d_loss_fake) z = np.random.normal(0, 1, (batch_size, 100)) 8 gen_imgs = generator.predict(z) g_loss = gan.train_on_batch(z, real) 9 if (iteration + 1) % sample_interval == 0: losses.append((d_loss, g_loss)) 10 accuracies.append(100.0 * accuracy) 10 iteration_checkpoints.append(iteration + 1) 10 print("%d [D loss: %f, acc.: %.2f%%] [G loss: %f]" % 11 (iteration + 1, d_loss, 100.0 * accuracy, g_loss)) sample_images(generator) 12
For completeness, we are also including the sample_images() function in the following listing. Recall from chapter 3 that this function outputs a 4 × 4 grid of images synthesized by the Generator in a given training iteration.
def sample_images(generator, image_grid_rows=4, image_grid_columns=4): z = np.random.normal(0, 1, (image_grid_rows * image_grid_columns, z_dim))1 gen_imgs = generator.predict(z) 2 gen_imgs = 0.5 * gen_imgs + 0.5 3 fig, axs = plt.subplots(image_grid_rows, 4 image_grid_columns, figsize=(4, 4), sharey=True, sharex=True) cnt = 0 for i in range(image_grid_rows): for j in range(image_grid_columns): axs[i, j].imshow(gen_imgs[cnt, :, :, 0], cmap='gray') 5 axs[i, j].axis('off') cnt += 1
Next, the following code is used to run the model.
iterations = 20000 1 batch_size = 128 sample_interval = 1000 train(iterations, batch_size, sample_interval) 2
Figure 4.6 shows a sample of handwritten digits produced by the Generator after the DCGAN is fully trained. For a side-by-side comparison, figure 4.7 shows a sample of digits produced by the GAN from chapter 3, and figure 4.8 shows a sample of real handwritten numerals from the MNIST dataset.
As evidenced by the preceding figures, all the extra work we put into implementing DCGAN paid off handsomely. Many of the images of handwritten digits that the network produces after being fully trained are virtually indistinguishable from the ones written by a human hand.
DCGAN demonstrates the versatility of the GAN framework. In theory, the Discriminator and Generator can be represented by any differentiable function, even one as complex as a multilayer convolutional network. However, DCGAN also demonstrates that there are significant hurdles to making more complex implementations work in practice. Without breakthroughs such as batch normalization, DCGAN would fail to train properly.
In the following chapter, we will explore some of the theoretical and practical limitations that make GAN training so challenging as well as the approaches to overcome them.