GANs work a lot like an art forger and a museum curator. Every day, the art forger tries to sell some fake art to the museum, and every day the curator tries to distinguish whether a certain piece is real or fake. The forger learns from their failures. By trying to fool the curator and observing what leads to success and failure, they become a better forger. But the curator learns too. By trying to stay ahead of the forger, they become a better curator. As time passes, the forgeries become better and so does the distinguishing process. After years of battle, the art forger is an expert that can draw just as well as Picasso and the curator is an expert that can distinguish a real painting by tiny details.
Technically, a GAN consists of two neural networks: a generator, which produces data from a random latent vector, and a discriminator, which classifies data as "real," that is, stemming from the training set, or "fake," that is, stemming from the generator.
We can visualize a GAN scheme, as we can see in the following diagram:
Once again, generative models are easier to understand when images are generated, so in this section, we will look at image data, although all kinds of data can be used.
The training process for a GAN works as follows:
Note: GAN training has a lot of similarities to the visualization of the network layers that we discussed in Chapter 3, Utilizing Computer Vision, only this time we don't just create one image that maximizes an activation function, instead we create a generative network that specializes in maximizing the activation function of another network.
Mathematically, generator G and discriminator D play a mini-max two-player game with the value function V(G,D):
In this formula x is an item drawn from the distribution of real data, , and z is a latent vector drawn from the latent vector space, pz.
The output distribution of the generator is noted as pg. It can be shown that the global optimum of this game is
, that is, if the distribution of the generated data is equal to the distribution of actual data.
GANs get optimized following a game-theoretic value function. Solving this type of optimization problem with deep learning is an active area of research, and an area we will visit again in Chapter 8, Privacy, Debugging, and Launching Your Products, where we will discuss reinforcement learning. The fact that deep learning can be used to solve Minimax games is exciting news for the field of finance and economics, which features many such problems.
Let's now implement a GAN in order to generate MNIST characters. Before we start, we need to do some imports. GANs are large models, and in this section you will see how to combine sequential and functional API models for easy model building:
from keras.models import Model, Sequential
In this example we will be using a few new layer types:
from keras.layers import Input, Dense, Dropout, Flatten from keras.layers import LeakyReLU, Reshape from keras.layers import Conv2D, UpSampling2D
Let's look at some of the key elements:
Reshape
does the same as np.reshape
: it brings a tensor into a new form.UpSampling2D
scales a 2D feature map up, for example, by a factor of two, by repeating all numbers in the feature map.We will be using the Adam
optimizer as we often do:
from keras.optimizers import Adam
Neural network layers get initialized randomly. Usually, the random numbers are drawn from a distribution that supports learning well. For GANs, it turns out that a normal Gaussian distribution is a better alternative:
from keras.initializers import RandomNormal
Now we're going to build the generator model:
generator = Sequential() #1 generator.add(Dense(128*7*7, input_dim=latent_dim, kernel_initializer=RandomNormal(stddev=0.02))) #2 generator.add(LeakyReLU(0.2)) #3 generator.add(Reshape((128, 7, 7))) #4 generator.add(UpSampling2D(size=(2, 2))) #5 generator.add(Conv2D(64,kernel_size=(5, 5),padding='same')) #6 generator.add(LeakyReLU(0.2)) #7 generator.add(UpSampling2D(size=(2, 2))) #8 generator.add(Conv2D(1, kernel_size=(5, 5),padding='same', activation='tanh')) #9 adam = Adam(lr=0.0002, beta_1=0.5) generator.compile(loss='binary_crossentropy', optimizer=adam) #10
Again, let's take a look at the generator model code, which consists of 10 key steps:
LeakyReLU
. We need to specify how steep the slope for negative inputs is; in this case, negative inputs are multiplied with 0.2.Flatten
layer, which we did in Chapter 3, Utilizing Computer Vision. We now have a tensor with 128 channels in a 7x7-pixel image or feature map.UpSampling2D
, we enlarge this image to 14x14 pixels. The size
argument specifies the multiplier factor for width and height.Conv2D
layer. As opposed to the case with most image classifiers, we use a relatively large kernel size of 5x5 pixels.Conv2D
layer is another LeakyReLU
.tanh
activation. Tanh
squishes all values to between negative one and one. This might be unexpected as image data usually does not feature any values below zero. Empirically, it turned out, however, that tanh
activations work much better for GANs than sigmoid
activations.Adam
optimizer with a very small learning rate and smaller-than-usual momentum.The discriminator is a relatively standard image classifier that classifies images as real or fake. There are only a few GAN-specific modifications:
#Discriminator discriminator = Sequential() discriminator.add(Conv2D(64, kernel_size=(5, 5), strides=(2, 2), padding='same', input_shape=(1, 28, 28),kernel_initializer=RandomNormal(stddev=0.02))) #1 discriminator.add(LeakyReLU(0.2)) discriminator.add(Dropout(0.3)) discriminator.add(Conv2D(128, kernel_size=(5, 5), strides=(2, 2), padding='same')) discriminator.add(LeakyReLU(0.2)) discriminator.add(Dropout(0.3)) #2 discriminator.add(Flatten()) discriminator.add(Dense(1, activation='sigmoid')) discriminator.compile(loss='binary_crossentropy', optimizer=adam)
There are two key elements here:
Now we have both a generator and a discriminator. To train the generator, we have to get the gradients from the discriminator to backpropagate through and train the generator. This is where the power of Keras' modular design comes into play.
The following code creates a GAN model that can be used to train the generator from the discriminator gradients:
discriminator.trainable = False #1 ganInput = Input(shape=(latent_dim,)) #2 x = generator(ganInput) #3 ganOutput = discriminator(x) #4 gan = Model(inputs=ganInput, outputs=ganOutput) #5 gan.compile(loss='binary_crossentropy', optimizer=adam) #6
Within that code, there are six key stages:
discriminator
. When setting discriminator
to non-trainable, the weights are frozen only for the model that is compiled with the non-trainable weights. That is, we can still train the discriminator
model on its own, but as soon as it becomes part of the GAN model that is compiled again, its weights are frozen.ganInput
layer. The model can be used just like a layer under the functional API.compile
here, the weights of the discriminator model are frozen for as long as they are part of the GAN model. Keras will throw a warning on training time that the weights are not frozen for the actual discriminator model.Training our GAN requires some customization of the training process and a couple of GAN-specific tricks as well. More specifically, we have to write our own training loop, something that we'll achieve with the following code:
epochs=50 batchSize=128 batchCount = X_train.shape[0] // batchSize #1 for e in range(1, epochs+1): #2 print('-'*15, 'Epoch %d' % e, '-'*15) for _ in tqdm(range(batchCount)): #3 noise = np.random.normal(0, 1, size=[batchSize, latent_dim]) #4 imageBatch = X_train[np.random.randint(0, X_train.shape[0],size=batchSize)] #5 generatedImages = generator.predict(noise) #6 X = np.concatenate([imageBatch, generatedImages]) #7 yDis = np.zeros(2*batchSize) #8 yDis[:batchSize] = 0.9 labelNoise = np.random.random(yDis.shape) #9 yDis += 0.05 * labelNoise + 0.05 discriminator.trainable = True #10 dloss = discriminator.train_on_batch(X, yDis) #11 noise = np.random.normal(0, 1, size=[batchSize, latent_dim]) #12 yGen = np.ones(batchSize) #13 discriminator.trainable = False #14 gloss = gan.train_on_batch(noise, yGen) #15 dLosses.append(dloss) #16 gLosses.append(gloss)
That was a lot of code we just introduced. So, let's take a minute to pause and think about the 16 key steps:
tqdm
tool helps us keep track of progress within the batch.In the following figure, you can see some of the generated MNIST characters:
Most of these characters look like identifiable numbers, although some, such as those in the bottom left and right, seem a bit off.
The code that we wrote and explored is now outputted in the following chart, showing us the Discriminitive and Generative loss of an increasing number of Epochs.
Note that the loss in GAN training is not interpretable as it is for supervised learning. The loss of a GAN will not decrease even as the GAN makes progress.
The loss of a generator and discriminator is dependent on how well the other model does. If the generator gets better at fooling the discriminator, then the discriminator loss will stay high. If one of the losses goes to zero, it means that the other model lost the race and cannot fool or properly discriminate the other model anymore.
This is one of the things that makes GAN training so hard: GANs don't converge to a low loss solution; they converge to an equilibrium in which the generator fools the discriminator not all the time, but many times. That equilibrium is not always stable. Part of the reason so much noise is added to labels and the networks themselves is that it increases the stability of the equilibrium.
As GANs are unstable and difficult, yet useful, a number of tricks has been developed over time that makes GAN training more stable. Knowing these tricks can help you with your GAN building process and save you countless hours, even though there is often no theoretical reason for why these tricks work.
For autoencoders, the latent space was a relatively straightforward approximation of PCA. VAEs create a latent space of distributions, which is useful but still easy to see as a form of PCA. So, what is the latent space of a GAN if we just sample randomly from it during training? As it turns out, GANs self-structure the latent space. Using the latent space of a GAN, you would still be able to cluster MNIST images by the characters they display.
Research has shown that the latent space of GANs often has some surprising features, such as "smile vectors," which arrange face images according to the width of the person's smile. Researchers have also shown that GANs can be used for latent space algebra, where adding the latent representation of different objects creates realistic, new objects. Yet, research on the latent space of GANs is still in its infancy and drawing conclusions about the world from its latent space representations is an active field of research.
GANs are tricky to train. They might collapse, diverge, or fail in a number of different ways. Researchers and practitioners have come up with a number of tricks that make GANs work better. While it may seem odd, it's not known why these work, but all that matters to us is that they help in practice:
In this formula, D is the discriminator output. In practice, it works better if the objective of the generator is this:
In other words, instead of minimizing the negative discriminator output, it is better to maximize the discriminator output. The reason is that the first objective often has vanishing gradients at the beginning of the GAN training process.
If the generator goes completely off track and produces only zeros, for instance, you will be able to see it before spending days of GPU time on training that will go nowhere.