GANs

GANs work a lot like an art forger and a museum curator. Every day, the art forger tries to sell some fake art to the museum, and every day the curator tries to distinguish whether a certain piece is real or fake. The forger learns from their failures. By trying to fool the curator and observing what leads to success and failure, they become a better forger. But the curator learns too. By trying to stay ahead of the forger, they become a better curator. As time passes, the forgeries become better and so does the distinguishing process. After years of battle, the art forger is an expert that can draw just as well as Picasso and the curator is an expert that can distinguish a real painting by tiny details.

Technically, a GAN consists of two neural networks: a generator, which produces data from a random latent vector, and a discriminator, which classifies data as "real," that is, stemming from the training set, or "fake," that is, stemming from the generator.

We can visualize a GAN scheme, as we can see in the following diagram:

GANs

GAN scheme

Once again, generative models are easier to understand when images are generated, so in this section, we will look at image data, although all kinds of data can be used.

The training process for a GAN works as follows:

  1. A latent vector containing random numbers is created.
  2. The latent vector is fed into the generator, which produces an image.
  3. A set of fake images from the generator is mixed with a set of real images from the training set. The discriminator is trained in the binary classification of real and fake data.
  4. After the discriminator has been trained for a while we feed in the fake images again. This time, we set the label of the fake images to "real." We backpropagate through the discriminator and obtain the loss gradient with respect to the input of the discriminator. We do not update the weights of the discriminator based on this information.
  5. We now have gradients describing how we would have to change our fake image so that the discriminator would classify it as a real image. We use these gradients to backpropagate and train the generator.
  6. With our new and improved generator, we once again create fake images, which get mixed with real images in order to train the discriminator, whose gradients are used to train the generator again.

Note

Note: GAN training has a lot of similarities to the visualization of the network layers that we discussed in Chapter 3, Utilizing Computer Vision, only this time we don't just create one image that maximizes an activation function, instead we create a generative network that specializes in maximizing the activation function of another network.

Mathematically, generator G and discriminator D play a mini-max two-player game with the value function V(G,D):

GANs

In this formula x is an item drawn from the distribution of real data, GANs, and z is a latent vector drawn from the latent vector space, pz.

The output distribution of the generator is noted as pg. It can be shown that the global optimum of this game is

GANs

, that is, if the distribution of the generated data is equal to the distribution of actual data.

GANs get optimized following a game-theoretic value function. Solving this type of optimization problem with deep learning is an active area of research, and an area we will visit again in Chapter 8, Privacy, Debugging, and Launching Your Products, where we will discuss reinforcement learning. The fact that deep learning can be used to solve Minimax games is exciting news for the field of finance and economics, which features many such problems.

A MNIST GAN

Let's now implement a GAN in order to generate MNIST characters. Before we start, we need to do some imports. GANs are large models, and in this section you will see how to combine sequential and functional API models for easy model building:

from keras.models import Model, Sequential

In this example we will be using a few new layer types:

from keras.layers import Input, Dense, Dropout, Flatten
from keras.layers import LeakyReLU, Reshape
from keras.layers import Conv2D, UpSampling2D

Let's look at some of the key elements:

  • LeakyReLU is just like ReLU, except that the activation allows for small negative values. This prevents the gradient from ever becoming zero. This activation function works well for GANs, something we will discuss in the next section:
A MNIST GAN

Leaky ReLU

  • Reshape does the same as np.reshape: it brings a tensor into a new form.
  • UpSampling2D scales a 2D feature map up, for example, by a factor of two, by repeating all numbers in the feature map.

We will be using the Adam optimizer as we often do:

from keras.optimizers import Adam

Neural network layers get initialized randomly. Usually, the random numbers are drawn from a distribution that supports learning well. For GANs, it turns out that a normal Gaussian distribution is a better alternative:

from keras.initializers import RandomNormal

Now we're going to build the generator model:

generator = Sequential()                                       #1 

generator.add(Dense(128*7*7, input_dim=latent_dim, kernel_initializer=RandomNormal(stddev=0.02)))   #2

generator.add(LeakyReLU(0.2))                                  #3
generator.add(Reshape((128, 7, 7)))                            #4
generator.add(UpSampling2D(size=(2, 2)))                       #5

generator.add(Conv2D(64,kernel_size=(5, 5),padding='same'))    #6

generator.add(LeakyReLU(0.2))                                  #7
generator.add(UpSampling2D(size=(2, 2)))                       #8

generator.add(Conv2D(1, kernel_size=(5, 5),padding='same', activation='tanh'))                    #9

adam = Adam(lr=0.0002, beta_1=0.5)
generator.compile(loss='binary_crossentropy', optimizer=adam) #10

Again, let's take a look at the generator model code, which consists of 10 key steps:

  1. We construct the generator as a sequential model.
  2. The first layer takes in the random latent vector and maps it to a vector with dimensions 128 * 7 * 7 = 6,272. It already significantly expands the dimensionality of our generated data. For this fully connected layer, it is important to initialize weights from a normal Gaussian distribution with a relatively small standard deviation. A Gaussian distribution, as opposed to a uniform distribution, will have fewer extreme values, which will make training easier.
  3. The activation function for the first layer is LeakyReLU. We need to specify how steep the slope for negative inputs is; in this case, negative inputs are multiplied with 0.2.
  4. Now we reshape our flat vector into a 3D tensor. This is the opposite of using a Flatten layer, which we did in Chapter 3, Utilizing Computer Vision. We now have a tensor with 128 channels in a 7x7-pixel image or feature map.
  5. Using UpSampling2D, we enlarge this image to 14x14 pixels. The size argument specifies the multiplier factor for width and height.
  6. Now we can apply a standard Conv2D layer. As opposed to the case with most image classifiers, we use a relatively large kernel size of 5x5 pixels.
  7. The activation following the Conv2D layer is another LeakyReLU.
  8. We upsample again, bringing the image to 28x28 pixels, the same dimensions as an MNIST image.
  9. The final convolutional layer of our generator outputs only a single channel image, as MNIST images are only black and white. Notice how the activation of this final layer is a tanh activation. Tanh squishes all values to between negative one and one. This might be unexpected as image data usually does not feature any values below zero. Empirically, it turned out, however, that tanh activations work much better for GANs than sigmoid activations.
  10. Finally, we compile the generator to train with the Adam optimizer with a very small learning rate and smaller-than-usual momentum.

The discriminator is a relatively standard image classifier that classifies images as real or fake. There are only a few GAN-specific modifications:

#Discriminator
discriminator = Sequential()
discriminator.add(Conv2D(64, kernel_size=(5, 5), strides=(2, 2), padding='same', input_shape=(1, 28, 28),kernel_initializer=RandomNormal(stddev=0.02)))                                               #1

discriminator.add(LeakyReLU(0.2))
discriminator.add(Dropout(0.3))
discriminator.add(Conv2D(128, kernel_size=(5, 5), strides=(2, 2), padding='same'))
discriminator.add(LeakyReLU(0.2))
discriminator.add(Dropout(0.3))                          #2
discriminator.add(Flatten())
discriminator.add(Dense(1, activation='sigmoid'))
discriminator.compile(loss='binary_crossentropy', optimizer=adam)

There are two key elements here:

  1. As with the generator, the first layer of the discriminator should be initialized randomly from a Gaussian distribution.
  2. Dropout is commonly used in image classifiers. For GANs, it should also be used just before the last layer.

Now we have both a generator and a discriminator. To train the generator, we have to get the gradients from the discriminator to backpropagate through and train the generator. This is where the power of Keras' modular design comes into play.

Note

Note: Keras models can be treated just like Keras layers.

The following code creates a GAN model that can be used to train the generator from the discriminator gradients:

discriminator.trainable = False                         #1
ganInput = Input(shape=(latent_dim,))                   #2
x = generator(ganInput)                                 #3
ganOutput = discriminator(x)                            #4
gan = Model(inputs=ganInput, outputs=ganOutput)         #5
gan.compile(loss='binary_crossentropy', optimizer=adam) #6

Within that code, there are six key stages:

  1. When training the generator, we do not want to train discriminator. When setting discriminator to non-trainable, the weights are frozen only for the model that is compiled with the non-trainable weights. That is, we can still train the discriminator model on its own, but as soon as it becomes part of the GAN model that is compiled again, its weights are frozen.
  2. We create a new input for our GAN, which takes in the random latent vector.
  3. We connect the generator model to the ganInput layer. The model can be used just like a layer under the functional API.
  4. We now connect the discriminator with frozen weights to the generator. Again, we call the model in the same way we would use a layer in the functional API.
  5. We create a model that maps an input to the output of the discriminator.
  6. We compile our GAN model. Since we call compile here, the weights of the discriminator model are frozen for as long as they are part of the GAN model. Keras will throw a warning on training time that the weights are not frozen for the actual discriminator model.

Training our GAN requires some customization of the training process and a couple of GAN-specific tricks as well. More specifically, we have to write our own training loop, something that we'll achieve with the following code:

epochs=50
batchSize=128
batchCount = X_train.shape[0] // batchSize                     #1

for e in range(1, epochs+1):                                   #2
    print('-'*15, 'Epoch %d' % e, '-'*15)
    for _ in tqdm(range(batchCount)):                          #3
      
        noise = np.random.normal(0, 1, size=[batchSize, latent_dim]) #4
        imageBatch = X_train[np.random.randint(0, X_train.shape[0],size=batchSize)] #5

        
        generatedImages = generator.predict(noise)             #6
        X = np.concatenate([imageBatch, generatedImages])      #7

        yDis = np.zeros(2*batchSize)                           #8
        yDis[:batchSize] = 0.9 
        
        labelNoise = np.random.random(yDis.shape)              #9
        yDis += 0.05 * labelNoise + 0.05

        
        discriminator.trainable = True                         #10
        dloss = discriminator.train_on_batch(X, yDis)          #11

        
        noise = np.random.normal(0, 1, size=[batchSize, latent_dim]) #12
        yGen = np.ones(batchSize)                              #13
        discriminator.trainable = False                        #14
        gloss = gan.train_on_batch(noise, yGen)                #15

    dLosses.append(dloss)                                      #16
    gLosses.append(gloss)

That was a lot of code we just introduced. So, let's take a minute to pause and think about the 16 key steps:

  1. We have to write a custom loop to loop over the batches. To know how many batches there are, we need to make an integer division of our dataset size by our batch size.
  2. In the outer loop, we iterate over the number of epochs we want to train.
  3. In the inner loop, we iterate over the number of batches we want to train on in each epoch. The tqdm tool helps us keep track of progress within the batch.
  4. We create a batch of random latent vectors.
  5. We randomly sample a batch of real MNIST images.
  6. We use the generator to generate a batch of fake MNIST images.
  7. We stack the real and fake MNIST images together.
  8. We create the target for our discriminator. Fake images are encoded with 0, and real images with 0.9. This technique is called soft labels. Instead of hard labels (zero and one), we use something softer in order to not train the GAN too aggressively. This technique has been shown to make GAN training more stable.
  9. On top of using soft labels, we add some noise to the labels. This, once again, will make the training more stable.
  10. We make sure that the discriminator is trainable.
  11. We train the discriminator on a batch of real and fake data.
  12. We create some more random latent vectors for training the generator.
  13. The target for generator training is always one. We want the discriminator to give us the gradients that would have made a fake image look like a real one.
  14. Just to be sure, we set the discriminator to be non-trainable, so that we can not break anything by accident.
  15. We train the GAN model. We feed in a batch of random latent vectors and train the generator part of the GAN so that the discriminator part will classify the generated images as real.
  16. We save the losses from training.

In the following figure, you can see some of the generated MNIST characters:

A MNIST GAN

GAN-generated MNIST characters

Most of these characters look like identifiable numbers, although some, such as those in the bottom left and right, seem a bit off.

The code that we wrote and explored is now outputted in the following chart, showing us the Discriminitive and Generative loss of an increasing number of Epochs.

A MNIST GAN

GAN training progress

Note that the loss in GAN training is not interpretable as it is for supervised learning. The loss of a GAN will not decrease even as the GAN makes progress.

The loss of a generator and discriminator is dependent on how well the other model does. If the generator gets better at fooling the discriminator, then the discriminator loss will stay high. If one of the losses goes to zero, it means that the other model lost the race and cannot fool or properly discriminate the other model anymore.

This is one of the things that makes GAN training so hard: GANs don't converge to a low loss solution; they converge to an equilibrium in which the generator fools the discriminator not all the time, but many times. That equilibrium is not always stable. Part of the reason so much noise is added to labels and the networks themselves is that it increases the stability of the equilibrium.

As GANs are unstable and difficult, yet useful, a number of tricks has been developed over time that makes GAN training more stable. Knowing these tricks can help you with your GAN building process and save you countless hours, even though there is often no theoretical reason for why these tricks work.

Understanding GAN latent vectors

For autoencoders, the latent space was a relatively straightforward approximation of PCA. VAEs create a latent space of distributions, which is useful but still easy to see as a form of PCA. So, what is the latent space of a GAN if we just sample randomly from it during training? As it turns out, GANs self-structure the latent space. Using the latent space of a GAN, you would still be able to cluster MNIST images by the characters they display.

Research has shown that the latent space of GANs often has some surprising features, such as "smile vectors," which arrange face images according to the width of the person's smile. Researchers have also shown that GANs can be used for latent space algebra, where adding the latent representation of different objects creates realistic, new objects. Yet, research on the latent space of GANs is still in its infancy and drawing conclusions about the world from its latent space representations is an active field of research.

GAN training tricks

GANs are tricky to train. They might collapse, diverge, or fail in a number of different ways. Researchers and practitioners have come up with a number of tricks that make GANs work better. While it may seem odd, it's not known why these work, but all that matters to us is that they help in practice:

  • Normalize the inputs: GANs don't work well with extreme values, so make sure you always have normalized inputs between -1 and 1. This is also the reason why you should use the tanh function as your generator output.
  • Don't use the theoretical correct loss function: If you read papers on GANs, you will find that they give the generator optimization goal as the following formula:
    GAN training tricks

    In this formula, D is the discriminator output. In practice, it works better if the objective of the generator is this:

    GAN training tricks

    In other words, instead of minimizing the negative discriminator output, it is better to maximize the discriminator output. The reason is that the first objective often has vanishing gradients at the beginning of the GAN training process.

  • Sample from a normal Gaussian distribution: There are two reasons to sample from normal distributions instead of uniform distributions. First, GANs don't work well with extreme values, and normal distributions have fewer extreme values than uniform distributions. Additionally, it has turned out that if the latent vectors are sampled from a normal distribution, then the latent space becomes a sphere. The relationships between latent vectors in this sphere are easier to describe than latent vectors in a cube space.
  • Use batch normalization: We've already seen that GANs don't work well with extreme values since they are so fragile. Another way to reduce extreme values is to use batch normalization, as we discussed in Chapter 3, Utilizing Computer Vision.
  • Use separate batches for real and fake data: In the beginning of this process, real and fake data might have very different distributions. As batch norm applies normalization over a batch, using the batches' mean and standard deviation, it is more advisable to keep the real and fake data separate. While this does lead to slightly less accurate gradient estimates, the gain from fewer extreme values is great.
  • Use soft and noisy labels: GANs are fragile; the use of soft labels reduces the gradients and keeps the gradients from tipping over. Adding some random noise to labels also helps to stabilize the system.
  • Use basic GANs: There is now a wide range of GAN models. Many of them claim wild performance improvements, whereas in reality they do not work much better, and are often worse, than a simple deep convolutional generative adversarial network, or DCGAN. That does not mean they have no justification for existing, but for the bulk of tasks, more basic GANs will perform better. Another GAN that works well is the adversarial autoencoder, which combines a VAE with a GAN by training the autoencoder on the gradients of a discriminator.
  • Avoid ReLU and MaxPool: ReLU activations and MaxPool layers are frequently used in deep learning, but they have the disadvantage of producing "sparse gradients." A ReLU activation will not have any gradient for negative inputs, and a MaxPool layer will not have any gradients for all inputs that were not the maximum input. Since gradients are what the generator is being trained on, sparse gradients will hurt generator training.
  • Use the Adam optimizer: This optimizer has been shown to work very well with GANs, while many other optimizers do not work well with them.
  • Track failures early: Sometimes, GANs can fail for random reasons. Just choosing the "wrong" random seed could set your training run up for failure. Usually, it is possible to see whether a GAN goes completely off track by observing outputs. They should slowly become more like real data.

    If the generator goes completely off track and produces only zeros, for instance, you will be able to see it before spending days of GPU time on training that will go nowhere.

  • Don't balance loss via statistics: Keeping the balance between the generator and discriminator is a delicate task. Many practitioners, therefore, try to help the balance by training either the generator or discriminator a bit more depending on statistics. Usually, that does not work. GANs are very counterintuitive and trying to help them with an intuitive approach usually makes matters worse. That is not to say there are no ways to help out GAN equilibriums, but the help should stem from a principled approach, such as "train the generator while the generator loss is above X."
  • If you have labels, use them: A slightly more sophisticated version of a GAN discriminator can not only classify data as real or fake but also classify the class of the data. In the MNIST case, the discriminator would have 11 outputs: an output for the 10 real numbers as well as an output for a fake. This allows us to create a GAN that can show more specific images. This is useful in the domain of semi-supervised learning, which we will cover in the next section.
  • Add noise to inputs, reduce it over time: Noise adds stability to GAN training so it comes as no surprise that noisy inputs can help, especially in the early, unstable phases of training a GAN. Later, however, it can obfuscate too much and keep the GAN from generating realistic images. So, we should reduce the noise applied to inputs over time.
  • Use dropouts in G in both the train and test phases: Some researchers find that using dropout on inference time leads to better results for the generated data. Why that is the case is still an open question.
  • Historical averaging: GANs tend to "oscillate," with their weights moving rapidly around a mean during training. Historical averaging penalizes weights that are too far away from their historical average and reduces oscillation. It, therefore, increases the stability of GAN training.
  • Replay buffers: Replay buffers keep a number of older generated images so they can be reused for training the discriminator. This has a similar effect as historical averaging, reducing oscillation and increasing stability. It also reduces the correlation and the test data.
  • Target networks: Another "anti-oscillation" trick is to use target networks. That is, to create copies of both the generator and discriminator, and then train the generator with a frozen copy of the discriminator and train the discriminator with a frozen copy of the generator.
  • Entropy regularization: Entropy regularization means rewarding the network for outputting more different values. This can prevent the generator network from settling on a few things to produce, say, only the number seven. It is a regularization method as it prevents overfitting.
  • Use dropout or noise layers: Noise is good for GANs. Keras not only features dropout layers, but it also features a number of noise layers that add different kinds of noise to activations in a network. You can read the documentation of these layers to see whether they are helpful for your specific GAN application: https://keras.io/layers/noise/.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset