In this chapter, we provide a hands-on tutorial to build a Progressive GAN by using TensorFlow and the newly released TensorFlow Hub (TFHub). The Progressive GAN (aka PGGAN, or ProGAN) is a cutting-edge technique that has managed to generate full-HD photorealistic images. Presented at one of the top machine learning conferences, the International Conference on Learning Representations (ICLR) in 2018, this technique made such a splash that Google immediately integrated it as one of the few models to be part of the TensorFlow Hub. In fact, this technique was lauded by Yoshua Bengio—one of the grandfathers of deep learning—as “almost too good to be true.” When it was released, it became an instant favorite of academic presentations and experimental projects.
We recommend that you go through this chapter with TensorFlow 1.7 or higher, but 1.8+ was the latest release at the time of writing, so that was the one we used. For TensorFlow Hub, we suggest using a version no later than 0.4.0, because later versions have trouble importing due to compatibility issues with TensorFlow 1.x. After reading this chapter, you’ll be able to implement all the key improvements of the Progressive GAN. These four innovations are as follows:
This chapter features two main examples:
The reasons we decided to implement the PGGAN using TFHub rather than from the ground up as we do in all the other chapters are threefold:
See “Progressive Growing of GANs for Improved Quality, Stability, and Variation,” by Tero Karras, 2018, https://github.com/tkarras/progressive_growing_of_gans.
Recall from chapter 2 that we have this lower-resolution space—called latent space—that seeds our output. As with the DCGAN from chapter 4 and indeed the Progressive GAN, the initial trained latent space has semantically meaningful properties. It means that we can find the vector offsets that, for example, introduce eyeglasses to an image of a face, and the same offset will introduce glasses in new images. We can also pick two random vectors and then move in equal increments between them and so gradually—smoothly—get an image that matches the second vector.
This is called interpolation, and you can see this process in figure 6.1. As the author of BigGAN said, meaningful transitions from one vector to another show that the GAN has learned some underlying structure.
In previous chapters, you learned which results are easy to achieve with GANs and which are difficult. Moreover, things like mode collapse (showing only a few examples of the overall distribution) and lack of convergence (one of the causes of poor quality of the results) are no longer alien terms to us.
Recently, a Finnish NVIDIA team released a paper that has managed to blow many previous cutting-edge papers out of the water: “Progressive Growing of GANs for Improved Quality, Stability, and Variation,” by Tero Karras et al. This paper features four fundamental innovations, so let’s walk through them in order.
Before we dive into what the Progressive GAN does, let’s start with a simple analogy. Imagine looking at a mountain region from a bird’s-eye view: you have lots of valleys, which have nice creeks and villages—generally quite habitable. Then you have many mountain tops that are rough and generally unpleasant to live on because of weather conditions. This sort of represents the loss function landscape, where we want to minimize the loss by going down the mountain slopes and into the valleys, which are much nicer.
We can imagine training as dropping a mountaineer into a random place in this mountain region and then following their path down the slope into a valley. This is what stochastic gradient descent does, and chapter 10 revisits this in a lot more detail. Now, unfortunately, if we start with a very complex mountain range, the mountaineer will not know which direction to travel. The space around our adventurer would be jagged and rough. It would be difficult to make out where the nicest, lowest valley is with lots of habitable lands. Instead, we zoom out and reduce the complexity of the mountain range to give the mountaineer a high-level picture of this particular area.
As our mountaineer gets closer to a valley, we can start increasing the complexity by zooming in on the terrain. Then we no longer see just the coarse/pixelated texture, but instead get to see the finer details. This approach has the advantage that as our mountaineer goes down the slope, they can easily make little optimizations to make the hiking easier. For example, they can take a path through a dried-up creek to make the descent into the valley even faster. That is progressive growing: increasing the resolution of the terrain as we go.
However, if you have ever seen an open world computer game or scrolled too quickly through Google Earth with 3D on, you know that quickly increasing the resolution of the terrain around you can be startling and unpleasant. Objects all of a sudden jump into existence. So instead, we progressively smooth in and slowly introduce more complexity as the mountaineer gets closer to the objective.
In technical terms, we are going from a few low-resolution convolutional layers to many high-resolution ones as we train. Thus, we first train the early layers and only then introduce a higher-resolution layer, where it is harder to navigate the loss space. We go from something simple—for example, 4 × 4 trained for several steps—to something more complex—for example, 1024 × 1024 trained for several epochs, as shown in figure 6.2.
The problem in this scenario is that upon introducing even one more layer at a time (for example, from 4 × 4 to 8 × 8), we are still introducing a massive shock to the training. What the PGGAN authors do instead is smoothly fade in those layers, as in figure 6.3, in order to give the system time to adapt to the higher resolution.
However, rather than immediately jumping to this resolution, we smoothly fade in this new layer with higher resolution by a parameter alpha (α), which is between 0 and 1. Alpha affects how much we use either the old—but upscaled—layer or the natively larger one. On the side of the D, we simply shrink by 0.5x to allow for smoothly injecting the trained layer for discrimination. This is (b) in figure 6.3. When we are confident about this new layer, we keep the 32 × 32—(c) in the figure—and then we are getting ready to grow yet again after we have trained 32 × 32 properly.
For all the innovations we’ve detailed, in this section we’ll give you working but isolated versions so that we can talk code. As an exercise, you may want to try implementing these things as one GAN network, maybe using the existing prior architectures. If you are ready, let’s load up ye olde, trusty machine learning libraries and get cracking:
import tensorflow as tf import keras as K
In the code, progressive smoothing in may look something like the following listing.
def upscale_layer(layer, upscale_factor): ''' Upscales layer (tensor) by the factor (int) where the tensor is [group, height, width, channels] ''' height = layer.get_shape()[1] width = layer.get_shape()[2] size = (upscale_factor * height, upscale_factor * width) upscaled_layer = tf.image.resize_nearest_neighbor(layer, size) return upscaled_layer def smoothly_merge_last_layer(list_of_layers, alpha): ''' Smoothly merges in a layer based on a threshold value alpha. This function assumes: that all layers are already in RGB. This is the function for the Generator. :list_of_layers : items should be tensors ordered by resolution :alpha : float in (0,1) ''' last_fully_trained_layer = list_of_layers[-2] 1 last_layer_upscaled = upscale_layer(last_fully_trained_layer, 2) 2 larger_native_layer = list_of_layers[-1] 3 assert larger_native_layer.get_shape() == last_layer_upscaled.get_shape()4 new_layer = (1-alpha) * upscaled_layer + larger_native_layer * alpha 5 return new_layer
Now that you have an understanding of the lower-level details of progressive growing and smoothing without unnecessary complexity, hopefully you can appreciate how general this idea is. Although Karras et al., were by no means the first to come up with some way of increasing model complexity during training, this seems like by far the most promising avenue and indeed the innovation that resonated the most. As of June 2019, this paper was cited over 730 times. With that context in mind, let’s move on to the second big innovation.
The next innovation introduced by Karras et al. in their paper is mini-batch standard deviation. Before we dive into it, let’s recall from chapter 5 the issue of mode collapse, which occurs when the GAN learns how to create a few good examples or only slight permutations on them. We generally want to produce the faces of all the people in the real dataset, maybe not just one picture of one woman.
Therefore, Karras et al. created a way for the Discriminator to tell whether the samples it is getting are varied enough. In essence, we calculate a single extra scalar statistic for the Discriminator. This statistic is the standard deviation of all the pixels in the mini-batch that are generated by the Generator or that come from the real data. That is an amazingly simple and elegant solution: now all the Discriminator needs to learn is that if the standard deviation is low in the images from the batch it is evaluating, the image is likely fake, because the real data has more variance.[2] The Generator has no choice but to increase the variance of the generated samples to have a chance to fool the Discriminator.
Some may object that this can also happen when the sampled real data includes a lot of very similar pictures. Though this is technically true, in practice this is easy to fix, and remember that the similarity would have to be so high that a single pass of a simple nearest neighbor clustering would reveal it.
Moving beyond the intuition, the technical implementation is straightforward as it applies only to the Discriminator. Given that we also want to minimize the number of trainable parameters, we include only a single extra number, which seems to be enough. This number is appended as a feature map—think dimension or the last number in the tf.shape list.
The exact procedure is as follows and is depicted in listing 6.2:
def minibatch_std_layer(layer, group_size=4): ''' Will calculate minibatch standard deviation for a layer. Will do so under a prespecified tf-scope with Keras. Assumes layer is a float32 data type. Else needs validation/casting. NOTE: there is a more efficient way to do this in Keras, but just for clarity and alignment with major implementations (for understanding) this was done more explicitly. Try this as an exercise. ''' group_size = K.backend.minimum(group_size, tf.shape(layer)[0]) 1 shape = list(K.int_shape(input)) 2 shape[0] = tf.shape(input)[0] minibatch = K.backend.reshape(layer, (group_size, -1, shape[1], shape[2], shape[3])) 3 minibatch -= tf.reduce_mean(minibatch, axis=0, keepdims=True) 4 minibatch = tf.reduce_mean(K.backend.square(minibatch), axis = 0) 5 minibatch = K.backend.square(minibatch + 1e8) 6 minibatch = tf.reduce_mean(minibatch, axis=[1,2,4], keepdims=True) 7 minibatch = K.backend.tile(minibatch, [group_size, 1, shape[2], shape[3]]) 8 return K.backend.concatenate([layer, minibatch], axis=1) 9
Equalized learning rate is one of those deep learning dark art techniques that is probably not clear to anyone. Although the researchers do provide a short explanation in the PGGAN paper, they avoided the topic in oral presentations, suggesting that this is probably just a hack that seems to work. Frequently in deep learning this is the case.
Furthermore, many nuances about equalized learning rate require a solid understanding of the implementation of RMSProp or Adam—which is the used optimizer—and also of weights initialization. So don’t worry if this does not make sense to you, because it probably does not really make sense to anyone.
But if you’re curious, the explanation goes something as follows: we need to ensure that all the weights (w) are normalized (w’) to be within a certain range such that w’ = w/c by a constant c that is different for each layer, depending on the shape of the weight matrix. This also ensures that if any parameters need to take bigger steps to reach optimum—because they tend to vary more—these relevant parameters can do that.
Karras et al. use a simple standard normal initialization and then scale the weights per layer at runtime. Some of you may be thinking that Adam already does that—yes, Adam allows learning rates to be different for different parameters, but there’s a catch. Adam adjusts the backpropagated gradient by the estimated standard deviation of the parameter, which ensures that the scale of that parameter is independent of the update. Adam has different learning rates in different directions, but does not always take into account the dynamic range—how much a dimension or feature tends to vary over given mini-batches. As some point out, this seems to solve a similar problem as weights initialization.[3]
See “Progressive Growing of GANs.md,” by Alexander Jung, 2017, http://mng.bz/5A4B.
However, if this is not clear, do not worry; we highly recommend two excellent resources: Andrew Karpathy’s 2016 computer science lecture for notes about weights initialization,[4] and a Distill article for details on how Adam works.[5] The following listing shows the equalized learning rate.
See “Lecture 5: Training Neural Networks, Part I,” by Fei-Fei Li et al. 2016, http://mng.bz/6wOo.
See “Why Momentum Really Works,” by Gabriel Goh, 2017, Distill, https://distill.pub/2017/momentum/.
def equalize_learning_rate(shape, gain, fan_in=None): ''' This adjusts the weights of every layer by the constant from He's initializer so that we adjust for the variance in the dynamic range in different features shape : shape of tensor (layer): these are the dimensions of each layer. For example, [4,4,48,3]. In this case, [kernel_size, kernel_size, number_of_filters, feature_maps]. But this will depend slightly on your implementation. gain : typically sqrt(2) fan_in : adjustment for the number of incoming connections as per Xavier's / He's initialization ''' if fan_in is None: fan_in = np.prod(shape[:-1]) 1 std = gain / K.sqrt(fan_in) 2 wscale = K.constant(std, name='wscale', dtype=np.float32) 3 adjusted_weights = K.get_value('layer', shape=shape, 4 initializer=tf.initializers.random_normal()) * wscale return adjusted_weights
See “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,” by Kaiming He et al., https://arxiv.org/pdf/1502.01852.pdf.
If you are still confused, rest assured that these initialization tricks and these complicated learning rate adjustments are rarely a point of differentiation in either academia or industry. Also, just because restricting weight values between –1 and 1 seems to work somewhat better in most reruns here, that does not mean this trick will generalize to other setups. So let’s move to better-proven techniques.
Let’s begin with some motivation for why would we even want to normalize the features—stability of training. Empirically, the authors from NVIDIA have discovered that one of the early signs of divergent training was an explosion in feature magnitudes. A similar observation was made by the BigGAN authors in chapter 12. So Karras et al. introduced a technique to combat this. On a broader note, this is frequently how GAN training is done: we observe a particular problem with the training, so we introduce mechanisms to prevent that problem from happening.
Note that most networks are using some form of normalization. Typically, they use either batch normalization or a virtual version of this technique. Table 6.1 presents an overview of normalization techniques used in the GANs presented in this book so far. You saw these in chapter 4 (DCGAN) and chapter 5—where we touched on the rest of the GANs and gradient penalties (GPs). Unfortunately, in order for batch normalization and its virtual equivalent to work, we must have large mini-batches so that the individual samples average themselves out.
Method |
Authors |
G normalization |
D normalization |
---|---|---|---|
DCGAN | (Radford et al., 2015, https://arxiv.org/abs/1511.06434) | Batch | Batch |
Improved GAN | (Salimans et al., 2016, https://arxiv.org/pdf/1606.03498.pdf) | Virtual batch | Virtual batch |
WGAN | (Arjovsky et al., 2017, https://arxiv.org/pdf/1701.07875.pdf) | — | Batch |
WGAN-GP | (Gulrajani et al., 2017, http://arxiv.org/abs/1704.00028) | Batch | Layer norm |
Based on the fact that all these major implementations use normalization, it is clearly important, but why not just used standard batch normalization? Unfortunately, batch normalization is too memory intensive at our resolution. We have to come up with something that allows us to work with a few examples—that fit into our GPU memory with the two network graphs—but still works well. Now we understand where the need for pixel-wise feature normalization comes from and why we use it.
If we jump into the algorithm, pixel normalization takes activation magnitude at each layer just before the input is fed into the next layer.
Figure 6.4 illustrates the process of pixel-wise feature normalization. The exact description of step 3 is shown in equation 6.1.
For each feature map do
End for
This formula normalizes (divides by the expression under the square root) each vector constructed in step 2 of figure 6.4. This expression is just an average of each squared value for that particular (x, y) pixel. One thing that may surprise you is the addition of a small noise term (ϵ). This is simply a way to ensure that we are not dividing by zero. The whole procedure is explained in greater detail in the 2012 paper “ImageNet Classification with Deep Convolutional Neural Networks,” by Alex Krizhevsky et al. (http://mng.bz/om4d).
The last thing to note is that this term is applied only to the Generator, as the explosion in the activation magnitudes leads to an arms race only if both networks participate. The following listing shows the code.
def pixelwise_feat_norm(inputs, **kwargs): ''' Uses pixelwise feature normalization as proposed by Krizhevsky et at. 2012. Returns the input normalized :inputs : Keras / TF Layers ''' normalization_constant = K.backend.sqrt(K.backend.mean( inputs**2, axis=-1, keepdims=True) + 1.0e-8) return inputs / normalization_constant
We have gone through four clever ideas on how to improve GAN training; however, without grounding them in their effects on the training, it may be difficult to isolate those effects. Thankfully, the paper’s authors provide a helpful table to help us understand just that; see figure 6.5.
The PGGAN paper’s authors are using sliced Wasserstein distance (SWD), where smaller is better. Recall from chapter 5 that a smaller Wasserstein—aka earth mover’s—distance means better results as quantified by the amount of probability mass one has to move to make the two distributions similar. The SWD means that patches of both the real data and the generated samples minimize this distance. The nuances of this technique are explained in the paper, but as the authors said during their presentation at ICLR, better measures—such as the Fréchet inception distance (FID)—now exist. We covered the FID in greater depth in chapter 5.
One key takeaway from this table is that a mini-batch does not work well, because, at a megapixel resolution, we do not have enough virtual RAM to load many images into the GPU memory. We have to use a smaller mini-batch—which may, overall, perform worse—and we have to reduce the mini-batch sizes further, making our training difficult.
Google has recently announced that as part of TensorFlow Extended and the general move toward implementing best practices from software engineering into the machine learning world, Google has created a central model and code repository called TensorFlow Hub, or TFHub. Working with TFHub is almost embarrassingly easy, especially with the models that Google has put there.
After importing the hub module and calling the right URL, TensorFlow downloads and imports the model all by itself, and you can start. These models are well-documented at the same URL that we use to download the model; just put them into your web browser. In fact, to get a pretrained Progressive GAN, all you need to type is an import statement and one line of code. That’s it!
The following listing shows a complete example of code that should by itself generate a face—based on the random seed that you specify in latent_vector.[7] Figure 6.6 displays the output.
This example was generated with the use of TFHub and is based on the example Colab provided at http://mng.bz/nvEa.
import matplotlib.pyplot as plt import tensorflow as tf import tensorflow_hub as hub with tf.Graph().as_default(): module = hub.Module("https://tfhub.dev/google/progan-128/1") 1 latent_dim = 512 2 latent_vector = tf.random_normal([1, latent_dim], seed=1337) 3 interpolated_images = module(latent_vector) 4 with tf.Session() as session: 5 session.run(tf.global_variables_initializer()) image_out = session.run(interpolated_images) plt.imshow(image_out.reshape(128,128,3)) plt.show()
Hopefully, this should be enough to get you started with Progressive GANs! Feel free to play around with the code and extend it. It should be noted here that the TFHub version of the Progressive GAN is not using the full 1024 × 1024, but rather just 128 × 128. This is probably because running the full version used to be computationally expensive, and the model size can grow huge quickly in the domain of computer vision problems.
Understandably, people are curious about the practical applications and ability to generalize Progressive GANs. One great example we’ll present is from our colleagues at Kheiron Medical Technologies, based in London, England. Recently, they released a paper that is a great testament to both the generalizability and practical applications of the PGGAN.[8]
See “High-Resolution Mammogram Synthesis Using Progressive Generative Adversarial Networks,” by Dimitrios Korkinof et al., 2018, https://arxiv.org/pdf/1807.03401.pdf.
Using a large dataset of medical mammograms,[9] these researchers managed to generate realistic 1280 × 1024 synthetic images of full-field digital mammography (FFDM), as shown in figure 6.7. This is a remarkable achievement on two levels:
X-ray scans for the purposes of breast cancer screening.
Figure 6.8 shows how realistic these mammograms can look. These have been randomly sampled (so no cherry-picking) and then compared to one of the closest images in the dataset.
See “MammoGAN: High-Resolution Synthesis of Realistic Mammograms,” by Dimitrios Korkinof et al., 2019, https://openreview.net/pdf?id=SJeichaN5E.
GANs may be used for many applications, not just fighting breast cancer or generating human faces, but also in 62 other medical GAN applications published through the end of July 2018.[10] We encourage you to look at them—but of course, not all of them use PGGANs. Generally, GANs are allowing massive leaps in many research fields, but are frequently applied nonintuitively. We hope to make these more accessible so that they can be used by more researchers. Make GANs, not war!
See “GANs for Medical Image Analysis,” by Salome Kazeminia et al., 2018, https://arxiv.org/pdf/1809.06222.pdf.
All of the techniques we presented in this chapter represent a general class of solving GAN problems—with a progressively more complex model. We expect this paradigm to pick up within GANs. The same is true for TensorFlow Hub: it is to TensorFlow what PyPI/Conda is to Python. Most Python programmers use them every week!
We hope that this new Progressive GAN technique opened your eyes to what GANs can do and why people are so excited about this paper. And hopefully not just for the cat meme vector that PGGAN can produce.[12] The next chapter will give you the tools so that you can start contributing to research yourself. See you then!
See Gene Kogan’s Twitter image, 2018, https://twitter.com/genekogan/status/1019943905318572033.