Chapter 6. Progressing with GANs

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 6. Progressing with GANs

This chapter covers

Progressively growing Discriminator and Generator networks throughout training
Making training more stable, and the output more varied and of higher quality and resolution
Using TFHub, a new central repository for models and TensorFlow code

In this chapter, we provide a hands-on tutorial to build a Progressive GAN by using TensorFlow and the newly released TensorFlow Hub (TFHub). The Progressive GAN (aka PGGAN, or ProGAN) is a cutting-edge technique that has managed to generate full-HD photorealistic images. Presented at one of the top machine learning conferences, the International Conference on Learning Representations (ICLR) in 2018, this technique made such a splash that Google immediately integrated it as one of the few models to be part of the TensorFlow Hub. In fact, this technique was lauded by Yoshua Bengio—one of the grandfathers of deep learning—as “almost too good to be true.” When it was released, it became an instant favorite of academic presentations and experimental projects.

We recommend that you go through this chapter with TensorFlow 1.7 or higher, but 1.8+ was the latest release at the time of writing, so that was the one we used. For TensorFlow Hub, we suggest using a version no later than 0.4.0, because later versions have trouble importing due to compatibility issues with TensorFlow 1.x. After reading this chapter, you’ll be able to implement all the key improvements of the Progressive GAN. These four innovations are as follows:

Progressively growing and smoothly fading in higher-resolution layers
Mini-batch standard deviation
Equalized learning rate
Pixel-wise feature normalization

This chapter features two main examples:

Code for the crucial innovations of Progressive GANs—more specifically, the smoothly fading-in higher-resolution layers and the other three innovations as listed previously. The rest of the implementation of the Progressive GAN technique is too substantial to be included in this book.
A pretrained, easily downloadable implementation as provided by Google on TFHub, which is a new centralized repository for machine learning models, similar to Docker Hub or Conda and PyPI repositories in the software package world. This implementation will allow us to do latent space interpolation to control the features of the generated examples. It will briefly touch on the seeding vectors in the latent space of the Generator so that we can get pictures that we want. You saw this idea in chapters 2 and 4.

The reasons we decided to implement the PGGAN using TFHub rather than from the ground up as we do in all the other chapters are threefold:

Especially for practitioners, we want to make sure you are—at least in one chapter—exposed to the software engineering best practices that may speed up your workflow. Want to try a quick GAN on your problem? Just use one of the implementations on TFHub. There are now many more than when we were first writing this chapter, including many reference implementations (for example, for BigGAN in chapter 12 and NS-GAN in chapter 5). We want to give you access to easy-to-use, state-of-the-art examples, because this is the way that machine learning is going—automating as much of machine learning as possible so we can focus on what matters the most: delivering impact. Google’s Cloud AutoML (https://cloud.google.com/automl/) and Amazon SageMaker (https://aws.amazon.com/sagemaker/) are prime examples of this trend. Even Facebook recently introduced PyTorch Hub, so both major frameworks now have one.
The original implementation of PGGAN took the NVIDIA researchers one to two months to run, which we thought was impractical for any person to run on their own, especially if you want to experiment or get something wrong.^[1] TFHub still gives you a fully trainable PGGAN, so if you want to repurpose the days of computation for something else, you can!

¹

See “Progressive Growing of GANs for Improved Quality, Stability, and Variation,” by Tero Karras, 2018, https://github.com/tkarras/progressive_growing_of_gans.
We still want to show you PGGANs’ most important innovations. But if we want to explain those well—including code—we can’t fit all the implementation details into one chapter, even in Keras, as all the implementations tend to be pretty sizeable. TFHub allows us to skip over the boilerplate code and focus on the ideas that matter.

6.1. Latent space interpolation

Recall from chapter 2 that we have this lower-resolution space—called latent space—that seeds our output. As with the DCGAN from chapter 4 and indeed the Progressive GAN, the initial trained latent space has semantically meaningful properties. It means that we can find the vector offsets that, for example, introduce eyeglasses to an image of a face, and the same offset will introduce glasses in new images. We can also pick two random vectors and then move in equal increments between them and so gradually—smoothly—get an image that matches the second vector.

This is called interpolation, and you can see this process in figure 6.1. As the author of BigGAN said, meaningful transitions from one vector to another show that the GAN has learned some underlying structure.

Figure 6.1. We can perform latent space interpolation because the latent vector we send to the Generator produces consistent outcomes that are predictable in some ways; not only is the generative process predictable, but also the output is not jagged—or reacting sharply to small changes—considering the latent vector changes. If we, for example, want an image that is a blend of two faces, we just need to search somewhere around the average of the two vectors.

6.2. They grow up so fast

In previous chapters, you learned which results are easy to achieve with GANs and which are difficult. Moreover, things like mode collapse (showing only a few examples of the overall distribution) and lack of convergence (one of the causes of poor quality of the results) are no longer alien terms to us.

Recently, a Finnish NVIDIA team released a paper that has managed to blow many previous cutting-edge papers out of the water: “Progressive Growing of GANs for Improved Quality, Stability, and Variation,” by Tero Karras et al. This paper features four fundamental innovations, so let’s walk through them in order.

6.2.1. Progressive growing and smoothing of higher-resolution layers

Before we dive into what the Progressive GAN does, let’s start with a simple analogy. Imagine looking at a mountain region from a bird’s-eye view: you have lots of valleys, which have nice creeks and villages—generally quite habitable. Then you have many mountain tops that are rough and generally unpleasant to live on because of weather conditions. This sort of represents the loss function landscape, where we want to minimize the loss by going down the mountain slopes and into the valleys, which are much nicer.

We can imagine training as dropping a mountaineer into a random place in this mountain region and then following their path down the slope into a valley. This is what stochastic gradient descent does, and chapter 10 revisits this in a lot more detail. Now, unfortunately, if we start with a very complex mountain range, the mountaineer will not know which direction to travel. The space around our adventurer would be jagged and rough. It would be difficult to make out where the nicest, lowest valley is with lots of habitable lands. Instead, we zoom out and reduce the complexity of the mountain range to give the mountaineer a high-level picture of this particular area.

As our mountaineer gets closer to a valley, we can start increasing the complexity by zooming in on the terrain. Then we no longer see just the coarse/pixelated texture, but instead get to see the finer details. This approach has the advantage that as our mountaineer goes down the slope, they can easily make little optimizations to make the hiking easier. For example, they can take a path through a dried-up creek to make the descent into the valley even faster. That is progressive growing: increasing the resolution of the terrain as we go.

However, if you have ever seen an open world computer game or scrolled too quickly through Google Earth with 3D on, you know that quickly increasing the resolution of the terrain around you can be startling and unpleasant. Objects all of a sudden jump into existence. So instead, we progressively smooth in and slowly introduce more complexity as the mountaineer gets closer to the objective.

In technical terms, we are going from a few low-resolution convolutional layers to many high-resolution ones as we train. Thus, we first train the early layers and only then introduce a higher-resolution layer, where it is harder to navigate the loss space. We go from something simple—for example, 4 × 4 trained for several steps—to something more complex—for example, 1024 × 1024 trained for several epochs, as shown in figure 6.2.

Figure 6.2. Can you see how we start with a smooth mountain range and gradually increase the complexity by zooming in? That is effectively what adding extra layers does to the loss function. This is handy, as our mountain region (loss function) is much easier to navigate when it is less jagged. You can think of it as follows: when we have a more complex structure (b), the loss function is jagged and hard to navigate (d), because there are so many parameters—especially in early layers—that can have a massive impact and generally increase the dimensionality of the problem. However, if we initially remove some part of the complexity (a), we can early on get a loss function that is much easier to navigate (c) and increases in complexity only as we gain confidence that we are at the approximately right part of the loss space. Only then do we move from (a) and (c) into (b) and (d) versions.

The problem in this scenario is that upon introducing even one more layer at a time (for example, from 4 × 4 to 8 × 8), we are still introducing a massive shock to the training. What the PGGAN authors do instead is smoothly fade in those layers, as in figure 6.3, in order to give the system time to adapt to the higher resolution.

Figure 6.3. When we’ve trained for enough steps with, say, 16 × 16 resolution (a), we introduce another transposed convolution in the Generator (G) and another convolution in the Discriminator (D) to get the “interface” between G and D to be 32 × 32. But we also introduce two pathways: (1 – α) simple nearest neighbor upscaling, which does not have any trained parameters, but is also quite naive; and (α) extra transposed convolution, which requires training but will ultimately perform much better.

However, rather than immediately jumping to this resolution, we smoothly fade in this new layer with higher resolution by a parameter alpha (α), which is between 0 and 1. Alpha affects how much we use either the old—but upscaled—layer or the natively larger one. On the side of the D, we simply shrink by 0.5x to allow for smoothly injecting the trained layer for discrimination. This is (b) in figure 6.3. When we are confident about this new layer, we keep the 32 × 32—(c) in the figure—and then we are getting ready to grow yet again after we have trained 32 × 32 properly.

6.2.2. Example implementation

For all the innovations we’ve detailed, in this section we’ll give you working but isolated versions so that we can talk code. As an exercise, you may want to try implementing these things as one GAN network, maybe using the existing prior architectures. If you are ready, let’s load up ye olde, trusty machine learning libraries and get cracking:

import tensorflow as tf
import keras as K

In the code, progressive smoothing in may look something like the following listing.

Listing 6.1. Progressive growing and smooth upscaling

def upscale_layer(layer, upscale_factor):
    '''
    Upscales layer (tensor) by the factor (int) where
    the tensor is [group, height, width, channels]
    '''
    height = layer.get_shape()[1]
    width = layer.get_shape()[2]
    size = (upscale_factor * height, upscale_factor * width)
    upscaled_layer = tf.image.resize_nearest_neighbor(layer, size)
    return upscaled_layer

def smoothly_merge_last_layer(list_of_layers, alpha):
    '''
    Smoothly merges in a layer based on a threshold value alpha.
    This function assumes: that all layers are already in RGB.
    This is the function for the Generator.
    :list_of_layers    :   items should be tensors ordered by resolution
    :alpha             :    float in (0,1)
    '''
    last_fully_trained_layer = list_of_layers[-2]                            1
    last_layer_upscaled = upscale_layer(last_fully_trained_layer, 2)         2

    larger_native_layer = list_of_layers[-1]                                 3

    assert larger_native_layer.get_shape() == last_layer_upscaled.get_shape()4

    new_layer = (1-alpha) * upscaled_layer + larger_native_layer * alpha     5

    return new_layer

1 Hint! If you are using pure TensorFlow rather than Keras, always remember scope.
2 Now we have the originally trained layer.
3 The newly added layer not yet fully trained
4 This makes sure we can run the merging code.
5 This code block should take advantage of broadcasting.

Now that you have an understanding of the lower-level details of progressive growing and smoothing without unnecessary complexity, hopefully you can appreciate how general this idea is. Although Karras et al., were by no means the first to come up with some way of increasing model complexity during training, this seems like by far the most promising avenue and indeed the innovation that resonated the most. As of June 2019, this paper was cited over 730 times. With that context in mind, let’s move on to the second big innovation.

6.2.3. Mini-batch standard deviation

The next innovation introduced by Karras et al. in their paper is mini-batch standard deviation. Before we dive into it, let’s recall from chapter 5 the issue of mode collapse, which occurs when the GAN learns how to create a few good examples or only slight permutations on them. We generally want to produce the faces of all the people in the real dataset, maybe not just one picture of one woman.

Therefore, Karras et al. created a way for the Discriminator to tell whether the samples it is getting are varied enough. In essence, we calculate a single extra scalar statistic for the Discriminator. This statistic is the standard deviation of all the pixels in the mini-batch that are generated by the Generator or that come from the real data. That is an amazingly simple and elegant solution: now all the Discriminator needs to learn is that if the standard deviation is low in the images from the batch it is evaluating, the image is likely fake, because the real data has more variance.^[2] The Generator has no choice but to increase the variance of the generated samples to have a chance to fool the Discriminator.

²

Some may object that this can also happen when the sampled real data includes a lot of very similar pictures. Though this is technically true, in practice this is easy to fix, and remember that the similarity would have to be so high that a single pass of a simple nearest neighbor clustering would reveal it.

Moving beyond the intuition, the technical implementation is straightforward as it applies only to the Discriminator. Given that we also want to minimize the number of trainable parameters, we include only a single extra number, which seems to be enough. This number is appended as a feature map—think dimension or the last number in the tf.shape list.

The exact procedure is as follows and is depicted in listing 6.2:

[4D -> 3D] We compute the standard deviation across all the images in the batch, across all the remaining channels—height, width, and color. We then get a single image with standard deviations for each pixel and each channel.
[3D -> 2D] We average the standard deviations across all channels—to get a single feature map or matrix of standard deviations for that pixel, but with a collapsed color channel.
[2D -> Scalar/0D] We average the standard deviations for all pixels within the preceding matrix to get a single scalar value.

Listing 6.2. Mini-batch standard deviation

def minibatch_std_layer(layer, group_size=4):
    '''
    Will calculate minibatch standard deviation for a layer.
    Will do so under a prespecified tf-scope with Keras.
    Assumes layer is a float32 data type. Else needs validation/casting.
    NOTE: there is a more efficient way to do this in Keras, but just for
    clarity and alignment with major implementations (for understanding)
    this was done more explicitly. Try this as an exercise.
    '''
    group_size = K.backend.minimum(group_size, tf.shape(layer)[0])         1

    shape = list(K.int_shape(input))                                       2
    shape[0] = tf.shape(input)[0]


    minibatch = K.backend.reshape(layer,
        (group_size, -1, shape[1], shape[2], shape[3]))                    3
    minibatch -= tf.reduce_mean(minibatch, axis=0, keepdims=True)          4
    minibatch = tf.reduce_mean(K.backend.square(minibatch), axis = 0)      5
    minibatch = K.backend.square(minibatch + 1e8)                          6
    minibatch = tf.reduce_mean(minibatch, axis=[1,2,4], keepdims=True)     7
    minibatch = K.backend.tile(minibatch,
        [group_size, 1, shape[2], shape[3]])                               8
return K.backend.concatenate([layer, minibatch], axis=1)                   9

1 Hint! If you are using pure TensorFlow rather than Keras, always remember scope. A mini-batch group must be divisible by (or <=) group_size.
2 Just getting some shape information so that we can use it as shorthand as well as ensure defaults. We get the input from tf. shape, as the “pre-image” dimensions are typically cast as None before graph execution.
3 Reshaping so that we operate on the level of the mini-batch. In this code, we assume the layer to be [Group (G), Mini-batch (M), Width (W), Height (H), Channel (C)], but be careful: different implementations use the Theano-specific order instead.
4 Centers the mean over the group [M,W,H,C]
5 Calculates the variance of the group [M,W,H,C]
6 Calculates the standard deviation over the group [M,W,H,C]
7 Takes the average over feature maps and pixels [M,1,1,1]
8 Transforms the scalar value to fit groups and pixels
9 Appends as a new feature map

6.2.4. Equalized learning rate

Equalized learning rate is one of those deep learning dark art techniques that is probably not clear to anyone. Although the researchers do provide a short explanation in the PGGAN paper, they avoided the topic in oral presentations, suggesting that this is probably just a hack that seems to work. Frequently in deep learning this is the case.

Furthermore, many nuances about equalized learning rate require a solid understanding of the implementation of RMSProp or Adam—which is the used optimizer—and also of weights initialization. So don’t worry if this does not make sense to you, because it probably does not really make sense to anyone.

But if you’re curious, the explanation goes something as follows: we need to ensure that all the weights (w) are normalized (w’) to be within a certain range such that w’ = w/c by a constant c that is different for each layer, depending on the shape of the weight matrix. This also ensures that if any parameters need to take bigger steps to reach optimum—because they tend to vary more—these relevant parameters can do that.

Karras et al. use a simple standard normal initialization and then scale the weights per layer at runtime. Some of you may be thinking that Adam already does that—yes, Adam allows learning rates to be different for different parameters, but there’s a catch. Adam adjusts the backpropagated gradient by the estimated standard deviation of the parameter, which ensures that the scale of that parameter is independent of the update. Adam has different learning rates in different directions, but does not always take into account the dynamic range—how much a dimension or feature tends to vary over given mini-batches. As some point out, this seems to solve a similar problem as weights initialization.^[3]

³

See “Progressive Growing of GANs.md,” by Alexander Jung, 2017, http://mng.bz/5A4B.

However, if this is not clear, do not worry; we highly recommend two excellent resources: Andrew Karpathy’s 2016 computer science lecture for notes about weights initialization,^[4] and a Distill article for details on how Adam works.^[5] The following listing shows the equalized learning rate.

⁴

See “Lecture 5: Training Neural Networks, Part I,” by Fei-Fei Li et al. 2016, http://mng.bz/6wOo.

⁵

See “Why Momentum Really Works,” by Gabriel Goh, 2017, Distill, https://distill.pub/2017/momentum/.

Listing 6.3. Equalized learning rate

def equalize_learning_rate(shape, gain, fan_in=None):
    '''
    This adjusts the weights of every layer by the constant from
    He's initializer so that we adjust for the variance in the dynamic
    range in different features
    shape   :  shape of tensor (layer): these are the dimensions
        of each layer.
    For example, [4,4,48,3]. In this case, [kernel_size, kernel_size,
        number_of_filters, feature_maps]. But this will depend
        slightly on your implementation.
    gain    :  typically sqrt(2)
    fan_in  :  adjustment for the number of incoming connections
        as per Xavier's / He's initialization
    '''
    if fan_in is None: fan_in = np.prod(shape[:-1])             1
    std = gain / K.sqrt(fan_in)                                 2
    wscale = K.constant(std, name='wscale', dtype=np.float32)   3
    adjusted_weights = K.get_value('layer', shape=shape,        4
        initializer=tf.initializers.random_normal()) * wscale
    return adjusted_weights

1 The default value is the product of all the shape dimensions minus the feature maps dimension; this gives us the number of incoming connections per neuron.
2 This uses He’s initialization constant.^[6]

⁶

See “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,” by Kaiming He et al., https://arxiv.org/pdf/1502.01852.pdf.
3 Creates a constant out of the adjustment
4 Gets values for weights and then uses broadcasting to apply the adjustment

If you are still confused, rest assured that these initialization tricks and these complicated learning rate adjustments are rarely a point of differentiation in either academia or industry. Also, just because restricting weight values between –1 and 1 seems to work somewhat better in most reruns here, that does not mean this trick will generalize to other setups. So let’s move to better-proven techniques.

6.2.5. Pixel-wise feature normalization in the generator

Let’s begin with some motivation for why would we even want to normalize the features—stability of training. Empirically, the authors from NVIDIA have discovered that one of the early signs of divergent training was an explosion in feature magnitudes. A similar observation was made by the BigGAN authors in chapter 12. So Karras et al. introduced a technique to combat this. On a broader note, this is frequently how GAN training is done: we observe a particular problem with the training, so we introduce mechanisms to prevent that problem from happening.

Note that most networks are using some form of normalization. Typically, they use either batch normalization or a virtual version of this technique. Table 6.1 presents an overview of normalization techniques used in the GANs presented in this book so far. You saw these in chapter 4 (DCGAN) and chapter 5—where we touched on the rest of the GANs and gradient penalties (GPs). Unfortunately, in order for batch normalization and its virtual equivalent to work, we must have large mini-batches so that the individual samples average themselves out.

Table 6.1. Use of normalization techniques in GANs

Method	Authors	G normalization	D normalization
DCGAN	(Radford et al., 2015, https://arxiv.org/abs/1511.06434)	Batch	Batch
Improved GAN	(Salimans et al., 2016, https://arxiv.org/pdf/1606.03498.pdf)	Virtual batch	Virtual batch
WGAN	(Arjovsky et al., 2017, https://arxiv.org/pdf/1701.07875.pdf)	—	Batch
WGAN-GP	(Gulrajani et al., 2017, http://arxiv.org/abs/1704.00028)	Batch	Layer norm

Based on the fact that all these major implementations use normalization, it is clearly important, but why not just used standard batch normalization? Unfortunately, batch normalization is too memory intensive at our resolution. We have to come up with something that allows us to work with a few examples—that fit into our GPU memory with the two network graphs—but still works well. Now we understand where the need for pixel-wise feature normalization comes from and why we use it.

If we jump into the algorithm, pixel normalization takes activation magnitude at each layer just before the input is fed into the next layer.

Figure 6.4 illustrates the process of pixel-wise feature normalization. The exact description of step 3 is shown in equation 6.1.

equation 6.1.

Figure 6.4. We map out all the points in an image (step 1) to a set of vectors (step 2), and then we normalize them so that they are all in the same range (typically between 0 and 1 in the high-dimensional space), which is step 3.

Pixel-wise feature normalization

For each feature map do

Take the pixel value of that feature map (fm) at a position (x, y).
Construct a vector for each (x, y), where
1. v_0,0 = [(0,0) value for fm₁, (0,0) value for fm₂, ...., (0,0) value for fm_n]
2. v_0,1 = [(0,1) value for fm₁, (0,1) value for fm₂, ...., (0,1) value for fm_n] ...
3. v_n,n = [(n,n) value for fm₁, (n,n) value for fm₂, ...., (n,n) value for fm_n]
Normalize each vector v_i,i as defined in step 2 to have a unit norm; call it n_i,i.
Pass that in the original tensor shape to the next layer.

End for

This formula normalizes (divides by the expression under the square root) each vector constructed in step 2 of figure 6.4. This expression is just an average of each squared value for that particular (x, y) pixel. One thing that may surprise you is the addition of a small noise term (ϵ). This is simply a way to ensure that we are not dividing by zero. The whole procedure is explained in greater detail in the 2012 paper “ImageNet Classification with Deep Convolutional Neural Networks,” by Alex Krizhevsky et al. (http://mng.bz/om4d).

The last thing to note is that this term is applied only to the Generator, as the explosion in the activation magnitudes leads to an arms race only if both networks participate. The following listing shows the code.

Listing 6.4. Pixel-wise feature normalization

def pixelwise_feat_norm(inputs, **kwargs):
    '''
    Uses pixelwise feature normalization as proposed by
    Krizhevsky et at. 2012. Returns the input normalized
    :inputs     :    Keras / TF Layers
    '''
    normalization_constant = K.backend.sqrt(K.backend.mean(
        inputs**2, axis=-1, keepdims=True) + 1.0e-8)
    return inputs / normalization_constant

6.3. Summary of key innovations

We have gone through four clever ideas on how to improve GAN training; however, without grounding them in their effects on the training, it may be difficult to isolate those effects. Thankfully, the paper’s authors provide a helpful table to help us understand just that; see figure 6.5.

Figure 6.5. Contributions of various techniques to score improvements. We can see that the introduction of equalized learning rate makes a big difference, and pixel-wise normalization adds to that, though what the authors do not tell us is how effective this technique would be if we had only pixel normalization and did not introduce equalized learning rate. We include this table only as an illustration of the rough magnitude of improvement we can expect from these changes—which is an interesting lesson on its own—but more detailed discussion follows.

The PGGAN paper’s authors are using sliced Wasserstein distance (SWD), where smaller is better. Recall from chapter 5 that a smaller Wasserstein—aka earth mover’s—distance means better results as quantified by the amount of probability mass one has to move to make the two distributions similar. The SWD means that patches of both the real data and the generated samples minimize this distance. The nuances of this technique are explained in the paper, but as the authors said during their presentation at ICLR, better measures—such as the Fréchet inception distance (FID)—now exist. We covered the FID in greater depth in chapter 5.

One key takeaway from this table is that a mini-batch does not work well, because, at a megapixel resolution, we do not have enough virtual RAM to load many images into the GPU memory. We have to use a smaller mini-batch—which may, overall, perform worse—and we have to reduce the mini-batch sizes further, making our training difficult.

6.4. TensorFlow Hub and hands-on

Google has recently announced that as part of TensorFlow Extended and the general move toward implementing best practices from software engineering into the machine learning world, Google has created a central model and code repository called TensorFlow Hub, or TFHub. Working with TFHub is almost embarrassingly easy, especially with the models that Google has put there.

After importing the hub module and calling the right URL, TensorFlow downloads and imports the model all by itself, and you can start. These models are well-documented at the same URL that we use to download the model; just put them into your web browser. In fact, to get a pretrained Progressive GAN, all you need to type is an import statement and one line of code. That’s it!

The following listing shows a complete example of code that should by itself generate a face—based on the random seed that you specify in latent_vector.^[7] Figure 6.6 displays the output.

⁷

This example was generated with the use of TFHub and is based on the example Colab provided at http://mng.bz/nvEa.

Listing 6.5. Getting started with TFHub

import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow_hub as hub

with tf.Graph().as_default():
    module = hub.Module("https://tfhub.dev/google/progan-128/1")  1
    latent_dim = 512                                              2

    latent_vector = tf.random_normal([1, latent_dim], seed=1337)  3

    interpolated_images = module(latent_vector)                   4

    with tf.Session() as session:                                 5
    session.run(tf.global_variables_initializer())
    image_out = session.run(interpolated_images)

plt.imshow(image_out.reshape(128,128,3))
plt.show()

1 Imports the Progressive GAN from TFHub
2 Latent dimension that gets sampled at runtime
3 Changes the seed to get different faces
4 Uses the module to generate images from the latent space. Implementation details are online.
5 Runs the TensorFlow session and gets back the image in shape (1,128,128,3)

Figure 6.6. Output of listing 6.5. Try changing the seed in the `latent_vector` definition to get different outputs. A word of warning: even though this random seed argument should consistently define the output we are meant to get, we have found that on reruns we sometimes get different results, depending on the version of TensorFlow. This image is obtained using 1.9.0-rc1.

Hopefully, this should be enough to get you started with Progressive GANs! Feel free to play around with the code and extend it. It should be noted here that the TFHub version of the Progressive GAN is not using the full 1024 × 1024, but rather just 128 × 128. This is probably because running the full version used to be computationally expensive, and the model size can grow huge quickly in the domain of computer vision problems.

6.5. Practical applications

Understandably, people are curious about the practical applications and ability to generalize Progressive GANs. One great example we’ll present is from our colleagues at Kheiron Medical Technologies, based in London, England. Recently, they released a paper that is a great testament to both the generalizability and practical applications of the PGGAN.^[8]

⁸

See “High-Resolution Mammogram Synthesis Using Progressive Generative Adversarial Networks,” by Dimitrios Korkinof et al., 2018, https://arxiv.org/pdf/1807.03401.pdf.

Using a large dataset of medical mammograms,^[9] these researchers managed to generate realistic 1280 × 1024 synthetic images of full-field digital mammography (FFDM), as shown in figure 6.7. This is a remarkable achievement on two levels:

⁹

X-ray scans for the purposes of breast cancer screening.

It shows the generalizability of this technique. Think about how different images of mammograms are from the images of human faces—especially structurally. The bar for whether a tissue structure makes sense is incredibly high, and yet their network manages to produce samples at the highest resolution to date that frequently fool medical professionals.
It shows how these techniques can be applied to many fields and uses. For example, we can use this new dataset in a semi-supervised way, as you will discover in the next chapter. Or the synthetic dataset can be open sourced for medical research with arguably fewer worries from General Data Protection Regulation (GDPR) or other legal repercussions, as these do not belong to any one person.

Figure 6.7. Progressive growing of FFDM. This is a great figure because it not only shows the progressively increasing resolution on these mammograms (e), but also some training statistics (a)–(d) to show you that training these GANs is messy for everyone, not just you.

Figure 6.8 shows how realistic these mammograms can look. These have been randomly sampled (so no cherry-picking) and then compared to one of the closest images in the dataset.

Figure 6.8. In comparing the real and the generated datasets, the data looks pretty realistic and generally close to an example in the training set. In their subsequent work, MammoGAN, Kheiron has shown that these images fool trained and certified radiologists.^[11] That”s generally a good sign, especially at this high resolution. Of course, in principle, we would love to have a statistical way of measuring the quality of the generation. But as we know from chapter 5, this is hard enough to do with standard images, let alone for any arbitrary GAN.

¹¹

See “MammoGAN: High-Resolution Synthesis of Realistic Mammograms,” by Dimitrios Korkinof et al., 2019, https://openreview.net/pdf?id=SJeichaN5E.

GANs may be used for many applications, not just fighting breast cancer or generating human faces, but also in 62 other medical GAN applications published through the end of July 2018.^[10] We encourage you to look at them—but of course, not all of them use PGGANs. Generally, GANs are allowing massive leaps in many research fields, but are frequently applied nonintuitively. We hope to make these more accessible so that they can be used by more researchers. Make GANs, not war!

¹⁰

See “GANs for Medical Image Analysis,” by Salome Kazeminia et al., 2018, https://arxiv.org/pdf/1809.06222.pdf.

All of the techniques we presented in this chapter represent a general class of solving GAN problems—with a progressively more complex model. We expect this paradigm to pick up within GANs. The same is true for TensorFlow Hub: it is to TensorFlow what PyPI/Conda is to Python. Most Python programmers use them every week!

We hope that this new Progressive GAN technique opened your eyes to what GANs can do and why people are so excited about this paper. And hopefully not just for the cat meme vector that PGGAN can produce.^[12] The next chapter will give you the tools so that you can start contributing to research yourself. See you then!

¹²

See Gene Kogan’s Twitter image, 2018, https://twitter.com/genekogan/status/1019943905318572033.

Summary

We can achieve 1-megapixel synthetic images thanks to the state-of-the-art PGGAN technique.
This technique has four key training innovations:
- Progressive growing and smoothing in higher-resolution layers
- Mini-batch standard deviation to enforce variation in the generated samples
- Equalized learning rate that ensures we can take learning steps of appropriate sizes in each direction
- Pixel-wise vector normalization that ensures that the Generator and the Discriminator do not spiral out of control in an arms race
You followed a hands-on tutorial using the newly released TensorFlow Hub and got to use their downsampled version of the Progressive GAN to generate images!
You learned about how GANs are helping to fight cancer.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 6. Progressing with GANs

Create new playlist

Sign In

Sign Up

Chapter 6. Progressing with GANs

6.1. Latent space interpolation

6.2. They grow up so fast

6.2.1. Progressive growing and smoothing of higher-resolution layers

6.2.2. Example implementation

Listing 6.1. Progressive growing and smooth upscaling

6.2.3. Mini-batch standard deviation

Listing 6.2. Mini-batch standard deviation

6.2.4. Equalized learning rate

Listing 6.3. Equalized learning rate

6.2.5. Pixel-wise feature normalization in the generator

Table 6.1. Use of normalization techniques in GANs

equation 6.1.

Figure 6.4. We map out all the points in an image (step 1) to a set of vectors (step 2), and then we normalize them so that they are all in the same range (typically between 0 and 1 in the high-dimensional space), which is step 3.

Listing 6.4. Pixel-wise feature normalization

6.3. Summary of key innovations

6.4. TensorFlow Hub and hands-on

Listing 6.5. Getting started with TFHub

6.5. Practical applications

Figure 6.7. Progressive growing of FFDM. This is a great figure because it not only shows the progressively increasing resolution on these mammograms (e), but also some training statistics (a)–(d) to show you that training these GANs is messy for everyone, not just you.

Summary

Table of Contents for
Chapter 6. Progressing with GANs