List of Figures

Chapter 1. Introduction to GANs

Figure 1.1. Progress in human face generation

Figure 1.2. The two GAN subnetworks, their inputs and outputs, and their interactions

Figure 1.3. The GAN training algorithm has two main parts. These two parts, Discriminator training and Generator training, depict the same GAN network at different time snapshots in the corresponding stages of the training process.

Figure 1.4. These photorealistic but fake human faces were synthesized by a Progressive GAN trained on high-resolution portrait photos of celebrities.

Figure 1.5. By using a GAN variant called CycleGAN, we can turn a Monet painting into a photograph or turn an image of a zebra into a depiction of a horse, and vice versa.

Chapter 2. Intro to generative modeling with autoencoders

Figure 2.1. Placing GANs and autoencoders in the AI landscape. Different researchers might draw this differently, but we will leave this argument to academics.

Figure 2.2. Using an autoencoder in our letter example follows these steps: (1) You compress all the things you know about a machine learning engineer, and then (2) compose that to the latent space (letter to your grandmother). When she, using her understanding of words as a decoder (3), reconstructs a (lossy) version of what that means, you get out a representation of an idea in the same space (in your grandmother’s head) as the original input, which was your thoughts.

Figure 2.3. Because we know from training where our examples get placed in the latent space, we can easily generate examples similar to the ones that the model has seen. Even if not, we can easily iterate or grid-search through the latent space to determine the kinds of representations that our model can generate.

Figure 2.4. How computer vision researchers think. Enough said.

Figure 2.5. As a reminder of what a multivariate (2D) distribution looks like, we’ve plotted probability density functions of bivariate (2D) Gaussians. They are uncorrelated 2D normal distributions, except with different variances. (a) has a variance of 0.5, (b) of 1, and (c) of 2. (d), (e), and (f) are the exact same distributions as (a), (b), and (c), respectively, but plotted with a set z-axis limit at 0.7. Intuitively, this is just a function that for each point says how likely it is to occur. So (a) and (d) are much more concentrated, whereas (c) and (f) are making it possible for values far away from the origin (0,0) to occur, but each given value is not as likely.

Figure 2.6. 2D projection of all the points from the test set into the latent space and their class. In this figure, we display the 2D latent space onto the graph. We then map out the classes of these generated examples and color them accordingly, as per the legend on the right. Here we can see that the classes tend to be neatly grouped together, which tells us that this is a good representation. A color version is available in the GitHub repository for this book.

Figure 2.7. We map out the values of a subset of the latent space on a grid and pass each of those latent space values through the generator to produce this figure. This gives us a sense of how much the resulting picture changes as we vary z.

Figure 2.8. Maximum likelihood, point estimates, and true distributions. The gray (theoretical) distribution is bimodal rather than having a single mode. But because we have assumed this, our model is catastrophically wrong. Alternatively, we can get mode collapse, which is worth keeping in mind for chapter 5. This is especially true when we are using flavors of the KL, such as the VAE or early GANs.

Figure 2.9. In these images of fake celebrity faces generated by a VAE, the edges are quite blurry and blend into the background. This is because the CelebA dataset has centered and aligned images with consistent features around eyes and mouth, but the backgrounds tend to vary. The VAE picks the safe path and makes the background blurry by choosing a “safe” pixel value, which minimizes the loss, but does not provide good images.

Chapter 3. Your first GAN: Generating handwritten digits

Figure 3.1. In this GAN architecture diagram, both the Generator and the Discriminator are trained using the Discriminator’s loss. The Discriminator strives to minimize the loss; the Generator seeks to maximize the loss for the fake examples it produces.

Figure 3.2. The bowl-shaped mesh represents the loss J in the parameter space θ1 and θ2. The black dotted line illustrates the minimization of the loss in the parameter space through optimization.

Figure 3.3. Player 1 (left) seeks to minimize V by tuning θ1. Player 2 (middle) seeks to minimize –V (maximize V) by tuning θ2. The saddle-shaped mesh (right) shows the combined loss in the parameter space V(θ1, θ2). The dotted line shows the convergence to Nash equilibrium at the center of the saddle. (Source: Goodfellow, 2019,

Figure 3.4. The Generator network G transforms the random vector z into a fake example x*: G(z) = x*. The Discriminator network D outputs a classification of whether the input example is real. For the real examples x, the Discriminator strives to output values as close to 1 as possible. For the fake examples x*, the Discriminator strives to output values as close to 0 as possible. In contrast, the Generator wants D(x*) to be as close as possible to 1, indicating that the Discriminator was fooled into classifying a fake example as real.

Figure 3.5. Over the course of the training iterations, the Generator learns to turn random noise input into images that look like members of the training data: the MNIST dataset of handwritten digits. Simultaneously, the Discriminator learns to distinguish the fake images produced by the Generator from the genuine ones coming from the training dataset.

Figure 3.6. Starting from what looks to be no more than random noise, the Generator gradually learns to emulate the features of the training dataset: in our case, images of handwritten digits.

Figure 3.7. Although far from perfect, our simple two-layer Generator learned to produce realistic-looking numerals, such as 9 and 1.

Figure 3.8. Example of real handwritten digits from the MNIST dataset used to train our GAN. Although the Generator made impressive progress toward emulating the training data, the difference between the numerals it produces and the real, human-written numerals remains clear.

Chapter 4. Deep Convolutional GAN

Figure 4.1. A 3 × 3 convolutional filter as it slides over a 5 × 5 input—left to right, top to bottom. At each step, the filter moves by two strides; accordingly, it makes a total of four steps, resulting in a 2 × 2 activation map. Notice how at each step, the entire filter produces a single activation value.

Figure 4.2. An activation value for a single convolutional step within the context of the activation map (feature map) and the input and output volumes. Notice that the ConvNet filter extends through the full depth of the input volume and that the depth of the output volume is determined by stacking together activation maps.

Figure 4.3. The overall model architecture for this chapter’s tutorial is the same as the GAN we implemented in chapter 3. The only differences (not visible on this high-level diagram) are the internal representations of the Generator and Discriminator networks (the insides of the Generator and Discriminator boxes). These networks are covered in detail later in this tutorial.

Figure 4.4. The Generator takes in a random noise vector as input and produces a 28 × 28 × 1 image. It does so by multiple layers of transposed convolutions. Between the convolutional layers, we apply batch normalization to stabilize the training process. (Image is not to scale.)

Figure 4.5. The Discriminator takes in a 28 × 28 × 1 image as input, applies several convolutional layers, and—using the sigmoid activation function σ—outputs a probability that the input image is real rather than fake. Between the convolutional layers, we apply batch normalization to stabilize the training process. (Image is not to scale.)

Figure 4.6. A sample of handwritten digits generated by a fully trained DCGAN

Figure 4.7. A sample of handwritten digits generated by the GAN implemented in chapter 3

Figure 4.8. A randomly generated grid of real handwritten digits from the MNIST dataset used to train our DCGAN. Unlike the images produced by the simple GAN we implemented in chapter 3, many of the handwritten digits produced by the fully trained DCGAN are essentially indistinguishable from the training data.

Chapter 5. Training and common challenges: GANing for success

Figure 5.1. Where do GANs fit in?

Figure 5.2. ACGAN failure mode. Scores on the right indicate the softmax output.

Figure 5.3. The GAN picks up on the patterns by mostly memorizing the items, which also creates an undesirable outcome indicating that the GAN has not learned much useful information and will most likely not generalize. The proof is in the images. The first two rows are pairs of duplicate samples; the last row is the nearest neighbor of the middle row in the training set. Note that these examples are very low resolution as they appear in the paper, due to a low-resolution GAN setup.

Figure 5.4. Full HD images generated by GANs. You may consider this a teaser for the next chapter, where you will be rewarded for all your hard work in this one.

Figure 5.5. A sketch of what the hypothesized relationships are meant to look like in theory. The y-axis is the loss function for the Generator, whereas D(G(z)) is the Discriminator’s “guess” for the likelihood of the generated sample. You can see that Minimax (MM) stays flat for too long, thereby giving the Generator too little information—the gradients vanish.

Figure 5.6. A moment of silence, please.

Figure 5.7. Plot (a) should be familiar from chapter 2. For extra clarity, we provide another view of a Gaussian distribution in plot (b) of the data drawn from the same distribution, but showing vertical slices of just the first distribution on the top and just the second distribution on the right. Plot (a) then is a probability density abstraction of this data, where the z-axis represents the probability of that point being sampled. Now, even though one of these is just an abstraction of the other, how would you compare the two? How would you make sure that they are the same even when we told you? What if this distribution had 3,072 possible dimensions? In this example, we have just two! We are building up to how we’d compare two heaps-of-sand-looking distributions as in (b), but remember that as our distributions get more complicated, properly matching like for like also gets harder.

Chapter 6. Progressing with GANs

Figure 6.1. We can perform latent space interpolation because the latent vector we send to the Generator produces consistent outcomes that are predictable in some ways; not only is the generative process predictable, but also the output is not jagged—or reacting sharply to small changes—considering the latent vector changes. If we, for example, want an image that is a blend of two faces, we just need to search somewhere around the average of the two vectors.

Figure 6.2. Can you see how we start with a smooth mountain range and gradually increase the complexity by zooming in? That is effectively what adding extra layers does to the loss function. This is handy, as our mountain region (loss function) is much easier to navigate when it is less jagged. You can think of it as follows: when we have a more complex structure (b), the loss function is jagged and hard to navigate (d), because there are so many parameters—especially in early layers—that can have a massive impact and generally increase the dimensionality of the problem. However, if we initially remove some part of the complexity (a), we can early on get a loss function that is much easier to navigate (c) and increases in complexity only as we gain confidence that we are at the approximately right part of the loss space. Only then do we move from (a) and (c) into (b) and (d) versions.

Figure 6.3. When we’ve trained for enough steps with, say, 16 × 16 resolution (a), we introduce another transposed convolution in the Generator (G) and another convolution in the Discriminator (D) to get the “interface” between G and D to be 32 × 32. But we also introduce two pathways: (1 – α) simple nearest neighbor upscaling, which does not have any trained parameters, but is also quite naive; and (α) extra transposed convolution, which requires training but will ultimately perform much better.

Figure 6.4. We map out all the points in an image (step 1) to a set of vectors (step 2), and then we normalize them so that they are all in the same range (typically between 0 and 1 in the high-dimensional space), which is step 3.

Figure 6.5. Contributions of various techniques to score improvements. We can see that the introduction of equalized learning rate makes a big difference, and pixel-wise normalization adds to that, though what the authors do not tell us is how effective this technique would be if we had only pixel normalization and did not introduce equalized learning rate. We include this table only as an illustration of the rough magnitude of improvement we can expect from these changes—which is an interesting lesson on its own—but more detailed discussion follows.

Figure 6.6. Output of listing 6.5. Try changing the seed in the latent_vector definition to get different outputs. A word of warning: even though this random seed argument should consistently define the output we are meant to get, we have found that on reruns we sometimes get different results, depending on the version of TensorFlow. This image is obtained using 1.9.0-rc1.

Figure 6.7. Progressive growing of FFDM. This is a great figure because it not only shows the progressively increasing resolution on these mammograms (e), but also some training statistics (a)–(d) to show you that training these GANs is messy for everyone, not just you.

Figure 6.8. In comparing the real and the generated datasets, the data looks pretty realistic and generally close to an example in the training set. In their subsequent work, MammoGAN, Kheiron has shown that these images fool trained and certified radiologists. That”s generally a good sign, especially at this high resolution. Of course, in principle, we would love to have a statistical way of measuring the quality of the generation. But as we know from chapter 5, this is hard enough to do with standard images, let alone for any arbitrary GAN.

Chapter 7. Semi-Supervised GAN

Figure 7.1. This graph approximates the monthly cumulative count of unique GAN implementations published by the research community, starting from GAN’s invention in 2014 until the first few months of 2018. As the chart makes clear, the field of generative adversarial learning has been growing exponentially since its inception, and there is no end in sight to this growth in interest and popularity.

Figure 7.2. In this Semi-Supervised GAN, the Generator takes in a random noise vector z and produces a fake example x*. The Discriminator receives three kinds of data inputs: fake data from the Generator, real unlabeled examples x, and real labeled examples (x, y), where y is the label corresponding to the given example. The Discriminator then outputs a classification; its goal is to distinguish fake examples from the real ones and, for the real examples, identify the correct class. Notice that the portion of examples with labels is much smaller than the portion of the unlabeled data. In practice, the contrast is even starker than the one shown, with labeled data forming only a tiny fraction (often as little as 1–2%) of the training data.

Figure 7.3. This SGAN diagram is a high-level illustration of the SGAN we implement in this chapter’s tutorial. The Generator turns random noise into fake examples. The Discriminator receives real images with labels (x, y), real images without labels (x), and fake images produced by the Generator (x*). To distinguish real examples from fake ones, the Discriminator uses the sigmoid function. To distinguish between the real classes, the Discriminator uses the softmax function.

Chapter 8. Conditional GAN

Figure 8.1. CGAN Generator: G(z, y) = x*|y. Using random noise vector z and label y as inputs, the Generator produces a fake example x*|y that strives to be a realistic-looking match for the label.

Figure 8.2. The CGAN Discriminator receives real examples along with their labels (x, y) and fake examples along with the label used to synthesize them (x*|y, y). The Discriminator then outputs a probability (computed by the sigmoid activation function σ) indicating whether the input pair is real rather than fake.

Figure 8.3. The CGAN Generator uses a random noise vector z and a label y (one of the n possible labels) as inputs and produces a fake example x*|y that strives to be both realistic looking and a convincing match for the label y.

Figure 8.4. The steps used to combine the conditioning label (7 in this example) and the random noise vector z into a single joint representation

Figure 8.5. The steps used to combine the label (7 in this case) and the input image into a single joint representation

Figure 8.6. Starting from random noise, GCAN learns to produce realistic-looking numerals for each of the labels in the training dataset.

Figure 8.7. Each row shows a sample of images produced to match a given numeral, 0 through 9. As you can see, the CGAN Generator has successfully learned to produce every class represented in our dataset.

Chapter 9. CycleGAN

Figure 9.1. Conditional GANs provide a powerful framework for image translation that performs well across many domains.

Figure 9.2. Because the loss works both ways, we can now reproduce not just images from summer to winter, but also from winter to summer. If G is our Generator from A to B, and F is our Generator from B to A, then .

Figure 9.3. A picture is worth a thousand words to clarify the effects of identity loss: there is a clear tint in the cases without identity loss, and since there seems to be no reason for it, so we try to penalize this behavior. Even in black and white, you should be able to see the difference. However, to see the full extent of it, check out the full-color version online.

Figure 9.4. In this image of an autoencoder from chapter 2, we used the analogy of compressing (step 1) a human concept into a more compact written form in a letter (step 2) and then expanding this concept out to the (imperfect) idea of the same notion in someone else’s head (step 3).

Figure 9.5. In this simplified architecture of the CycleGAN, we start with the input image, which either (1) goes to the Discriminator for evaluation or (2) is translated to one domain, evaluated by the other Discriminator, and then translated back.

Figure 9.6. Architecture of the Generator. The generator itself has a contraction path (d0 to d3) and expanding path (u1 to u4). The contraction and expanding paths are sometimes referred to as encoder and decoder, respectively.

Figure 9.7. Apples translated into oranges, and oranges into apples. These are results as they appear verbatim in our Jupyter notebook. (Results may vary slightly based on random seeds, implementation of TensorFlow and Keras, and hyperparameters.)

Figure 9.8. In this information flow of the augmented CycleGAN, we have latent vectors Za and Zb that seed the Generator along with the image input, effectively reducing the problem to two CGANs joined together. This allows us to control the generation.

Figure 9.9. This structure should be somewhat familiar from earlier, so hopefully this chapter has at least given you a head start. One extra thing to point out: we now have an extra step with labels and semantic understanding that gives us the so-called task loss. This allows us to also check the produced image for semantic meaning.

Chapter 10. Adversarial examples

Figure 10.1. In this typical loss space, remember, this is the type of loss value we can feasibly get with our deep learning algorithms. On the left, you have 2D contour lines of equal loss, and on the right, you have a 3D rendering of what a loss space may look like. Remember the mountaineering analogy from chapter 6?

Figure 10.2. A bit of noise makes a lot of difference. The picture in the middle has the noise (difference) applied to it (the picture to the right). Of course, the right picture is heavily amplified—approximately 300 times—and shifted so that it can create a meaningful image.

Figure 10.3. DAWNBench is a great place to see the current state-of-the-art models and ResNet-50 dominance, at least as of early July 2019.

Figure 10.4. The numbers here denote the percentage of adversarial examples crafted to fool the classifier in that row that also fooled that column’s classifier. The methods are deep neural networks (DNNs), logistic regression (LR), support-vector machine (SVM), decision trees (DT), nearest neighbors (kNN), and ensembles (Ens.).

Figure 10.5. It is clear that we do not get a confident classification as a wrong class in most cases on just naively sampled noise. So that is plus points to ResNet-50. On the left, we include the mean and variance we used so that you can see their impact.

Figure 10.6. Projected gradient descent takes a step in the optimal direction, wherever it may be, and then uses projection to find the nearest equivalent point in the set of points. In this case, we are trying to ensure that we still end up with a valid picture: we take an example x(k) and take the optimal step to y(k + 1) to then project it to a valid set of images as x(k + 1).

Figure 10.7. When we run ResNet-50 on adversarial noise, we get a different story: most of the items are misclassified after applying a PGD attack—still a simple attack.

Figure 10.8. Inception V3 applied to Gaussian noise. Notice that we are not using any attacks; this noise is just sampled from the distribution.

Chapter 11. Practical applications of GANs

Figure 11.1. Techniques used to enlarge a dataset by altering existing data include scaling (zooming in and out), translations (moving left/right and up/down), and rotations. Although effective at increasing dataset sizes, classic data augmentation techniques bring only limited additional data diversity.

Figure 11.2. The DCGAN model architecture employed by Frid-Adar et al. to generate synthetic images of liver lesions to augment their dataset, aiming to improve classification accuracy. The model architecture is similar to the DCGAN in chapter 4, underscoring the applicability of GANs across a wide array of datasets and use cases. (Note that the figure shows only the GAN flow for fake examples.)

Figure 11.3. This chart shows classification accuracy as new examples are added using two dataset augmentation strategies: standard/classic data augmentation; and augmentation using synthetic examples produced by DCGAN. Using standard augmentation (dotted line), the classification performance peaks at around 80%. Using GAN-created examples (dashed line) boosts the accuracy to over 85%.

Figure 11.4. The architectures of the CGAN Generator and the Discriminator networks that Kang etal. use in their study. The label c represents the category of clothing. The researchers use it as the conditioning label to guide the Generator to synthesize an image matching the given category, and the Discriminator to identify real image-category pairs.

Figure 11.5. In the results Kang et al. present in their paper, every image is annotated with its preference score. Each row shows results for a different shopper and product category (men’s and women’s tops, men’s and women’s bottoms, and men’s and women’s shoes).

Figure 11.6. Variations on the digit 9 obtained by moving around in the latent space (image reproduced from chapter 2). Nearby vectors produce variations on the same digit. For example, notice that as we move from left to right in the first row, the numeral 9 starts off being slightly right-slanted but eventually turns fully upright. Also notice that as we move far enough away, the number 9 morphs into another, visually similar digit. Progressive variations like these apply equally to more complex datasets, where the variations tend to be more nuanced.

Figure 11.7. The personalization process for six shoppers (three male and three female) using the same starting image: polo shirt for males and a pair of pants for women.

Chapter 12. Looking ahead

Figure 12.1. Under divergence minimization (a), the Generator is always playing catch-up with the Discriminator (because divergence is always ≥ 0). In (b), we see what “good” NS-GAN training looks like. Again, the Generator cannot win. In (c), we can see that now the generator can win, but more importantly, the Generator always has something to strive for (and therefore recover useful gradient), no matter the stage of training.

Figure 12.2. The output pixel (2 × 2 patch) ignores anything except the small highlighted region. Attention helps us solve that.

Figure 12.3. Here, we can see the regions of the image that the attention mechanism pays most attention to, given a representative query location. We can see that the attention mechanism generally cares about regions of different shapes and sizes, which is a good sign, given that we want it to pick out the regions of the image that indicate the kind of object it is.

Figure 12.4. Deadwood, South Dakota, 1877. The image on the right has been colorized . . . for a black-and-white book. Trust us. If you do not believe us, check out the online liveBook on Manning’s website to see for yourself!

Figure 12.5. Every time you click the Make Children button, Ganbreeder gives you a selection of mutated images in the nearby latent space, producing the three images below. You may start from your own sample or someone else’s—thereby making it a collaborative exercise. This is what the Crossbreed section is for, where you can select another interesting sample from other parts of the space and mix the two samples. Lastly, in Edit-Genes, you can edit parameters (such as Castle and Stone Wall, in this case) and add more or less of that feature into the picture.

