Chapter 5. Training and common challenges: GANing for success

This chapter covers

  • Meeting the challenges of evaluating GANs
  • Min-Max, Non-Saturating, and Wasserstein GANs
  • Using tips and tricks to best train a GAN

When reading this chapter, please remember that GANs are notoriously hard to both train and evaluate. As with any other cutting-edge field, opinions about what is the best approach are always evolving.

Papers such as “How to Train Your DRAGAN” are a testament to both the incredible capacity of machine learning researchers to make bad jokes and the difficulty of training Generative Adversarial Networks well. Dozens of arXiv papers preoccupy themselves solely with the aim of improving the training of GANs, and numerous workshops have been dedicated to various aspects of training at top academic conferences (including Neural Information Processing Systems, or NIPS, one of the prominent machine learning conferences[1]).


NIPS 2016 featured a workshop on GAN training with many important researchers in the field, which this chapter was based on. NIPS has recently changed its abbreviation to NeurIPS.

But GAN training is an evolving challenge, and so a lot of resources—including those presented through papers and conferences—now need a certain amount of updating. This chapter provides a comprehensive yet up-to-date overview of training techniques. In this chapter, you also finally get to experience something no one has ever been known to hate—math. (But we promise not to use more than strictly necessary.)

Jokes aside, however, as the first chapter in the “Advanced Topics in GANs” section of this book, this is quite a dense chapter. We recommend that you go back and try some of the models with several parameters. Then you can return to this chapter, as you should be reading it with a strong understanding of not just what each part of a GAN does, but also the challenges in training them from your own experience.

Like the other chapters in this advanced section, this chapter is here to teach you as well as to provide a useful reference for at least a couple of years to come. Therefore, this chapter is a summary of the tips and tricks from people’s experiences, blog posts, and most relevant papers. (If academia is not your cup of tea, now is the time to get out those doodling pens and scribble over the footnotes.) We look at this chapter as a short academic intermission that will give you a clear map indicating all the amazing present and future developments of GANs.

We also hope to thereby equip you with all the basic tools to understand the vast majority of new papers that may come out. In many books, this would be presented as pros and cons lists that would not give readers the full high-level understanding of the choices. But because GANs are such a new field, simple lists are not possible, as the literature has still not agreed on some aspects conclusively. GANs are also a fast-growing field, so we would much prefer to equip you with the ability to navigate it, rather than give you information that is likely to soon be outdated.

With the purpose of this chapter explained, let’s clarify where GANs sit again. Figure 5.1 expands on the diagram from chapter 2 and shows the taxonomy of the models so you can understand what other generative techniques exist and how (dis)similar they are.

Figure 5.1. Where do GANs fit in?

(Source: “Generative Adversarial Networks (GANs),” by Ian Goodfellow, NIPS 2016 tutorial,

There are two key takeaways from this diagram:

  • All of these generative models ultimately derive from Maximum Likelihood, at least implicitly.
  • The variational autoencoder introduced in chapter 2 sits in the Explicit part of the tree. Remember that we had a clear loss function (the reconstruction loss)? Well, with GANs we do not have it anymore. Rather, we now have two competing loss functions that we will cover in lot more depth later. But as such, the system does not have a single analytical solution.

If you know any of the other techniques pictured, that’s great. The key idea is that we are moving away from explicit and tractable, into the territory of implicit approaches toward training. However, by now you should be wondering: if we do not have an explicit loss function (even though we have the two separate losses encountered implicitly in the “Conflicting objectives” section of chapter 3), how do we evaluate a GAN? What if you’re running parallel, large-scale experiments?

To clear up potential confusion, not all the techniques in figure 5.1 come from deep learning, and we certainly do not need you to know any of them, other than VAEs and GANs!

5.1. Evaluation

Let’s revisit the chapter 1 analogy about forging a da Vinci painting. Imagine that a forger (Generator) is trying to mimic da Vinci, to get the forged painting accepted at an exhibition. This forger is competing against an art critic (Discriminator) who is trying to accept only real work into the exhibition. In this circumstance, if you are the forger who is aiming to create a “lost piece” by this great artist in order to fool the critic with a flawless impersonation of da Vinci’s style, how would you evaluate how well you’re doing? How would each actor evaluate their performance?

GANs are trying to solve the problem of never-ending competition between the forger and the art critic. Indeed, given that typically the Generator is of greater interest than the Discriminator, we should think about its evaluation extra carefully. But how would we quantify the style of a great painter or how closely we imitate it? How would we quantify the overall quality of the generation?

5.1.1. Evaluation framework

The best solution would be to have da Vinci paint all the paintings that are possible to paint, using his style, and then see whether the image generated using a GAN would be somewhere in that collection. You can think of this process as a nonapproximate version of maximum likelihood maximization. In fact, we would know that the image either is or is not in this set, so no likelihood is involved. However, in practice, this solution is never really possible.

The next best thing would be to assess the image and point to instances of what to look for and then add up the number of errors or artifacts. But these will be highly localized and ultimately would always require a human critic to look at the art piece itself. It is a fundamentally nonscalable—although probably the second best—solution.

We want to have a statistical way of evaluating the quality of the generated samples, because that would scale and would allow us to evaluate as we are experimenting. If we do not have an easy metric to calculate, we also cannot monitor progress. This is a problem especially for evaluating different experiments—imagine measuring or even backpropagating with a human in the loop at each, for example, hyperparameter initialization. This is especially a problem, given that GANs tend to be quite sensitive to hyperparameters. So not having a statistical metric is difficult, because we’d have to check back with humans every time we want to evaluate the quality of training.

Why don’t we just use something that we already understand, such as maximum likelihood? It is statistical and measures something vaguely desirable, and we implicitly derive from it anyway. Despite this, maximum likelihood is difficult to use because we need to have a good estimate of the underlying distribution and its likelihood—and that may mean more than billions of images.[2] There are also reasons to want to go beyond maximum likelihood, even if we just had a good sample—which is what we effectively have with the training set.


We give the problems of dimensionality better treatment in chapter 10.

What else is wrong with maximum likelihood? After all, it is a well-established metric in much of the machine learning research. Generally, maximum likelihood has lots of desirable properties, but as we have touched on, using it is not tractable as an evaluation technique for GANs.

Furthermore, in practice, approximations of maximum likelihood tend to overgeneralize and therefore deliver samples that are too varied to be realistic.[3] Under maximum likelihood, we may find samples that would never occur in the real world, such as a dog with multiple heads or a giraffe with dozens of eyes but no body. But because we don’t want GAN violence to give anyone nightmares, we should probably weed out samples that are “too general,” using a loss function and/or the evaluation method.


See “How (Not) to Train your Generative Model: Scheduled Sampling, Likelihood, Adversary?” by Ferenc Huszár, 2015,

Another way to think about overgeneralization is to start with a probability distribution of fake and real data (for example, images) and look at what the distance functions (a way to measure distance between real and fake images’ distributions) would do in cases where there should be zero probability mass. The additional loss due to these overgeneral samples could be tiny if they are not too different, for example, because these modes are close to real data in all but a few key problems such as multiple heads. An overgeneral metric would therefore allow creation of samples even when, according to the true data-generating process, there should not be any, such as a cow with multiple heads.

That is why researchers felt that we need different evaluation principles even though what we are effectively doing is always maximizing likelihood. We are just measuring it in different ways. For those curious, KL divergence and JS divergence—which we will visit in a bit—are also based on maximum likelihood, so here we can treat them as interchangeable.

Thus you now understand that we have to be able to evaluate a sample and that we cannot simply use maximum likelihood to do this. In the following pages, we will talk about the two most commonly used and accepted metrics for statistically evaluating the quality of the generated samples: the inception score (IS) and Fréchet inception distance (FID). The advantage of those two metrics is that they have been extensively validated to be highly correlated with at least some desirable property such as visual appeal or realism of the image. The inception score was designed solely around the idea that the samples should be recognizable, but it has also been shown to correlate with human intuition about what constitutes a real image, as validated by Amazon Mechanical Turkers.[4]


Amazon Mechanical Turk is a service that allows you to purchase people’s time by the hour to work on a prespecified task. It’s something like on-demand freelancers or Task Rabbit, but only online.

5.1.2. Inception score

We clearly need a good statistical evaluation method. Let’s start from a high-level wish list of what our ideal evaluation method would ensure:

  • The generated samples look like some real, distinguishable thing—for example, buckets or cows. The samples look real, and we can generate samples of items in our dataset. Moreover, our classifier is confident that what it sees is an item it recognizes. Luckily, we already have computer vision classifiers that are able to classify an image as belonging to a particular class, with certain confidence. Indeed, the score itself is named after the Inception network, which is one of those classifiers.
  • The generated samples are varied and contain, ideally, all the classes that were represented in the original dataset. This point is also highly desirable because our samples should be representative of the dataset we gave it; if our MNIST-generating GAN is always missing the number 8, we would not have a good generative model. We should have no interclass (between classes) mode collapse.[5]


    See “An Introduction to Image Synthesis with Generative Adversarial Nets,” by He Huang et al., 2018,

Although we might have further requirements of our generative model, this is a good start.

The inception score (IS) was first introduced in a 2016 paper that extensively validated this metric and confirmed that it indeed correlates with human perceptions of what constitutes a high-quality sample.[6] This metric has since become popular in the GAN research community.


See “Improved Techniques for Training GANS,” by Tim Salimans et al., 2016,

We have explained why we want to have this metric. Now let’s dive into the technical details. Computing the IS a simple process:

  1. We take the Kullback–Leibler (KL) divergence between the real and the generated distribution.[7]


    We introduced KL divergence in chapter 2.

  2. We exponentiate the result of step 1.

Let’s look at an example: a failure mode in an Auxiliary Classifier GAN (ACGAN),[8] where we were trying to generate examples of daisies from the ImageNet dataset. When we ran the Inception network on the following ACGAN failure mode, we saw something like figure 5.2; your results may differ, depending on your OS, TensorFlow version, and implementation details.


See “Conditional Image Synthesis with Auxiliary Classifier GANs,” by Augustus Odena et al., 2017,

Figure 5.2. ACGAN failure mode. Scores on the right indicate the softmax output.

(Source: Odena, 2017,

The important thing to note here is that the Inception classifier is not certain what it’s looking at, especially among the first three categories. Humans would work out that it’s probably a flower, but even we are not sure. Overall confidence in the predictions is also quite low (scores go up to 1.00). This is an example of something that would receive a low IS, which matches our two requirements from the start of the section. Thus, our metrics journey has been a success, as this matches our intuition.

5.1.3. Fréchet inception distance

The next problem to solve is the lack of variety of examples. Frequently, GANs learn only a handful of images for each class. In 2017, a new solution was proposed: the Fréchet inception distance (FID).[9] The FID improves on the IS by making it more robust to noise and allowing the detection of intraclass (within class) sample omissions.


See “GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium,” by Martin Heusel et al., 2017,

This is important, because if we accept the IS baseline, then producing only one type of a category technically satisfies the category-being-generated-sometimes requirement. But, for example, if we are trying to create a cat-generation algorithm, this is not actually what we want (say, if we had multiple breeds of cats represented). Furthermore, we want the GAN to output samples that present a cat from more than one angle and, generally, images that are distinct.

We equally do not want the GAN to simply memorize the images. Luckily, that is much easier to detect—we can look at the distance between images in pixel space. Figure 5.3 shows what that may look like. Technical implementation of the FID is again complex, but the high-level idea is that we are looking for a generated distribution of samples that minimizes the number of modifications we have to make to ensure that the generated distribution looks like the distribution of the true data.

Figure 5.3. The GAN picks up on the patterns by mostly memorizing the items, which also creates an undesirable outcome indicating that the GAN has not learned much useful information and will most likely not generalize. The proof is in the images. The first two rows are pairs of duplicate samples; the last row is the nearest neighbor of the middle row in the training set. Note that these examples are very low resolution as they appear in the paper, due to a low-resolution GAN setup.

(Source: “Do GANs Actually Learn the Distribution? An Empirical Study,” by Sanjeev Arora and Yi Zhang, 2017,

The FID is calculated by running images through an Inception network. In practice, we compare the intermediate representations—feature maps or layers—rather than the final output (in other words, we embed them). More concretely, we evaluate the distance of the embedded means, the variances, and the covariances of the two distributions—the real and the generated one.

To abstract away from images, if we have a domain of well-understood classifiers, we can use their predictions as a measure of whether this particular sample looks realistic. To summarize, the FID is a way of abstracting away from a human evaluator and allows us to reason statistically, in terms of distributions, even about things as difficult to quantify as the realism of an image.

Because this metric is so new, it is still worth waiting to see whether a flaw may be revealed in a later paper. But given the number of reputable authors who have already started using this metric, we decided to include it.[10]


See “Is Generator Conditioning Causally Related to GAN Performance?” by Augustus Odena et al., 2018, See also S. Nowozin (Microsoft Research) talk at UCL, February 10, 2018.

5.2. Training challenges

Training a GAN can be complicated, and we will walk you through the best practices. But here we provide only a high-level, accessible set of explanations that do not deep dive into any of the mathematics that proves the theorems or shows the evidence, because the details are beyond the scope of this book. But we encourage you to go to the sources and decide for yourself. Frequently, the authors even provide code samples to help you get started.

Here is a list of the main problems:

  • Mode collapse—In mode collapse, some of the modes (for example, classes) are not well represented in the generated samples. The mode collapses even though the real data distribution has support for the samples in this part of the distribution; for example, there will be no number 8 in the MNIST dataset. Note that mode collapse can happen even if the network has converged. We talked about interclass mode collapse during the explanation of the IS and intraclass mode collapse when discussing the FID.
  • Slow convergence—This is a big problem with GANs and unsupervised settings, in which generally the speed of convergence and available compute are the main constraints—unlike with supervised learning, in which available labeled data is typically the first barrier. Moreover, some people believe that compute, not data, is going to be the determining factor in the AI race in the future. Plus, everyone wants fast models that do not take days to train.
  • Overgeneralization—Here, we talk especially about cases in which modes (potential data samples) that should not have support (should not exist), do. For example, you might see a cow with multiple bodies but only one head, or vice versa. This happens when the GAN overgeneralizes and learns things that should not exist based on the real data.

Note that mode collapse and overgeneralization can sometimes most naively be resolved by reinitializing the algorithm, but such an algorithm is fragile, which is bad. This list gives us, broadly, two key metrics: speed and quality. But even these two metrics are similar, as much of training is ultimately focused on closing the gap between the real and the generated distribution faster.

So how do we resolve this? When it comes to GAN training, several techniques can help us improve the training process, just as you would with any other machine learning algorithm:

  • Adding network depth
  • Changing the game setup

    • Min-Max design and stopping criteria that were proposed by the original paper
    • Non-Saturating design and stopping criteria that were proposed by the original paper[11]


      See “Generative Adversarial Networks,” by Ian Goodfellow et al., 2014,

    • Wasserstein GAN as a recent improvement
  • Number of training hacks with commentary

    • Normalizing the inputs
    • Penalizing the gradients
    • Training the Discriminator more
    • Avoiding sparse gradients
    • Changing to soft and noisy labels

5.2.1. Adding network depth

As with many machine learning algorithms, the easiest way to make learning more stable is to reduce the complexity. If you can start with a simple algorithm and iteratively add to it, you get more stability during training, faster convergence, and potentially other benefits. Chapter 6 explores this idea in more depth.

You could quickly achieve stability with both a simple Generator and Discriminator and then add complexity as you train, as explained in one of the most mind-blowing GAN papers.[12] Here, the authors from NVIDIA progressively grow the two networks so that at the end of each training cycle, we double the output size of the Generator and double the input of the Discriminator. We start with two simple networks and train until we achieve good performance.


See “Progressive Growing of GANs for Improved Quality, Stability, and Variation,” by Tero Karras et al., 2017,

This ensures that rather than starting with a massive parameter space, which is orders of magnitude larger than the initial input size, we start by generating an image of 4 × 4 pixels and navigating this parameter space before doubling the size of the output. We repeat this until we reach images of size 1024 × 1024.

See how impressive this is for yourself; both the pictures in figure 5.4 are generated. Now we are moving beyond the blurry 64 × 64 images that autoencoders can generate.

Figure 5.4. Full HD images generated by GANs. You may consider this a teaser for the next chapter, where you will be rewarded for all your hard work in this one.

(Source: Karras et al., 2017,

This approach has these advantages: stability, speed of training, and, most importantly, the quality of the samples produced as well as their scale. Although this paradigm is new, we expect more and more papers to use it. You should definitely experiment with it also, because it is a technique that can be applied to virtually any type of GAN.

5.2.2. Game setups

One way to think about the two-player competitive nature of GANs is to imagine that you are playing the game of Go or any other board game that can end at any point, including chess. (Indeed, this borrows from DeepMind’s approach to AlphaGo and its split into policy and value network.) As a player, you need to be able to not only know the game’s objective and therefore what both players are trying to accomplish, but also understand how close you are to victory. So you have rules and you have a distance (victory) metric—for example, the number of pawns lost.

But just as not every board-game victory metric applies equally well to every game, some GAN victory metrics—distances or divergences—tend to be used with particular game setups and not with others. It is worth examining each loss function (victory metrics) and the player dynamics (game setup) separately.

Here, we start to introduce some of the mathematical notation that describes the GAN problem. The equations are important, and we promise we won’t scare you with any more than necessary. The reason we introduce them is to give you a high-level understanding as well as equip you with the tools to understand what a lot of GAN researchers still do not seem to distinguish. (Maybe they should train the Discriminator in their head—oh, well.)

5.2.3. Min-Max GAN

As we explained earlier in this book, you can think of the GAN setup from a game-theoretical point of view, where you have two players trying to outplay each other. But even the original 2014 paper mentioned that there are two versions of the game. In principle, the more understandable and the more theoretically well-grounded approach is exactly the one we described: just consider the GAN problem a min-max game. Equation 5.1 describes the loss function for the Discriminator.

equation 5.1.

The Es stand for expectation over either x (true data distribution) or z (latent space), D stands for the Discriminator’s function (mapping image to probability), and G stands for the Generator’s function (mapping latent vector to an image). This first equation should be familiar from any binary classification problem. If we give ourselves some freedom and get rid of the complexity, we can rewrite this equation as follows:

This states that the Discriminator is trying to minimize the likelihood of mistaking a real sample for a fake one (first part) or a fake sample for a real one (the second part).

Now let’s turn our attention to the Generator’s loss function in equation 5.2.

equation 5.2.

Because we have only two agents and they are competing against each other, it makes sense that the Generator’s loss would be a negative of the Discriminator’s.

Putting it all together: we have two loss functions, and one is the negative value of the other. The adversarial nature is clear. The Generator is trying to outsmart the Discriminator. As for the Discriminator, remember that it is a binary classifier. The Discriminator also outputs only a single number—not the binary class—so it’s punished for its confidence or lack thereof. The rest is just some fancy math to give us nice properties such as asymptotic consistency to the Jensen–Shannon divergence (which is a great phrase to memorize if you’re trying to curse someone).

We previously explained why we typically don’t use maximum likelihood. Instead, we use measures such as the KL divergence and the Jensen–Shannon divergence (JSD) and, more recently the earth mover’s distance, also known as Wasserstein distance. But all these divergences help us understand the difference between the real and the generated distribution. For now, just think of the JSD as a symmetric version of the KL divergence, which we introduced in chapter 2.


Jensen-Shannon divergence (JSD) is a symmetric version of KL divergence. Whereas KL(p,q)! = KL(q,p), it is the case that JSD(p,q) == JSD(q,p).

For those of you who want more detail, KL divergence, as well as JSD, are generally regarded as what GANs are ultimately trying to minimize. These are both types of distance metrics that help us understand how different the two distributions are in a high-dimensional space. Some neat proofs connect those divergences and the min-max version of the GAN; however, these concerns are too academic for this book. If this paragraph makes little sense, you’re not having a stroke; don’t worry. It’s just statistician things.

We typically do not use the Min-Max GAN (MM-GAN) beyond the nice theoretical guarantees it gives us. It serves as a neat theoretical framework to understand GANs: both as a game-theoretical concept—stemming from the competitive nature between the two networks/players—as well as an information-theoretical one. Beyond that, there are ordinarily no advantages to the MM-GAN. Typically, only the next two setups are used.

5.2.4. Non-Saturating GAN

In practice, it frequently turns out that the min-max approach creates more problems, such as slow convergence for the Discriminator. The original GAN paper proposes an alternative formulation: Non-Saturating GAN (NS-GAN). In this version of the problem, rather than trying to put the two loss functions as direct competitors of each other, we make the two loss functions independent, as shown in equation 5.3, but directionally consistent with the original formulation (equation 5.2).

Again, let’s focus on a general understanding: the two loss functions are no longer set directly against each other. But in equation 5.3, you can see that the Generator is trying to minimize the opposite of the second term of the Discriminator in equation 5.4. Basically, it is trying not to get caught for the samples that it generates.

equation 5.3.

equation 5.4.

The intuition for the Discriminator is the exact same as it was before—equation 5.1 and equation 5.4 are identical, but the equivalent of equation 5.2 has now changed. The main reason for the NS-GAN is that in the MM-GAN’s case, the gradients can easily saturate—get close to 0, which leads to slow convergence, because the weight updates that are backpropagated are either 0 or tiny. Perhaps a picture would make this clearer; see figure 5.5.

Figure 5.5. A sketch of what the hypothesized relationships are meant to look like in theory. The y-axis is the loss function for the Generator, whereas D(G(z)) is the Discriminator’s “guess” for the likelihood of the generated sample. You can see that Minimax (MM) stays flat for too long, thereby giving the Generator too little information—the gradients vanish.

(Source: “Understanding Generative Adversarial Networks,” by Daniel Seita, 2017,

You can see that around 0.0, the gradient of both maximum likelihood and MM-GAN is close to 0, which is where a lot of early training happens, whereas the NS-GAN has a lot higher gradient there, so training should happen much more quickly at the start.

We don’t have a good theoretical understanding of why the NS variant should converge to the Nash equilibrium. In fact, because the NS-GAN is heuristically motivated, using this form no longer gives us any of the neat mathematical guarantees we used to get; see figure 5.6. Because of the complexity of the GAN problem, however, even in the NS-GAN’s case, there is a chance that the training might not converge at all, although it has been empirically shown to perform better than the MM-GAN.

Figure 5.6. A moment of silence, please.

But our dreadful sacrifice leads to significant improvement in performance. The neat thing about the NS approach is not only that the initial training is faster, but also, because the Generator learns faster, the Discriminator learns faster too. This is desirable, because (almost) all of us are on a tight computational and time budget, and the faster we can learn, the better. Some argue that the NS-GAN has not yet been surpassed on a fixed computational budget, and even Wasserstein GAN is not conclusively a better architecture.[13]


See “Are GANs Created Equal? A Large-Scale Study,” by Mario Lucic et al., 2017,

5.2.5. When to stop training

Strictly speaking, the NS-GAN

  • Is no longer asymptotically consistent with the JSD
  • Has an equilibrium state that theoretically is even more elusive

The first point is important, because the JSD is a meaningful tool in explaining why an implicitly generated distribution should even converge at all to the real data distribution. In principle, this gives us stopping criteria; but in practice, this is almost pointless, because we can never verify when the true distribution and the generated distribution have converged. People typically decide when to stop by looking at the generated samples every couple of iterations. More recently, some people have started looking at defining stopping criteria by FID, IS, or the less popular sliced Wasserstein distance.

The second point is also important because the instability obviously causes training problems. One of the more important questions is knowing when to stop. In the two original formulations of the GAN problem, we are never given a clear set of conditions under which the training has finished in practice. In principle, we are always told that once we reach Nash equilibrium, the training is done, but in practice this is again hard to verify, because the high dimensionality makes equilibrium difficult to prove.

If you want to plot the loss functions of the Generator and the Discriminator, they would typically jump all over the place. This makes sense because they’re competing against each other, so if one gets better, the other one gets a larger loss. Just by looking at the two loss functions, it is unclear when we’ve actually finished training.

In the NS-GAN’s defense, it should be said that it is still much faster than the Wasserstein GAN. As a result, the NS-GAN may get over these limitations by being able to run more quickly.

5.2.6. Wasserstein GAN

Recently, a new development in GAN training has emerged and quickly reached academic popularity: Wasserstein GAN (WGAN).[14] It is now mentioned by virtually every major academic paper and many practitioners. Ultimately, the WGAN is important for three reasons:


See “Wasserstein GAN,” by Martin Arjovsky et al., 2017,

  • It significantly improves on the loss functions, which are now interpretable and provide clearer stopping criteria.
  • Empirically, the WGAN tends to have better results.
  • Unlike a lot of research into GANs, it has clear theoretical backing that starts from the loss and shows how the KL divergence that we are trying to approximate is ultimately not well justified theoretically or practically. Based on this theory, it then proposes a better loss function that mitigates this problem.

The importance of the first point should be fairly obvious from the previous section. Given the competitive nature between Generator and Discriminator, we don’t have a clear point at which we want to stop training. The WGAN uses the earth mover’s distance as a loss function that clearly correlates with the visual quality of the samples generated. The benefits of the second and third points are somewhat obvious—we want to have higher-quality samples and better theoretical grounding.

How is this magic achieved? Let’s look at the Wasserstein loss for the Discriminator—or the critic, as the WGAN calls it—in more detail. Take a look at equation 5.5.

equation 5.5.

This equation is somewhat similar to what you have seen before (as a high-level simplification of equation 5.1), with some important differences. We now have the function fw, which acts as a Discriminator. The critic is trying to estimate the earth mover’s distance, and looks for the maximum difference between the real (first term) and the generated (second term) distribution under different (valid) parametrizations of the fw function. And we are now simply measuring the difference. The critic is trying to make the Generator’s life the hardest it could be by looking at different projections using fw into shared space in order to maximize the amount of probability mass it has to move.

Equation 5.6 shows the Generator, as it now has to include the earth mover’s distance.

equation 5.6.

On a high level, in this equation we are trying to minimize the distance between the expectation of the real distribution and the expectation of the generated distribution. The paper that introduced the WGAN itself is complex, but the gist is that fw is a function satisfying a technical constraint.


The technical constraint that fw satisfies is 1 – Lipschitz: for all x1, x2: | f(x1) – f(x2) | ≤ | x1 – x2 |.

The problem that the Generator is trying to solve is similar to the one before, but let’s go into more detail anyway:

  1. We draw x from either the real distribution (x ~ Pr) or the generated distribution x* (gθ(z), where z ~ p(z)).
  2. The generated samples are sampled from z (the latent space) and then transformed via gθ to get the samples (x*) in the same space and then evaluated using fw.
  3. We are trying to minimize our loss function—or distance function, in this case—the earth mover’s distance. The actual numbers are calculated using the earth mover’s distance, which we will explain later.

The setup is also great because we have a much more understandable loss (for example, no logarithms). We also have more tunable training, because in WGAN settings, we have to set a clipping constant, which acts a lot like a learning rate in standard machine learning. This gives us an extra parameter to tune, but that can be a double-edged sword, if your GAN architecture ends up being very sensitive to it. But without going into the mathematics too much, the WGAN has two practical implications:

  • We now have clearer stopping criteria because this GAN has been validated by later papers that show a correlation between the Discriminator loss and the perceptual quality. We can simply measure the Wasserstein distance, and that helps inform when to stop.
  • We can now train the WGAN to convergence. This is relevant because meta-review papers[15] showed that using the JS loss and the divergence between the Generator in the real distribution as a measure of training progress can often be meaningless.[16] To translate that into human terms, sometimes in chess, you need to lose a couple of rounds and therefore temporarily do worse in order to learn in a couple of iterations and ultimately do better.


    A meta-review is just a review of reviews. It helps researchers pool findings from across several papers.


    See “Many Paths to Equilibrium: GANs Do Not Need to Decrease a Divergence at Every Step,” by William Fedus et al., 2018,

This may sound like magic. But this is partially because the WGAN is using a different distance metric than anything you’ve encountered so far. It is called the earth mover’s distance, or Wasserstein distance, and the idea behind it is clever. We will be nice for once and not torture you with more math, but let’s talk about this idea.

You implicitly understand that there are two distributions that are both very high dimensional: the real data-producing one (that we never fully see) and the samples from the Generator (the fake one). Think about how vast the sample space for even a 32 × 32 RGB (x3 × 256 pixel values) image is. Now imagine all of this probability mass for both of these distributions as being just two sets of hills. Chapter 10 revisits this in more detail. For reference, we include figure 5.7, but it builds largely on the same ideas as chapter 2.

Figure 5.7. Plot (a) should be familiar from chapter 2. For extra clarity, we provide another view of a Gaussian distribution in plot (b) of the data drawn from the same distribution, but showing vertical slices of just the first distribution on the top and just the second distribution on the right. Plot (a) then is a probability density abstraction of this data, where the z-axis represents the probability of that point being sampled. Now, even though one of these is just an abstraction of the other, how would you compare the two? How would you make sure that they are the same even when we told you? What if this distribution had 3,072 possible dimensions? In this example, we have just two! We are building up to how we’d compare two heaps-of-sand-looking distributions as in (b), but remember that as our distributions get more complicated, properly matching like for like also gets harder.

Imagine having to move all the ground that represents probability mass from the fake distribution so that the distribution looks exactly like the real distribution, or at least what we have seen of it. That would be like your neighbor having a super cool sandcastle, and you having a lot of sand and trying to make the exact same sandcastle. How much work would that take, to move all of that mass into just the right places? Hey, it’s okay, we’ve all been there; sometimes you just wish your sandcastle was a bit cooler and more sparkly.

Using an approximate version of the Wasserstein distance, we can evaluate how close we are to generating samples that look like they came from the real distribution. Why approximate? Well, for one because we never see the real data distribution, so it’s difficult to evaluate the exact earth mover’s distance.

In the end, all you need to know is that the earth mover’s distance has nicer properties than either the JS or KL, and there are already important contributions building on the WGAN as well as validating its generally superior performance.[17] Although in some cases the WGAN does not completely outperform all the others, it is generally at least as good in every case (though it should be noted that some may disagree with this interpretation).[18]


See “Improved Training of Wasserstein GANs,” by Ishaan Gulrajani et al., 2017,


See Lucic et al., 2017,

Overall, the WGAN (or the gradient penalty version, WGAN-GP) is widely used and has become the de facto standard in much of GAN research and practice—though the NS-GAN should not be forgotten anytime soon. When you see a new paper that does not have the WGAN as one of the benchmarks being compared and does not have a good justification for not including it—be careful!

5.3. Summary of game setups

We have presented the three core versions of the GAN setup: min-max, non-saturating, and Wasserstein. One of these versions will be mentioned at the beginning of every paper, and now you’ll have at least an idea of whether the paper is using the original formulation, which is more explainable but doesn’t work as well in practice; or the non-saturating version, which loses a lot of the mathematical guarantees but works much better; or the newer Wasserstein version, which has both theoretical grounding and largely superior performance.

As a handy guide, table 5.1 presents a list of the NS-GAN, WGAN, and even the improved WGAN-GP formulations we use in this book. This is here so that you have the relevant versions in one place—sorry, MM-GAN. We have included the WGAN-GP here for completeness, because these three are the academic and industry go-tos.

Table 5.1. Summary of loss functions[a]


Source: “Collection of Generative Models in TensorFlow,” by Hwalsuk Lee,


Value function


NS-GAN LDNS = E[log(D(x))] + E[log(1 – D(G(z)))] LGNS = E[log(D(G(z)))] This is one of the original formulations. Typically not used in practice anymore, except as a foundational block or comparison. This is an equivalent formulation to the NS-GAN you have seen, just without the constants. But these are effectively equivalent.[b]
WGAN LDWGAN = E[D(x)] – E[D(G(z))] LGWGAN = E[D(G(z))] This is the WGAN with somewhat simplified loss. This seems to be creating a new paradigm for GANs. We explained this equation previously as equation 5.5 in greater detail.
WGAN-GP[c] (gradient penalties) LDW – GP = E[D(x)] – E[D(G(z))] + GPterm LGW – GP = E[D(G(z))] This is an example of a GAN with a gradient penalty (GP). WGAN-GP typically shows the best results. We have not discussed the WGAN-GP in this chapter in great detail; we include it here for completeness.


We tend to use the constants in written code, and this cleaner mathematical formulation in papers.


This is a version of the WGAN with gradient penalty that is commonly used in new academic papers. See Gulrajani et al., 2017,

5.4. Training hacks

We are now departing from the well-grounded academic results into the areas that academics or practitioners just “figured out.” These are simply hacks, and often you just have to try them to see if they work for you. The list in this section was inspired by Soumith Chintala’s 2016 post, “How to Train a GAN: Tips and Tricks to Make GANs Work” (, but some things have changed since then.

An example of what has changed is some of the architectural advice, such as the Deep Convolutional GAN (DCGAN) being a baseline for everything. Currently, most people start with the WGAN; in the future, the Self-Attention GAN (SAGAN is touched on in chapter 12) may be a focus. In addition, some things are still true, and we regard them as universally accepted, such as using the Adam optimizer instead of vanilla stochastic gradient descent.[19] We encourage you to check out the list, as its creation was a formative moment in GAN history.


Why is Adam better than vanilla stochastic gradient descent (SGD)? Because Adam is an extension of SGD that tends to work better in practice. Adam groups several training hacks along with SGD into one easy-to-use package.

5.4.1. Normalizations of inputs

Normalizing the images to be between –1 and 1 is still typically a good idea according to almost every machine learning resource, including Chintala’s list. We generally normalize because of the easier tractability of computations, as is the case with the rest of machine learning. Given this restriction on the inputs, it is a good idea to restrict your Generator’s final output with, for example, a tanh activation function.

5.4.2. Batch normalization

Batch normalization was discussed in detail in chapter 4. We include it here for completeness. As a note on how our perceptions of batch normalization have changed: originally batch norm was generally regarded as an extremely successful technique, but recently it has been shown to sometimes deliver bad results, especially in the Generator.[20] In the Discriminator, on the other hand, results have been almost universally positive.[21]


See Gulrajani et al., 2017,


See “Tutorial on Generative Adversarial Networks—GANs in the Wild,” by Soumith Chintala, 2017,

5.4.3. Gradient penalties

This training trick builds on point 10 in Chintala’s list, which had the intuition that if the norms of the gradients are too high, something is wrong. Even today, networks such as BigGAN are innovating in this space, as we touch on in chapter 12.[22]


See “Large-Scale GAN Training for High-Fidelity Natural Image Synthesis,” by Andrew Brock et al., 2019,

However, technical issues still remain: naive weighed clipping can produce vanishing or exploding gradients known from much of the rest of deep learning.[23] We can restrict the gradient norm of the Discriminator output with respect to its input. In other words, if you change your input a little bit, your updated weights should not change too much. Deep learning is full of magic like this. This is especially important in the WGAN setting, but can be applied elsewhere.[24] Generally, this trick has in some form been used by numerous papers.[25]


See Gulrajani et al., 2017,


Though here the authors call the Discriminator critic, borrowing from reinforcement learning, as much of that paper is inspired by it.


See “Least Squares Generative Adversarial Networks,” by Xudong Mao et al., 2016, Also see “BEGAN: Boundary Equilibrium Generative Adversarial Networks,” by David Berthelot et al., 2017,

Here, we can simply use the native implementation of your favorite deep learning framework to penalize the gradient and not focus on the implementation detail beyond what we described. Smarter methods have recently been published by top researchers (including one good fellow) and presented at ICML 2018, but their widespread academic acceptance has not been proven yet.[26] A lot of work is being done to make GANs more stable—such as Jacobian clamping, which is also yet to be reproduced in any meta-study—so we will need to wait and see which methods will make it.


See Odena et al., 2018,

5.4.4. Train the Discriminator more

Training the Discriminator more is an approach that has recently gained a lot of success. In Chintala’s original list, this is labeled as being uncertain, so use it with caution. There are two broad approaches:

  • Pretraining the Discriminator before the Generator even gets the chance to produce anything.
  • Having more updates for the Discriminator per training cycle. A common ratio is five Discriminator weight updates per one of the Generator’s.

In the words of deep learning researcher and teacher Jeremy Howard, this works because it is “the blind leading the blind.” You need to initially and continuously inject information about what the real-world data looks like.

5.4.5. Avoid sparse gradients

It intuitively makes sense that sparse gradients (such as the ones produced by ReLU or MaxPool) would make training harder. This is because of the following:

  • The intuition, especially behind average pooling, can be confusing, but think of it this way: if we go with standard max pooling, we lose all but the maximum value for the entire receptive field of a convolution, and that makes it much harder to use the transposed convolutions—in DCGAN’s case—to recover the information. With average pooling, we at least have a sense of what the average value is. It is still not perfect—we are still losing information—but at least less than before, because the average is more representative than the simple maximum.
  • Another problem is information loss, if we are using, say, regular rectified linear unit (ReLU) activation. A way to look at this problem is to consider how much information is lost when applying this operation, because we might have to recover it later. Recall that ReLU(x) is simply max(0,x), which means that for all the negative values, all this information is lost forever. If instead we ensure that we carry over the information from the negative regions and signify that this information is different, we can preserve all this information.

As we suggested, fortunately, a simple solution exists for both of these: we can use Leaky ReLU—which is something like 0.1 × x for negative x, and 1 × x for x that’s at least 0—and average pooling to get around a lot of these problems. Other activation functions exist (such as sigmoid, ELU, and tanh), but people tend to use Leaky ReLU most commonly.


The Leaky ReLU can be any real number, typically, 0 < x < 1.

Overall, we are trying to minimize information loss and make the flow of information the most logical it can be, without asking the GAN to backpropagate the error in some strange way, where it also has to learn the mapping.

5.4.6. Soft and noisy labels

Researchers use several approaches to either add noise to labels or smooth them. Ian Goodfellow tends to recommend one-sided label smoothing (for example, using 0 and 0.9 as binary labels), but generally playing around with either adding noise or clipping seems to be a good idea.


  • You have learned why evaluation is such a difficult topic for generative models and how we can train a GAN well with clear criteria indicating when to stop.
  • Various evaluation techniques move beyond the naive statistical evaluation of distributions and provide us with something more useful that correlates with visual sample quality.
  • Training is performed in three setups: the game-theoretical Min-Max GAN, the heuristically motivated Non-Saturating GAN, and the newest and theoretically well-founded Wasserstein-GAN.
  • Training hacks that allow us to train faster include the following:

    • Normalizing inputs, which is standard in machine learning
    • Using gradient penalties that give us more stability in training
    • Helping to warm-start the Discriminator to ultimately give us a good Generator, because doing so sets a higher bar for the generated samples
    • Avoiding sparse gradients, because they lose too much information
    • Playing around with soft and noisy labels rather than the typical binary classification
