Chapter 2. Intro to generative modeling with autoencoders

This chapter covers

  • Encoding data into a latent space (dimensionality reduction) and subsequent dimensionality expansion
  • Understanding the challenges of generative modeling in the context of a variational autoencoder
  • Generating handwritten digits by using Keras and autoencoders
  • Understanding the limitations of autoencoders and motivations for GANs

I dedicate this chapter to my grandmother, Aurelie Langrova, who passed away as we were finishing the work on it. She will be missed dearly.


You might be wondering why we chose to include this chapter in the book. There are three core reasons:

  • Generative models are a new area for most. Most people who come across machine learning typically become exposed to classification tasks in machine learning first and more extensively—perhaps because they tend to be more straightforward. Generative modeling, through which we are trying to produce a new example that looks realistic, is therefore less understood. So we decided to include a chapter that covers generative modeling in an easier setting before delving into GANs, especially given the wealth of resources and research on autoencoders—GANs’ closest precursor. But if you want to dive straight into the new and exciting bits, feel free to skip this chapter.
  • Generative models are very challenging. Because generative modeling has been underrepresented, most people are unaware of what a typical model looks like and its challenges. Although autoencoders are in many ways closer to the models that are most commonly taught (such as an explicit objective function, as we will discuss later), they still present many challenges that GANs face—such as how difficult it is to evaluate sample quality. Chapter 5 covers this in more depth.
  • Generative models are an important part of the literature today. Autoencoders themselves have their own uses, as we discuss in this chapter. They are also still an active area of research, even state of the art in some areas, and are used explicitly by many GAN architectures. Other GAN architectures use them as implicit inspiration or a mental model—such as CycleGAN, covered in chapter 9.

2.1. Introduction to generative modeling

You should be familiar with how deep learning takes raw pixels and turns them into, for example, class predictions. For example, we can take three matrixes that contain pixels of an image (one for each color channel) and pass them through a system of transformations to get a single number at the end. But what if we want to go in the opposite direction?

We start with a prescription of what we want to produce and get the image at the other end of the transformations. That is generative modeling in its simplest, most informal form; we add more depth throughout the book.

A bit more formally, we take a certain prescription (z)—for this simple case, let’s say it is a number between 0 and 9—and try to arrive at a generated sample (x*). Ideally, this x* would look as realistic as another real sample, x. The prescription, z, lives in a latent space and serves as an inspiration so that we do not always get the same output, x*. This latent space is a learned representation—hopefully meaningful to people in ways we think of it (“disentangled”). Different models will learn a different latent representation of the same data.

The random noise vector we saw in chapter 1 is often referred to as a sample from the latent space. Latent space is a simpler, hidden representation of a data point. In our context, it is denoted by z, and simpler just means lower-dimensional—for example, a vector or array of 100 numbers rather than the 768 that is the dimensionality of the samples we will use. In many ways, a good latent representation of a data point will allow you to group things that are similar in this space. We will get to what latent means in the context of an autoencoder in figure 2.3 and show you how this affects our generated samples in figures 2.6 and 2.7, but before we can do that, we’ll describe how autoencoders function.

2.2. How do autoencoders function on a high level?

As their name suggests, autoencoders help us encode data, well, automatically. Autoencoders are composed of two parts: encoder and decoder. For the purposes of this explanation, let’s consider one use case: compression.

Imagine that you are writing a letter to your grandparents about your career as a machine learning engineer. You have only one page to explain everything that you do so that they understand, given their knowledge and beliefs about the world.

Now imagine that your grandparents suffer from acute amnesia and do not remember what you do at all. This already feels a lot harder, doesn’t it? This may be because now you have to explain all the terminology. For example, they can still read and understand basic things in your letter, such as your description of what your cat did, but the notion of a machine learning engineer might be alien to them. In other words, their learned transformations from latent space z into x* has been (almost) randomly initialized. You have to first retrain these mental structures in their heads before you can explain. You have to train their autoencoder by passing in concepts x and seeing whether they manage to reproduce them (x*) back to you in a meaningful way. That way, you can measure their error, called the reconstruction loss (|| xx* ||).

Implicitly, we compress data—or information—every day so we do not spend ages explaining known concepts. Human communication is full of autoencoders, but they are context-dependent: what we explain to our grandparents, we do not have to explain to our engineering colleagues, such as what a machine learning model is. So some human latent spaces are more appropriate than others, depending on the context. We can just jump to the succinct representation that their autoencoder will already understand.

We can compress, because it is useful to simplify certain recurring concepts into abstractions that we have agreed on—for example, a job title. Autoencoders can systematically and automatically uncover these information-efficient patterns, define them, and use them as shortcuts to increase the information throughput. As a result, we need to transmit only the z, which is typically much lower-dimensional, thereby saving us bandwidth.

From an information theory point of view, you are trying to pass as much information through the “information bottleneck” (your letter or spoken communication) as possible without sacrificing too much of the understanding. You can almost imagine this as a secret shortcut that only you and your family understand but that has been optimized for the topics you frequently discussed.[1] For simplicity and to focus on compression, we chose to ignore the fact that words are an explicit model, although most words also have tremendous context-dependent complexity behind them.


In fact, the Rothschilds, a famous European financier family, did this in their letters, which is why they were so successful in finance.


The latent space is the hidden representation of the data. Rather than expressing words or images (for example, machine learning engineer in our example, or JPEG codec for images) in their uncompressed versions, an autoencoder compresses and clusters them based on its understanding of the data.

2.3. What are autoencoders to GANs?

One of the key distinctions with autoencoders is that we end-to-end train the whole network with one loss function, whereas GANs have distinct loss functions for the Generator and the Discriminator. Let’s now look at the context in which autoencoders sit compared to GANs. As you can see in figure 2.1, both are generative models that are subsets of artificial intelligence (AI) and machine learning (ML). In the case of autoencoders (or their variational alternative, VAEs), we have an explicitly written function that we are trying to optimize (a cost function); but in the case of GANs (as you will learn), we do not have an explicit metric as simple as mean squared error, accuracy, or area under the ROC curve to optimize.[2] GANs instead have two competing objectives that cannot be written in one function.


A cost function (also known as a loss function or objective function) is what we are trying to optimize/minimize for. In statistics, for example, this would be the root mean squared error (RMSE). The root mean squared error (RMSE) is a mathematical function that gives an error by taking the root of the square of the difference between the true value of an example and our prediction.
In statistics, we typically want to evaluate a classifier across several combinations of false positives and negatives. The area under the curve (AUC) helps us do that. For more details, Wikipedia has an excellent explanation, as this concept is beyond the scope of this book.

Figure 2.1. Placing GANs and autoencoders in the AI landscape. Different researchers might draw this differently, but we will leave this argument to academics.

2.4. What is an autoencoder made of?

As we look at the structure of an autoencoder, we’ll use images as an example, but this structure also applies in other cases (for instance, language, as in our example about the letter to your grandparents). Like many advancements in machine learning, the high-level idea of autoencoders is intuitive and follows these simple steps, illustrated in figure 2.2:

  1. Encoder network: We take a representation x (for example, an image) and then reduce the dimension from y to z by using a learned encoder (typically, a one- or many-layer neural network).
    Figure 2.2. Using an autoencoder in our letter example follows these steps: (1) You compress all the things you know about a machine learning engineer, and then (2) compose that to the latent space (letter to your grandmother). When she, using her understanding of words as a decoder (3), reconstructs a (lossy) version of what that means, you get out a representation of an idea in the same space (in your grandmother’s head) as the original input, which was your thoughts.

  2. Latent space (z): As we train, here we try to establish the latent space to have some meaning. Latent space is typically a representation of a smaller dimension and acts as an intermediate step. In this representation of our data, the autoencoder is trying to “organize its thoughts.”
  3. Decoder network: We reconstruct the original object into the original dimension by using the decoder. This is typically done by a neural network that is a mirror image of the encoder. This is the step from z to x*. We apply the reverse process of the encoding to get back, for example, a 784 pixel-values long reconstructed vector (of a 28 × 28 image) from the 256 pixel-values long vector of the latent space.

Here’s an example of autoencoder training:

  1. We take images x and feed them through the autoencoder.
  2. We get out x*, reconstruction of the images.
  3. We measure the reconstruction loss—the difference between x and x*.

    • This is done using a distance (for example, mean average error) between the pixels of x and x*.
    • This gives us an explicit objective function (|| xx* ||) to optimize via a version of gradient descent.

So we are trying to find the parameters of the encoder and the decoder that would minimize the reconstruction loss that we update by using gradient descent.

And that’s it! We’re done. Now you may be wondering why this is useful or important. You’d be surprised!

2.5. Usage of autoencoders

Despite their simplicity, there are many reasons to care about autoencoders:

  • First of all, we get compression for free! This is because the intermediate step (2) from figure 2.2 becomes an intelligently reduced image or object at the dimensionality of the latent space. Note that in theory, this can be orders of magnitude less than the original input. It obviously is not lossless, but we are free to use this side effect, if we wish.
  • Still using the latent space, we can think of many practical applications, such as a one-class classifier (an anomaly-detection algorithm), where we can see the items in a reduced, more quickly searchable latent space to check for similarity with the target class. This can work in search (information retrieval) or anomaly-detection settings (comparing closeness in the latent space).
  • Another use case is data denoising or colorization of black-and-white images.[3] For example, if we have an old photo or video or a very noisy one—say, World War II images—we can make them less noisy and add color back in. Hence the similarity to GANs, which also tend to excel at these types of applications.


    For more information on coloring black-and-white images, see Emil Wallner’s “Coloring Greyscale Images,” on GitHub (

  • Some GANs architectures—such as BEGAN[4]—use autoencoders as part of their architecture to help them stabilize their training, which is critically important, as you will discover later.


    BEGAN is an acronym for Boundary Equilibrium Generative Adversarial Networks. This interesting GAN architecture was one of the first to use an autoencoder as part of the setup.

  • Training of these autoencoders does not require labeled data. We will get to this and why unsupervised learning is so important in the next section. This makes our lives a lot easier, because it is only self-training and does not require us to look for labels.
  • Last, but definitely not least, we can use autoencoders to generate new images. Autoencoders have been applied to anything from digits to faces to bedrooms, but usually the higher the resolution of the image, the worse the performance, as the output tends to look blurry. But for the MNIST dataset—as you will discover later—and other low-resolution images, autoencoders work great; you’ll see what the code looks like in just a moment!

The Modified National Institute of Standards and Technology (MNIST) database is a dataset of handwritten digits. Wikipedia has a great overview of this extremely popular dataset used in computer vision literature.

So all of these things can be done just because we found a new representation of the data we already had. This representation is useful because it brings out the core information, which is natively compressed, but it’s also easier to manipulate or generate new data based on the latent representation!

2.6. Unsupervised learning

In the previous chapter, we already talked about unsupervised learning without using the term. In this section, we’ll take a closer look.


Unsupervised learning is a type of machine learning in which we learn from the data itself without additional labels as to what this data means. Clustering, for example, is unsupervised—because we are just trying to discover the underlying structure of the data; but anomaly detection is usually supervised, as we need human-labeled anomalies.

In this chapter, you will learn why unsupervised machine learning is different: we can use any data without having to label it for a specific purpose. We can throw in all images from the internet without having to annotate the data about the purpose of each sample, for each representation that we might care about. For example: Is there a dog in this picture? A car?

In supervised learning, on the other hand, if you don’t have labels for that exact task, (almost) all of your labels could be unusable. If you’re trying to make a classifier that would classify cars from Google Street View, but you do not have labels of those images for animals as well, training a classifier that would classify animals with the same dataset would be basically impossible. Even if the animals frequently feature in these samples, you would need to go back and ask your labelers to relabel the same Google Street View dataset for animals.

In essence, we need to think about the application of the data before we know the use case, which is difficult! But for a lot of compression-type tasks, you always have labeled data: your data. Some researchers, such as François Chollet (research scientist at Google and author of Keras), call this type of machine learning self-supervised. For much of this book, our only labels will be either the examples themselves or any other examples from the dataset.

Since our training data also acts as our labels, training many of these algorithms becomes far easier from one crucial perspective: we now have lots more data to work with, and we do not need to wait weeks and pay millions for enough labeled data.

2.6.1. New take on an old idea

Autoencoders themselves are a fairly old idea—at least when you look at the age of machine learning as a field. But seeing as everyone is working on something deep today, it should surprise exactly no one that people have successfully applied deep learning as part of both encoder and decoder.

An autoencoder is composed of two neural networks: an encoder and a decoder. In our case, both have activation functions,[5] and we will be using just one intermediate layer for each. This means we have two weight matrices in each network—one from input to intermediate and then one from intermediate to latent. Then again, we have one from latent to different intermediate and then one from intermediate to output. If we had just one weight matrix in each, our procedure would resemble a well-established analytical technique called principal component analysis (PCA). If you have a background in linear algebra, you should be in broadly familiar territory.


We feed any output from an earlier layer’s computation through an activation function before passing it to the next one. Frequently, people pick a rectified linear unit (ReLU)—which is defined as max(0, x). We don’t go into too much depth on activation functions, because they alone could be a subject of a lengthy blog post.


Some technical differences exist in how the solutions are learned—for example, PCA is numerically deterministic, whereas autoencoders are typically trained with a stochastic optimizer. There are also differences in the final form of the solution. But we’re not going to give you a lecture about how one of them gives you an orthonormal basis and how fundamentally they still span the same vector space—though if you happen to know what that means, then more power to you.

2.6.2. Generation using an autoencoder

At the beginning of this chapter, we said that autoencoders can be used to generate data. Some of you who are really keen may have been thinking about the use of the latent space and whether it can be repurposed for something else . . . and it totally can! (If you got this right, you can give yourself an official, approved self-five!)

But you probably didn’t buy this book to look silly, so let’s get to the point. If we go back to the example with your grandparents and apply a slightly different lens, using autoencoders as a generative model might start to make sense. For example, imagine that your idea of what a job is becomes the input to the decoder network. Think of the word job written down on the piece of paper as the latent space input, and the idea of a job in your grandparents’ head as the output.

In this case, we see that the latent space encoding (a written word, combined with your grandparents’ ability to read and understand concepts) becomes a generative model that generates an idea in their heads. The written letter acts as an inspiration or some sort of latent vector, and the output—the ideas—are in the same high-dimensional space as the original input. Your grandparents’ ideas are as complex—albeit slightly different—as yours.

Now let’s switch back to the domain of images. We train our autoencoder on a set of images. So we tune the parameters of the encoder and the decoder to find appropriate parameters for the two networks. We also get a sense for the way the examples are represented in the latent space. For generation, we cut off the encoder part and use only the latent space and the decoder. Figure 2.3 shows a schematic of the generation process.

Figure 2.3. Because we know from training where our examples get placed in the latent space, we can easily generate examples similar to the ones that the model has seen. Even if not, we can easily iterate or grid-search through the latent space to determine the kinds of representations that our model can generate.

(Image adapted from Mat Leonard’s simple autoencoder project on GitHub,

2.6.3. Variational autoencoder

You may be wondering: what is the difference between a variational autoencoder and a “regular” one? It all has to do with the magical latent space. In the case of a variational autoencoder, we choose to represent the latent space as a distribution with a learned mean and standard deviation rather than just a set of numbers. Typically, we choose multivariate Gaussian, but exactly what that is or why we choose this distribution over another is not that important right now. If you would like a refresher on what that might look like, take a look at figure 2.5.

As the more statistically inclined of you may have realized at this point, the variational autoencoder is a technique based on Bayesian machine learning. In practice, this means we have to learn the distribution, which adds further constraints. In other words, frequentist autoencoders would try to learn the latent space as an array of numbers, but Bayesian—for example, variational—autoencoders would try to find the right parameters defining a distribution.

We then sample from the latent distribution and get some numbers. We feed these numbers through the decoder. We get back an example that looks like something from the original dataset, except it has been newly created by the model. Ta-da!

2.7. Code is life

In this book, we use a popular, deep learning, high-level API called Keras. We highly suggest that you familiarize yourself with it. If you are not already comfortable with it, plenty of good free resources are available online, including outlets such as Towards Data Science (, where we frequently contribute. If you want to learn more about Keras from a book, several good resources exist, including another great Manning book, Deep Learning with Python by François Chollet—the author and creator of Keras.

Keras is a high-level API for several deep learning frameworks—TensorFlow, Microsoft Cognitive Toolkit (CNTK), and Theano. It is easy to use and allows you to work on a much higher level of abstraction, so you can focus on the concepts rather than recording every standard block of multiplication, biasing, activation, and then pooling[6] or having to worry about variable scopes too much.


A pooling block is an operation on a layer that allows us to pool several inputs into fewer—for example, having a matrix of four numbers and getting the maximum value as a single number. This is a common operation in computer vision to reduce complexity.

To show the true power of Keras and how it simplifies the process of writing a neural network, we will look at the variational autoencoder example in its simplest form.[7] In this tutorial, we use the functional API that Keras has for a more function-oriented approach to writing deep learning code, but we will show you the sequential API (the other way) in later tutorials as things get more difficult.


This example was highly modified by the authors for simplicity, from

The goal of this exercise is to generate handwritten digits based on the latent space. We are going to create an object, generator or decoder, that can use the predict() method to generate new examples of handwritten digits, given an input seed, which is just the latent space vector. And of course, we have to use MNIST because we wouldn’t want anyone getting any ideas that there could be other datasets out there; see figure 2.4.

Figure 2.4. How computer vision researchers think. Enough said.

(Source: Artificial Intelligence Memes for Artificial Intelligence Teens on Facebook,

In our code, we first have to import all dependencies, as shown in the following listing. For reference, this code was checked with Keras as late as 2.2.4 and TensorFlow as late as 1.12.0.

Listing 2.1. Standard imports
from keras.layers import Input, Dense, Lambda
from keras.models import Model
from keras import backend as K
from keras import objectives
from keras.datasets import mnist
import numpy as np

The next step is to set global variables and hyperparameters, as shown in listing 2.2. They should all be familiar: the original dimensions are 28 × 28, which is the standard size. We then flatten the images from the MNIST dataset, to get a vector of 784 (28 × 28) dimensions. And we will also have a single intermediate layer of, say, 256 nodes. But do experiment with other sizes; that’s why it’s a hyperparameter!

Listing 2.2. Setting hyperparameters
batch_size = 100
original_dim = 28*28          1
latent_dim = 2
intermediate_dim = 256
nb_epoch = 5                  2
epsilon_std = 1.0

  • 1 Height × width of MNIST image
  • 2 Number of epochs

In listing 2.3, we start constructing the encoder. To achieve this, we use the functional API from Keras.


The functional API uses lambda functions in Python to return constructors for another function, which takes another input, producing the final result.

The short version is that we will simply declare each layer, mentioning the previous input as a second group of arguments after the regular arguments. For example, the layer h takes x as an input. At the end, when we compile the model and indicate where it starts (x) and where it ends ([z_mean, z_log_var and z]), Keras will understand how the starting input and the final list output are linked together. Remember from the diagrams that z is our latent space, which in this case is a normal distribution defined by mean and variance. Let’s now define the encoder.[8]


This idea is inspired by Branko Blagojevic in our book forums. Thank you for this suggestion.

Listing 2.3. Creating the encoder
x = Input(shape=(original_dim,), name="input")                          1
h = Dense(intermediate_dim, activation='relu', name="encoding")(x)      2
z_mean = Dense(latent_dim, name="mean")(h)                              3
z_log_var = Dense(latent_dim, name="log-variance")(h)                   4
z = Lambda(sampling, output_shape=(latent_dim,))([z_mean, z_log_var])   5
encoder = Model(x, [z_mean, z_log_var, z], name="encoder")              6

  • 1 Input to our encoder
  • 2 Intermediate layer
  • 3 Defines the mean of the latent space
  • 4 Defines the log variance of the latent space
  • 5 Note that output_shape isn’t necessary with the TensorFlow backend.
  • 6 Defines the encoder as a Keras model

Now comes the tricky part, where we sample from the latent space and then feed this information through to the decoder. But think for a bit how z_mean and z_log_var are connected: they are both connected to h with a dense layer of two nodes, which are the defining characteristics of a normal distribution: mean and variance. The preceding sampling function is implemented as shown in the following listing.

Listing 2.4. Creating the sampling helper function
def sampling(args):
    z_mean, z_log_var = args
    epsilon = K.random_normal(shape=(batch_size, latent_dim), mean=0.)
    return z_mean + K.exp(z_log_var / 2) * epsilon

In other words, we learn the mean (μ) and the variance (μ). This overall implementation, where we have one (ω) connected through a sampling function as well as z_mean and z_log_var, allows us to both train and subsequently sample efficiently to get some neat-looking figures at the end. During generation, we sample from this distribution according to these learned parameters, and then we feed these values through the decoder to get the output, as you will see in the figures later. For those of you who are a bit rusty on distributions—or probability density functions in this case—we have included several examples of unimodal two-dimensional Gaussians in figure 2.5.

Figure 2.5. As a reminder of what a multivariate (2D) distribution looks like, we’ve plotted probability density functions of bivariate (2D) Gaussians. They are uncorrelated 2D normal distributions, except with different variances. (a) has a variance of 0.5, (b) of 1, and (c) of 2. (d), (e), and (f) are the exact same distributions as (a), (b), and (c), respectively, but plotted with a set z-axis limit at 0.7. Intuitively, this is just a function that for each point says how likely it is to occur. So (a) and (d) are much more concentrated, whereas (c) and (f) are making it possible for values far away from the origin (0,0) to occur, but each given value is not as likely.

Now that you understand what defines our latent space and what these distributions look like, we’ll write the decoder. In this case, we write the layers as variables first so we can reuse them later for the generation.

Listing 2.5. Writing the decoder
input_decoder = Input(shape=(latent_dim,), name="decoder_input")   1
decoder_h = Dense(intermediate_dim, activation='relu',             2
x_decoded = Dense(original_dim, activation='sigmoid',
name="flat_decoded")(decoder_h)                                    3
decoder = Model(input_decoder, x_decoded, name="decoder")          4

  • 1 Input to the decoder
  • 2 Takes the latent space to the intermediate dimension
  • 3 Gets the mean from the original dimension
  • 4 Defines the decoder as a Keras model

We can now combine the encoder and the decoder into a single VAE model.

Listing 2.6. Combining the model
output_combined = decoder(encoder(x)[2])      1
vae = Model(x, output_combined)               2
vae.summary()                                 3

  • 1 Grabs the output. Recall that we need to grab the third element, our sampling z.
  • 2 Links the input and the overall output
  • 3 Prints out what the overall model looks like

Next, we get to the more familiar parts of machine learning: defining a loss function so our autoencoder can train.

Listing 2.7. Defining our loss function
def vae_loss(x, x_decoded_mean, z_log_var, z_mean,
    xent_loss = original_dim * objectives.binary_crossentropy(
        x, x_decoded_mean)
    kl_loss = - 0.5 * K.sum(
        1 + z_log_var - K.square(z_mean) - K.exp(z_log_var),
    return xent_loss + kl_loss

vae.compile(optimizer='rmsprop', loss=vae_loss)     1

  • 1 Finally compiles our model

Here you can see where using binary cross-entropy and KL divergence add together to form overall loss. KL divergence measures the difference between distributions; imagine the two blobs from figure 2.5 and then measuring the volume of overlap. Binary cross-entropy is one of the common loss functions for two-class classification: here we simply compare each grayscale pixel value of x to the value in x_decoded_mean, which is the reconstruction we were talking about earlier. If you are still confused about this paragraph after the following definition, chapter 5 provides more details on measuring differences between distributions.


For those interested in more detail and who are familiar with information theory, the Kullback–Leibler divergence (KL divergence), aka relative entropy, is the difference between cross-entropy of two distributions and their own entropy. For everyone else, imagine drawing out the two distributions, and wherever they do not overlap will be an area proportional to the KL divergence.

Then we define the model to start at x and end at x_decoded_mean. The model is compiled using RMSprop, but we could use Adam or vanilla stochastic gradient descent (SGD). As with any deep learning system, we are using backpropagated errors to navigate the parameter space. We are always using some type of gradient descent, but in general, people rarely try any other than the three mentioned here: Adam, SGD, or RMSprop.


Stochastic gradient descent (SGD) is an optimization technique that allows us to train complex models by figuring out the contribution of any given weight to an error and updating this weight (no update if the prediction is 100% correct). We recommend brushing up on this in, for example, Deep Learning with Python.

We train the model by using the standard procedure of train-test split and input normalization.

Listing 2.8. Creating the train/test split
(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = x_train.astype('float32') / 255.
x_test = x_test.astype('float32') / 255.
x_train = x_train.reshape((len(x_train),[1:])))
x_test = x_test.reshape((len(x_test),[1:])))

We normalize the data and reshape the train set and test set to be one 784-digit-long array per example instead of a 28 × 28 matrix.

Then we apply the fit function, using shuffling to get a realistic (nonordered) dataset. We also use validation data to monitor progress as we train:, x_train,
        validation_data=(x_test, x_test),verbose=1)

We’re done!

The full version of the code provides a fun visualization of the latent space; however, for that, look into the accompanying Jupyter/Google Colaboratory notebook. Now we get to kick back, relax, and watch those pretty progress bars. After we are done, we can even take a look at what the values of the latent space look like on a 2D plane, as shown in figure 2.6.

Figure 2.6. 2D projection of all the points from the test set into the latent space and their class. In this figure, we display the 2D latent space onto the graph. We then map out the classes of these generated examples and color them accordingly, as per the legend on the right. Here we can see that the classes tend to be neatly grouped together, which tells us that this is a good representation. A color version is available in the GitHub repository for this book.

We can also compute the values at fixed increments of a latent space grid to take a look at the generated output. For example, going from 0.05 to 0.95 in 0.15 linear increments across both dimensions gives us the visualization in figure 2.7. Remember that we’re using a bivariate Gaussian in this case, giving us two axes to iterate over. Again, for the code for this visualization, look at the full Jupyter/Google Colab notebook.

Figure 2.7. We map out the values of a subset of the latent space on a grid and pass each of those latent space values through the generator to produce this figure. This gives us a sense of how much the resulting picture changes as we vary z.

2.8. Why did we try aGAN?

It would seem that the book could almost stop at this point. After all, we have successfully generated images of MNIST, and that will be our test case for several examples. So before you call it quits, let us explain our motivation for the chapters to come.

To appreciate the challenges, imagine that we have a simple one-dimensional bimodal distribution—as pictured in figure 2.8. (As before, just think of it as a simple mathematical function that is bounded between 0 and 1 and that represents probability at any given point. The higher the value of the function, the more points we sampled at that exact point before.)

Figure 2.8. Maximum likelihood, point estimates, and true distributions. The gray (theoretical) distribution is bimodal rather than having a single mode. But because we have assumed this, our model is catastrophically wrong. Alternatively, we can get mode collapse, which is worth keeping in mind for chapter 5. This is especially true when we are using flavors of the KL, such as the VAE or early GANs.

Suppose we draw a bunch of samples from this true distribution, but we do not know the underlying model. We are now trying to infer what distribution generated these samples, but for some reason we assume that the true distribution is a simple Gaussian and we just need to estimate the mean and variance. But because we did not specify the model correctly (in this case, we put in wrong assumptions about the modality of these samples), we get into loads of trouble. For example, if we apply a traditional statistical technique called maximum likelihood estimation to estimate this distribution as unimodal—in some ways, that is what VAE is trying to do—we get out the wrong estimate. Because we have misspecified the model,[9] it will estimate a normal distribution around the average of the two distributions—called the point estimate. Maximum likelihood is a technique that does not know and cannot figure out that there are two distinct distributions. So to minimize the error, it creates a “fat-tailed” normal around the point estimate. Here, it can seem trivial, but always remember, we are trying to specify models in very high-dimensional spaces, which is not easy!


See Pattern Recognition and Machine Learning, by Christopher Bishop (Springer, 2011).


Bimodal means having two peaks, or modes. This notion will be useful in chapter 5. In this case, we made the overall distribution to be composed of two normals with means of 0 and 5.

Interestingly, the point estimate will also be wrong and can even live in an area where there is no actual data sampled from the true distribution. When you look at the samples (black crosses), no real samples occur where we have estimated our mean. This is, again, quite troubling. To tie it back to the autoencoder, see how in figure 2.6 we learned 2D normal distribution in the latent space centered around the origin? But what if we had thrown images of celebrity faces into the training data? We would no longer have an easy center to estimate, because the two data distributions would have more modalities than we thought we would have. As a result, even around the center of the distribution, the VAE could produce odd hybrids of the two datasets, because the VAE would try to somehow separate the two datasets.

So far, we have discussed only the hypothetical impact of a statistical mistake. To connect this aspect all the way to autoencoder-generated images, we should think about what our Gaussian latent space z allows us to do. The VAE uses the Gaussian as a way to build representations of the data it sees. But because Gaussians have 99.7% of the probability mass within three standard deviations of the middle, the VAE will also opt for the safe middle. Because VAEs are, in a way, trying to come up directly with the underlying model based on Gaussians, but the reality can be pretty complex, VAEs do not scale up as well as GANs, which can pick up “scenarios.”

You can see what happens when your VAE opts for the “safe middle” in figure 2.9. On the CelebA dataset, which features aligned and cropped celebrity faces, the VAE models the consistently present facial features well, such as eyes or mouth, but makes mistakes in the background.

Figure 2.9. In these images of fake celebrity faces generated by a VAE, the edges are quite blurry and blend into the background. This is because the CelebA dataset has centered and aligned images with consistent features around eyes and mouth, but the backgrounds tend to vary. The VAE picks the safe path and makes the background blurry by choosing a “safe” pixel value, which minimizes the loss, but does not provide good images.

(Source: VAE-TensorFlow by Zhenliang He, GitHub,

On the other hand, GANs have an implicit and hard-to-analyze understanding of the real data distribution. As you will discover in chapter 5, VAEs live in the directly estimated maximum likelihood model family.

This section hopefully made you comfortable with thinking about the distributions of the target data and how these distributional implications manifest themselves in our training process. We will look into these assumptions much more in chapter 10, where the model has assumed how to fill in the distributions and that becomes a problem that adversarial examples will be able to exploit to make our machine learning models fail.


  • Autoencoders on a high level are composed of an encoder, a latent space, and a decoder. An autoencoder is trained by using a common objective function that measures the distance between the reproduced and original data.
  • Autoencoders have many applications and can also be used as a generative model. In practice, this tends not to be their primary use because other methods, especially GANs, are better at the generative task.
  • We can use Keras (a high-level API for TensorFlow) to write a simple variational autoencoder that produces handwritten digits.
  • VAEs have limitations that motivate us to move on to GANs.
