Decoder

The decoder consists of three deconvolution layers arranged in sequence. For each deconvolution operation, we reduce the number of features to obtain an image that must be the same size as the original image. In addition to reducing the number of features, deconvolution involves a shape transformation of the images:

Data flow of the decoding phase

We're ready to look at how to implement a convolutional autoencoder. The first step is to load the basic libraries:

import matplotlib.pyplot as plt 
import numpy as np 
import math 
import tensorflow as tf 
import tensorflow.examples.tutorials.mnist.input_data as input_data

Then build the training and test sets:

mnist = input_data.read_data_sets("data/", one_hot=True) 
trainimgs   = mnist.train.images 
trainlabels = mnist.train.labels 
testimgs    = mnist.test.images 
testlabels  = mnist.test.labels 
ntrain      = trainimgs.shape[0] 
ntest       = testimgs.shape[0] 
dim         = trainimgs.shape[1] 
nout        = trainlabels.shape[1]

Define a placeholder variable for the input images:

x = tf.placeholder(tf.float32, [None, dim])

The data type is set to float32, and the shape is set to [None, dim], where None means that the tensor may hold an arbitrary number of images, with each image being a vector of length dim. Next, we have a placeholder variable for the output images. The shape of this variable is set to [None, dim], the same input shape:

y = tf.placeholder(tf.float32, [None, dim])

Then we define the keepprob variable, which is used to configure the dropout rate used during neural network training:

keepprob = tf.placeholder(tf.float32)

Also, we have to define the number of nodes for each of the network's layers:

n1 = 16 
n2 = 32 
n3 = 64 
ksize = 5

The network contains a total of six layers. The first three layers are convolutional and belong to the encoding phase, while the last three layers are deconvolutional and define the decoding phase:

weights = { 
    'ce1': tf.Variable(tf.random_normal 
                       ([ksize, ksize, 1, n1],stddev=0.1)), 
    'ce2': tf.Variable(tf.random_normal 
                       ([ksize, ksize, n1, n2],stddev=0.1)), 
    'ce3': tf.Variable(tf.random_normal 
                       ([ksize, ksize, n2, n3],stddev=0.1)), 
    'cd3': tf.Variable(tf.random_normal 
                       ([ksize, ksize, n2, n3],stddev=0.1)), 
    'cd2': tf.Variable(tf.random_normal 
                       ([ksize, ksize, n1, n2],stddev=0.1)), 
    'cd1': tf.Variable(tf.random_normal 
                       ([ksize, ksize, 1, n1],stddev=0.1)) 
} 

biases = { 
    'be1': tf.Variable 
    (tf.random_normal([n1], stddev=0.1)), 
    'be2': tf.Variable 
    (tf.random_normal([n2], stddev=0.1)), 
    'be3': tf.Variable 
    (tf.random_normal([n3], stddev=0.1)), 
    'bd3': tf.Variable
    (tf.random_normal([n2], stddev=0.1)), 
    'bd2': tf.Variable 
    (tf.random_normal([n1], stddev=0.1)), 
    'bd1': tf.Variable 
    (tf.random_normal([1],  stddev=0.1)) 
}

The following function, cae, builds the convolutional autoencoder: the inputs passed are the image, _X, the data structure weights and bias, _W, _b , and the _keepprob parameter:

def cae(_X, _W, _b, _keepprob):

The initial image of 784 pixels must be reshaped in a 28×28 matrix to be subsequently processed by the next convolutional layers:

    _input_r = tf.reshape(_X, shape=[-1, 28, 28, 1])

The first convolutional layer is _ce1, which has as its input the _input_r tensor relative to the input image:

    _ce1 = tf.nn.sigmoid 
           (tf.add(tf.nn.conv2d 
                   (_input_r, _W['ce1'], 
                    strides=[1, 2, 2, 1], 
                    padding='SAME'), 
                   _b['be1']))

Before moving to the second convolutional layer, we apply the dropout operation:

    _ce1 = tf.nn.dropout(_ce1, _keepprob)

In the following two encoding layers, we apply the same convolution and dropout operations:

    _ce2 = tf.nn.sigmoid 
           (tf.add(tf.nn.conv2d 
                   (_ce1, _W['ce2'], 
                    strides=[1, 2, 2, 1], 
                    padding='SAME'), 
                   _b['be2']))  
    _ce2 = tf.nn.dropout(_ce2, _keepprob) 

    _ce3 = tf.nn.sigmoid 
           (tf.add(tf.nn.conv2d 
                   (_ce2, _W['ce3'], 
                    strides=[1, 2, 2, 1], 
                    padding='SAME'), 
                   _b['be3']))  
    _ce3 = tf.nn.dropout(_ce3, _keepprob)

The number of features has increased from 1 (the input image) to 64, while the original shape image is reduced from 28×28 to 7×7. In the decoding phase, the compressed (or encoded) and reshaped image must be as similar to the original as possible.

To obtain this, we used the TensorFlow function, conv2d_transpose, for the next (three) layers:

tf.nn.conv2d_transpose(value, filter, output_shape, strides, padding='SAME')

This operation is sometimes called deconvolution; it is simply the transpose (gradient) of conv2d:

The arguments of this function are as follows:

value: A 4-D tensor of type float and shape (batch, height, width, and in_channels).
filter: A 4-D tensor with the same type as value and shape (height, width, output_channels, in_channels). The in_channels dimension must match that of value.
output_shape: A 1-D tensor representing the output shape of the deconvolution op.
strides: A list of ints. The stride of the sliding window for each dimension of the input tensor.
padding: A string, either valid or same.
conv2d_transpose: This will return a tensor with the same type as the value argument.

The first deconvolutional layer, _cd3, has the convolutional layer _ce3 as the input. It returns the _cd3, tensor, whose shape is (1, 7, 7, 32):

    _cd3 = tf.nn.sigmoid 
           (tf.add(tf.nn.conv2d_transpose 
                   (_ce3, _W['cd3'], 
                    tf.pack([tf.shape(_X)[0], 7, 7, n2]), 
                    strides=[1, 2, 2, 1], 
                    padding='SAME'), 
                   _b['bd3']))  
    _cd3 = tf.nn.dropout(_cd3, _keepprob)

To the second deconvolutional layer, _cd2, we pass as the input the deconvolutional layer, _cd3. It returns the _cd2 tensor, whose shape is (1, 14, 14, 16):

    _cd2 = tf.nn.sigmoid 
           (tf.add(tf.nn.conv2d_transpose 
                   (_cd3, _W['cd2'], 
                    tf.pack([tf.shape(_X)[0], 14, 14, n1]), 
                    strides=[1, 2, 2, 1], 
                    padding='SAME'), 
                   _b['bd2']))  
    _cd2 = tf.nn.dropout(_cd2, _keepprob)

The third and final deconvolutional layer, _cd1, has the layer _cd2 passed as the input. It returns the resulting _out tensor, whose shape is (1, 28, 28, 1), equal to the input image:

    _cd1 = tf.nn.sigmoid 
           (tf.add(tf.nn.conv2d_transpose 
                   (_cd2, _W['cd1'], 
                    tf.pack([tf.shape(_X)[0], 28, 28, 1]), 
                    strides=[1, 2, 2, 1], 
                    padding='SAME'), 
                   _b['bd1']))  
    _cd1 = tf.nn.dropout(_cd1, _keepprob) 
    _out = _cd1 
    return _out

Then we define a cost function as the mean squared error between y and pred:

pred = cae(x, weights, biases, keepprob) 
cost = tf.reduce_sum 
       (tf.square(cae(x, weights, biases, keepprob) 
                  - tf.reshape(y, shape=[-1, 28, 28, 1]))) 
learning_rate = 0.001

To optimize the cost, we'll use the AdamOptimizer:

optm = tf.train.AdamOptimizer(learning_rate).minimize(cost)

In the next step, we configure the running session for our network:

init = tf.global_variables_initializer()
print ("Functions ready") 
sess = tf.Session() 
sess.run(init) 
mean_img = np.zeros((784))

The size of the batch is set as 128:

batch_size = 128

The number of epochs is 5:

n_epochs   = 5

Start the loop session:

for epoch_i in range(n_epochs):

For each epoch, we get a batch set, trainbatch:

    for batch_i in range(mnist.train.num_examples // batch_size): 
        batch_xs, _ = mnist.train.next_batch(batch_size) 
        trainbatch = np.array([img - mean_img for img in batch_xs])

We apply a random noise, just like denoising autoencoders, to ensure better learning:

        trainbatch_noisy = trainbatch + 0.3*np.random.randn( 
            trainbatch.shape[0], 784) 
        sess.run(optm, feed_dict={x: trainbatch_noisy 
                                  , y: trainbatch, keepprob: 0.7}) 
    print ("[%02d/%02d] cost: %.4f" % (epoch_i, n_epochs 
        , sess.run(cost, feed_dict={x: trainbatch_noisy 
                                    , y: trainbatch, keepprob: 1.})))

For each training epoch, we randomly take 5 training examples:

    if (epoch_i % 1) == 0: 
        n_examples = 5 
        test_xs, _ = mnist.test.next_batch(n_examples) 
        test_xs_noisy = test_xs + 0.3*np.random.randn( 
            test_xs.shape[0], 784)

Then we test the trained model on a little subset:

        recon = sess.run(pred, feed_dict={x: test_xs_noisy, 
                                                   keepprob: 1.}) 
        fig, axs = plt.subplots(2, n_examples, figsize=(15, 4)) 
        for example_i in range(n_examples): 
            axs[0][example_i].matshow(np.reshape( 
                test_xs_noisy[example_i, :], (28, 28)) 
                , cmap=plt.get_cmap('gray'))

Finally, we can display the inputs and the learned set, by matplotlib:

            axs[1][example_i].matshow(np.reshape( 
                np.reshape(recon[example_i, ...], (784,)) 
                + mean_img, (28, 28)), cmap=plt.get_cmap('gray')) 
        plt.show()

The execution will produce the following output:

>>>  
Extracting data/train-images-idx3-ubyte.gz 
Extracting data/train-labels-idx1-ubyte.gz 
Extracting data/t10k-images-idx3-ubyte.gz 
Extracting data/t10k-labels-idx1-ubyte.gz 
Packages loaded 
Network ready 
Functions ready 
Start training.. 
[00/05] cost: 8049.0332 
[01/05] cost: 3706.8667 
[02/05] cost: 2839.9155 
[03/05] cost: 2462.7021 
[04/05] cost: 2391.9460 
>>>

Note that, for each epoch, we'll visualize the input set and the corresponding learned set shown previously.

As you may note, for the first epoch, we have no idea what images are learned: