Fine-tuning implementation

Our classification task contains two categories, so the new softmax layer of the network will consist of 2 categories instead of 1,000 categories. Here is the input tensor, which is a 227×227×3 image, and the output tensor of rank 2:

n_classes = 2
train_x = zeros((1, 227,227,3)).astype(float32)
train_y = zeros((1, n_classes))

Fine-tuning implementation consists of truncating the last layer (the softmax layer) of the pre-trained network and replacing it with a new softmax layer that is relevant to our problem.

For example, the pre-trained network on ImageNet comes with a softmax layer with 1,000 categories.

The following code snippet defines the new softmax layer, fc8:

fc8W = tf.Variable(tf.random_normal
                   ([4096, n_classes]),
                   trainable=True, name="fc8w")
fc8b = tf.Variable(tf.random_normal
                   ([n_classes]),
                   trainable=True, name="fc8b")
fc8 = tf.nn.xw_plus_b(fc7, fc8W, fc8b)
prob = tf.nn.softmax(fc8)

Loss is a performance measure used in classification. It is a continuous function that is always positive, and if the predicted output of the model exactly matches the desired output then the cross-entropy equals zero. The goal of optimization is therefore to minimize the cross-entropy, by changing the weights and biases of the model, so it is as close to zero as possible.

TensorFlow has a built-in function for calculating cross-entropy. In order to use cross-entropy to optimize the model's variables we need a single scalar value, so we simply take the average of the cross-entropy for all the image classifications:

loss = tf.reduce_mean
       (tf.nn.softmax_cross_entropy_with_logits_v2
        (logits =prob, labels=y))
opt_vars = [v for v in tf.trainable_variables()
            if (v.name.startswith("fc8"))]

Now that we have a cost measure that must be minimized, we can then create an optimizer:

optimizer = tf.train.AdamOptimizer
            (learning_rate=learning_rate).minimize
            (loss, var_list = opt_vars)
correct_pred = tf.equal(tf.argmax(prob, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

In this case, we use the AdamOptimizer in which the step size is set to 0.5. Note that optimization is not performed at this point. In fact, nothing is calculated at all, we just add the optimizer object to the TensorFlow graph for later execution. Then we run backpropagation on the network to fine-tune the pre-trained weights:

batch_size = 100
training_iters = 6000
display_step = 1
dropout = 0.85 # Dropout, probability to keep units

init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
    step = 1

Keep training until we reach the maximum number of iterations:

    while step * batch_size < training_iters:
        batch_x, batch_y = 
                 next(next_batch(batch_size)) #.next()

Run the optimization operation (backpropagation):

        sess.run(optimizer, 
                 feed_dict={x: batch_x, 
                            y: batch_y, 
                            keep_prob: dropout})

        if step % display_step == 0:

Calculate the batch loss and accuracy:

            cost, acc = sess.run([loss, accuracy],
                                 feed_dict={x: batch_x, 
                                            y: batch_y, 
                                            keep_prob: 1.})
            print ("Iter " + str(step*batch_size) 
                   + ", Minibatch Loss= " + 
                  "{:.6f}".format(cost) + 
                   ", Training Accuracy= " + 
                  "{:.5f}".format(acc))              
            
        step += 1
    print ("Optimization Finished!")

The training of the network produces the following results:

Iter 100, Minibatch Loss= 0.555294, Training Accuracy= 0.76000
Iter 200, Minibatch Loss= 0.584999, Training Accuracy= 0.73000
Iter 300, Minibatch Loss= 0.582527, Training Accuracy= 0.73000
Iter 400, Minibatch Loss= 0.610702, Training Accuracy= 0.70000
Iter 500, Minibatch Loss= 0.583640, Training Accuracy= 0.73000
Iter 600, Minibatch Loss= 0.583523, Training Accuracy= 0.73000
…………………………………………………………………
…………………………………………………………………
Iter 5400, Minibatch Loss= 0.361158, Training Accuracy= 0.95000
Iter 5500, Minibatch Loss= 0.403371, Training Accuracy= 0.91000
Iter 5600, Minibatch Loss= 0.404287, Training Accuracy= 0.91000
Iter 5700, Minibatch Loss= 0.413305, Training Accuracy= 0.90000
Iter 5800, Minibatch Loss= 0.413816, Training Accuracy= 0.89000
Iter 5900, Minibatch Loss= 0.413476, Training Accuracy= 0.90000
Optimization Finished!

To test our model, we compare the forecasts with the label set (cat = 0, dog = 1):

    output = sess.run(prob, feed_dict = {x:imlist, keep_prob: 1.})
    result = np.argmax(output,1)
    testResult = [1,1,1,1,0,0,0,0,0,0,
                  0,1,0,0,0,0,1,1,0,0,
                  1,0,1,1,0,1,1,0,0,1,
                  1,1,1,0,0,0,0,0,1,0,
                  1,1,1,1,0,1,0,1,1,0,
                  1,0,0,1,0,0,1,1,1,0,
                  1,1,1,1,1,0,0,0,0,0,
                  0,1,1,1,0,1,1,1,1,0,
                  0,0,1,0,1,1,1,1,0,0,
                  0,0,0,1,1,0,1,1,0,0]
    count = 0
    for i in range(0,99):
        if result[i] == testResult[i]:
            count=count+1

    print("Testing Accuracy = " + str(count) +"%")

Finally, we have the accuracy of our model:

Testing Accuracy = 82%

VGG

VGG is the name of a team of people who presented their neural networks during ILSVRC 2014. We are talking about networks, plural, since more than one version of the same network was created, each possessing a different number of layers. Depending on the number of layers, n, with weight that one of these networks has, each of them is usually called VGG-n. All of these networks are more deep than AlexNet. This means that they are made up of a number of layers with more workable parameters than AlexNet, in this case 11 to 19 total trained layers. Often, only the workable layers are considered, because they are the ones that affect the processing and size of the model, as seen in the previous paragraph. However, the overall structure remains very similar: there is always an initial series of convolutional layers and a final series of fully connected layers, the latter being exactly the same as in AlexNet. What changes is therefore the number of convolutional layers used and, of course, their parameters. The following table shows all the variants built by the VGG team.

Each column, starting from the left and going to the right, shows a certain VGG network, from the deepest to the shallowest. Bold terms show what has been added in each version compared to the previous version. The ReLU layer is not shown in the table, but in the network it exists after each convolutional layer. All convolutional layers use a stride of 1:

VGG

Table: VGGs network architectures

Note that AlexNet does not have convolutional layers with a fairly large receptive field: here, all receptive fields are 3×3, except for a couple of convolutional layers in VGG-16 that have a 1×1 receptive field. Recall that a convex layer with a 1-step gradient does not change the input space size while modifying the depth value that becomes the same as the number of kernels used. Consequently, the VGG convolutional layers do not ever affect the width and height of the input volumes; only the pooling layers do that. The idea of using a series of convolutional layers with a smaller receptive field, which in the end overall simulates a single convolutional layer with a larger receptive field, is motivated by the fact that in this way multiple ReLU layers are used instead of one alone, thereby increasing the nonlinearity of the activation function and thus making it more discriminating. It also serves to reduce the number of parameters used. These networks are considered an evolution of AlexNet because, overall, and with the same dataset, they perform better than AlexNet. The main concept demonstrated with VGG networks is that more a congestion neural network is profound and more its performance increases. However, it is necessary to have more and more powerful hardware, otherwise network training would become problematic.

For the VGGs, four NVIDIA Titan Blacks were used with 6 GB of memory each. VGGs therefore have better performance but need a lot of hardware for training and also use a very large number of parameters: the VGG-19 model, for example, is about 550 MB (twice as much as AlexNet). Smaller VGG networks still have a model of about 507 MB.

Artistic style learning with VGG-19

In this project, we'll use a pretrained VGG-19 to learn the style and patterns created by an artist and transfer them to an image (the project file is style_transfer.py in the GitHub repository of this book). This technique is called artistic style learning (see the paper A Neural Algorithm of Artistic Style (https://arxiv.org/pdf/1508.06576.pdf) by Gatys and others). According to the academic literature, artistic style learning is defined as follows: given two images as input, synthesize a third image that has the semantic content of the first image and the texture/style of the second.

For this to work properly, we need to train a deep convolutional neural network to build the following:

  • A content extractor to determine the content of image A
  • A style extractor to determine the style of image B
  • A merger to merge some arbitrary content with another arbitrary style to obtain the final result
    Artistic style learning with VGG-19

    Figure 11: Artistic style learning operational schema

Input images

The input images, each of which is 478×478 pixels, are the following images (cat.jpg, and mosaic.jpg) that you will also find in the code repository for this book:

Input images

Figure 12: Input images in Artistic Style Learning

In order to be analyzed by the VGG model, these images need to be preprocessed:

  1. Adding an extra dimension
  2. Subtracting MEAN_VALUES from the input image:
    MEAN_VALUES = np.array([123.68, 116.779, 103.939]).reshape((1,1,1,3))
    content_image = preprocess('cat.jpg')
    style_image = preprocess('mosaic.jpg') 
    
    def preprocess(path):
        image = plt.imread(path)
        image = image[np.newaxis]
        image = image - MEAN_VALUES
        return image

Content extractor and loss

To isolate the semantic content of an image, we use a pre-trained VGG-19 neural network, made some slight tweaks in the weights to adapt to this problem, and then used the output of one of the hidden layers as a content extractor. The following figure shows the CNN used for this problem:

Content extractor and loss

Figure 13: VGG-19 used for Artistic Style Learning

The pre-trained VGG is loaded using the following code:

import scipy.io
vgg = scipy.io.loadmat('imagenet-vgg-verydeep-19.mat')

The imagenet-vgg-verydeep-19.mat model should be downloaded from http://www.vlfeat.org/matconvnet/models/imagenet-vgg-verydeep-19.mat.

This model has 43 layers, 19 of which are convolutional layers. The rest are max pooling/activation/fully connected layers.

We can check the shape of each convolutional layer:

 [print (vgg_layers[0][i][0][0][2][0][0].shape,
        vgg_layers[0][i][0][0][0][0]) for i in range(43) 
 if 'conv' in vgg_layers[0][i][0][0][0][0] 
 or 'fc' in vgg_layers[0][i][0][0][0][0]]

The result of the preceding code is as follows:

(3, 3, 3, 64) conv1_1
(3, 3, 64, 64) conv1_2
(3, 3, 64, 128) conv2_1
(3, 3, 128, 128) conv2_2
(3, 3, 128, 256) conv3_1
(3, 3, 256, 256) conv3_2
(3, 3, 256, 256) conv3_3
(3, 3, 256, 256) conv3_4
(3, 3, 256, 512) conv4_1
(3, 3, 512, 512) conv4_2
(3, 3, 512, 512) conv4_3
(3, 3, 512, 512) conv4_4
(3, 3, 512, 512) conv5_1
(3, 3, 512, 512) conv5_2
(3, 3, 512, 512) conv5_3
(3, 3, 512, 512) conv5_4
(7, 7, 512, 4096) fc6
(1, 1, 4096, 4096) fc7
(1, 1, 4096, 1000) fc8

Each shape is represented in the following way: [kernel height, kernel width, number of input channels, number of output channels].

The first layer has 3 input channels because the input is an RGB image, while the number of output channels goes from 64 to 512 for the convolutional layers, and all kernels are 3x3 matrices.

Then we apply the transfer learning technique in order to adapt the VGG-19 network to our problem:

  1. Fully connected layers are not needed because they are used for object recognition.
  2. Max pooling layers are substituted for average pool layers in order to achieve better results. Average layers work in the same way as the kernels in the convolutional layers.
    IMAGE_WIDTH = 478
    IMAGE_HEIGHT = 478
    INPUT_CHANNELS = 3
    model = {}
    model['input'] = tf.Variable(np.zeros((1, IMAGE_HEIGHT,
                                     IMAGE_WIDTH,
                                     INPUT_CHANNELS)),
                                   dtype = 'float32')
    
    model['conv1_1']  = conv2d_relu(model['input'], 0, 'conv1_1')
    model['conv1_2']  = conv2d_relu(model['conv1_1'], 2, 'conv1_2')
    model['avgpool1'] = avgpool(model['conv1_2'])
    
    model['conv2_1']  = conv2d_relu(model['avgpool1'], 5, 'conv2_1')
    model['conv2_2']  = conv2d_relu(model['conv2_1'], 7, 'conv2_2')
    model['avgpool2'] = avgpool(model['conv2_2'])
    
    model['conv3_1']  = conv2d_relu(model['avgpool2'], 10, 'conv3_1')
    model['conv3_2']  = conv2d_relu(model['conv3_1'], 12, 'conv3_2')
    model['conv3_3']  = conv2d_relu(model['conv3_2'], 14, 'conv3_3')
    model['conv3_4']  = conv2d_relu(model['conv3_3'], 16, 'conv3_4')
    model['avgpool3'] = avgpool(model['conv3_4'])
    
    model['conv4_1']  = conv2d_relu(model['avgpool3'], 19,'conv4_1')
    model['conv4_2']  = conv2d_relu(model['conv4_1'], 21, 'conv4_2')
    model['conv4_3']  = conv2d_relu(model['conv4_2'], 23, 'conv4_3')
    model['conv4_4']  = conv2d_relu(model['conv4_3'], 25,'conv4_4')
    model['avgpool4'] = avgpool(model['conv4_4'])
    
    model['conv5_1']  = conv2d_relu(model['avgpool4'], 28, 'conv5_1')
    model['conv5_2']  = conv2d_relu(model['conv5_1'], 30, 'conv5_2')
    model['conv5_3']  = conv2d_relu(model['conv5_2'], 32, 'conv5_3')
    model['conv5_4']  = conv2d_relu(model['conv5_3'], 34, 'conv5_4')
    model['avgpool5'] = avgpool(model['conv5_4'])

Here we defined the contentloss function that measures the difference in content between two images p and x:

def contentloss(p, x):
    size = np.prod(p.shape[1:])
    loss = (1./(2*size)) * tf.reduce_sum(tf.pow((x - p),2))
    return loss

This function tends to be 0 when the input images are very close to each other in terms of content and grows as their content deviates.

We'll use contentloss on the conv5_4 layer. This is the output layer and its output would be prediction, hence we need to compare this prediction with actual one using the contentloss function:

content_loss = contentloss
               (sess.run(model['conv5_4']), model['conv5_4'])

Minimizing the content_loss means that the mixed image has feature activation in the given layers that is very similar to the activation of the content image.

Style extractor and loss

Style extractor uses the Gram matrix of the filters for a given hidden layer. Simply speaking, using this matrix, we can destroy the semantic of the image preserving its basic components and making it a good texture extractor:

def gram_matrix(F, N, M):
    Ft = tf.reshape(F, (M, N))
    return tf.matmul(tf.transpose(Ft), Ft)

The style_loss, measures how close in style two images are to one another. This function is the sum of the squared difference of the elements of the Gram matrix produced by the style image and input noise_image:

noise_image = np.random.uniform
              (-20, 20,
               (1, IMAGE_HEIGHT, 
                IMAGE_WIDTH,
                INPUT_CHANNELS)).astype('float32')

def style_loss(a, x):
    N = a.shape[3]
    M = a.shape[1] * a.shape[2]
    A = gram_matrix(a, N, M)
    G = gram_matrix(x, N, M)
    result = (1/(4 * N**2 * M**2))* tf.reduce_sum(tf.pow(G-A,2))
    return result

style_loss grows as its two input images (a and x) tend to deviate in style.

Merger and total loss

We can merge the content and style loss so that the input noise_image is trained to output (in the layers) a similar style as the style image, along with features that are similar to the content image:

alpha = 1
beta = 100
total_loss = alpha * content_loss + beta * styleloss

Training

Minimize the loss in the network so that the style loss (the loss between the output image's style and the style of the style image), content loss (loss between the content image and the output image), and the total variation loss are as low as possible:

train_step = tf.train.AdamOptimizer(1.5).minimize(total_loss)

The output image generated from such a network should resemble the input image and have the stylist attributes of the style image.

Finally, we can prepare the network for training:

sess.run(tf.global_variables_initializer())
sess.run(model['input'].assign(input_noise))
for it in range(2001):
    sess.run(train_step)
    if it%100 == 0:
        mixed_image = sess.run(model['input'])
        print('iteration:',it,'cost: ', sess.run(total_loss))
        filename = 'out2/%d.png' % (it)
        deprocess(filename, mixed_image)

The training time could be very time-consuming, but the results could be very interesting:

iteration: 0 cost:  8.14037e+11
iteration: 100 cost:  1.65584e+10
iteration: 200 cost:  5.22747e+09
iteration: 300 cost:  2.72995e+09
iteration: 400 cost:  1.8309e+09
iteration: 500 cost:  1.36818e+09
iteration: 600 cost:  1.0804e+09
iteration: 700 cost:  8.83103e+08
iteration: 800 cost:  7.38783e+08
iteration: 900 cost:  6.28652e+08
iteration: 1000 cost:  5.41755e+08

After 1,000 iterations, we have created a new mosaic:

Training

Figure 14: Output image in Artistic Style Learning

That's really amazing! You can finally train your neural network to paint like Picasso...have fun!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset