Emotion recognition with CNNs

One of the hardest problems to solve in deep learning has nothing to do with neural nets, it's the problem of getting the right data in the right format. However, a valuable assistant to find new problems, and new datasets to study, comes from the Kaggle platform (https://www.kaggle.com/).

The Kaggle platform was founded in 2010 as a platform for predictive modeling and analytics competitions on which companies and researchers post their data and statisticians and data miners from all over the world compete to produce the best models.

In this section, we show how to make a CNN for emotion detection from facial images. The train and test set of this example can be downloaded from https://inclass.kaggle.com/c/facial-keypoints-detector/data. Please note that you can login and download the data using Facebook, Google+ or Yahoo. Alternatively, you will have to create an account and you can download the dataset.

The Kaggle competition page

The train set consists of 3,761 grayscale images of 48x48 pixels in size and a 3,761 label set of seven elements each.

Each element encodes an emotional stretch, 0 = anger, 1 = disgust, 2 = fear, 3 = happy, 4 = sad, 5 = surprise, 6 = neutral.

In a classic Kaggle competition, the set of labels obtained from the test set must be evaluated by subjecting it to the platform. In this example, we will train a neural network from a training set, after which we will evaluate the model on a single image.

Before starting the CNN implementation, we'll look at the downloaded data by implementing a simple procedure.

Import the libraries with the following code:

import numpy as np 
from matplotlib import pyplot as plt 
import EmotionDetectorUtils

Please note that there is a dependency on EmotionDetectorUtils and EmotionDetectorUtils uses pandas package to execute these codes. Now to install the pandas package use the following command on terminal on Ubuntu:

sudo apt-get update  
sudo apt-get install python-pip 
sudo pip install numpy 
sudo pip install pandas
sudo apt-get install python-pandas

The read_data function allows you to build all the datasets starting with the downloaded data. You can find it in the EmotionDetectorUtils library, which you can download in the code repository for this book:

FLAGS = tf.flags.FLAGS 
tf.flags.DEFINE_string("data_dir", "EmotionDetector/", "Path to data files") 
 
train_images, train_labels, valid_images, valid_labels, test_images =  
                  EmotionDetectorUtils.read_data(FLAGS.data_dir)

Then print the shape of the training images set and test sets:

print "train images shape = ",train_images.shape 
print "test labels shape = ",test_images.shape

Display the first image of the training set and its correct label:

image_0 = train_images[0] 
label_0 = train_labels[0] 
print "image_0 shape = ",image_0.shape 
print "label set = ",label_0 
image_0 = np.resize(image_0,(48,48)) 
 
plt.imshow(image_0, cmap='Greys_r') 
plt.show()

There are 3761 grayscale images of 48x48 pixels size:

train images shape = (3,761, 48, 48, 1)

There are 3761 class labels, each class contains seven elements:

train labels shape = (3,761, 7)

The test set is formed by 1312 grayscale images of 48x48 pixel size:

test labels shape = (1,312, 48, 48, 1)

A single image has the following shape:

image_0 shape = (48, 48, 1)

The label set for the first image is:

label set = [ 0.  0.  0.  1.  0.  0.  0.]

It corresponds to a happy emotional stretch, that we visualize in the following matplot figure:

The first image from the emotion detection face dataset

We shall now proceed to study the CNN architecture. The following figure shows how the data flows in the CNN that will be implemented:

The first two convolutional layers of the implemented CNN

The network has two convolutional layers, two fully-connected layers and finally a softmax classification layer. The input image (48 pixel) is processed in the first convolutional layer using 5x5 convolutional kernels. This results in 32 features, one for each filter used. The images are also downsampled by a max-pooling operation, to decrease the images from 48x48 to 24x24 pixels. These 32 smaller images are then processed by a second convolutional layer; this results in 64 new features (see the preceding figure). The resulting images are downsampled again to 12x12 pixels, by a second pooling operation.

The output of this second pooling layer is formed by 64 images of 12x12 pixels each. These are then flattened to a single vector of length 12x12x64 = 9,126, which is used as the input to a fully-connected layer with 256 neurons. This feeds into another fully-connected layer with 10 neurons, one for each of the classes, which is used to determine the class of the image, that is, which decodes the emotion in depicted in the image.

The last two layers of the implemented CNN

Let's go onto the weights and bias definition. The following data structure, represents the definition of the network's weights and summarizes what we have previously described:

weights = { 
    'wc1': weight_variable([5, 5, 1, 32], name="W_conv1"), 
    'wc2': weight_variable([3, 3, 32, 64],name="W_conv2"), 
    'wf1': weight_variable 
           ([(IMAGE_SIZE / 4) * (IMAGE_SIZE / 4) * 64,   
                                            256],name="W_fc1"), 
    'wf2': weight_variable([256, NUM_LABELS], name="W_fc2") 
}

Note again, that the convolutional filters are randomly chosen, so the classification is done randomly:

def weight_variable(shape, stddev=0.02, name=None): 
    initial = tf.truncated_normal(shape, stddev=stddev) 
    if name is None: 
        return tf.Variable(initial) 
    else: 
        return tf.get_variable(name, initializer=initial)

In a similar way, we have defined bias_variable:

biases = { 
    'bc1': bias_variable([32], name="b_conv1"), 
    'bc2': bias_variable([64], name="b_conv2"), 
    'bf1': bias_variable([256], name="b_fc1"), 
    'bf2': bias_variable([NUM_LABELS], name="b_fc2") 
} 
 
def bias_variable(shape, name=None): 
    initial = tf.constant(0.0, shape=shape) 
    if name is None: 
        return tf.Variable(initial) 
    else: 
        return tf.get_variable(name, initializer=initial)

An optimizer must propagate the error back through the CNN using the chain rule of differentiation and update the filter weights to improve the classification error. The error between the predicted and true class of the input image is measured by the implemented loss function. It takes in the input the predicted output of the pred model to the desired label output:

def loss(pred, label): 
    cross_entropy_loss=  
                tf.nn.softmax_cross_entropy_with_logits(pred, label) 
    cross_entropy_loss= tf.reduce_mean(cross_entropy_loss) 
    reg_losses = tf.add_n(tf.get_collection("losses")) 
    return cross_entropy_loss + REGULARIZATION * reg_losses

The tf.nn.softmax_cross_entropy_with_logits(pred, label) function computes the cross_entropy_loss of the result after applying the softmax function (but it does it altogether in a more mathematically careful way). It's like the result of:

a = tf.nn.softmax(x) 
b = cross_entropy(a)

We calculate the cross_entropy_loss function for each of the classified images so that we'll have a measure of how well the model performs on each image individually.

We take the cross-entropy 's average for the classified images:

cross_entropy_loss= tf.reduce_mean(cross_entropy_loss)

To prevent overfitting, we use L2 regularization that consists of inserting an additional term to the cross_entropy_loss function:

reg_losses = tf.add_n(tf.get_collection("losses")) 
return cross_entropy_loss + REGULARIZATION * reg_losses

Where:

def add_to_regularization_loss(W, b): 
    tf.add_to_collection("losses", tf.nn.l2_loss(W)) 
    tf.add_to_collection("losses", tf.nn.l2_loss(b))

See http://www.kdnuggets.com/2015/04/preventing-overfitting-neural-networks.html/2 for further reference.

We built the network's weights and bias and their optimization procedure. However, like all the implemented networks, we must start the implementation by importing all necessary libraries:

import tensorflow as tf 
import numpy as np 
import os, sys, inspect 
from datetime import datetime 
import EmotionDetectorUtils

We set the paths for storing the dataset on the computer, and the network parameters with the following code:

FLAGS = tf.flags.FLAGS 
tf.flags.DEFINE_string("data_dir",  
                           "EmotionDetector/", "Path to data files") 
tf.flags.DEFINE_string("logs_dir", "logs/EmotionDetector_logs/", 
                           "Path to where log files are to be saved") 
tf.flags.DEFINE_string("mode", "train", "mode: train (Default)/ test") 
 
BATCH_SIZE = 128 
LEARNING_RATE = 1e-3 
MAX_ITERATIONS = 1001 
REGULARIZATION = 1e-2 
IMAGE_SIZE = 48 
NUM_LABELS = 7 
VALIDATION_PERCENT = 0.1

The emotion_cnn function implements our model:

def emotion_cnn(dataset): 
    with tf.name_scope("conv1") as scope:
        tf.summary.histogram("W_conv1", weights['wc1'])
        tf.summary.histogram("b_conv1", biases['bc1'])
        conv_1 = tf.nn.conv2d(dataset, weights['wc1'], 
                              strides=[1, 1, 1, 1], padding="SAME")                                                  
        h_conv1 = tf.nn.bias_add(conv_1, biases['bc1']) 
        h_1 = tf.nn.relu(h_conv1) 
        h_pool1 = max_pool_2x2(h_1) 
        add_to_regularization_loss(weights['wc1'], biases['bc1']) 
 
    with tf.name_scope("conv2") as scope:
         tf.summary.histogram("W_conv2", weights['wc2'])                     tf.summary.histogram("b_conv2", biases['bc2']) 
        conv_2 = tf.nn.conv2d(h_pool1, weights['wc2'], 
                              strides=[1, 1, 1, 1], padding="SAME") 
        h_conv2 = tf.nn.bias_add(conv_2, biases['bc2']) 
        h_2 = tf.nn.relu(h_conv2) 
        h_pool2 = max_pool_2x2(h_2) 
        add_to_regularization_loss(weights['wc2'], biases['bc2']) 
 
    with tf.name_scope("fc_1") as scope: 
        prob=0.5 
        image_size = IMAGE_SIZE / 4 
        h_flat = tf.reshape(h_pool2, [-1, image_size * image_size * 64]) 
        tf.summary.histogram("W_fc1", weights['wf1'])           
        tf.summary.histogram("b_fc1", biases['bf1'])        
        h_fc1 = tf.nn.relu(tf.matmul 
                     (h_flat, weights['wf1']) + biases['bf1']) 
        h_fc1_dropout = tf.nn.dropout(h_fc1, prob) 
         
   with tf.name_scope("fc_2") as scope:
        tf.summary.histogram("W_fc2", weights['wf2'])          
        tf.summary.histogram("b_fc2", biases['bf2'])
        pred = tf.matmul(h_fc1_dropout, weights['wf2']) + biases['bf2'] 
    return pred

We defined a main function where we'll define the dataset, the input and output placeholder variables and the main session to start the training procedure:

def main(argv=None):

The first operation in this function is to load the dataset for the training and validation phase. We'll use the training set to teach the classifier to recognize the to-be-predicted labels, and the we'll use the validation set to estimate the classifier performance:

    train_images,  
                  train_labels, 
                  valid_images, 
                  valid_labels,  
                  test_images =  
                  EmotionDetectorUtils.read_data(FLAGS.data_dir) 
    print "Train size: %s" % train_images.shape[0] 
    print 'Validation size: %s' % valid_images.shape[0] 
    print "Test size: %s" % test_images.shape[0]

We define the placeholder variable for the input images. This allows us to change the images that are input to the TensorFlow graph. The datatype is set to float32 and the shape is set to [None, IMG_SIZE, IMAGE_SIZE, 1], where None means that the tensor may hold an arbitrary number of images with each image being img_size pixels high and img_size pixels wide and 1 is the number of color channels:

    input_dataset = tf.placeholder(tf.float32,  
                                   [None,  
                                    IMAGE_SIZE,  
                                    IMAGE_SIZE, 1],name="input")

Next, we have the placeholder variable for the true labels associated with the images that were input in the input_dataset placeholder variable. The shape of this placeholder variable is [None, NUM_LABELS] which means it may hold an arbitrary number of labels and each label is a vector of length NUM_LABELS, which is 7 in this case:

    input_labels = tf.placeholder(tf.float32, 
                                  [None, NUM_LABELS])

The global_step variable keeps track of the number of optimization iterations performed so far. We want to save this variable with all the other TensorFlow variables in the checkpoints. Note that trainable=False which means that TensorFlow will not try to optimize this variable:

global_step = tf.Variable(0, trainable=False)

And the following variable, dropout_prob, for dropout optimization:

    dropout_prob = tf.placeholder(tf.float32)

Now create the neural network for the test phase. The emotion_cnn() function returns the predicted pred class labels for the input_dataset variable:

    pred = emotion_cnn(input_dataset)

The output_pred variable is the predictions for the test and validation, which we'll compute in the running session:

    output_pred = tf.nn.softmax(pred,name="output")

The loss_val variable contains the error between the pred predicted class and the true class of the input image (input_labels):

    loss_val = loss(pred, input_labels)

The train_op variable defines the optimizer used to minimize the cost function. In this case again we use AdamOptimizer:

    train_op = tf.train.AdamOptimizer 
                    (LEARNING_RATE).minimize 
                              (loss_val, global_step)

And summary_op for TensorBoard visualizations:

 summary_op = tf.merge_all_summaries()

Once the graph has been created, we need to create a TensorFlow session which is used to execute the graph:

with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
summary_writer = tf.summary.FileWriter(FLAGS.logs_dir, sess.graph_def)

We define a saver variable to restore the model:

        saver = tf.train.Saver() 
        ckpt = tf.train.get_checkpoint_state(FLAGS.logs_dir) 
        if ckpt and ckpt.model_checkpoint_path: 
            saver.restore(sess, ckpt.model_checkpoint_path) 
            print "Model Restored!"

Get a batch of training examples, batch_image now holds a batch of images and batch_label are the true labels for those images:

        for step in xrange(MAX_ITERATIONS): 
            batch_image, batch_label = get_next_batch(train_images, 
                                                      train_labels, 
                                                      step)

We put the batch into a dict variable with the proper names for placeholder variables in the TensorFlow graph:

            feed_dict = {input_dataset: batch_image,  
                         input_labels: batch_label}

We run the optimizer using this batch of training data; TensorFlow assigns the variables in feed_dict_train to the placeholder variables and then runs the optimizer:

            sess.run(train_op, feed_dict=feed_dict) 
            if step % 10 == 0: 
                train_loss, 
                             summary_str =  
                                 sess.run([loss_val,summary_op], 
                                              feed_dict=feed_dict) 
                summary_writer.add_summary(summary_str, 
                                           global_step=step) 
                print "Training Loss: %f" % train_loss

When the running step is a multiple of 100, we run the trained model on the validation set:

            if step % 100 == 0: 
                valid_loss =  
                           sess.run(loss_val,  
                                feed_dict={input_dataset:   
                                           valid_images, 
                                              input_labels:  
                                              valid_labels})

Print the Loss value:

                print "%s Validation Loss: %f"  
                      % (datetime.now(), valid_loss)

At the end of the training session the model is saved:

                saver.save(sess, FLAGS.logs_dir 
                           + 'model.ckpt',  
                           global_step=step) 
 
  if __name__ == "__main__":

Now we report the resulting output. As you can see the loss function decreased during the following simulation:

>>>  
Train size: 3761 
Validation size: 417 
Test size: 1312 
2016-11-05 22:39:36.645682 Validation Loss: 1.962719 
2016-11-05 22:42:58.951699 Validation Loss: 1.822431 
2016-11-05 22:46:55.144483 Validation Loss: 1.335237 
2016-11-05 22:50:17.677074 Validation Loss: 1.111559 
2016-11-05 22:53:30.999141 Validation Loss: 0.999061 
2016-11-05 22:56:53.256991 Validation Loss: 0.931223 
2016-11-05 23:00:06.530139 Validation Loss: 0.911489 
2016-11-05 23:03:15.351156 Validation Loss: 0.818303 
2016-11-05 23:06:26.575298 Validation Loss: 0.824178 
2016-11-05 23:09:40.136353 Validation Loss: 0.803449 
2016-11-05 23:12:50.769527 Validation Loss: 0.851074 
>>>

However, the model can be improved by acting on hyperparameters or changing its architecture. In the next section, we will see how to effectively test the model on your own images.

Table of Contents for Emotion recognition with CNNs

Create new playlist

Sign In

Sign Up

Table of Contents for
Emotion recognition with CNNs