Building the model

So far, we haven't started to build our computational graph for this classifier. Let's start off by creating the session variable that will be responsible for executing the computational graph we are going to build:

sess = tf.Session()

Next up, we are going to define our model's placeholders, which will be used to feed data into the computational graph:

input_values = tf.placeholder(tf.float32, shape=[None, 784]

When we specify None in our placeholder's first dimension, it means the placeholder can be fed as many examples as we like. In this case, our placeholder can be fed any number of examples, where each example has a 784 value.

Now, we need to define another placeholder for feeding the image labels. Also we'll be using this placeholder later on to compare the model predictions with the actual labels of the images:

output_values = tf.placeholder(tf.float32, shape=[None, 10])

Next, we will define the weights and biases. These two variables will be the trainable parameters of our network and they will be the only two variables needed to make predictions on unseen data:

weights = tf.Variable(tf.zeros([784,10]))
biases = tf.Variable(tf.zeros([10]))

I like to think of these weights as 10 cheat sheets for each number. This is similar to how a teacher uses a cheat sheet to grade a multiple choice exam.

We will now define our softmax regression, which is our classifier function. This particular classifier is called multinomial logistic regression, and we make the prediction by multiplying the flattened version of the digit by the weight and then adding the bias:

softmax_layer = tf.nn.softmax(tf.matmul(input_values,weights) + biases)

First, let's ignore the softmax and look at what's inside the softmax function. matmul is the TensorFlow function for multiplying matrices. If you know matrix multiplication (https://en.wikipedia.org/wiki/Matrix_multiplication), you'll understand that this computes properly and that:

Will result in a number of training examples fed (m) × number of classes (n) matrix:

Figure 13: Simple matrix multiplication.

You can confirm it by evaluating softmax_layer:

print(softmax_layer)
Output:
Tensor("Softmax:0", shape=(?, 10), dtype=float32)

Now, let's experiment with the computational graph that we have defined previously with three samples from the training set and see how it works. To execute the computational graph, we need to use the session variable that we defined before. And we need to initialize the variables using tf.global_variables_initializer().

Let's go ahead and only feed three samples to the computational graph:

input_values_train, target_values_train = train_size(3)
sess.run(tf.global_variables_initializer())
#If using TensorFlow prior to 0.12 use:
#sess.run(tf.initialize_all_variables())
print(sess.run(softmax_layer, feed_dict={input_values: input_values_train}))

Output:

[[ 0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1]
 [ 0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1]
 [ 0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1]]

Here, you can see the model predictions for the three training samples that fed to it. At the moment, the model has learned nothing about our task because we haven't gone through the training process yet, so it just outputs 10% probability of each digit being the correct class for the input samples.

As we mentioned previously, softmax is an activation function that squashes the output to be between 0 and 1, and the TensorFlow implementation of softmax ensures that all the probabilities of a single input sample sums up to one.

Let's experiment a bit with the softmax function of TensorFlow:

sess.run(tf.nn.softmax(tf.zeros([4])))
sess.run(tf.nn.softmax(tf.constant([0.1, 0.005, 2])))

Output:
array([0.11634309, 0.10579926, 0.7778576 ], dtype=float32)

Next up, we need to define our loss function for this model, which will measure how good or bad our classifier is while trying to assign a class for the input images. The accuracy of our model is calculated by making a comparison between the actual values that we have in the dataset and the predictions that we got from the model.

The goal will be to reduce any misclassifications between the actual and predicted values.

Cross-entropy is defined as:

Where:

y is our predicted probability distribution
y' is the true distribution (the one-hot vector with the digit labels)

In some rough sense, cross-entropy measures how inefficient our predictions are for describing the actual input.

We can implement the cross-entropy function:

model_cross_entropy = tf.reduce_mean(-tf.reduce_sum(output_values * tf.log(softmax_layer), reduction_indices=[1]))

This function takes the log of all our predictions from softmax_layer (whose values range from 0 to 1) and multiplies them element-wise (https://en.wikipedia.org/wiki/Hadamard_product_%28matrices%29) by the example's true value, output_values. If the log function for each value is close to zero, it will make the value a large negative number (-np.log(0.01) = 4.6), and if it is close to one, it will make the value a small negative number (-np.log(0.99) = 0.1):

Figure 15: Visualization for Y = log (x)

We are essentially penalizing the classifier with a very large number if the prediction is confidently incorrect and a very small number if the prediction is confidently correct.

Here is a simple Python example of a softmax prediction that is very confident that the digit is a 3:

j = [0.03, 0.03, 0.01, 0.9, 0.01, 0.01, 0.0025,0.0025, 0.0025, 0.0025]

Let's create an array label of 3 as a ground truth to compare to our softmax function:

k = [0,0,0,1,0,0,0,0,0,0]

Can you guess what value our loss function gives us? Can you see how the log of j would penalize a wrong answer with a large negative number? Try this to understand:

-np.log(j)
-np.multiply(np.log(j),k)

This will return nine zeros and a value of 0.1053; when they all are summed up, we can consider this a good prediction. Notice what happens when we make the same prediction for what is actually a 2:

k = [0,0,1,0,0,0,0,0,0,0]
np.sum(-np.multiply(np.log(j),k))

Now, our cross_entropy function gives us 4.6051, which shows a heavily penalized, poorly made prediction. It was heavily penalized due to the fact the classifier was very confident that it was a 3 when it actually was a 2.

Next, we begin to train our classifier. In order to train it, we have to develop appropriate values for W and b that will give us the lowest possible loss.

The following is where we can now assign custom variables for training if we wish. Any value that is in all caps as follows is designed to be changed and messed with. In fact, I encourage it! First, use these values, and then notice what happens when you use too few training examples or too high or low of a learning rate:

input_values_train, target_values_train = train_size(5500)
input_values_test, target_values_test = test_size(10000)
learning_rate = 0.1
num_iterations = 2500

We can now initialize all variables so that they can be used by our TensorFlow graph:

init = tf.global_variables_initializer()
#If using TensorFlow prior to 0.12 use:
#init = tf.initialize_all_variables()
sess.run(init)

Next, we need to train the classifier using the gradient descent algorithm. So we first define our training method and some variables for measuring the model accuracy. The variable train will perform the gradient descent optimizer with a chosen learning rate in order to minimize the model loss function model_cross_entropy:

train = tf.train.GradientDescentOptimizer(learning_rate).minimize(model_cross_entropy)
model_correct_prediction = tf.equal(tf.argmax(softmax_layer,1), tf.argmax(output_values,1))
model_accuracy = tf.reduce_mean(tf.cast(model_correct_prediction, tf.float32))

Table of Contents for Building the model

Create new playlist

Sign In

Sign Up

Table of Contents for
Building the model