A restricted Boltzmann machine looks a lot like a single layer of a neural network. There are a set of input nodes that have connections to another set of output nodes:
The way these output nodes are activated is also identical to an autoencoder. There is a weight between each input node and output node, the activation of each input nodes multiplied by this matrix of weight mappings, and a bias vector is then applied, and the sum for each output node is then put through a sigmoid function.
What makes a restricted Boltzmann machine different is what the activations represent, how we think about them, and the way in which they are trained. To begin with, when talking about RBMs rather than talking about input and output layers, we refer to the layers as visible and hidden. This is because, when training, the visible nodes represent the known information we have. The hidden nodes will aim to represent some variables that generated the visible data. This contrasts with an autoencoder, where the output layer doesn't explicitly represent anything, is just a constrained space through which the information is passed.
The basis of learning the weights of a restricted Boltzmann machine comes from statistical physics and uses an energy-based model (EBM). In these, every state is put through an energy function, which relates to the probability of a state occurring. If an energy function returns a high value, we expect this state to be unlikely, rarely occurring. Conversely, a low result from an energy function means a state that is more stable and will occur more frequently.
A good intuitive way of thinking about an energy function is to imagine a huge number of bouncy balls being thrown into a box. At first, all the balls have high energy and so will be bouncing very high. A state here would be a single snapshot in time of all the balls' positions and their associated velocities. These states, when the balls are bouncing, are going to be very transitory; they will only exist for moments and because of the range of movement of the balls, are very unlikely to reoccur. But as the balls start to settle, as the energy leaves the system, some of the balls will start to be increasingly stationary. These states are stable once it occurs once it never stops occurring. Eventually, once the balls have stopped bouncing and all become stationary, we have a completely stable state, which has high probability.
To learn the weights of our Boltzmann machine or Hopfield network, we want to maximize the likelihood of the dataset, given the W, which is simply the product of the likelihood for each data item:
Here, W is the weights matrix, and x(n) is the nth sample from the dataset x of size N. Let's now replace with the actual likelihood from our Boltzmann machine:
Here, Z is as shown in the following equation:
If you look at the original definition of our energy function and Z, then x' should be every possible configuration of x based on the probability distribution p(x). We now have W as part of our model, so the distribution will change to . Unfortunately, is, if not completely intractable, at the very least, far too computationally expensive to compute. We would need to take every possible configuration of x across all possible W.
One approach to computing an intractable probability distribution such as this is what's called Monte Carlo sampling. This involves taking lots of samples from the distribution and using the average of these samples to approximate the true value. The more samples we take from the distribution, the more accurate it will tend to be. A hypothetical infinite number of samples would be exactly the quantity we want, while 1 would be a very poor approximation.
Since the products of probabilities can get very small, we will instead use the log probability; also, let's also include the definition of Z:
Here, x' is a sample of state of the network, as taken from the probability distribution learned by the network. If we take the gradient of this with respect to a single weight between nodes i and j, it looks like this:
Here, across all N samples is simply the correlation between the nodes i and j. Another way to write this across all N samples, for each weight i and j, would be this:
This equation can be understood as being two phases of learning, known as positive and negative or, more poetically, waking and sleeping. In the positive phase, increases the weights based on the data we are given. In the negative phase, , we draw samples from the model as per the weights we currently have, then move the weights away from that distribution. This can be thought of as reducing the probability of items generated by the model. We want our model to reflect the data as closely as possible, so we want to reduce the selection generated by our model. If our model was producing images exactly like the data, then the two terms would cancel each other out, and equilibrium would be reached.
Boltzmann machines and Hopfield networks can be useful for tasks such as optimization and recommendation systems. They are very computationally expensive. Correlation must be measured between every single node, and then a range of Monte Carlo samples must be generated from the model for every training step. Also, the kinds of patterns it can learn are limited. If we are training on images to learn shapes, it cannot learn position invariant information. A butterfly on the left-hand side of an image is a completely different beast to a butterfly on the right-hand side of an image. In Chapter 5, Image Recognition, we will take a look at convolutional neural networks, which offers a solution to this problem.
Restricted Boltzmann machines make two changes from the Boltzmann machines: the first is that we add in hidden nodes, each of which is connected to every visible node, but not to each other. The second is that we remove all connections between visible nodes. This has the effect of making each node in the visible layer conditionally independent of each other if we are given the hidden layer. Nodes in the hidden layer are also conditionally independent, given the visible layer. We will also now add bias terms to both the visible and hidden nodes. A Boltzmann machine can also be trained with a bias term for each node, but this was left out of the equations for ease of notation.
As before, is approximated by taking the Monte Carlo samples from the distribution. These final three equations give us the complete way to iteratively train all the parameters for a given dataset. Training will be a case of updating our parameters by some learning rate by these gradients.
It is worth restating on a conceptual level what is going on here. v denotes the visible variables, the data from the world on which we are learning. h denotes the hidden variables, the variables we will train to generate visible variables. The hidden variables do not explicitly represent anything, but through training and minimizing the energy in the system, they should eventually find important components of the distribution we are looking at. For example, if the visible variables are a list of movies, with a value of 1 if a person likes the movie and 0 if they do not, the hidden variables may come to represent genres of movie, such as horror or comedy, because people may have genre preferences, so this is an efficient way to encode people's tastes.
If we generate random samples of hidden variables and then activate the visible variables based on this, it should give us a plausible looking set of human tastes in movies. Likewise, if we set the visible variables to a random selection of movies over successive activations of the hidden and visible nodes, it should move us to find a more plausible selection.
Now that we have gone through the math, let's see what an implementation of it looks like. For this, we will use TensorFlow. TensorFlow is a Google open source mathematical graph library that is popular for deep learning. It does not have built-in neural network concepts, such as network layers and nodes, which a higher-level library such as Keres does; it is closer to a library such as Theano. It has been chosen here because being able to work directly on the mathematical symbols underlying the network allows the user to get a better understanding of what they are doing.
TensorFlow can be installed directly via pip
using the command pip install tensorflow
for the CPU version or pip install tensorflow-gpu
if you have NVidea GPU-enabled machine.
We will build a small restricted Boltzmann machine and train it on the MNIST collection of handwritten digits. We will have a smaller number of hidden nodes than visible nodes, which will force the RBM to learn patterns in the input. The success of the training will be measured in the network's ability to reconstruct the image after putting it through the hidden layer; for this, we will use the mean squared error between the original and our reconstruction. The full code sample is in the GitHub repo https://github.com/DanielSlater/PythonDeepLearningSamples in the restricted_boltzmann_machine.py
file.
Since the MNIST dataset is used so ubiquitously, TensorFlow has a nice built-in way to download and cache the MNIST dataset. It can be done by simply calling the following code:
from tensorflow.examples.tutorials.mnist import input_data mnist = input_data.read_data_sets("MNIST_data/")
This will download all the MNIST data into MNIST_data
into the "MNIST_data/"
directory, if it is not already there. The mnist
object has properties, train and test
, which allow you to access the data in NumPy arrays. The MNIST
images are all sized 28 by 28, which means 784 pixels per image. We will need one visible node in our RBM for each pixel:
input_placeholder = tf.placeholder("float", shape=(None, 784))
A placeholder object in TensorFlow represents values that will be passed in to the computational graph during usage. In this case, the input_placeholder
object will hold the values of the MNIST
images we give it. The "float"
specifies the type of value we will be passing in, and the
shape
defines the dimensions. In this case, we want 784 values, one for each pixel, and the None
dimension is for batching. Having a None dimension means that it can be of any size; so, this will allow us to send variable-sized batches of 784-length arrays:
weights = tf.Variable(tf.random_normal((784, 300), mean=0.0, stddev=1./784))
tf.variable
represents a variable on the computational graph. This is the W from our preceding equations. The argument passed to it is how the variable values should first be initialized. Here, we are initializing it from a normal distribution of size 784 by 300, the number of visible nodes to hidden nodes:
hidden_bias = tf.Variable(tf.zeros([300])) visible_bias = tf.Variable(tf.zeros([784]))
These variables will be the a
and b
from our preceding equation; they are initialised to all start with a value of 0. Now we will program in the activations of our network:
hidden_activation = tf.nn.sigmoid(tf.matmul(input_placeholder, weights) + hidden_bias)
This represents the activation of the hidden nodes, , in the preceding equations. After applying the sigmoid
function, this activation could be put into a binomial distribution so that all values in the hidden layer go to 0 or 1, with the probability given; but it turns out an RBM trains just as well as the raw probabilities. So, there's no need to complicate the model by doing this:
visible_reconstruction = tf.nn.sigmoid(tf.matmul(hidden_activation, tf.transpose(weights)) + visible_bias)
Now we have the reconstruction of the visible layer, . As specified by the equation, we give it the hidden_activation
, and from that, we get our sample from the visible layer:
final_hidden_activation = tf.nn.sigmoid(tf.matmul(visible_reconstruction, weights) + hidden_bias)
We now compute the final sample we need, the activation of the hidden nodes from our visible_reconstruction
. This is equivalent to in the equations. We could keep going with successive iterations of hidden and visual activation to get a much more unbiased sample from the model. But it doing just one rotation works fine for training:
Positive_phase = tf.matmul(tf.transpose(input_placeholder), hidden_activation) Negative_phase = tf.matmul(tf.transpose(visible_reconstruction), final_hidden_activation)
Now we compute the positive and negative phases. The first phase is the correlation across samples from our mini-batch of the input_placeholder
,
and the first hidden_activation
, . Then the negative phase gets the correlation between the visible_reconstruction
, and the final_hidden_activation
, :
LEARING_RATE = 0.01 weight_update = weights.assign_add(LEARING_RATE * (positive_phase – negative_phase))
Calling assign_add
on our weights
variable creates an operation that, when run, adds the given quantity to the variable. Here, 0.01 is our learning rate, and we scale the positive and negative phases by that:
visible_bias_update = visible_bias.assign_add(LEARING_RATE * tf.reduce_mean(input_placeholder - visible_reconstruction, 0)) hidden_bias_update = hidden_bias.assign_add(LEARING_RATE * tf.reduce_mean(hidden_activation - final_hidden_activation, 0))
Now we create the operations for scaling the hidden and visible biases. These are also scaled by our 0.01 learning rate:
train_op = tf.group(weight_update, visible_bias_update, hidden_bias_update)
Calling tf.group
creates a new operation than when called executes all the operation arguments together. We will always want to update all the weights in unison, so it makes sense to create a single operation for them:
loss_op = tf.reduce_sum(tf.square(input_placeholder - visible_reconstruction))
This loss_op
will give us feedback on how well we are training, using the MSE. Note that this is purely used for information; there is no backpropagation run against this signal. If we wanted to run this network as a pure autoencoder, we would create an optimizer here and activate it to minimize the loss_op
:?
session = tf.Session()
session.run(tf.initialize_all_variables())
Then we create a session object that will be used for running the computational graph. Calling tf.initialize_all_variables()
is when everything gets initialized on to the graph. If you are running TensorFlow on the GPU, this is where the hardware is first interfaced with. Now that we have created every step for the RBM, let's put it through a few epochs of running against MNIST and see how well it learns:
current_epochs = 0 for i in range(10): total_loss = 0 while mnist.train.epochs_completed == current_epochs: batch_inputs, batch_labels = mnist.train.next_batch(100) _, reconstruction_loss = session.run([train_op, loss_op], feed_dict={input_placeholder: batch_inputs}) total_loss += reconstruction_loss print("epochs %s loss %s" % (current_epochs, reconstruction_loss)) current_epochs = mnist.train.epochs_completed
Every time we call mnist.train.next_batch(100)
, 100 images are retrieved from the mnist
dataset. At the end of each epoch, the mnist.train.epochs_completed
is incremented by 1, and all the training data is reshuffled. If you run this, you may see results something like this:
epochs 0 loss 1554.51 epochs 1 loss 792.673 epochs 2 loss 572.276 epochs 3 loss 479.739 epochs 4 loss 466.529 epochs 5 loss 415.357 epochs 6 loss 424.25 epochs 7 loss 406.821 epochs 8 loss 354.861 epochs 9 loss 410.387 epochs 10 loss 313.583
We can now see what an image reconstruction looks like by running the following command on the mnist
data:
reconstruction = session.run(visible_reconstruction, feed_dict={input_placeholder:[mnist.train.images[0]]})
Here are some examples of what the reconstructed images with 300 hidden nodes look like:
As you can see, with 300 hidden nodes, less than half the number of pixels, it can still do an almost perfect reconstruction of the image, with only a little blurring around the edges. But as the number of hidden nodes decreases, so does the quality of the reconstruction. Going down to just 10 hidden nodes, the reconstructions can produce images that, to the human eye, look like the wrong digit, such as the 2 and 3 in Figure 3.
If we imagine our RBM is learning a set of latent variables that generated our visible data and we were feeling inquisitive, we might wonder: can we then learn a second layer of latent variables that generated the latent variables for the hidden layer? The answer is yes, we can stack RBMs on top of previously trained RBMs to be able to learn second, third, fourth, and so on, order information about the visible data. These successive layers of RBMs allow the network to learn increasingly invariant representations of the underlying structure:
These stacked RBMs are known as deep belief networks and were the deep networks used by Geoffrey Hinton in his 2002 paper Training Products of Experts by Minimizing Contrastive Divergence, to first produce the record-breaking results on MNIST. The exact technique he found useful was to train successive RBMs on data with only a slight reduction in the size of the layers. Once a layer was trained to the point where the reconstruction error was no longer improving, its weights were frozen, and a new RBM was stacked on top and again trained until error rate convergence. Once the full network was trained, a final supervised layer was put at the end in order to map the final RBM's hidden layer to the labels of the data. Then the weights of the whole network were used to construct a standard deep feed-forward neutral network, allowing those precalculated weights of the deep belief network to be updated by backpropagation.
At first, these had great results, but over time, the techniques for training standard feed-forward networks have improved, and RBMs are no longer considered the best for image or speech recognition. They also have the problem that because of their two-phase nature, they can be a lot slower to train. But they are still very popular for things such as recommender systems and pure unsupervised learning. Also, from a theoretical point of view, using the energy-based model to learn deep representations is a very interesting approach and leaves the door open for many extensions that can be built on top of this approach.