Introduction to RBMs

By their textbook definition, RBMs are probabilistic graphical models, which—given what we've already covered regarding the structure of neural networks—simply means a bunch of neurons that have weighted connections to another bunch of neurons.

These networks have two layers: a visible layer and a hidden layer. A visible layer is a layer into which you feed the data, and a hidden layer is a layer that isn't exposed to your data directly, but has to develop a meaningful representation of it for the task at hand. These tasks include dimensionality reduction, collaborative filtering, binary classification, and others. The restricted means that the connections are not lateral (that is, between nodes of the same layer), but rather that each hidden unit is connected to each visible unit across the layers of the network. The graph is undirected, meaning that data is not fixed into flowing in one direction. This is illustrated as follows:

The training process is fairly straightforward and differs from our vanilla neural networks in that we are not only making a prediction, testing the strength of that prediction, and then backpropagating the error back through the network. In the case of our RBM, this is only half of the story.

To break the training process down further, a forward pass on an RBM looks like this:

  • Visible layer node values are multiplied by connection weights
  • A hidden unit bias is added to the sum of all nodes of the resulting value (forcing activations)
  • The activation function is applied
  • The value of the hidden node is given (activation probability)

Were this a deep network, the output for the hidden layer would be passed on as input to another layer. An example of this kind of architecture is a Deep Belief Network (DBN), which is another important piece of work by Geoff Hinton and his group at the University of Toronto, that uses multiple RBMs stacked on top of each other.

Our RBM is not, however, a deep network. Thus, we will do something different with the hidden unit output. We will use it to attempt to reconstruct the input (visible units) of the network. We will do this by using the hidden units as input for the backward or reconstruction phase of network training.

The backward pass looks similar to the forward pass, and is performed by following these steps:

  1. The activation of the hidden layer as input is multiplied by the connection weights
  2. The visible unit bias is added to the sum of all nodes of the result from the multiplication
  3. Calculate the reconstruction error, or the difference between the predicted input, and the actual input (known to us from our forward pass)
  4. The error is used to update the weights in an effort to minimize the reconstruction error

Together, the two states (the predicted activation of the hidden layer and the predicted input of the visible layer) form a joint probability distribution.

If you're mathematically inclined, the formulas for both passes are given as follows:

  • Forward pass: The probability of a (hidden node activation) is given a weighted input, x:

p(a|x; w)

  • Backward pass: The probability of x (visible layer input) is given a weighted activation, a:

p(x|a; w)

  • The joint probability distribution is therefore given simply by the following:

p(a, x)

Reconstruction can thus be thought of differently from the kinds of techniques we have discussed so far. It is neither regression (predicting a continuous output for a given set of inputs) nor classification (applying a class label for a given set of inputs). This is made clear by the way in which we calculate the error in the reconstruction phase. We do not merely measure input versus predicted input as a real number (a difference of the output); rather, we compare the probability distribution for all values of the x input versus all values of the reconstructed input. We use a method called Kullback-Leibler divergence to perform this comparison. Essentially, this approach measures the area under the curve of each probability distribution that does not overlap. We then try to make our weight adjustments and rerun the training loop in an attempt to reduce this divergence (error), thus bringing the curves closer together, as shown in the following diagram:


At the end of the training, when this error has been minimized, we are then able to make a prediction about what other films a given user might give the thumbs-up to.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset