Backpropagation

This is a simple example. For complicated models with thousands, or even millions, of parameters across a number of layers, there are convolutional networks and we need to be more intelligent about how we propagate these updates back through our network. This is true for networks with a number of layers (increasing the number of parameters accordingly), with new research coming out that, in an extreme example, includes CNNs of 10,000 layers.

So, how can we go about this? The easiest way is to build your neural network out of functions for which we know the derivative. We can do this symbolically or on a more practical basis; if we build it out of functions where we know how to apply the function and where we know how to backpropagate (by virtue of knowing how to write a function for the derivative), we can build a neural network out of these functions.

Of course, building these functions can be time-consuming. Fortunately, Gorgonia already has all of these, hence allowing us to do what we call auto-differentiation. As I have mentioned previously, we create a directed graph for computation; this allows to do not only the forward pass but the backward pass as well!

For example, let's consider something with more layers (although still simple) like the following, where i is the input, f is the first layer with weight w1, g is the second layer with the weight, w2, and o is the output:

First, we have the error, which is a function of o. Let's call this E.

In order to update our weights in g, we need to know the derivative of the error with respect to the input of g.

From the chain rule when dealing with derivatives, we know that this is actually equivalent to the following:

dE_dg = dE_do * do_dg * dg_dw2

That is to say, the derivative of the error with respect to the input of g (dE_dg) is actually equivalent to the derivative of the error with respect to the output, (dE_do), multiplied by the derivative of the output with respect to the function, g (do_dg), and then multiplied by the derivative of the function, g, with respect to w2.

This gives us the derivative rate we need to update our weights in g.

We now need to do the same for f. How? It is a matter of repeating the process. We need the derivative of the error with respect to the input of f. Using the chain rule again, we know that the following is true:

dE_df = dE_do * do_dg * dg_df * df_dw1

You'll notice that there is something in common here with the previous derivative, dE_do * do_dg.

This presents us with an opportunity for further optimization. We don't have to calculate the entirety of the derivative each time; we only need to know the derivative of the layer we are backpropagating from and the derivative of the layer we are backpropagating to, and this is true all of the way through the entire network. This is called the backpropagation algorithm, which allows us to update weights throughout our entire network without constantly needing to recalculate the derivatives of the error with respect to the specific weight that we are targeting from scratch, and we can reuse the result of previous calculations.

Table of Contents for Backpropagation

Create new playlist

Sign In

Sign Up

Table of Contents for
Backpropagation