CNN training and backpropagation

We have seen the process of feedforward propagation as a CNN executes. Training a CNN relies on the process of backpropagation of errors and gradients, deriving a new result, and correcting errors over and over. The same network including all pooling layers, activation functions, and matrices are used as the backward propagation flows through the network in attempts to optimize or correct the weighting:  

 
CNN forward propagation during training and inference.

Backpropagation is short for "back propagation of errors". Here an error function will calculate the gradient of an error function based on the neural network weights. The calculation of the gradient is forced backward through all the hidden layers. Shown below is the backpropagation process:

CNN backward propagation during training.

We will now explore the training process. First, we must provide a training set for the network to normalize to. The training set will be talked to later in this chapter but is crucial in developing a well behaving system in the field. The training data will have an image and a known label. 

Second, the neural network is composed of identical initial values or random values for each weight on each neuron that needs to be trained. The first forward pass results in substantial errors that go into a loss function:

Here, the new weights are based on the previous weight W(t - 1) minus the partial derivative of the error over the weight W (loss function). This is also called the gradient. In the equation, Lambda refers to the learning rate. This is up to the designer to tune. If the rate is high (greater than 1), the algorithm will use larger steps in the trial process. This may allow the network to converge to an optimal answer faster, or it could produce a poorly trained network that will never converge to a solution. Alternatively, if Lambda is set low (less than 0.01), the training will take very small steps and much longer to converge, but the accuracy of the model may be better. In the following example, the optimal convergence is the very bottom of the curve representing error and weights. If the learning rate is too high, we could never reach the bottom, and will settle for a near bottom towards one of the sides:

Global minimum. This illustration shows the basis of a learning function. The goal is to find the minimal value through a gradient descent. The accuracy of the learning model is proportional to the number of steps (time) taken to converge to a minimum. 

Finding a global minimum of an error function isn't guaranteed. That is, a local minimum may be found and resolved as a false global minimum. The algorithm often will have trouble escaping the local minimum once found. In the next graph, you see the correct global minimum, and how a local minimum may be settled upon:

Errors in training. We see the true global minimum and maximum. Depending on factors such as the training step size or even the initial starting point of the descent, a CNN could be trained to a false minimum.

The loss will be especially heavy during the first initial runs of the network. We can visualize that with the TensorFlow playground. Here again, we are training a neural network to identify spirals. At first in the training, the loss is heavy at 0.425. After 1531 epochs, we arrive at this network's weights and a loss of 0.106.

We can see that there is still some degree of error that the training could not resolve:

TensorFlow training example. Courtesy of Daniel Smilkov and TensorFlow Playground.

Here, was see the training progress left to right with more accuracy. The left illustration clearly shows the heavy influence of the horizontal and vertical primitive features. After a number of epochs, the training starts converging on the true solution. Even after 1531 epochs, there are still some error cases where the training didn't converge on the correct answer.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset