Stochastic gradient descent

We can further optimize the training process with a simple change. With basic (or batch) gradient descent, we calculate the adjustment by looking at the entire dataset. Therefore, the next obvious step for optimization is: can we calculate the adjustment by looking at less than the entire dataset?

As it turns out, the answer is yes! As we are expecting to train the network over numerous iterations, we can take advantage of the fact that we expect the gradient to be updated multiple times by calculating it for fewer examples. We can even do it by calculating it for a single example. By performing fewer calculations for each network update, we can significantly reduce the amount of computation required, meaning faster training times. This is essentially a stochastic approximation to gradient descent and, hence, how it got its name.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset