We have already mentioned in Chapter 3, Deep Learning Fundamentals, the paper published in 2012 by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton titled: ImageNet Classification with Deep Convolutional Neural Networks. Though the genesis of convolutional may be traced back to the '80s, that was one of the first papers that highlighted the deep importance of convolutional networks in image processing and recognition, and currently almost no deep neural network used for image recognition can work without some convolutional layer.
An important problem that we have seen when working with classical feed-forward networks is that they may overfit, especially when working with medium to large images. This is often due to the fact that neural networks have a very large number of parameters, in fact in classical neural nets all neurons in a layer are connected to each and every neuron in the next. When the number of parameters is large, over-fitting is more likely. Let's look at the following images: we can fit the data by drawing a line that goes exactly through all the points, or better, a line that will not match exactly the data but is more likely to predict future examples.
In the first example of the two pictures represented, we overfit the data. In the second we have matched our prediction to the data in such a way that our prediction is more likely to better predict future data. In the first case, we just need three parameters to describe the curve: y = ax2 + bx + c, while in the second case we would need many more than just three parameters to write the equation for that curve. This gives an intuitive explanation of why, sometimes, having too many parameters may not be a good thing and it may lead to over-fitting. A classical feed-forward network for an image as small as those in the cifar10
examples (cifar10
is an established computer-vision dataset consisting of 60000 32 x 32 images divided in to 10 classes, and we will see a couple of examples from this dataset in this chapter) has inputs of size 3 x 32 x 32, which is already about four times as large as a simple
mnist
digit image. Larger images, say 3 x 64 x 64, would have about as many as 16 times the number of input neurons multiplying the number of connection weights:
Convolutional networks reduce the number of parameters needed, since they require neurons to only connect locally to neurons corresponding to neighboring pixels, and therefore help avoid overfitting. In addition, reducing the number of parameters also helps computationally. In the next section, will introduce some convolutional layer examples to help the intuition and then we will move to formally define them.