Once you've understood how convolutional layers work, the pooling layers are quite easy to grasp. A pooling layer typically works on every input channel independently, so the output depth is the same as the input depth. You may alternatively pool over the depth dimension, as we will see next, in which case the image's spatial dimensions (for example, height and width) remain unchanged but the number of channels is reduced. Let's see a formal definition of pooling layers from the well-known TensorFlow website:
Therefore, in summary, just like convolutional layers, each neuron in a pooling layer is connected to the outputs of a limited number of neurons in the previous layer, located within a small rectangular receptive field. However, we must define its size, the stride, and the padding type. So in summary, the output can be computed as follows:
output[i] = reduce(value[strides * i:strides * i + ksize]),
Here, the indices also take the padding values into consideration.
In other words, the goal of using pooling is to subsample the input image in order to reduce the computational load, memory usage, and number of parameters. This helps to avoid overfitting in the training stage. Reducing the input image size also makes the neural network tolerate a little bit of image shift. The spatial semantics of the convolution ops depend on the padding scheme chosen.
Padding is an operation to increase the size of the input data. In the case of one-dimensional data, you just append/prepend the array with a constant; in two-dimensional data, you surround the matrix with these constants. In n-dimensional, you surround your n-dimensional hypercube with the constant. In most of the cases, this constant is zero and it is called zero padding:
- VALID padding: Only drops the rightmost columns (or bottommost rows)
- SAME padding: Tries to pad evenly left and right, but if the number of columns to be added is odd, it will add the extra column to the right, as is the case in this example
Let's explain the preceding definition graphically, in the following figure. If we want a layer to have the same height and width as the previous layer, it is common to add zeros around the inputs, as shown in the diagram. This is called SAME or zero padding.
On the other hand, zero padding is introduced to make the shapes match as needed, equally on every side of the input map. VALID means no padding and only drops the rightmost columns (or bottommost rows):
In the following example (Figure 5), we use a 2 × 2 pooling kernel and a stride of 2 with no padding. Only the max input value in each kernel makes it to the next layer since the other inputs are dropped (we will see this later on):