Convolutional layers

A convolutional layer (sometimes referred to in the literature as "filter") is a particular type of neural network that manipulates the image to highlight certain features. Before we get into the details, let's introduce a convolutional filter using some code and some examples. This will make the intuition simpler and will make understanding the theory easier. To do this we can use the keras datasets, which makes it easy to load the data.

We will import numpy, then the mnist dataset, and matplotlib to show the data:

import numpy 
from keras.datasets import mnist  
import matplotlib.pyplot as plt 
import matplotlib.cm as cm

Let's define our main function that takes in an integer, corresponding to the image in the mnist dataset, and a filter, in this case we will define the blur filter:

def main(image, im_filter):
      im = X_train[image]

Now we define a new image imC, of size (im.width-2, im.height-2):

      width = im.shape[0]       
      height = im.shape[1]
      imC = numpy.zeros((width-2, height-2))

At this point we do the convolution, which we will explain soon (as we will see, there are in fact several types of convolutions depending on different parameters, for now we will just explain the basic concept and get into the details later):

      for row in range(1,width-1):
          for col in range(1,height-1):
              for i in range(len(im_filter[0])):
                  for j in range(len(im_filter)):
                      imC[row-1][col-1] += im[row-1+i][col-1+j]*im_filter[i][j]
              if imC[row-1][col-1] > 255:
                  imC[row-1][col-1] = 255
              elif imC[row-1][col-1] < 0:
                  imC[row-1][col-1] = 0 

Now we are ready to display the original image and the new image:

      plt.imshow( im, cmap = cm.Greys_r )         
      plt.show()
      plt.imshow( imC/255, cmap = cm.Greys_r )       
      plt.show()

Now we are ready to load the mnist dataset using Keras as we did in Chapter 3, Deep Learning Fundamentals. Also, let's define a filter. A filter is a small region (in this case 3 x 3) with each entry defining a real value. In this case we define a filter with the same value all over:

    blur = [[1./9, 1./9, 1./9], [1./9, 1./9, 1./9], [1./9, 1./9, 1./9]]

Since we have nine entries, we set the value to be 1/9 to normalize the values.

And we can call the main function on any image (expressed by an integer that indicates the position) in such a dataset:

if __name__ == '__main__':          
    (X_train, Y_train), (X_test, Y_test) = mnist.load_data()
    blur = [[1./9, 1./9, 1./9], [1./9, 1./9, 1./9], [1./9, 1./9, 1./9]]
    main(3, blur)

Let's look at what we did. We multiplied each entry of the filter with an entry of the original image, and then we summed them all up to get a single value. Since the filter size is smaller than the image size, we moved the filter by 1 pixel and kept doing this process until we covered the whole image. Since the filter was composed by values that are all equal to 1/9, we have in fact averaged all input values with the values that are close to it, and this has the effect of blurring the image.

This is what we get:

Convolutional layers

On top is the original mnist image, on the bottom is the new image after we applied the filter

In the choice of the filter we can use any value we want; in this case we have used values that are all the same. However, we can instead use different values, for example values that only look at the neighboring values of the input, add them up, and subtract the value of the center input. Let's define a new filter, and let's call it edges, in the following way:

    edges = [[1, 1, 1], [1, -8, 1], [1, 1, 1]]

If we now apply this filter, instead of the filter blur defined earlier, we get the following images:

Convolutional layers

On top is the original mnist image, on the bottom is the new image after we applied the filter

It is clear, therefore, that filters can alter the images, and show "features" that can be useful to detect and classify images. For example, to classify digits, the color of the inside is not important, and a filter such as "edges" helps identify the general shape of the digit which is what is important for a correct classification.

We can think of filters in the same way we think about neural networks, and think that the filter we have defined is a set of weights, and that the final value represents the activation value of a neuron in the next layer (in fact, even though we chose particular weights to discuss these examples, we will see that the weights will be learned by the neural network using back-propagation):

Convolutional layers

The filter covers a fixed region, and for each neuron in that region, it defines a connection weight to a neuron in the next layer. The neuron in the next layer will then have an input value equal to the regular activation value calculated by summing the contributions of all input neurons mediated by the corresponding connection weights.

We then keep the same weights and we slide the filter across, generating a new set of neurons, which correspond to the filtered image:

Convolutional layers

We can keep repeating the process until we have moved across the whole image, and we can repeat this process with as many filters as we like, creating a new set of images, each of which will have different features or characteristics highlighted. While we have not used a bias in our examples, it is also possible to add a bias to the filter, which will be added to the neural network, and we can also define different activity functions. In our code example you will notice that we have forced the value to be in the range (0, 255), which can be thought of as a simple threshold function:

Convolutional layers

As the filter moves across the image, we define new activation values for the neurons in the output image.

Since one may define many filters, we should think of the output not as a single image, but as a set of images, one for each filter defined. If we used just the "edges" and the "blur" filter, the output layer would therefore have two images, one per filter chosen. The output will therefore have, besides a width and a height, also a depth equal to the number of filters chosen. In actuality, the input layer can also have a depth if we use color images as input; images are in fact usually comprised of three channels, which in computer graphics are represented by RGB, the red channel, the green channel, and the blue channel. In our example, the filter is represented by a two-dimensional matrix (for example the blur filter is a 3 x 3 matrix with all entries equal to 1/9. However, if the input is a color image, the filter will also have a depth (in this case equal to three, the number of color channels), and it will therefore be represented by three (number of color channels) 3 x 3 matrices. In general, the filter will therefore be represented by a three-dimensional array, with a width, a height, and a depth, which are sometimes called "volumes". In the preceding example, since the mnist images are gray-scale only, the filter had depth 1. A general filter of depth d is therefore comprised of d filters of the same width and height. Each of those d filters are called a "slice" or a "leaf":

Convolutional layers

Similarly, as before, for each "leaf" or "slice", we connect each neuron in the small sub-region, as well as a bias, to a neuron and we calculate its activation value defined by the connection weights set in the filter, and we slide the filter across the whole area. Such a procedure, as it is easy to calculate, requires a number of parameters that are equal to the number of weights defined by the filter (in our example above, this would be 3 x 3=9), multiplied by the number of "leaves", that is, the depth of the layer, plus one bias. This defines a feature map, because it highlights specific features of the input. In our code above we defined two feature maps, a "blur" and an "edges". Therefore, we need to multiply the number of parameters by the number of feature maps. Note that the weights for each filter are fixed; when we slide the filter across the region we do not change weights. Therefore, if we start with a layer with size (width, height, depth), and a filter of dimension (filter_w, filter_h), the output layer after having applied the convolution is (width - filter_w + 1, height – filter_h + 1). The depth of the new layer depends on how many feature maps we want to create. In our mnist code example earlier, if we applied both the blur and edges filters, we would have an input layer of size (28 x 28 x 1), since there is only one channel because the digits are gray-scale images, and an output layer of dimension (26 x 26 x 2), since our filters had dimension (3 x 3) and we used two filters. The number of parameters is only 18 (3 x 3 x 2), or 20 (3 x 3 x 2+2) if we add a bias. This is way less than what we would need to have with classical feed-forward networks, whereas, since the input is 784 pixels, a simple hidden layer with just 50 neurons would need 784 x 50 = 39200 parameters, or 39250 if we add the bias:

Convolutional layers

We slide the filter across the image over all the "leaves" comprising the layer.

Convolutional layers moreover can work better, since each neuron gets its input only from neighboring neurons, and does not care about collecting input from neurons that are distant from each other.

Stride and padding in convolutional layers

The examples we have shown, aided by pictures, in fact only tell one particular application of filters (as we mentioned earlier, there are different types of convolutions, depending on the parameters chosen). In fact, the size of the filter may vary, as well as how it moves across the image and its behavior at the edges of the image. In our example, we moved the filter across the image 1 pixel at a time. How many pixels (neurons) we skip each time we move our filter is called the stride. In the above example, we used a stride of 1, but it is not unusual to use larger strides, of 2 or even more. In this case the output layer would have a smaller width and height:

Stride and padding in convolutional layers

A filter applied with stride 2—the filter is moved by two pixels at a time.

In addition, we might also decide to apply the filter partially outside of the original picture. In that case, we would assume that the missing neurons would have value 0. This is called padding; that is, we add 0 value neurons outside the original image. This can be useful if, for example, we want the output image to be the same size as the input image. Above, we wrote the formula for the size of the new output image in case of zero padding, and that was (width - filter_w + 1, height – filter_h + 1) for an input of size (width, height) and a filter of dimensions (filter_w, filter_h). If we use a padding P all around the image, the output size will be (width + 2P - filter_w + 1, height + 2P – filter_h + 1). To summarize, in each dimension (either width or height), let the size of the input slice be called I=(Iw, Ih), the size of the filter F=(Fw,Fh), the size of the stride S=(Sw,Sh), and the size of the padding P=(Pw,Ph, then the size O=(Ow, Oh for the output slice is given by:

Stride and padding in convolutional layers
Stride and padding in convolutional layers

This of course identifies one of the constraints for S, that it must divide (I + 2P – F) both in the width direction and the height direction. The dimension for the final volume is obtained by multiplying for the number of desired feature maps.

The number of parameters W used, instead, is independent of the stride and padding, and it is just a function of the (square) size of the filter, the depth D (number of slices) of the input, and the number of feature maps M chosen:

Stride and padding in convolutional layers

The use of padding (also called zero-padding, as we are padding the image with zeros) is sometimes useful if we are seeking to make the output dimension the same as the input dimension. If we use a filter of dimension (2 x 2), it is in fact clear that by applying a padding of value 1 and a stride of 1, we have the dimension of the output slice the same as the size of the input slice.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset