Chapter 3. Deep Learning Fundamentals

In Chapter 1, Machine Learning – An Introduction, we introduced machine learning and some of its applications, and we briefly talked about a few different algorithms and techniques that can be used to implement machine learning. In Chapter 2, Neural Networks, we concentrated on neural networks; we have shown that 1-layer networks are too simple and can only work on linear problems, and we have introduced the Universal Approximation Theorem, showing how 2-layer neural networks with just one hidden layer are able to approximate to any degree any continuous function on a compact subset of R n.

In this chapter, we will introduce deep learning and deep neural networks, that is, neural networks with at least two or more hidden layers. The reader may wonder what is the point of using more than one hidden layer, given the Universal Approximation Theorem, and this is in no way a naïve question, since for a long period the neural networks used were very shallow, with just one hidden layer. The answer is that it is true that 2-layer neural networks can approximate any continuous function to any degree, however, it is also true that adding layers adds levels of complexity that may be much harder and may require many more neurons to simulate with shallow networks. There is also another, more important, reason behind the term deep of deep learning that refers not just to the depth of the network, or how many layers the neural net has, but to the level of "learning". In deep learning, the network does not simply learn to predict an output Y given an input X, but it also understands basic features of the input. In deep learning, the neural network is able to make abstractions of the features that comprise the input examples, to understand the basic characteristics of the examples, and to make predictions based on those characteristics. In deep learning, there is a level of abstraction that is missing in other basic machine learning algorithms or in shallow neural networks.

In this chapter, we will cover the following topics:

  • What is deep learning?
  • Fundamental concepts of deep learning
  • Applications of deep learning
  • GPU versus CPU
  • Popular open source libraries

What is deep learning?

In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton published an article titled ImageNet Classification with Deep Convolutional Neural Networks in Proceedings of Neural Information Processing Systems (NIPS) (2012) and, at the end of their paper, they wrote:

"It is notable that our network's performance degrades if a single convolutional layer is removed. For example, removing any of the middle layers results in a loss of about 2% for the top-1 performance of the network. So the depth really is important for achieving our results."

In this milestone paper, they clearly mention the importance of the number of hidden layers present in deep networks. Krizheysky, Sutskever, and Hilton talk about convolutional layers, and we will not discuss them until Chapter 5, Image Recognition, but the basic question remains: What do those hidden layers do?

A typical English saying is a picture is worth a thousand words. Let's use this approach to understand what Deep Learning is. In H. Lee, R. Grosse, R. Ranganath, and A. Ng, Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations in Proceedings of International Conference on Machine Learning (ICML) (2009) (Refer to http://web.eecs.umich.edu/~honglak/icml09-ConvolutionalDeepBeliefNetworks.pdf) the authors use a few images, which we copy here.

What is deep learning?

In their example, they showed neural network pictures of different categories of objects and/or animals, and the network learned some basic features for each category. For instance, the network can learn some very basic shapes, like lines or edges, which are common to every category. In the next layer, however, the network can learn how those lines and edges fit together for each category to make images that feature the eyes of a face, or the wheels of a car. This is similar to how the visual cortex in humans works, where our brain recognizes features more and more complex starting from simple lines and edges.

What is deep learning?

The hidden layers in a deep neural network work similarly by understanding more and more complex features in each hidden layer. If we want to define what makes a face, we need to define its parts: the eyes, the nose, the mouth, and then we need to go a level up and define their position with respect to each other: the two eyes are on the top middle part, at the same height, the nose in the middle, the mouth in the lower middle part, below the nose. Deep neural networks catch these features by themselves, first learning the components of the image, then its relative position and so on, similarly to how, in Images 1 and 2, we see the level of deeper abstraction working in each layer. Some deep learning networks can, in fact, be considered generative algorithms, as in the case of Restricted Boltzmann Machines (RBMs), rather than simply a predictive algorithm, as they learn to generate a signal, and then they make the prediction based on the generation assumptions they have learned. As we will progress through this chapter, we will make this concept clearer.

Fundamental concepts

In 1801, Joseph Marie Charles invented the Jacquard loom. Charles named the Jacquard, hence the name of its invention, was not a scientist, but simply a merchant. The Jacquard loom used a set of punched cards, and each punched card represented a pattern to be reproduced on the loom. Each punched card represented an abstraction of a design, a pattern, and each punched card was an abstract representation of that pattern. Punched cards have been used afterwards, for example in the tabulating machine invented by Herman Hollerith in 1890, or in the first computers where they were used to feed code to the machine. However, in the tabulating machine, for example, punched cards were simply abstractions of samples to be fed into the machine to calculate statistics on a population. In the Jacquard loom, the use of punched cards was subtler; in it, each card represented the abstraction of a pattern that could then be combined together with others to create more complex patterns. The punched card is an abstract representation of a feature of a reality, the final weaved design.

In a way, the Jacquard loom had the seed of what makes deep learning today: the definition of a reality through the representations of its features. In deep learning, the neural network does not simply recognize what makes a cat a cat, or a squirrel a squirrel, but it understands what features are present in a cat and what features are present in a squirrel, and it learns to design a cat or a squirrel using those features. If we were to design a weaving pattern in the shape of a cat using a Jacquard loom, we would need to use punched cards that have moustaches on the nose, like those of a cat, and an elegant and slender body. Instead, if we were to design a squirrel, we would need to use the punched card that makes a furry tail, for example. A deep network that learns basic representations of its output can make classifications using the assumptions it has made; therefore, if there is no furry tail it will probably not be a squirrel, but rather a cat. This has many implications, as we will see, not least that the amount of information that the network learns is much more complete and robust. By learning to generate the model (in technical parlance by learning the joint probability p(x,y) rather than simply p(y|x), the network is much less sensitive to noise, and it learns to recognize images even when there are other objects present in the scene or the object is partially obscured. The most exciting part is that deep neural networks learn to do this automatically.

Feature learning

The Ising model was invented by the physicist Wilhelm Lenz in1920, and he gave it as a problem to his student Ernst Ising. The model consists of discrete variables that can be in two states (positive or negative) and that represent magnetic dipoles.

In Chapter 4, Unsupervised Feature Learning, we will introduce Restricted Boltzmann machines and auto-encoders, and we will start going deeper into how to build multi-layer neural networks. The type of neural networks that we have seen so far all have a feed-forward architecture, but we will see that we can define networks with a feedback loop to help tune the weights that define the neural network. Ising models, though not directly used in deep learning, are a good physical example that helps us understand the basic inner workings of tuning deep neural architectures, including Restricted Boltzmann machines, and in particular help us understand the concept of representation.

What we are going to discuss in this section is a simple adaption (and simplification) of the Ising model to deep learning. In Chapter 2, Neural Networks, we discussed how important it is to tune the weights of the connections between neurons. In fact, it is the weights in a neural network that make the network learn. Given an input (fixed), this input propagates to the next layer and sets the internal state of the neurons in the next layer based on the weights of their connections. Then, these neurons will fire and move the information over to the following layer through new connections defined by new weights, and so on. The weights are the only variables of the network, and they are what make the network learn. In general, if our activity function were a simple threshold function, a large positive weight would tend to make two neurons fire together. By firing together, we mean that, if one neuron fires, and the connecting weight is high, then the other neuron will also fire (since the input times the large connecting weight will likely make it over the chosen threshold). In fact, in 1949, in his The organization of behavior, Donald Hebb (http://s-f-walker.org.uk/pubsebooks/pdfs/The_Organization_of_Behavior-Donald_O._Hebb.pdf) proposed that the opposite should also be true. Donald Hebb was a Canadian psychologist, who lived during the 20th century, who proposed the rule that goes by his name, the Hebb rule, which says that when neurons fire together their connection strengthens; when they do not fire together, their connection weakens.

In the following example, we think of an Ising model as a network of neurons that acts in a binary way, that is, where they can only activate (fire) or not, and that, the stronger their relative connection, the more likely they are to fire together. We assume that the network is stochastic, and therefore if two neurons are strongly connected they are only very likely to fire together.

Tip

Stochastic means probabilistic. In a stochastic network, we define the probability of a neuron to fire: the higher the probability, the more likely the neuron is to fire. When two neurons are strongly connected, that is, they are connected by a large weight, the probability that one firing will induce the other one to fire as well is very high (and vice versa, a weak connection will give a low probability). However, the neuron will only fire according to a probability, and therefore we cannot know with certainty whether it will fire.

On the other hand, if they are inversely correlated (a large negative weight), they are very likely not to fire together. Let's show some examples:

Feature learning

In the first figure, the first two neurons are active, and their connection with the third neuron is large and positive, so the third neuron will also be active. In the second figure, the first two neurons are off, and their connection with the third neuron is positive, so the third neuron will also be off.

In the second figure, the first two neurons are off, and their connection with the third neuron is positive, so the third neuron will also be off.

There are several combinations that may be present; we will show only a few of them. The idea is that the state of the neurons in the first layer will probabilistically determine the state of the neurons in the following layer, depending on the sign and strength of the connection. If the connections are weak, the connected neurons in the following layer may have equal or almost equal probability to be in any state. But if the connections are very strong, then the sign of the weight will make the connected neurons act in a similar or opposite way. Of course, if the neuron on the second layer has more than one neuron as its input, we will weigh all the input connections as usual. And if the input neurons are not all on or off, and their connections are equally strong, then again, the connected neuron may have equal or an almost equal chance of being on or off.

Feature learning

In the first figure, the first two neurons are active, and their connection with the third neuron is large and negative, so the third neuron will also be off. In the second figure, the first two neurons are off, and their connection with the third neuron is large and negative, so the third neuron will likely be on.

It is then clear that, to most likely determine the state of the neurons in the following layers, the neurons in the first layer should all be in similar states (on or off) and all be connected with strong (that is, large weights) connections. Let's see more examples:

Feature learning

In the first figure, the first two neurons are active, and their connection with the third neuron is large but opposite, so the third neuron could be equally likely to be on or off. In the second figure, the first two neurons are, one on and one off, and their connections with the third neuron are both large and positive, so the third neuron will also be equally likely to be on or off. In the last figure, the first two neurons are active, but their connection with the third neuron is small, so the third neuron is slightly more likely to be on but it has a relatively high chance to be off as well.

The point of introducing this adaptation of the Ising model is to understand how representation learning works in deep neural networks. We have seen that setting the correct weights can make a neural network turn on or off certain neurons, or in general, affect their output. Picturing neurons in just two states, however, helps our intuition and our visual description of what goes on in a neural network. It also helps our visual intuition to represent our network layers in 2-dimensions, rather than as a 1-dimensional layer. Let's picture our neural network layer as in a 2-dimensional plane. We could then imagine that each neuron represents the pixel on a 2-dimensional image, and that an "on" neuron represents a (visible) dark dot on a white plane, while an "off" neuron blends in (invisibly) with the white background. Our input layer of on/off neurons can then be seen as a simple 2-dimensional black and white image. For example, let's suppose we want to represent a smiley face, or a sad face—we would just turn on (activate) the correct neurons to get the following figures:

Feature learning

A happy and a sad face: the difference lies in a few neurons on the side of the mouth that can be on or off.

Now let's suppose that this corresponds to the input layer, so this layer would be connected to another layer, one of the hidden layers. There would then be connections between each pixel in this image (both black and white) and each neuron in the following layer. In particular, each black (on) pixel would be connected to each neuron in the following layer. Let's now assume that the connections from each neuron making the left eye has a strong (large positive weight) connection to a particular pixel in the hidden layer, but it has a large negative connection to any other neuron in the hidden layer:

Feature learning

On the left a smiley face, on the right, the same smiley face and the connections between its left eye and a hidden neuron.

What this means is that if we set large positive weights between the hidden layer and the left eye, and large negative connections between the left eye and any other hidden neuron, whenever we show the network a face that contains a left eye (which means those neurons are on) this particular hidden neuron will activate, while all the other neurons will tend to stay off. This means that this particular neuron will be able to detect when a left eye is present or not. We can similarly create connections between the right eye, the nose and the main part of the mouth, so that we can start detecting all those face features.

Feature learning

Each face feature, the eyes, the nose and the mouth, has large positive connections with certain hidden neurons and large but negative connections with the others.

This shows how we can select weights for our connections, to have the hidden neurons start recognizing features of our input.

Tip

As an important reminder, we want to point out to the reader that, in fact, we do not select the weights for our connections to start recognizing features of the input. Instead, those weights are automatically selected by the network using back-propagation or other tuning methods.

In addition, we can have more hidden layers that can recognize features of the features (is the mouth in our face smiling or is it sad?) and therefore get more precise results.

There are several advantages to deep learning. The first is, as we have seen, that it can recognize features. Another, even more important, is that it will recognize features automatically. In this example, we have set the weights ourselves to recognize the features we chose. This is one of the disadvantages of many machine learning algorithms, that the user must use his/her own experience to select what he/she thinks are the best features. A lot of time goes therefore, into feature selection that must still be performed by a human being. Deep Learning algorithms, instead, automatically select the best features. This can be done, as we have seen in the previous chapter, using back-propagation, but in fact, other techniques also exist to select those weights, and those will be the important points that will be treated in the next chapter, such as auto-encoders and restricted Boltzmann machines (or Harmoniums, as Paul Smolensky, who invented them in 1986, called them). We should, however, also caution the reader that the advantage we get from the automatic feature selection has to pay the price of the fact that we need to choose the correct architecture for the neural network.

In some deep learning systems (for example in Restricted Boltzmann Machines, as we will see in the next chapter), the neural network can also learn to "repair" itself. As we mentioned in the previous example, we could produce a general face by activating the four neurons we have associated to the right/left eye, the nose, and the mouth respectively. Because of the large positive weights between them and the neurons in the previous layer, those neurons will turn on and we will have the neurons corresponding to those features activated, generating a general image of a face. At the same time, if the neurons corresponding to the face are turned on, the four corresponding neurons to the eyes, nose, and mouth will also activate. What this means is that, even if not all the neurons defining the face are on, if the connections are strong enough, they may still turn on the four corresponding neurons, which, in turn, will activate the missing neurons for the face.

This has one more extra advantage: robustness. Human vision can recognize objects even when the view is partly obscured. We can recognize people even when they wear a hat, or a scarf that covers their mouth; we are not sensitive to noise in the image. Similarly, when we create this correspondence, if we alter the face slightly, for example by modifying the mouth by one or two pixels, the signal would still be strong enough to turn the "mouth" neuron on, which in turn would turn on the correct pixel and off the wrong pixel making up the modified eye. This system is not sensitive to noise and can make auto-corrections.

Let's say, for example, that the mouth has a couple of pixels off (in the figure those are the pixels that have an x).

Feature learning

This image has a couple of pixels making the mouth that are not turned on.

However, the mouth may still have enough neurons in the right place to be able to turn on the corresponding neuron representing it:

Feature learning

Even though a couple of neurons are off, the connections with the other neurons are strong enough that the neuron representing the mouth in the next layer will turn on anyway.

On the other hand, we could now travel the connections backwards and whenever the neuron representing the mouth is on, this would turn on all the neurons comprising the mouth, including the two neurons that were previously off:

Feature learning

The two neurons have been activated by the neuron on top

In summary, deep learning advantages with respect to many other machine learning algorithms and shallow neural networks in particular are:

  • Deep learning can learn representations
  • Deep learning is less sensitive to noise
  • Deep learning can be a generative algorithm (more on this in the next chapter)

To further understand why many hidden layers may be necessary, let's consider the task of recognizing a simple geometric figure, a cube. Say that each possible line in 3D is associated with a neuron (let's forget for a moment that this will require an infinite number of neurons).

Feature learning

Each line on the same visual field is associated to a different neuron.

If we restrict ourselves to a single eye, lines at different angles in our vision will project to the same line on a 2-dimensional plane. Each line we see could therefore be given by any corresponding 3D-line that projects to the same line onto the retina. Assume that any possible 3D-line is associated to a neuron. Two distinct lines that make up the cube are therefore associated to a family of neurons each. However, the fact that these two lines intersect, allows us to create a connection between two neurons each belonging to a different family. We have many neurons for the line making up one of the edges of the cube, and many other neurons for the line making up another of the edges of the cube, but since those two lines intersect, there are two neurons that will be connected. Similarly, each of these lines connects to other lines that make up the cube, allowing us to further redefine our representation. At a higher level, our neural net can also start to identify that these lines are not connected at just any angle, but they are connected at exactly 90 degree angles. This way we can make increasingly more abstract representations that allow us to identify the set of lines drawn on a piece of paper as a cube.

Neurons in different layers, organized hierarchically, represent different levels of abstraction of basic elements in the image and how they are structured. This toy example shows that each layer can, in an abstract system, link together what different neurons at a lower level are seeing, making connections between them, similar to how we can make connections between abstract lines. It can, using those connections, realize that those abstract lines are connected at a point, and, in a further up layer, are in fact connected at 90 degrees and make up a cube, the same way we described how we can learn to recognize a face by recognizing the eyes, the nose, and the mouth, and their relative position.

Feature learning

Each line is associated with a neuron, and we can create basic representations by associating neurons that represent lines that intersect, and more complex representations by associating neurons that represent lines at specific angles.

Deep learning algorithms

In the previous paragraph, we have given an intuitive introduction to deep learning. In this section, we will give a more precise definition of key concepts that will be thoroughly introduced in the next chapters. Deep Neural Networks with many layers also have a biological reason to exist: through our study of how humans understand speech, it has in fact become clear that we are endowed with a layered hierarchical structure that transforms the information from the audible sound input into the linguistic level. Similarly, the visual system and the visual cortex have a similar layered structure, from the V1 (or striate cortex), to the V2, V3 and V4 visual area in the brain. Deep neural networks mimic the nature of our brains, though in a very primitive way. We should warn the reader, however, that while understanding our brain can help us create better artificial neural networks, in the end, we may be creating a completely different architecture, the same way we may have created airplanes by trying to mimic birds, but ended up with a very different model.

In Chapter 2, Neural Networks, we introduced the back-propagation algorithm as a popular training algorithm. In practice, when we have many layers, back-propagation may be a slow and difficult algorithm to use. Back-propagation, in fact, is mainly based on the gradient of the function, and often the existence of local minima may prevent convergence of the method. However, the term deep learning applies to a class of deep neural networks algorithms that may use different training algorithms and weight tuning, and they are not limited to back-propagation and classical feed-forward neural networks. We should then more generally define deep learning as a class of machine learning techniques where information is processed in hierarchical layers, to understand representations and features from the data in increasing levels of complexity. In this class of algorithms, we can generally include:

  • Multi-Layer Perceptrons (MLP): A neural network with many hidden layers, with feed-forward propagation. As discussed, this is one of the first examples of deep learning network but not the only possible one.
  • Boltzmann Machines (BM): A stochastic symmetric network with a well-defined energy function.
  • Restricted Boltzmann Machines (RBM): Similar to the Ising model example above, restricted Boltzmann machines are comprised of symmetric connections between two layers, one visible and one hidden layer, but unlike general Boltzmann machines, neurons have no intra-layers connections. They can be stacked together to form DBNs.
  • Deep Belief Networks (DBN): a stochastic generative model where the top layers have symmetric connections between them (undirected, unlike feed-forward networks), while the bottom layers receive the processed information from directed connections from the layers above them.
  • Autoencoders: A class of unsupervised learning algorithms in which the output shape is the same as the input, that allows the network to better learn basic representations.
  • Convolutional Neural Networks (CNN): Convolutional layers apply filters to the input image (or sound) by sliding this filter all across the incoming signal to produce a bi-dimensional activation map. CNNs allow the enhancement of features hidden in the input.

Each of these deep learning implementations has its own advantages and disadvantages, and they can be easier or harder to train depending on the number of layers and neurons in each layer. While simple feed-forward Deep Neural Networks can generally be trained using the back-propagation algorithm discussed in the second chapter, different techniques exist for the other types of networks, as will be discussed further in the next chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset