Chapter 5. Baby steps with neural networks (perceptrons and backpropagation)

This chapter covers

  • Learning the history of neural networks
  • Stacking perceptrons
  • Understanding backpropagation
  • Seeing the knobs to turn on neural networks
  • Implementing a basic neural network in Keras

In recent years, a lot of hype has developed around the promise of neural networks and their ability to classify and identify input data, and more recently the ability of certain network architectures to generate original content. Companies large and small are using them for everything from image captioning and self-driving car navigation to identifying solar panels from satellite images and recognizing faces in security camera videos. And luckily for us, many NLP applications of neural nets exist as well. While deep neural networks have inspired a lot of hype and hyperbole, our robot overlords are probably further off than any clickbait cares to admit. Neural networks are, however, quite powerful tools, and you can easily use them in an NLP chatbot pipeline to classify input text, summarize documents, and even generate novel works.

This chapter is intended as a primer for those with no experience in neural networks. We don’t cover anything specific to NLP in this chapter, but gaining a basic understanding of what is going on under the hood in a neural network is important for the upcoming chapters. If you’re familiar with the basics of a neural network, you can rest easy in skipping ahead to the next chapter, where you dive back into processing text with the various flavors of neural nets. Although the mathematics of the underlying algorithm, backpropagation, are outside this book’s scope, a high-level grasp of its basic functionality will help you understand language and the patterns hidden within.

Tip

Manning publishes two other tremendous resources on deep learning:

  • Deep Learning with Python, by François Chollet (Manning, 2017), is a deep dive into the wonders of deep learning by the creator of Keras himself.
  • Grokking Deep Learning, by Andrew Trask (Manning, 2017), is a broad overview of deep learning models and practices.

5.1. Neural networks, the ingredient list

As the availability of processing power and memory has exploded over the course of the decade, an old technology has come into its own again. First proposed in the 1950s by Frank Rosenblatt, the perceptron[1] offered a novel algorithm for finding patterns in data.

1

Rosenblatt, Frank (1957), “The perceptron—a perceiving and recognizing automaton.” Report 85-460-1, Cornell Aeronautical Laboratory.

The basic concept lies in a rough mimicry of the operation of a living neuron cell. As electrical signals flow into the cell through the dendrites (see figure 5.1) into the nucleus, an electric charge begins to build up. When the cell reaches a certain level of charge, it fires, sending an electrical signal out through the axon. However, the dendrites aren’t all created equal. The cell is more “sensitive” to signals through certain dendrites than others, so it takes less of a signal in those paths to fire the axon.

Figure 5.1. Neuron cell

The biology that controls these relationships is most certainly beyond the scope of this book, but the key concept to notice here is the way the cell weights incoming signals when deciding when to fire. The neuron will dynamically change those weights in the decision making process over the course of its life. You are going to mimic that process.

5.1.1. Perceptron

Rosenblatt’s original project was to teach a machine to recognize images. The original perceptron was a conglomeration of photo-receptors and potentiometers, not a computer in the current sense. But implementation specifics aside, Rosenblatt’s concept was to take the features of an image and assign a weight, a measure of importance, to each one. The features of the input image were each a small subsection of the image.

A grid of photo-receptors would be exposed to the image. Each receptor would see one small piece of the image. The brightness of the image that a particular photo-receptor could see would determine the strength of the signal that it would send to the associated “dendrite.”

Each dendrite had an associated weight in the form of a potentiometer. Once enough signal came in, it would pass the signal into the main body of the “nucleus” of the “cell.” Once enough of those signals from all the potentiometers passed a certain threshold, the perceptron would fire down its axon, indicating a positive match on the image it was presented with. If it didn’t fire for a given image, that was a negative classification match. Think “hot dog, not hot dog” or “iris setosa, not iris setosa.”

5.1.2. A numerical perceptron

So far there has been a lot of hand waving about biology and electric current and photo-receptors. Let’s pause for a second and peel out the most important parts of this concept.

Basically, you’d like to take an example from a dataset, show it to an algorithm, and have the algorithm say yes or no. That’s all you’re doing so far. The first piece you need is a way to determine the features of the sample. Choosing appropriate features turns out to be a surprisingly challenging part of machine learning. In “normal” machine learning problems, like predicting home prices, your features might be square footage, last sold price, and ZIP code. Or perhaps you’d like to predict the species of a certain flower using the Iris dataset.[2] In that case your features would be petal length, petal width, sepal length, and sepal width.

2

The Iris dataset is frequently used to introduce machine learning to new students. See the Scikit-Learn docs (http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html).

In Rosenblatt’s experiment, the features were the intensity values of each pixel (subsections of the image), one pixel per photo receptor. You then need a set of weights to assign to each of the features. Don’t worry yet about where these weights come from. Just think of them as a percentage of the signal to let through into the neuron. If you’re familiar with linear regression, then you probably already know where these weights come from.[3]

3

The weights for the inputs to a single neuron are mathematically equivalent to the slopes in a multivariate linear regression or logistic regression.

Tip

Generally, you’ll see the individual features denoted as xi, where i is a reference integer. And the collection of all features for a given example are denoted as X representing a vector:

  • X = [x1, x2, ..., xi, ..., xn]

And similarly, you’ll see the associate weights for each feature as wi, where i corresponds to the index of feature x associated with that weight. And the weights are generally represented as a vector W:

  • W = [w1, w2, ..., wi, ..., wn]

With the features in hand, you just multiply each feature (xi) by the corresponding weight (wi) and then sum up:

  • (x1 * w1) + (x2 * w2) + ... + (xi * wi) + ...

The one piece you’re missing here is the neuron’s threshold to fire or not. And it’s just that, a threshold. Once the weighted sum is above a certain threshold, the perceptron outputs 1. Otherwise it outputs 0.

You can represent this threshold with a simple step function (labeled “Activation Function” in figure 5.2).

Figure 5.2. Basic perceptron

5.1.3. Detour through bias

Figure 5.2 and this example reference bias. What is this? The bias is an “always on” input to the neuron. The neuron has a weight dedicated to it just as with every other element of the input, and that weight is trained along with the others in the exact same way. This is represented in two ways in the various literature around neural networks. You may see the input represented as the base input vector, say of n-elements, with a 1 appended to the beginning or the end of the vector, giving you an n+1 dimensional vector. The position of the 1 is irrelevant to the network, as long as it’s consistent across all of your samples. Other times people presume the existence of the bias term and leave it off the input in a diagram, but the weight associated with it exists separately and is always multiplied by 1 and added to the dot product of the sample input’s values and their associated weights. Both are effectively the same—just a heads-up to notice the two common ways of displaying the concept.

The reason for having the bias weight at all is that you need the neuron to be resilient to inputs of all zeros. It may be the case that the network needs to learn to output 0 in the face of inputs of 0, but it may not. Without the bias term, the neuron would output 0 * weight = 0 for any weights you started with or tried to learn. With the bias term, you won’t have this problem. And in case the neuron needs to learn to output 0, in that case, the neuron can learn to decrement the weight associated with the bias term enough to keep the dot product below the threshold.

Figure 5.3 is a rather neat visualization of the analogy between some of the signals within a biological neuron in your brain and the signals of an artificial neuron used for deep learning. If you want to get deep, think about how you are using a biological neuron to read this book about natural language processing to learn about deep learning.[4]

4

Natural language understanding (NLU) is a term often used in academic circles to refer to natural language processing when that processing appears to demonstrate that the machine understands natural language text. Word2vec embeddings are one example of a natural language understanding task. Question answering and reading comprehension tasks also demonstrate understanding. Neural networks in general are very often associated with natural language understanding.

Figure 5.3. A perceptron and a biological neuron

And in mathematical terms, the output of your perceptron, denoted f(x), looks like

Equation 5.1. Threshold activation function

Tip

The sum of the pairwise multiplications of the input vector (X) and the weight vector (W) is exactly the dot product of the two vectors. This is the most basic element of why linear algebra factors so heavily in the development of neural networks. The other side effect of this matrix multiplication structure of a perceptron is that GPUs in modern computers turn out to be super-efficient at implementing neural networks due to their hyper-optimization of linear algebra operations.

Your perceptron hasn’t learned anything just yet. But you have achieved something quite important. You’ve passed data into a model and received an output. That output is likely wrong, given you said nothing about where the weight values come from. But this is where things will get interesting.

Tip

The base unit of any neural network is the neuron. And the basic perceptron is a special case of the more generalized neuron. We refer to the perceptron as a neuron for now.

A Pythonic neuron

Calculating the output of the neuron described earlier is straightforward in Python. You can also use the numpy dot function to multiply your two vectors together:

>>> import numpy as np
 
>>> example_input = [1, .2, .1, .05, .2]
>>> example_weights = [.2, .12, .4, .6, .90]
 
>>> input_vector = np.array(example_input)
>>> weights = np.array(example_weights)
>>> bias_weight = .2
 
>>> activation_level = np.dot(input_vector, weights) +
...     (bias_weight * 1)                                1
>>> activation_level
0.674

  • 1 The multiplication by one (* 1) is just to emphasize that the bias_weight is like all the other weights: it’s multiplied by an input value, only the bias_weight input feature value is always 1.

With that, if you use a simple threshold activation function and choose a threshold of .5, your next step is the following:

>>> threshold = 0.5
>>> if activation_level >= threshold:
...    perceptron_output = 1
... else:
...    perceptron_output = 0
>>> perceptron_output)
1

Given the example_input, and that particular set of weights, this perceptron will output 1. But if you have several example_input vectors and the associated expected outcomes with each (a labeled dataset), you can decide if the perceptron is correct or not for each guess.

Class is in session

So far you have set up a path toward making predictions based on data, which sets the stage for the main act: machine learning. The weight values up to this point have been brushed off as arbitrary values so far. In reality, they are the key to the whole structure, and you need a way to “nudge” the weights up and down based on the result of the prediction for a given example.

The perceptron learns by altering the weights up or down as a function of how wrong the system’s guess was for a given input. But from where does it start? The weights of an untrained neuron start out random! Random values, near zero, are usually chosen from a normal distribution. In the preceding example, you can see why starting the weights (including the bias weight) at zero would lead only to an output of zero. But establishing slight variations, without giving any track through the neuron too much power, you have a foothold from where to be right and where to be wrong.

And from there you can start to learn. Many different samples are shown to the system, and each time the weights are readjusted a small amount based on whether the neuron output was what you wanted or not. With enough examples (and under the right conditions), the error should tend toward zero, and the system learns.

The trick is, and this is the key to the whole concept, that each weight is adjusted by how much it contributed to the resulting error. A larger weight (which lets that data point affect the result more) should be blamed more for the rightness/wrongness of the perceptron’s output for that given input.

Let’s assume that your earlier example_input should have resulted in a 0 instead:

>>> expected_output = 0
>>> new_weights = []
>>> for i, x in enumerate(example_input):
...     new_weights.append(weights[i] + (expected_output -
...         perceptron_output) * x)                          1
 >>> weights = np.array(new_weights)
 
>>> example_weights                                          2
[0.2, 0.12, 0.4, 0.6, 0.9]
>>> weights                                                  3
[-0.8  -0.08  0.3   0.55  0.7]

  • 1 For example, in the first index above: new_weight = .2 + (0 - 1) * 1 = -0.8
  • 2 Original weights
  • 3 New weights

This process of exposing the network over and over to the same training set can, under the right circumstances, lead to an accurate predictor even on input that the perceptron has never seen.

Logic is a fun thing to learn

So the preceding example was just some arbitrary numbers to show how the math goes together. Let’s apply this to a problem. It’s a trivial toy problem, but it demonstrates the basics of how you can teach a computer a concept, by only showing it labeled examples.

Let’s try to get the computer to understand the concept of logical OR. If either one side or the other of the expression is true (or both sides are), the logical OR statement is true. Simple enough. For this toy problem, you can easily model every possible example by hand (this is rarely the case in reality). Each sample consists of two signals, each of which is either true (1) or false (0). See the following listing.

Listing 5.1. OR problem setup
>>> sample_data = [[0, 0],  # False, False
...                [0, 1],  # False, True
...                [1, 0],  # True, False
...                [1, 1]]  # True, True
 
>>> expected_results = [0,  # (False OR False) gives False
...                     1,  # (False OR True ) gives True
...                     1,  # (True  OR False) gives True
...                     1]  # (True  OR True ) gives True
 
>>> activation_threshold = 0.5

You need a few tools to get started: numpy just to get used to doing vector (array) multiplication, and random to initialize the weights:

>>> from random import random
>>> import numpy as np
 
>>> weights = np.random.random(2)/1000  # Small random float 0 < w < .001
>>> weights
[5.62332144e-04 7.69468028e-05]

You need a bias as well:

>>> bias_weight = np.random.random() / 1000
>>> bias_weight
0.0009984699077277136

Then you can pass it through your pipeline and get a prediction for each of your four samples. See the following listing.

Listing 5.2. Perceptron random guessing
>>> for idx, sample in enumerate(sample_data):
...     input_vector = np.array(sample)
...     activation_level = np.dot(input_vector, weights) +
...         (bias_weight * 1)
...     if activation_level > activation_threshold:
...         perceptron_output = 1
...     else:
...         perceptron_output = 0
...     print('Predicted {}'.format(perceptron_output))
...     print('Expected: {}'.format(expected_results[idx]))
...     print()
Predicted 0
Expected: 0
 
Predicted 0
Expected: 1
 
Predicted 0
Expected: 1
 
Predicted 0
Expected: 1

Your random weight values didn’t help your little neuron out that much—one right and three wrong. Let’s send it back to school. Instead of just printing 1 or 0, you’ll update the weights at each iteration. See the following listing.

Listing 5.3. Perceptron learning
>>> for iteration_num in range(5):
...     correct_answers = 0
...     for idx, sample in enumerate(sample_data):
...         input_vector = np.array(sample)
...         weights = np.array(weights)
...         activation_level = np.dot(input_vector, weights) +
...             (bias_weight * 1)
...         if activation_level > activation_threshold:
...             perceptron_output = 1
...         else:
...             perceptron_output = 0
...         if perceptron_output == expected_results[idx]:
...             correct_answers += 1
...         new_weights = []
...         for i, x in enumerate(sample):                                1
...             new_weights.append(weights[i] + (expected_results[idx] -
...                 perceptron_output) * x)
...         bias_weight = bias_weight + ((expected_results[idx] -
...             perceptron_output) * 1)                                   2
...         weights = np.array(new_weights)
...     print('{} correct answers out of 4, for iteration {}'
...         .format(correct_answers, iteration_num))
3 correct answers out of 4, for iteration 0
2 correct answers out of 4, for iteration 1
3 correct answers out of 4, for iteration 2
4 correct answers out of 4, for iteration 3
4 correct answers out of 4, for iteration 4

  • 1 This is where the magic happens. There are more efficient ways of doing this, but you broke it out into a loop to reinforce that each weight is updated by force of its input (xi). If an input was small or zero, the effect on that weight would be minimal, regardless of the magnitude of the error. And conversely, the effect would be large if the input was large.
  • 2 The bias weight is updated as well, just like those associated with the inputs.

Haha! What a good student your little perceptron is. By updating the weights in the inner loop, the perceptron is learning from its experience of the dataset. After the first iteration, it got two more correct (three out of four) than it did with random guessing (one out of four).

In the second iteration, it overcorrected the weights (changed them too much) and had to learn to backtrack with its adjustment of the weights. By the time the fourth iteration completed, it had learned the relationships perfectly. The subsequent iterations do nothing to update the network, as there is an error of 0 at each sample, so no weight adjustments are made.

This is what is known as convergence. A model is said to converge when its error function settles to a minimum, or at least a consistent value. Sometimes you’re not so lucky. Sometimes a neural network bounces around looking for optimal weights to satisfy the relationships in a batch of data and never converges. In section 5.8, you’ll see how an objective function or loss function affects what your neural net “thinks” are the optimal weights.

Next step

The basic perceptron has an inherent flaw. If the data isn’t linearly separable, or the relationship cannot be described by a linear relationship, the model won’t converge and won’t have any useful predictive power. It won’t be able to predict the target variable accurately.

Early experiments were successful at learning to classify images based solely on example images and their classes. The initial excitement of the concept was quickly tempered by the work of Minsky and Papert,[5] who showed the perceptron was severely limited in the kinds of classifications it can make. Minsky and Papert showed that if the data samples weren’t linearly separable into discrete groups, the perceptron wouldn’t be able to learn to classify the input data.

5

Perceptrons by Minsky and Papert, 1969

Linearly separable data points (as shown in figure 5.4) are no problem for a perceptron. Crossed up data will cause a single-neuron perceptron to forever spin its wheels without learning to predict anything better than a random guess, a random flip of a coin. It’s not possible to draw a single line between your two classes (dots and Xs) in figure 5.5.

Figure 5.4. Linearly separable data

Figure 5.5. Nonlinearly separable data

A perceptron finds a linear equation that describes the relationship between the features of your dataset and the target variable in your dataset. A perceptron is just doing linear regression. A perceptron cannot describe a nonlinear equation or a nonlinear relationship.

Local vs global minimum

When a perceptron converges, it can be said to have found a linear equation that describes the relationship between the data and the target variable. It doesn’t, however, say anything about how good this descriptive linear equation is, or how “minimum” the cost is. If there are multiple solutions, multiple possible cost minimums, it will settle on one particular minimum determined by where its weights started. This is called a local minimum because it’s the best (smallest cost) that could be found near where the weights started. It may not be the global minimum, which is the best you could ever find by searching all the possible weights. In most cases it’s not possible to know if you’ve found the global minimum.

A lot of relationships between data values aren’t linear, and there’s no good linear regression or linear equation that describes those relationships. And many datasets aren’t linearly separable into classes with lines or planes. Because most data in the world isn’t cleanly separable with lines and planes, the “proof” Minsky and Paperts published relegated the perceptron to the storage shelves.

But the perceptron idea didn’t die easily. It resurfaced again when the Rumelhardt-McClelland collaboration effort (which Geoffrey Hinton was involved in)[6] showed you could use the idea to solve the XOR problem with multiple perceptrons in concert.[7] The problem you solved with a single perceptron and no multilayer backpropagation was for a simpler problem, the OR problem. The key breakthrough by Rumelhardt-McClelland was the discovery of a way to allocate the error appropriately to each of the perceptrons. The way they did this was to use an old idea called backpropagation. With this idea for backpropagation across layers of neurons, the first modern neural network was born.

6

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). “Learning representations by back-propagating errors.” Nature, 323, 533–536.

7

See the Wikipedia article “The XOR affair” (https://en.wikipedia.org/wiki/Perceptrons_(book)#The_XOR_affair).

The basic perceptron has the inherent flaw that if the data isn’t linearly separable, the model won’t converge to a solution with useful predictive power.

Note

The code in listing 5.3 solved the OR problem with a single perceptron. The table of 1s and 0s in listing 5.1 that our perceptron learned was the output of binary OR logic. The XOR problem slightly alters that table to try to teach the perceptron how to mimic an Exclusive OR logic gate. If you changed the correct answer for the last example from a 1 (True) to a 0 (False) to represent XOR logic, that makes the problem a lot harder. The examples in each class (0 or 1) aren’t linearly separable without adding an additional neuron to our neural network. The classes are diagonal from each other in our two-dimensional feature vector space (similar to figure 5.5), so there’s no line you can draw that separates 1s (logic Trues) from 0s (logic Falses).

Even though they could solve complex (nonlinear) problems, neural networks were, for a time, too computationally expensive. It was seen as a waste of precious computational power to require two perceptrons and a bunch of fancy backpropagation math to solve the XOR problem, a problem that can be solved with a single logic gate or a single line of code. They proved impractical for common use, and they found their way back to the dusty shelves of academia and supercomputer experimentation. This began the second “AI Winter”[8] that lasted from around 1990 to about 2010.[9] But eventually computing power, backpropagation algorithms, and the proliferation of raw data, like labeled images of cats and dogs,[10] caught up. Computationally expensive algorithms and limited datasets were no longer show-stoppers. Thus the third age of neural networks began.

8

9

See the web page titled “Philosophical Transactions of the Royal Society B: Biological Sciences” (http://rstb.royalsocietypublishing.org/content/365/1537/177.short).

10

See the PDF “Learning Multiple Layers of Features from Tiny Images” by Alex Krizhevsky (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.222.9220&rep=rep1&type=pdf).

But back to what they found.

Emergence from the Second AI winter

As with most great ideas, the good ones will bubble back to the surface eventually. It turns out that the basic idea behind the perceptron can be extended to overcome the basic limitation that doomed it at first. The idea is to gather multiple perceptrons together and feed the input into one (or several) perceptrons. Then you can feed the output of those perceptrons into more perceptrons before finally comparing the output to the expected value. This system (a neural network) can learn more complex patterns and overcome the challenge of classes that aren’t linearly separable, like in the XOR problem. The key question is: How do you update the weights in the earlier layers?

Let’s pause for a moment and formalize an important part of the process. So far we’ve discussed errors and how much the prediction was off base for a perceptron. Measuring this error is the job of a cost function, or loss function. A cost function, as you have seen, quantifies the mismatch between the correct answers that the network should output and the values of the actual outputs (y) for the corresponding “questions” (x) input into the network. The loss function tells us how often our network output the wrong answer and how wrong those answers were. Equation 5.2 is one example of a cost function, just the error between the truth and your model’s prediction:

Equation 5.2. Error between truth and prediction

The goal in training a perceptron, or a neural network in general, is to minimize this cost function across all available input samples:

Equation 5.3. Cost function you want to minimize

You’ll soon see other cost functions, such as mean squared error, but you won’t have to decide on the best cost function. It’s usually already decided for you within most neural network frameworks. The most important thing to grasp is the idea that minimizing a cost function across a dataset is your ultimate goal. Then the rest of the concepts presented here will make sense.

Backpropagation

Hinton and his colleagues decided there was a way to use multiple perceptrons at the same time with one target. This they showed could solve problems that weren’t linearly separable. They could now approximate nonlinear functions as well as linear ones.

But how in the world do you update the weights of these various perceptrons? What does it even mean to have contributed to an error? Say two perceptrons sit next to each other and each receive the same input. No matter what you do with output (concatenate it, add it, multiply it), when you try to push the error back to the initial weights it will be a function of the input (which was identical on both sides), so they would be updated the same amount at each step and you’d never go anywhere. Your neurons would be redundant. They’d both end up with the same weights and your network wouldn’t learn very much.

The concept gets even more mind bending when you imagine a perceptron that feeds into a second perceptron as the second’s input. Which is exactly what you’re going to do.

Backpropagation helps you solve this problem, but you have to tweak your perceptron a little to get there. Remember, the weights were updated based on how much they contributed to the overall error. But if a weight is affecting an output that becomes the input for another perceptron, you no longer have a clear idea of what the error is at the beginning of that second perceptron.

You need a way to calculate the amount a particular weight (w1i in figure 5.6) contributed to the error given that it contributed to the error via other weights (w1j) and (w2j) in the next layer. And the way to do that is with backpropagation.

Now is a good time to stop using the term “perceptron,” because you’re going to change how the weights in each neuron are updated. From here on out, we’ll refer to the more general neuron that includes the perceptron, but also its more powerful relatives. You’ll also see neurons referred to as cells or nodes in the literature, and in most cases the terms are interchangeable.

Figure 5.6. Neural net with hidden weights

A neural network, regardless of flavor, is nothing more than a collection of neurons with connections between them. We often organize them into layers, but that’s not required. Once you have an architecture where the output of a neuron becomes the input of another neuron, you begin to talk about hidden neurons and layers versus an input or output layer or neuron.

This is called a fully connected network. Though not all the connections are shown in figure 5.7, in a fully connected network each input element has a connection to every neuron in the next layer. And every connection has an associated weight. So in a network that takes a four-dimensional vector as input and has 5 neurons, there will be 20 total weights in the layer (4 weights for the connections to each of the 5 neurons).

Figure 5.7. Fully connected neural net

As with the input to the perceptron, where there was a weight for each input, the neurons in the second layer of a neural network have a weight assigned not to the original input, but to each of the outputs from the first layer. So now you can see the difficulty in calculating the amount a first-layer weight contributed to the overall error. The first-layer weight has an effect that is passed through not just a single other weight but through one weight in each of the next layer’s neurons. The derivation and mathematical details of the algorithm itself, although extremely interesting, are beyond the scope of this book, but we take a brief moment for an overview so you aren’t left completely in the dark about the black box of neural nets.

Backpropagation, short for backpropagation of the errors, describes how you can discover the appropriate amount to update a specific weight, given the input, the output, and the expected value. Propagation, or forward propagation, is an input flowing “forward” through the net and computing the output for the network for that input. To get to backpropagation, you first need to change the perceptron’s activation function to something that is slightly more complex.

Until now, you have been using a step function as your artificial neuron’s activation function. But as you’ll see in a moment, backpropagation requires an activation function that is nonlinear and continuously differentiable.[11] Now each neuron will output a value between two values, like 0 and 1, as it does in the commonly used sigmoid function shown in equation 5.4:

11

A continuously differentiable function is even more smooth than a differentiable function. See the Wikipedia article “Differentiable function” (https://en.wikipedia.org/wiki/Differentiable_function#Differentiability_and_continuity).

Equation 5.4. Sigmoid function

Why does your activation function need to be nonlinear?

Because you want your neurons to be able to model nonlinear relationships between your feature vectors and the target variable. If all a neuron could do is multiply inputs by weights and add them together, the output would always be a linear function of the inputs and you couldn’t model even the simplest nonlinear relationships.

But the threshold function you used for your neurons earlier was a nonlinear step function. So the neurons you used before could theoretically be trained to work together to model nearly any nonlinear relationship... as long as you had enough neurons.

That’s the advantage of a nonlinear activation function; it allows a neural net to model a nonlinear relationship. And a continuously differentiable nonlinear function, like a sigmoid, allows the error to propagate smoothly back through multiple layers of neurons, speeding up your training process. Sigmoid neurons are quick learners.

There are many other activation functions, such as hyperbolic tangent and rectified linear units; they all have benefits and downsides. Each shines in different ways for different neural network architectures, as you’ll learn in later chapters.

So why differentiable? If you can calculate the derivative of the function, you can also do partial derivatives of the function, with respect to various variables in the function itself. The hint of the magic is “with respect to various variables.” You have a path toward updating a weight with respect to the amount of input it received!

Differentiate all the things

You’ll start with the error of the network and apply a cost function, say squared error, as shown in equation 5.5:

Equation 5.5. Mean squared error

You can then lean on the chain rule of calculus to calculate the derivative of compositions of functions, as in equation 5.6. And the network itself is nothing but a composition of functions (specifically dot products followed by your new nonlinear activation function at each step):

Equation 5.6. Chain rule

You can now use this formula to find the derivative of the activation function of each neuron with respect to the input that fed it. You can calculate how much that weight contributed to the final error and adjust it appropriately.

If the layer is the output layer, the update of the weights is rather straightforward, with the help of your easily differentiable activation function. The derivative of the error with respect to the j-th output that fed it is

Equation 5.7. Error derivative

If you’re updating the weights of a hidden layer, things are a little more complex, as you can see in equation 5.8:

Equation 5.8. Derivative of the previous layer

The function f(x) in equation 5.7 is the output, specifically the j-th position of the output vector. The y in equation 5.7 is the output of a node in either the i-th layer or the j-th layer, where the output of the i-th layer is the input of the j-th layer. So you have the a (the learning rate) times the output of the earlier layer times the derivative of the activation function from the later layer with respect to the weight that fed the output of the i-th layer into the j-th layer. The sum in equation 5.8 expresses this for all inputs to all the layers.

It’s important to be specific about when the changes are applied to the weights themselves. As you calculate each weight update in each layer, the calculations all depend on the network’s state during the forward pass. Once the error is calculated, you then calculate the proposed change to each weight in the network. But do not apply any of them—at least until you get all the way back to the beginning of the network. Otherwise as you update weights toward the end of the net, the derivatives calculated for the lower levels will no longer be the appropriate gradient for that particular input. You can aggregate all the ups and down for each weight based on each training sample, without updating any of the weights and instead update them at the end of all the training, but we discuss more on that choice in section 5.1.6.

And then to train the network, pass in all the inputs. Get the associated error for each input. Backpropagate those errors to each of the weights. And then update each weight with the total change in error. After all the training data has gone through the network once, and the errors are backpropagated, we call this an epoch of the neural network training cycle. The dataset can then be passed in again and again to further refine the weights. Be careful, though, or the weights will overfit the training set and no longer be able to make meaningful predictions on new data points from outside the training set.

In equations 5.7 and 5.8, a is the learning rate. It determines how much of the observed error in the weight is corrected during a particular training cycle (epoch) or batch of data. It usually remains constant during a single training cycle, but some sophisticated training algorithms will adjust it adaptively to speed up the training and ensure convergence. If a is too large, you could easily overcorrect. Then the next error, presumably larger, would itself lead to a large weight correction the other way, but even further from the goal. Set a too small and the model will take too long to converge to be practical, or worse, it will get stuck in a local minimum on the error surface.

5.1.4. Let’s go skiing—the error surface

The goal of training in neural networks, as we stated earlier, is to minimize a cost function by finding the best parameters (weights). Keep in mind, this isn’t the error for any one particular data point. You want to minimize the cost for all the various errors taken together.

Creating a visualization of this side of the problem can help build a mental model of what you’re doing when you adjust the weights of the network as you go.

From earlier, mean squared error is a common cost function (shown back in equation 5.5). If you imagine plotting the error as a function of the possible weights, given a set of inputs and a set of expected outputs, a point exists where that function is closest to zero. That point is your minimum—the spot where your model has the least error.

This minimum will be the set of weights that gives the optimal output for a given training example. You will often see this represented as a three-dimensional bowl with two of the axes being a two-dimensional weight vector and the third being the error (see figure 5.8). That description is a vast simplification, but the concept is the same in higher dimensional spaces (for cases with more than two weights).

Figure 5.8. Convex error curve

Similarly, you can graph the error surface as a function of all possible weights across all the inputs of a training set. But you need to tweak the error function a little. You need something that represents the aggregate error across all inputs for a given set of weights. For this example, you’ll use mean squared error as the z axis (see equation 5.5).

Here again, you’ll get an error surface with a minimum that is located at the set of weights. That set of weights will represent a model that best fits the entire training set.

5.1.5. Off the chair lift, onto the slope

What does this visualization represent? At each epoch, the algorithm is performing gradient descent in trying to minimize the error. Each time you adjust the weights in a direction that will hopefully reduce your error the next time. A convex error surface will be great. Stand on the ski slope, look around, find out which way is down, and go that way!

But you’re not always so lucky as to have such a smoothly shaped bowl. The error surface may have some pits and divots scattered about. This situation is what is known as a nonconvex error curve. And, as in skiing, if these pits are big enough, they can suck you in and you might not reach the bottom of the slope.

Again, the diagrams are representing weights for two-dimensional input. But the concept is the same if you have a 10-dimensional input, or 50, or 1,000. In those higher dimensional spaces, visualizing it doesn’t make sense anymore, so you trust the math. Once you start using neural networks, visualizing the error surface becomes less important. You get the same information from watching (or plotting) the error or a related metric over the training time and seeing if it’s trending toward 0. That will tell you if your network is on the right track or not. But these 3D representations are a helpful tool for creating a mental model of the process.

But what about the nonconvex error space? Aren’t those divots and pits a problem? Yes, yes they are. Depending on where you randomly start your weights, you could end up at radically different weights and the training would stop, as there’s no other way to go down from this local minimum (see figure 5.9).

And as you get into even higher dimensional space, the local minima will follow you there as well.

Figure 5.9. Nonconvex error curve

5.1.6. Let’s shake things up a bit

Up until now, you have been aggregating the error for all the training examples and skiing down the slope as best you could. This training approach, as described, is batch learning. A batch is a large subset of your training data. But batch learning has a static error surface for the entire batch. With this single static surface, if you only head downhill from a random starting point, you could end up in some local minima (divot or hole) and not know that better options exist for your weight values. Two other options to training can help you skirt these traps.

The first option is stochastic gradient descent. In stochastic gradient descent, you update the weights after each training example, rather than after looking at all the training examples. And you reshuffle the order of the training examples each time through. By doing this, the error surface is redrawn for each example, as each different input could have a different expected answer. So the error surface for most examples will look different. But you’re still just adjusting the weights based on gradient descent, for that example. Instead of gathering up the errors and then adjusting the weights once at the end of the epoch, you update the weights after every individual example. The key point is that you’re moving toward the presumed minimum (not all the way to that presumed minimum) at any given step.

And as you move toward the various minima on this fluctuating surface, with the right data and right hyperparameters, you can more easily bumble toward the global minimum. If your model isn’t tuned properly or the training data is inconsistent, the model won’t converge, and you’ll just spin and turn over and over and the model never learns anything. But in practice stochastic gradient descent proves quite effective in avoiding local minima in most cases. The downfall of this approach is that it’s slow. Calculating the forward pass and backpropagation, and then updating the weights after each example, adds that much time to an already slow process.

The more common approach, your second training option, is mini-batch. In mini-batch training, a small subset of the training set is passed in and the associated errors are aggregated as in full batch. Those errors are then backpropagated as with batch and the weights updated for each subset of the training set. This process is repeated with the next batch, and so on until the training set is exhausted. And that again would constitute one epoch. This is a happy medium; it gives you the benefits of both batch (speedy) and stochastic (resilient) training methods.

Although the details of how backpropagation works are fascinating,[12] they aren’t trivial, and as noted earlier they’re outside the scope of this book. But a good mental image to keep handy is that of the error surface. In the end, a neural network is just a way to walk down the slope of the bowl as fast as possible until you’re at the bottom. From a given point, look around you in every direction, find the steepest way down (not a pleasant image if you’re scared of heights), and go that way. At the next step (batch, mini-batch, or stochastic), look around again, find the steepest way, and now go that way. Soon enough, you’ll be by the fire in the ski lodge at the bottom of the valley.

12

5.1.7. Keras: Neural networks in Python

Writing a neural network in raw Python is a fun experiment and can be helpful in putting all these pieces together, but Python is at a disadvantage regarding speed, and the shear number of calculations you’re dealing with can make even moderately sized networks intractable. Many Python libraries, though, get you around the speed zone: PyTorch, Theano, TensorFlow, Lasagne, and many more. The examples in this book use Keras (https://keras.io/).

Keras is a high-level wrapper with an accessible API for Python. The exposed API can be used with three different backends almost interchangeably: Theano, TensorFlow from Google, and CNTK from Microsoft. Each has its own low-level implementation of the basic neural network elements and has highly tuned linear algebra libraries to handle the dot products to make the matrix multiplications of neural networks as efficiently as possible.

Let’s look at the simple XOR problem and see if you can train a network using Keras.

Listing 5.4. XOR Keras network
>>> import numpy as np
>>> from keras.models import Sequential                 1
>>> from keras.layers import Dense, Activation          2
>>> from keras.optimizers import SGD                    3
>>> # Our examples for an exclusive OR.
>>> x_train = np.array([[0, 0],
...                     [0, 1],
...                     [1, 0],
...                     [1, 1]])                        4
>>> y_train = np.array([[0],
...                     [1],
...                     [1],
...                     [0]])                           5
>>> model = Sequential()
>>> num_neurons = 10                                    6
>>> model.add(Dense(num_neurons, input_dim=2))          7
>>> model.add(Activation('tanh'))
>>> model.add(Dense(1))                                 8
>>> model.add(Activation('sigmoid'))
>>> model.summary()
Layer (type)                 Output Shape              Param #
=================================================================
dense_18 (Dense)             (None, 10)                30
_________________________________________________________________
activation_6 (Activation)    (None, 10)                0
_________________________________________________________________
dense_19 (Dense)             (None, 1)                 11
_________________________________________________________________
activation_7 (Activation)    (None, 1)                 0
=================================================================
Total params: 41.0
Trainable params: 41.0
Non-trainable params: 0.0

  • 1 The base Keras model class
  • 2 Dense is a fully connected layer of neurons.
  • 3 Stochastic gradient descent, but there are others
  • 4 x_train is a list of samples of 2D feature vectors used for training.
  • 5 y_train is the desired outcomes (target values) for each feature vector sample.
  • 6 The fully connected hidden layer will have 10 neurons.
  • 7 input_dim is only necessary for the first layer; subsequent layers will calculate the shape automatically from the output dimensions of the previous layer. We have 2D feature vectors for our 2-input XOR gate examples.
  • 8 The output layer has one neuron to output a single binary classification value (0 or 1).

The model.summary() gives you an overview of the network parameters and number of weights (Param #) at each stage. Some quick math: 10 neurons, each with two weights (one for each value in the input vector), and one weight for the bias gives you 30 weights to learn. The output layer has a weight for each of the 10 neurons in the first layer and one bias weight for a total of 11 in that layer.

The next bit of code is a bit opaque:

>>> sgd = SGD(lr=0.1)
>>> model.compile(loss='binary_crossentropy', optimizer=sgd,
...     metrics=['accuracy'])

SGD is the stochastic gradient descent optimizer you imported. This is just how the model will try to minimize the error, or loss. lr is the learning rate, the fraction applied to the derivative of the error with respect to each weight. Higher values will speed learn, but may force the model away from the global minimum by shooting past the goal; smaller values will be more precise but increase the training time and leave the model more vulnerable to local minima. The loss function itself is also defined as a parameter; here it’s binary_crossentropy. The metrics parameter is a list of options for the output stream during training. The compile method builds, but doesn’t yet train the model. The weights are initialized, and you can use this random state to try to predict from your dataset, but you’ll only get random guesses:

>>> model.predict(x_train)
[[ 0.5       ]
 [ 0.43494844]
 [ 0.50295198]
 [ 0.42517585]]

The predict method gives the raw output of the last layer, which would be generated by the sigmoid function in this example.

Not much to write home about. But remember this has no knowledge of the answers just yet; it’s just applying its random weights to the inputs. So let’s try to train this. See the following listing.

Listing 5.5. Fit model to the XOR training set
model.fit(x_train, y_train, epochs=100)                    1
Epoch 1/100
4/4 [==============================] - 0s - loss: 0.6917 - acc: 0.7500
Epoch 2/100
4/4 [==============================] - 0s - loss: 0.6911 - acc: 0.5000
Epoch 3/100
4/4 [==============================] - 0s - loss: 0.6906 - acc: 0.5000
...
Epoch 100/100
4/4 [==============================] - 0s - loss: 0.6661 - acc: 1.0000

  • 1 This is where you train the model.
Tip

The network might not converge on the first try. The first compile might end up with base parameters from the random distribution that make finding the global minimum difficult or impossible. If you run into this situation, you can call model.fit again with the same parameters (or add even more epochs) and see if the network finds its way eventually. Or reinitialize the network with a different random starting point and try fit from there. If you try the latter, make sure that you don’t set a random seed, or you’ll just repeat the same experiment over and over.

As it looked at what was a tiny dataset over and over, it finally figured out what was going on. It “learned” what exclusive-or (XOR) was, just from being shown examples! That is the magic of neural networks and what will guide you through the next few chapters:

>>> model.predict_classes(x_train))
4/4 [==============================] - 0s
[[0]
 [1]
 [1]
 [0]]
>>> model.predict(x_train))
4/4 [==============================] - 0s
[[ 0.0035659 ]
 [ 0.99123639]
 [ 0.99285167]
 [ 0.00907462]]

Calling predict again (and predict_classes) on the trained model yields better results. It gets 100% accuracy on your tiny dataset. Of course, accuracy isn’t necessarily the best measure of a predictive model, but for this toy example it will do. So in the following listing you save your ground-breaking XOR model for posterity.

Listing 5.6. Save the trained model
>>> import h5py
>>> model_structure = model.to_json()                1
 
>>> with open("basic_model.json", "w") as json_file:
...     json_file.write(model_structure)
 
>>> model.save_weights("basic_weights.h5")           2

  • 1 Export the structure of the network to a JSON blob for later use using Keras' helper method.
  • 2 The trained weights must be saved separately. The first part just saves the network structure. You must re-instantiate the same model structure to reload them later.

And there are similar methods to re-instantiate the model, so you don’t have to retrain every time you want to make a prediction, which will be huge going forward. Although this model takes a few seconds to run, in the coming chapters that will quickly grow to minutes, hours, even in some cases days depending on the hardware and the complexity of the model, so get ready!

5.1.8. Onward and deepward

As neural networks have spread and spawned the entire deep learning field, much research has been done (and continues to be done) into the details of these systems:

  • Different activation functions (such as sigmoid, rectified linear units, and hyperbolic tangent)
  • Choosing a good learning rate, to dial up or down the effect of the error
  • Dynamically adjusting the learning rate using a momentum model to find the global minimum faster
  • Application of dropout, where a randomly chosen set of weights are ignored in a given training pass to prevent the model from becoming too attuned to its training set (overfitting)
  • Regularization of the weights to artificially dampen a single weight from growing or shrinking too far from the rest of the weights (another tactic to avoid overfitting)

The list goes on and on.

5.1.9. Normalization: input with style

Neural networks want a vector input and will do their best to work on whatever is fed to them, but one key thing to remember is input normalization. This is true of many machine learning models. Imagine the case of trying to classify houses, say on their likelihood of selling in a given market. You have only two data points: number of bedrooms and last selling price. This data could be represented as a vector. Say, for a two-bedroom house that last sold for $275,000:

input_vec = [2, 275000]

As the network tries to learn anything about this data, the weights associated with bedrooms in the first layer would need to grow huge quickly to compete with the large values associated with price. So it’s common practice to normalize the data so that each element retains its useful information from sample to sample. Normalization also ensures that each neuron works within a similar range of input values as the other elements within a single sample vector. Several approaches exist for normalization, such as mean normalization, feature scaling, and coefficient of variation. But the goal is to get the data in some range like [-1, 1] or [0, 1] for each element in each sample without losing information.

You won’t have to worry too much about this with NLP, as TF-IDF, one-hot encoding, and word2vec (as you’ll soon see) are normalized already. Keep it in mind for when your input feature vectors aren’t normalized (such as with raw word frequencies or counts).

Finally, a last bit of terminology. Not a great deal of consensus exists on what constitutes a perceptron versus a multi-neuron layer versus deep learning, but we’ve found it handy to differentiate between a perceptron and a neural network if you have to use the activation function’s derivative to properly update the weights. In this book, we use neural network and deep learning in this context and save the term “perceptron” for its (very) important place in history.

Summary

  • Minimizing a cost function is a path toward learning.
  • A backpropagation algorithm is the means by which a network learns.
  • The amount a weight contributes to a model’s error is directly related to the amount it needs to be updated.
  • Neural networks are, at their heart, optimization engines.
  • Watch out for pitfalls (local minima) during training by monitoring the gradual reduction in error.
  • Keras helps make all of this neural network math accessible.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset