Introduction to artificial intelligence and deep learning
This chapter describes essential concepts about deep learning (DL) and neural networks. This chapter also describes the most used and known neural networks architectures, their purposes, and how they started.
This chapter contains the following topics:
1.1 Deep learning
Humans have always searched for a way to describe reality and find solutions for all kinds of problems. From the use of mechanical machines to facilitate and perform the hardest physical tasks to the construction of magnificent mathematical systems to solve problems that initially existed only deep in the mind and can be expressed only through complex mathematical language, humans have seemed to find their way to solve problems across three distinct paths: experimentation, equation solving (analytical solution), and mechanical computation (numerical solution).
Although these paths seem to be distant from each other, every age since the development of mathematics, a few humans have dreamed of finding a way to merge all of these paths, mainly through human-created devices that can develop a certain degree of intelligence.
The 20th century brought the dawn of a new machine. Computers could compute a large number of numbers so that humans could solve complex mathematical systems at an unimaginable scale.
The most promising approach to these problems is artificial intelligence (AI). AI is a scientific concept that is glorified and promoted in literature and films. However, in recent decades, science fiction has steadily moved closer to becoming science fact, and in everyday places in our modern world. Although AI remains a field of science, certain characteristics that were previously labeled as science fiction were defined by subsequent fields and areas of scientific study.
Probably the biggest field in AI is machine learning (ML), which can be described as a scientific representation of decision making, that is, the ability to represent a human decision process in mathematical form so that it can be implemented as a program (or algorithm) that a computer can run against a given data set. To prove and establish the algorithm, a computer must be taught or trained so that learning is based on experience (as is the case in the human context) to enable the system to interpret results and improve the accuracy of results. ML is closely related to statistics, and has grown as large amounts of information are made available through digitalization and computer storage.
For example, imagine a data set of thousands of images from a traffic camera. An algorithm can be written to count the number of vehicles that is captured over time. The algorithm can be improved to count only the number of cars (ignoring trucks and bikes). The results can then be processed again to determine whether a certain car can be considered blue.
Before you consider the detection of color, an initial problem is to define an algorithm to determine whether a certain image was captured and depicted a vehicle (and if so, what type). Although it is obvious to the human eye, the more combinations and permutations that can be handled with more training and algorithms, the more accurate the results become. The size of the vehicle, the time of day of the photograph, weather conditions, and density of objects (think of peak rush hour), and other variables can reduce accuracy. Consider a photograph of a road junction that is taken at peak morning rush hour, in winter, during a snow storm. The initial decision of “is there a vehicle in the photograph?” becomes a more challenging task even for human eyes.
DL is a subset of ML that attempts to model decision making by using neural networks, which mimic the way the human brain itself processes senses (sight, sound, and others).
This book covers several topics about DL and in particular the deployment, use, and optimization of IBM PowerAI, which is a set of tools that enables data scientists and systems administrations to access the power of DL with little effort, and get results in a shorter time.
Figure 1-1 shows how these fields relate to each other.
Figure 1-1 Deep learning in the artificial intelligence landscape
1.1.1 Artificial intelligence milestones and the development of deep learning
Deep Learning1 suggests that there are three eras in the history of DL:
1940s -1960s: Cybernetics
1980s -1990s: Connectionism
2006 - present: DL
These three eras show that DL is an established discipline with roots in the beginning of computer science that was subject to trends during its history. The history shows that each wave was pushed by certain developments of the era.
The current trends in DL increased because of the availability of large amounts of digital data (closely related to the big data discipline), and the availability of inexpensive computing power in the form of graphical processing units (GPUs) with a high degree of specialization in solving complex algebraic problems.
To put this idea in a wider context, Table 1-1 provides a timeline of some of the most important milestones in this discipline. The boundaries between these disciplines are difficult to perceive clearly, so many of the articles or books can apply to one or more of the fields.
Table 1-1 Milestones for each discipline
Date (CE)
Artificial intelligence
Machine learning
Deep learning
c. 1300
Ramon Llull builds the first mechanical machine that can perform basic calculations.
1623
Wilhem Schickard builds the first calculating machine.
c. 1700
Gottfried Leibniz extends the concept of the calculating machine.
1937
Alan Turing’s paper On Computable Numbers, with an Application to the Entscheidungs problem, establishes the foundation of the Theory of Computation.1
1943
McCullouch and Pitts' formal design for Turing-complete artificial neurons is the first paper published that covers an AI topic.2 This text marks the beginning of the development of theories of biological learning.
 
 
1956
The AI research field is born as a workshop at Dartmouth College.
 
 
1958
The perceptron algorithm is invented by Frank Rosenblatt,3 enabling the training of a single neuron.
 
 
1965
 
 
The first general, working learning algorithm for supervised, deep, feed forward, and multilayer perceptrons (MLPs) is published by Ivakhnenko and Lapa.4
1980
Commercial use of expert systems grow.
 
 
1986
 
 
Rina Dechter introduces the term deep learning.5
1986
 
 
Rumelhart, et al, propose the concept of back-propagation6 to train neural networks through one or two hidden layers.
1990
 
A simple ML algorithm can determine whether a Cesarean section is recommended.7
 
1990s
 
The ML discipline flourishes because of the increasing availability of digitalized information.
 
1997
IBM Deep Blue defeats the reigning Chess World Champion Gary Kasparov.
 
 
2000
 
 
Igor Aizemberg, et al., coin the term Artificial Neural Network.8
2005
 
 
Gomez and Schmidhuber publish a paper establishing the concept of reinforced learning.9
2006
 
 
Hinton, et al, introduce the concept of supervised back propagation in the context of deep neural networks.
2010
 
IBM Watson® defeats human contestants in a Jeopardy! quiz exhibition show.10
 
2015
 
 
AlphaGo beats the reigning Go World Champion.11

4 Ivakhnenko, A. G., “Cybernetic Predicting Devices”, CCM Information Corporation, 1973
8 Aizemberg, et al, “Multi-Valued and Universal Binary Neurons: Theory, Learning, and Applications”. Springer Science & Business Media, 2000
1.2 Neural networks overview
This section describes neural networks.
1.2.1 A brief history about neural networks
Artificial neural networks have been around for some time, and they have evolved over time. Today, neural networks play an important role in the AI realm, as described in 1.2.2, “Why neural networks are an important subject” on page 6.
According to Yadav,2 neural network history can be divided into four greater ages starting in the 1940s when the first article about the subject was published in 1943. It described the fact that neural networks can compute any arithmetic or logical function. Then, researchers looked into brain-like methods of learning as a promising way to create learning algorithms.
The 1950s and 1960s were marked by the first neurocomputer, which was called Snark, and was created by Marvin Minsky in 1951. What characterized these years as the first golden age of the neural networks was the first successful neurocomputer that changed the direction of what we know today. The neurocomputer was created in 1957 - 58 by Frank Rosenblatt, Charles Wightman, and others.
Unfortunately, by the end of 1960s, neural networks studies faltered because most of the researchers were working on experimental studies that were not applicable in real scenarios.
During the 1970s, researches continued experimenting in the areas of adaptive signal processing, biological modeling, and pattern recognition. Many of these scientists who started their work became the ones who were responsible for reviving the neural network field about one decade later.
In the 1980s, after years of research, new proposals started to be submitted that focused on neurocomputers and neural networks applications. During 1983 and 1986, a physicist named John Hopfield became involved with neural networks and published two papers that motivated several experts in the area to become involved with neural networks.
In 1986 when the Parallel Distributed Processing (PDP) books3 were published, neural networks became a top subject in research areas, and have been constantly researched since then.
1.2.2 Why neural networks are an important subject
Most neural network proposals are around theories and how to improve them, even when researching about their applicability and coming up with successful results. Until recently, there was not enough machine, processing, and storage power to support neural network training.
To train a neural network to recognize and classify different objects, you must use many images of the objects to achieve good accuracy, and it is even more important than for real-life scenarios. For example, in an image recognition field, large quantities of images representing types of dogs are required, although equally important are images that are not dogs to train the neural network in a suitable way.
Figure 1-2 shows how storage capacity grew two times from the 1950s to the1980s and more than 20,000 times from the 1980s to today.
Figure 1-2 Storage capacity growth
Also, the price dropped exponentially (Figure 1-3). For example, in 1980, the 305 RAMAC was leased for USD 3,200 a month, which is about USD 27,287 in 2016.4 These changes in storage contributed to the scenario today to where you can start using neural networks and extract their benefits in real-world applications.
Figure 1-3 Graph showing the decrease of hard disk drive costs
Besides storage, you must have several epochs around the data (loops through the data) to adjust all the neural networks weights and biases so that you can ensure higher accuracy rate. Processing power has become increasingly accessible and cheaper.
Figure 1-4 shows how processing has grown and is about to surpass human brain capacity.
Figure 1-4 Processing capacity growth
In this current context, neural networks use CPU and GPU processing because of distributed architecture, which enables the system to scale horizontally by using on-premises or cloud environments. Storage can also be used in a distributed fashion, which increases the speed of data access by using the principle of locality and memory, reducing disk I/O and network traffic.
1.2.3 Types of neural networks and their usage
There are several different types or architectures of neural networks. This section describes fundamentals and briefly reviews the main architectures.
As the name implies, neural networks were inspired by the human brain, and the idea is to imitate how the brain works. Scientists still have much to learn about the human brain, and the little knowledge that is available so far is enough to produce some interesting theories and implementations for an artificial neural network.
In the human brain, there are billions of neurons, and they signal each other through a structure that is called a synapse. Every neuron is a single cell that is composed of a cell body (that holds the cell nucleus), an axon (a long tail that is insulated from the environment), and many dendrites (smaller branches of the cell wall that are around the cell body and at the end of the axon). The synapse can be compared to a chemical bridge between two neurons. The synapse is the gap between two dendrites of two different neurons, and works as a chemical gate that can be opened or closed to allow or prevent the electrical signal from jumping from one neuron to another one.
Neural networks are inspired by the synpase structure and how it passes and processes signals. By using this structure as a base, scientists can implement a new approach to data processing, where different simple elements (neurons) are part of a bigger and more complex structure (neural network) that can perform much more complex tasks.
In this scenario, a neural network is an artificial structure that is composed of artificial neurons that are arranged so that they form layers with a level of specialization. A neuron in a certain layer receives the inputs from a previous layer, processes them (that is, weights all input signals to generate an output signal), and passes the result to the adjacent neuron in the next layer.
Figure 1-5 shows how an artificial neuron receives several inputs and processes them according to a function (f(S)) to produce an output that is propagated to the next layer.
Figure 1-5 Signal processing in a neuron
The network expects some values as an input, so there must be input neurons. All neural networks must have output, so there also must be output neurons.
The set of input neurons is called the input layer, and the set of output neurons is called the output layer. In the brain, there are several neurons communicating to each other and passing information through them; in the artificial neural network, there are neurons between the input and the output layer, which are collectively known as the hidden layer. The following section describes the hidden layers.
Figure 1-6 illustrates a simple neural network with one hidden layer and eight neurons.
Figure 1-6 Simple feed forward neural network
After you have your neural network structure defined, you adjust each neuron weight after each training step to conform to the provided label. For each type of neural network, there is a specific behavior and way to train and handle its neurons weights and biases. The following section outlines the most common neural network architectures.
1.2.4 Neural network architectures
This section describes some of the most common neural network architectures.
Feed forward neural network
A feed forward neural network (FFNN) is one of the most common neural network architectures. Data comes into the input layer and is propagated forward through the hidden layer until it reaches the output layer with the value it comes up with based on its training.
The FFNN is trained by using a back-propagation algorithm. Basically, for each input, an expected value is provided to the neural network and the output is compared against this value. The difference from the expected value and the one delivered by the neural network (which can be the Mean Square Error or other cost function) is then back-propagated across all the neurons, and an optimization algorithm is used to adjust each neuron’s weight aiming to reduce the error that is found.
This process happens for the number of epochs that is defined for the training phase or until the error rate converges, which is decided by the data scientist.
For more information about FFNN, see “The Perceptron: A probabilistic model for information storage and organization in the brain” by Rosenblatt.5
Recurrent neural network
Recurrent neural networks (RNNs) use sequential information. In an FFNN, the inputs and outputs are independent of each other. An RNN is different because they need to know their previous input to understand the sequence, and make suggestions about that sequence. RNNs are called recurrent because they go through the same processing (layers) for every element in a specific sequence. In a certain way, RNNs can store the data that is calculated so far.
An RNN has two kinds of input: the new input, also known as the present one, and the recent past that happens through a feedback loop returning its previous decision back into the neurons in the hidden layer. These outputs are combined to output the new value.
Figure 1-7 illustrates how an RNN works.
Figure 1-7 How a recurrent neural network works
RNNs are used in many fields, especially the natural language processing (NLP) field because you need to know the sequence of words in a sentence to take actions such as predicting the next word or understanding the sentence’s semantic content. RNN also allows the creation of a language model that can be used to generate text or help with word suggestions and algorithms for language conversion.
One of the limitations of this neural network model is that it can remember only short sequences, and it cannot output values that are based on inputs from the beginning of a large sequence. This limitation was addressed by a variation of the RNN called long / short term memory (LSTM), which is described in “Long and Short Term Memory” on page 12.
For more information, see “Finding structure in time” by Elman.
Long and Short Term Memory
The LSTM network is a specific type of RNN. LSTM is more robust and aims to address the limitations of classical RNNs. LSTM stores long-term memory and can provide relevant outputs about information that is provided far behind in a sequence.
With LSTM, it is possible to preserve error generation across several steps in the loop so that the RNN keeps learning with the input provided that is far behind. This behavior enables the neural network to make relationships and point out connections between remote things.
Using what is called gates, information flowing through the neural network can be stored by creating a memory within each cell. Those cells enable the network to perform reads and writes or even erase its content to manage what is useful for the given learning context.
Those gates that are responsible for deciding whether the information is kept, altered, or removed use a sigmoid function, which is similar to a classical neural network, and activates that gate based on its signal strengths that manage the cells’ weights. This process is repeated and the weights are adjusted by the RNN learning method.
With this type of neural network, it is possible to understand and infer words or sentences based on a complete book writing style, which leads to several other applications, such as handwriting recognition, music composition, and grammar learning.
For more on LSTM, see “Long short-term memory.” by Hochreiter and Schmidhuber.6
Convolutional neural networks
Convolutional neural networks (CNNs) have become more prominent because of their effectiveness in areas such image recognition and classification. CNNs can identify different objects within the same image and know where in the image that object is.
In a CNN, there are a few basic operations that are essential to how they work:
The convolutional phase focuses on extracting features from the input object (in most cases, those objects are images) by creating a filter (or kernel) that goes through this object piece by piece, performs a multiplication of the elements, and then sums the result and puts it into a new and smaller generated matrix that is called the feature map. In a simple grayscale image (numerical matrix), imagine a filter being a smaller matrix that is applied over the original one at the upper left corner that performs the calculation, moves one pixel to the right, and starts the calculation again until the filter has covered the whole matrix.
Figure 1-8 and Figure 1-9 illustrate the process.
Figure 1-8 Image pixel matrix (simplified) and filter
Figure 1-9 Convolutional process example
The filter is initialized randomly before a CNN is trained, and during the training process, it is adjusted and its size is defined by whoever is designing the CNN based on their needs.
After the convolution, it is necessary to add non-linearity to the result because most of the data and applications are non-linear and the convolutional process generates linear results. In Figure 1-10, all the negative pixels are turned to 0 by using an operation called rectified linear unit (ReLU).
Figure 1-10 Rectified linear unit graph
The next phase, pooling, reduces the dimensions of the feature map that are generated in the convolutional phase and then rectifies the previous phase. This process is known as downsampling. Even though the dimensionality is reduced, the most important information is retained.
There are different methods of pooling, such as extracting the average of a specific group or getting the highest number. This process is similar to the convolutional process where a specific area is defined and then moved across (without overlapping) and the highest number is kept (in the max pooling scenario), as shown in Figure 1-11.
Figure 1-11 Max pooling process example
Finally comes the last phase, the fully connected layer. This is a classical FFNN with a specific activation function at the output layer (usually a softmax one). The idea is to learn and classify the features that were extracted in the previous phases. It is called fully connected because as in an FFNN, all neurons from a layer are connected to all neurons in the next layer.
It is worth noting that the convolutional, rectified, and pooling phases can be repeated before reaching the fully connected phase to extract even more features from the object.
During the training phase, a back-propagation algorithm is used to apply a gradient descent across all the fully connected neurons and into the convolutional filters to adjust their weights and values.
1.2.5 Difference between a classical and deep neural networks
The answer to the question “What is the difference between classical and deep neural networks?” is straightforward. A deep neural network has more than one hidden layer in its architecture, usually even more than two or three.
It is possible to have deep neural networks in many different neural network models, such as a deep FFNN or a deep convolutional neural network, as shown in Figure 1-12.
Figure 1-12 Difference between a classical neural network and a deep neural network
So, why use more than one hidden layer, and what are the advantages of doing so?
Using only one hidden layer has proved that it is possible to approximate any function based on the error that is generated.7 However, the point is that is most approximations in more complex situations, neural networks with one hidden layer do not have accuracies greater than 99%. Even though there can be many reasons for that situation, and there are many studies and research in this area, one thing that has been tried and shown successful results is using extra hidden layers.
Adding more hidden layers to a neural network enables it to better generalize unknown data (not in the training set). This characteristic is essential and helps greatly with the image and object recognition areas; with more hidden layers, it is possible to recognize a set of features in a better way.
However, deep neural networks require much more processing power and powerful hardware that can reach those levels of processing.
1.2.6 Neural networks versus classical machine learning algorithms
Both neural networks (in addition to DL) and classical ML offer ways to train models and classify data.
If you take pictures of a car and a motorcycle and show them to any human being, you expect them to be able to answer which one is the car and which one is the motorcycle. The reason is that they have seen many cars and motorcycles in their lives so that they know the specific characteristics and shape of each one. A machine tries to learn by studying objects characteristics and behaviors of things and becoming aware the same way humans do. However, machines and humans can fail to identify things sometimes because they are not familiar with them or do not know specific things about it.
For a machine to do a classical ML classification of an image, you must select manually its relevant features to train that specific model (logistic regression or any other). Those features can then be used by the model to identify and classify the image to train this kind of model, and also to perform predictions on it later. The data (images in this case) must be prepared from all the features that were previously extracted to feed the model.
By using a neural network, there is no need to use code to extract those features from the image beforehand because you provide the image directly to the neural network algorithm (the image layers, for example red, green, and blue (RGB) must be sent before they go into the model).
Even though neural networks seem more promising, they require much machine power and large data sets to make the model effective.
When choosing classical ML algorithms, it is possible to train your model in several classifiers and check which one produces the best results, and if needed, work with more than one in conjunction to achieve even more. You also need to know the best features to extract so that the model can perform its best for your data set.
Usually, for more common applications where you already have most of the features, for example, revenue prediction, it might be easier and faster to work with classical ML algorithms. Especially for non-structured data such as sounds, text, images, and others, neural networks can be straightforward even though they require more processing and in many cases, a great deal of data.
It is possible to use neural networks for classical regression and classification because they provide better accuracy8 when trained with the expected amount of data. In several cases for simple classification or regressions, there is not much data to use in a training set, which makes neural networks performance insufficient.
If you do not have that much data or lack machine power, it is better to stick with classical ML algorithms; otherwise, add neural networks options to your stack and check which one performs better for your specific scenario.
1.3 Deep learning frameworks
Implementing neural networks algorithms is not an easy task because it usually requires several lines of code to create a simple neural network, and it probably lacks several features. Besides the neural network logic with all its neurons and connections, it is necessary to think about how to distribute the heavy training process across different machines and make them run on CPUs and GPUs.
This is where DL frameworks come into play. They take care of most of the complexities that are involved in running neural network algorithms, and also all the code that is necessary to interact with the GPU library. They also do many other things besides implementing those algorithms, such as providing higher-level APIs to make the DL code easier to implement.
1.3.1 Most popular deep learning frameworks
There are several DL frameworks, and each one has their own specific characteristics, advantages, and disadvantages. This section introduces and describes the most popular learning frameworks.
Caffe
Caffe is one of the first DL frameworks, even though it is not that old. Caffe was created at Berkeley University by Yangqing Jia during his PhD years. Caffe was developed in C++ and it supports Python.
Today, Caffe is stable and suited for production, although it is also used in research. Caffe performs well when the area is computer vision, such as image recognition, but it lacks in other DL areas.
The framework also works with GPUs that use the CUDA Deep Neural Network (cuDNN) library. The code can be found on GitHub.9 Unfortunately, it still lacks documentation.
Chainer
Chainer is a flexible framework that enables a complex neural network architecture in a simpler way.
Chainer implements a Define-by-Run methodology, which means that the network that Chainer creates is defined dynamically during run time. This approach is similar to writing in any language a simple FOR loop. A FOR loop runs the same way within the network as it runs in a normal code.
The framework was developed in Python and there are no other interfaces. As mentioned on the official website.10 Chainer works by representing a network as an execution path on a computational graph where this computational graph is a series of function applications. Each function is a layer of a neural network, and they are connected by using links where the variables are, and are updated during the training process.
Chainer is successful framework, although other frameworks are capable for applicable models. Chainer has a well documented website11 and several examples on the web. It is possible to run Chainer on GPUs by using the CuPy library, which can be easily installed.
TensorFlow
TensorFlow started in 2011 as a proprietary system that was called DistBelief that was developed by Google Brain. Due to its success across the company’s applications, Google focused on improving DistBelief, which then became TensorFlow. The first open source version was released in 2017.
TensorFlow is one of the well-known DL frameworks, and its usage has increased over time because it is sponsored by Google, and it comes with several useful features:
TensorFlow works in a Define-and-Run scheme where the compiler goes through all the code to create a static graph and then runs that graph. An advantage of this approach is that during the graph creation, TensorFlow can optimize all the operations that are defined and the best way to run each one of them.
TensorFlow was developed in C++ and has an interface for Python. TensorFlow has an interesting feature that is called TensorBoard, which is a GUI that you can use to visualize the graphs that you created and how data is flowing through.
The framework supports GPUs and runs in a distributed mode. Running in Define-and-Run mode introduces a lack of flexibility because the graph is static, but this mode also enables a better way to distribute the code across different nodes, ensures that the graph can be shipped, and runs equally in all of the GPUs.
Theano
Theano is a numerical computation library like TensorFlow. Its development began at the University of Montreal in 2007, so it is one of the older frameworks. It is developed in Python, which is its only supported interface.
Similar to TensorFlow, Theano also works in a Define-and-Run scheme where the graph is built first during compile time, and then all the processes flow through the graph during run time. Theano supports working with GPUs and can run in a distributed mode.
Although the framework is stable and ready for use in production, in September 2017, the development team announced that Theano’s development was stopping after release 1.0.12 At the time of writing, the future of Theano is unknown. Theano is open source, so it is possible that its development might continue under new management.
Torch and PyTorch
Even though these two frameworks look alike and have the same C/C++ engine, they are not the same.
Torch came first, and it its core engine was developed in C/C++ with interfaces for Lua and LuaJIT as its scripting language. Torch was developed by research teams with Facebook, Twitter, and Google.
PyTorch was developed by Facebook and released in 2017. The main purpose of PyTorch is to make Torch available with a Python interface because Lua is a language that is not often used among DL researchers and developers. Even though that was the main purpose, PyTorch was designed in a way that Python tightly integrates with the Torch core engine that is developed in C, improving memory management and optimization.
PyTorch works in a Defined-by-Run scheme similar to Chainer and focus on speed processing.
Torch and PyTorch both have a modular way of working, they are easy to combine with each other, not that difficult to run by using GPUs, and have several pre-trained models.
1.3.2 A final word on deep learning frameworks
There is not a best framework. The choice of framework depends on your need and context. At a high level, PyTorch and TensorFlow have been growing and becoming fit-all frameworks. Caffe has been around for some time, is stable, and good for production, but specializes in computer vision. Theano is stable, one of the oldest, might be going away, and unless you already have Theano implementations, you should analyze it carefully before starting projects with this framework. Chainer has a great deal of documentation, has been the inspiration for PyTorch, and looks promising, but is not growing as fast as TensorFlow.
The frameworks that are mentioned in this chapter are all supported by IBM PowerAI and are optimized to run on IBM Power Systems machines.

2 Yadav, et al, “History of Neural Networks. In: An Introduction to Neural Network Methods for Differential Equations”, Springer Briefs in Applied Sciences and Technology, Springer, Dordrecht, 2015
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset