Deep learning, a not-so-deep overview

So, what is this deep learning that is grabbing our attention and headlines? Let's turn to Wikipedia again for a working definition: Deep learning is a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using model architectures, with complex structures or otherwise, composed of multiple non-linear transformations. That sounds as if a lawyer wrote it. The characteristics of deep learning are that it is based on ANNs where the machine learning techniques, primarily unsupervised learning, are used to create new features from the input variables. We will dig into some unsupervised learning techniques in the next couple of chapters, but one can think of it as finding structure in data where no response variable is available. A simple way to think of it is the Periodic Table of Elements, which is a classic case of finding structure where no response is specified. Pull up this table online and you will see that it is organized based on the atomic structure with metals on one side and non-metals on the other. It was created based on latent classification/structure. This identification of latent structure/hierarchy is what separates deep learning from your run-of-the-mill ANN. Deep learning sort of addresses the question if there is an algorithm that better represents the outcome than just the raw inputs. In other words, can our model learn to classify pictures other than with just the raw pixels as the only input? This can be of great help in a situation where you have a small set of labeled responses but vast amounts of unlabeled input data. You could train your deep learning model using unsupervised learning and then apply this in a supervised fashion to the labeled data, iterating back and forth.

Identification of these latent structures is not trivial mathematically, but one example is the concept of regularization that we looked at in Chapter 4, Advanced Feature Selection in Linear Models. In deep learning, one can penalize weights with regularization methods such as L1 (penalize non-zero weights), L2 (penalize large weights), and dropout (randomly ignore certain inputs and zero their weight out). In standard ANNs, none of these regularization methods take place.

Another way is to reduce the dimensionality of the data. One such method is the autoencoder. This is a neural network where the inputs are transformed into a set of reduced dimension weights. In the following diagram, notice that Feature A is not connected to one of the hidden nodes:

Deep learning, a not-so-deep overview

This can be applied recursively and learning can take place over many hidden layers. What you have happening in this case is the network is developing features of features as they are stacked on each other. Deep learning will learn the weights between two layers in sequence first and then only use backpropagation in order to fine-tune these weights. Other feature selection methods include Restricted Boltzmann Machine and Sparse Coding Model.

The details are beyond our scope, and many resources are available to learn about the specifics. Here are a couple of starting points:

Deep learning has performed well on many classification problems including winning a Kaggle contest or two. It still suffers from the problems of ANNs, especially the black box problem. However, it is appropriate for problems where an explanation of How is not a problem and the important question is What. Additionally, the Python community has a bit of a head start on the R community in deep learning usage and packages. As we will see in the practical exercise, this gap has closed, if not been eliminated altogether.

While deep learning is an exciting undertaking, be aware that to achieve the full benefit of its capabilities, you will need a high degree of computational power along with taking the time to train the best model by fine-tuning the hyperparameters. Here is a list of some that you will need to consider:

  • An activation function
  • Size of the hidden layers
  • Dimensionality reduction, that is, Restricted Boltzmann versus. Autoencoder versus …
  • The number of epochs
  • The gradient descent learning rate
  • The loss function
  • Regularization

You can imagine that this can be no small feat, but enough of the overview; let's move on to some practical applications.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset