Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

7
Machine Learning

Samuel H. Huddleston and Gerald G. Brown

Operations Research Department, Naval Postgraduate School, Monterey, CA, USA

7.1 Introduction

This chapter provides an overview of machine learning, that is, using automated algorithms to discover descriptive, predictive, and prescriptive results from data. Machine learning is closely related to artificial intelligence (in computer science) and statistical learning (in statistics), incorporating algorithms and procedures developed in both fields. Due to this amalgam of multiple fields, the lexicon is very broad, with many synonyms used to describe the same algorithms, procedures, and model parameters. Throughout this chapter, we seek to capture the lexicon employed by the various tribes within this emerging discipline and provide a map to terms formalized in the operations research and statistical literature.

The defining characteristic of machine learning is the focus on using algorithmic methods to improve descriptive, predictive, and prescriptive performance in real-world contexts. An older, but perhaps more accurate, synonym for this approach from the statistical literature is algorithmic modeling [1]. This algorithmic approach to problem solving often entails sacrificing the interpretability of the resulting models. Therefore, machine learning is best applied when this trade-off makes business sense, but is not appropriate for situations such as public policy decision-making, where the requirement to explain how one is making decisions about public resources is often essential.

There are many nonparametric and heuristic approaches employed in machine learning for which full statistical and theoretic interpretations have not yet been developed. These methods are often widely employed for years with a high degree of success in practice before formal statistical explanations for their performance are developed and validated. Machine learning practitioners balance the risk of using these emerging techniques by conducting a formal evaluation (competition) of the many available algorithms that might apply to a problem in a way designed to identify and validate the performance of the best algorithm for the problem in a real-world context. Much of this chapter is devoted to summarizing this procedure and its underlying motivation.

Machine learning is a very large and rapidly expanding field, and a full description of the discipline would require much more space than is available in this chapter. Here we describe the major paradigms of machine learning, review the modeling in very general terms, and briefly describe some of the most popular machine learning algorithms. The reader is encouraged to conduct further investigation of the many terms and ideas summarized here.

Throughout this chapter, we have italicized terms for which Wikipedia provides more in-depth descriptions and frequently provide links to these pages in footnotes. For more in-depth technical treatments of most of the discussed algorithms, we recommend An Introduction to Statistical Learning [2] and The Elements of Statistical Learning [3], which are freely available online.¹ Additional references for specific techniques and applications are provided in the following section.

7.2 Supervised, Unsupervised, and Reinforcement Learning

There are three broad classes of machine learning problems, with classes defined based on the data provided for algorithmically “training” the models: supervised learning problems,² unsupervised learning problems,³ and reinforcement learning problems.⁴ These broad classes are not mutually exclusive, and learning methods designed for different classes of machine learning problems are often combined to solve real-world analytics challenges.

Supervised learning applies when the data set used to train models includes a labeled response variable. The goal in supervised learning is to use available observations to predict the values of the response variable associated with new observations (where the responses will not be known beforehand). Sometimes the term semisupervised learning is used to distinguish situations where the labeled response is available for some, but not all, observations in the training data set. Supervised learning problems are further subcategorized into regression problems and classification problems.²

Regression problems have response variables that take quantitative values. One classic example of a regression problem is predicting the future sale price of a home based on its age, zip code, square footage, and so on. Another example would be to predict the sales volume of a product in the next quarter, given historical sales records (an example of a time series forecast). The response variable in statistical regression is often referred to as the dependent variable, and the features used to predict the response variable are referred to as independent or predictor variables. Note that the term regression describes both statistical regression (an algorithm) and a subclass of supervised learning problems. However, statistical regression is not the only machine learning algorithm available to address regression problems.

Classification problems have a response variable that is categorical and unordered. The goal in a classification problem is to use the data at hand to predict the class or category for new observations (or in some cases, those for the next time period) where the class or category is not known. One classic example of a classification problem is predicting the probability of someone having (or developing) a particular disease based on risk factors (age, body mass index, cholesterol, family disease history, etc.). This is an example of binary or binomial classification. While the output is a probability (a continuous numerical quantity), this is a classification rather than a regression because the goal is to predict the probability of being in the “disease class.” Multiclass or multinomial classification problems classify observations into three or more classes or groups–for example, predicting which of three products a customer is most likely to buy (and not buying any product could make up a fourth class). Most classification methods will provide a predicted probability estimate for each class for each observation (with ties possible).

Unsupervised learning seeks to identify latent (underlying or hidden) structures in a data set. Therefore, unsupervised learning requires no labeled response variable. There can be many latent structures in a data set, and there are several unsupervised learning tasks that are quite common. Density methods (from statistics) construct an estimate for an underlying probability distribution for the data based on the observations in the data set.⁵ Clustering methods partition a data set into groups or clusters where each cluster (group) is a set of observations in the data that are more similar to each other in some way than they are to the other observations. ⁶ Dimension reduction seeks to provide a more compact (i.e., lower dimensional) representation of the data in a data set.⁷ Dimension reduction is frequently used to provide a more compact feature set (i.e., the set of independent variables in statistical regression, also often described as the predictors in machine learning) to use for supervised learning.

Reinforcement learning (Figure 7.1) addresses a very broad category of problems in which scoring or response functions are applied iteratively over time to improve the performance of models or policies. Often, reinforcement learning is applied in situations in which there is no labeled response data available now, but over time, feedback about the model outputs will provide the opportunity to improve performance. For example, one might automatically cluster documents together using topic modeling (an unsupervised approach) and then ask humans to randomly inspect the clusters made, providing feedback on which clusters are well grouped (winners) and which clusters are less useful (losers). This feedback (scoring function) provides the opportunity to apply supervised learning and then repeat the sampling and feedback (iteratively applying the scoring function). Another application is the highly generalizable multiarmed bandit problem in which the goal is to determine which slot machines (in a row of slot machines) should be played in what order to maximize profit (or minimize loss). The win–lose results (scoring function) provide the needed feedback for developing an optimal policy over time.

**Figure 7.1** Overview of machine learning paradigms: supervised, unsupervised, and reinforcement learning. This illustration provides a summary of the data needed to train the model, the format of the resulting model output, and a list of algorithms often used for this class of problem. Algorithms listed are addressed in this chapter, except Q Learning, which is omitted due to space limitations.

Artificial Intelligence (AI) systems are often trained using reinforcement learning to iteratively improve the ability of the AI system to accomplish the desired tasks (e.g., find cats in Internet images or avoid collisions while driving). One classic example of reinforcement learning is the training of an AI system (machine) to play strategy games such as Chess or Go. At the onset, the machine will make random or uninformed moves, but games or points won and lost will provide a scoring function that will allow for algorithmic evaluation of policies or decisions made. Reinforcement learning requires balancing exploitation of already-known information and exploration of new policies. In the case of the recent, and largely unexpected, success in training an AI system to defeat a Go grandmaster, the AI system competed against itself for millions of games to provide the opportunity to explore the vast decision-making space.⁸

Machine learning is an emerging field, with new algorithms regularly developed and fielded. Specific machine learning techniques are usually designed to address one of the three types of machine learning problems introduced here, but in practice several methods are often combined for real-world application. As a general rule, unsupervised learning methods are designed to be descriptive, supervised learning methods are designed to be predictive, and reinforcement learning methods are designed to be prescriptive. Because reinforcement learning is a more advanced machine learning topic, it will not be further expanded upon in this short overview chapter.

7.3 Model Development, Selection, and Deployment for Supervised Learning

While machine learning relies heavily on automated algorithms for model development, this does not mean that machine learning is completely automated. Rather, the development of machine learning models is an involved and iterative process that requires considerable engagement and judgment on the part of an analyst. The first step, while obvious, is often overlooked: understand the intended use for the model in the context of its “business use case” (see Chapter 6 for further discussion). Only then should you proceed with developing a model designed to meet this business need.

7.3.1 Goals and Guiding Principles in Machine Learning

In contrast to traditional approaches to modeling, machine learning does not strive to produce specific model with static coefficients. Machine learning strives to produce an algorithmic procedure with demonstrated predictive power in the business context in which it will be applied. In fact, in many machine learning applications, such as time series forecasting, we expect that the coefficients of the model(s) will change with every time period in which we apply the algorithm because we will refit the model with any new data that have arrived in the interim and use this new information to improve our forecast for the next time period. The goal is to have a useful modeling algorithm (i.e., sequence of steps that produce a model) to solve the real-word problem. Readers are referred to Ref. [1] for an in-depth discussion about the difference in goals between algorithmic and more traditional statistical modeling.

This difference in modeling focus yields the following list of guiding principles for algorithmic modeling, some of which are based on Breiman's influential article:

There are likely to be many models with demonstrated predictive power.
Analysts should investigate and compete as many models as possible.
Analysts should measure the performance of models on out-of-sample test data sets using a procedure that mimics the real-world situation in which the model will be applied.
Predictive accuracy on out-of-sample test data, not goodness of fit on training data, is the primary criterion for how good a model is.
Predictive power is not the only criteria upon which model selection is made; we routinely also consider model interpretability, speed, deployability, and parsimony.

7.3.2 Algorithmic Modeling Overview

The following steps describe a standard workflow for developing an algorithmic solution for prediction in the context of a supervised learning problem:

Data acquisition and cleaning
Feature engineering and scaling
Model fitting (training) and feature selection
Model selection
Model performance assessment
Model implementation

7.3.3 Data Acquisition and Cleaning

Algorithmic modeling begins with data acquisition and cleaning. This is often the most time-consuming step. An oft-cited rule of thumb is that this step will consume as much as 80% of an analyst's time (with the punch line that subsequent modeling consumes the other 50%). This step requires considerable domain expertise within the area of application to ensure that data for any and all features with predictive power are collected and included in the analysis. For this reason, analysts should work closely with domain experts and practitioners in the application area to identify all possible data sources that are relevant to the problem.

Real-world data are often messy, with missing values, varied formatting, duplicated records, and so on. Analysts can expect to spend considerable time reformatting and standardizing any data acquired and will often require domain expert advice for dealing with data problems. For most supervised learning problems, data acquisition and cleaning will end with the data stored in a flat table where the rows of the table represent observations and the columns represent the features of those observations. For supervised learning problems, at least one column must represent the response variable or class (the item that will be predicted). This response variable is often referred to as the dependent variable in statistical regression and as the target variable in artificial intelligence.

7.3.4 Feature Engineering

The next step is feature engineering.⁹ A feature in machine learning is an attribute describing an observation that all observations share (although the values of the attribute vary across observations). These features are also variously described as independent variables, predictor variables, or explanatory variables. The premise is that we can use the differences in attributes between observations to better classify or predict the response (target variable) for new observations. Feature engineering uses both algorithms and domain knowledge to generate features that provide predictive power for a machine learning algorithm. Because feature engineering usually has more impact on predictive performance than the algorithmic procedure you decide to implement, it is widely considered the most important step in generating predictive models. Feature engineering is often divided in the lexicon between feature construction (human-conducted) and feature learning (machine-automated) tasks.⁹

While feature engineering tasks overlap with data acquisition and cleaning, feature engineering also often involves the construction of new data features from existing ones. One simple example of feature engineering is constructing interaction variables by multiplying features with each other: Consider the case where the length (), width (), and height () of a product are not correlated with shipping cost (sc), but the volume estimated by the multiplication of these variables is highly predictive, that is,

Another simple example of feature engineering might be to raise a feature to a power (e.g., square it) to model a nonlinear (e.g., quadratic) relationship between the predictor variable and the response (e.g., see the regression example provided in Chapter 6).

One example of more involved feature engineering is the process often used in crime prediction [4]. Each grid point in a city is tagged with an indication of whether or not a crime has occurred (this is usually done separately for each type of crime), and this binary variable acts as the response variable. Demographic information from sources such as the census (each point is given all of the demographic features of its census block) is then added to each of the points. Finally, distances between each point and geographic features that either are associated with crime (called crime attractors or generators) or repel crime (called crime detractors) are calculated. The premise is that crimes such as assault and battery are more likely to occur in areas near crime attractors such as bar districts and are less likely to occur near geographic features such as police stations. Feature engineering would include the identification of any and all demographic and geographic features that might have an influence on the crime type an analyst is trying to predict. These feature data sets routinely involve hundreds of potential predictor variables. Huddleston and Brown [5] provide a detailed example of feature engineering conducted to predict gang crime.

Another common task in feature engineering is data scaling. While some machine learning algorithms are scale-invariant, the performance of others can be adversely affected when the scale of the predictor (i.e., independent) variables varies a great deal. Feature scaling, also called data normalization, involves transforming all features to a standard scale such as unit length.

7.3.5 Modeling Overview

Once an appropriate feature set has been constructed, analysts can begin algorithmically “fitting” predictive models. As already suggested, one of the principles of algorithmic modeling is that analysts should investigate and compete as many models as possible. This has two implications. First, for any particular modeling method (i.e., linear regression), there are many models that can be built from the same feature set by using different combinations of predictive features. Second, we should consider many different modeling methods (algorithms) for the task (i.e., linear regression, regularized regression, K-nearest neighbors, etc.), so we now need to compare the performance of these different modeling methods against each other.

This involves a multistep “contest” for choosing a model from the many (perhaps thousands when considering all the different combinations of features) available models:

Model Fitting: We fit and perform feature selection and parameter optimization for each of the modeling methods (algorithms) under consideration on a training data set. The output of this step is a list of “best of breed” models that we will compete against each other in the next step.
Model (Algorithm) Selection: We compete the “best of breed” models against each other on an out-of-sample validation data set. The best performing algorithm (on a range of criteria) is chosen for implementation.
Model Performance Assessment: We assess the performance of our selected approach on an out-of-sample test data set. This gives us an unbiased estimate for how well the algorithm will perform in practice.
Model (Algorithm) Implementation: The selected algorithm is applied to the full data set (i.e., training, validation, and test data sets, now combined) so that the model we deploy uses all available information.

Figure 7.2 provides an illustration for of how data are partitioned into training, validation, and test data sets and then used in the context of this “contest.”¹⁰ Note that it is absolutely critical to ensure that the validation and test data sets are representative samples of the data set so that they provide realistic performance evaluation. As a general rule, the training data set is also designed to be a representative data set, but there are some exceptions. One example concerns model fitting for classification problems where the percentage of “positive” cases is very small (e.g., rare disease prediction, or flight screening for weapons and explosives). In these cases, one might build a training data set with a much higher percentage of “positive” cases than one would expect to find in practice in order to leverage classification algorithms that struggle to fit models when the number of positive versus negative cases is highly unbalanced. Some algorithms, such as artificial neural networks, also require the use of a tuning data set; in this case, the tuning data set should be subset from the training data. The remainder of this section elaborates on each of the steps already described.

**Figure 7.2** Illustration of the machine learning process as a successive algorithmic competition. The data set is partitioned into training, validation, and test data sets. Many models are trained (and tuned if necessary) on the training data set and a competition is held to predict the observations in the validation data set. The winning algorithm is trained on the combined training and validation data sets and then predictive performance on the test data set provides an estimate for real-world performance. The selected algorithm is applied to the full data set (so that the model deployed leverages all available information) and deployed.

7.3.6 Model Fitting (Training) and Feature Selection

Model fitting sets all parameters of a model–for example, the model coefficients in a linear regression model. Feature selection selects the best set (combination) of features for a particular algorithm for a particular problem. These two processes are necessarily conflated for model fitting in machine learning.

One way to think about model fitting in machine learning is as a highly nonlinear optimization problem. Machine learning algorithms provide a working framework (via the model structure) for defining and solving this optimization problem. Specifically, each machine learning algorithm provides a model structure, defined procedures for feature selection and parameter optimization, and an objective function (or scoring function) to be optimized. Any particular machine learning method can have a wide variety of extensions and variations to the basic algorithmic procedure, and so the decision space to be explored for even one algorithm is often very large. Therefore, solving this problem is often very computationally expensive. The “learning” activity performed in machine learning is the highly iterative process of having a machine find the “optimal” solution (or perhaps a near-optimal solution) to the problem it has been given.

One simple example of how feature selection might be conducted for linear regression is known as stepwise feature selection.¹¹ In stepwise feature selection, features are iteratively added (forward stepwise procedure) or dropped (backward stepwise procedure) based on some predefined criteria that measure model quality. After a feature is added or dropped, the model parameters are fit and model goodness of fit is assessed using statistics such as the Aikake Information Criterion ( AIC) or Bayesian Information Criterion ( BIC).¹² These statistics are used as the objective function in the optimization, and we choose the feature set (and model) that maximizes the chosen objective function (in this case an information criterion).

However, in fitting the model for a given feature set, you also need to estimate all of the parameters for that model. In linear regression, we can use closed form linear algebra to solve for and define the parameters of a model, but many machine learning algorithms require a full numeric optimization procedure such as stochastic gradient descent to fit the parameters at this stage.¹³ Thus, fitting machine learning models is a highly iterative and computationally intensive activity.

Feature selection and parameter estimation are performed iteratively on a training data set for each machine learning algorithm under consideration (linear regression, K-nearest neighbors, regularized regression). Thus, if you are comparing three different algorithms for a regression problem, you would conduct a highly iterative model fitting and feature selection three times, each time using the model structure, objective function, and optimization procedures defined by a particular machine learning algorithm. The result of the model fitting stage is a set of “best of breed” models that will be entered into a competition in the model selection stage of the competition.

7.3.7 Model (Algorithm) Selection

In this step, the goal is to decide which model fitting algorithm provides the best predictive performance in an out-of-sample (i.e., blind) test. We apply the “best of breed” models developed in the previous step to predict the response variable in the validation data set (the data set we have set aside as the “blind” for this purpose). Performance is measured by comparing the predicted values from each modeling algorithm with the known response variable. It is critically important to use the correct performance function to choose the winning model (and there are many of them). The performance function should measure performance in a way that is meaningful for the true application of the data. This performance measure often differs from the statistic used as the objective function for model fitting.

If one model or algorithm is better than the others in predictive performance (i.e., more accurately classifies or predicts the response variable), this does not imply that it will be the model or algorithm selected for implementation. Predictive performance describes the “payoff” for employing a model, but there may also be a figurative (for example, “political”) or literal (in the case of computational resources required) cost. Some algorithms such as regression or decision trees are highly interpretable (meaning one can describe how the model arrives at its predicted value).¹⁴ Others, such as artificial neural networks, function as “black box” oracles.¹⁵ In many public policy applications, for example, political requirements may dictate that only interpretable models are employed so that the public can be reasonably informed about how decisions are being made.

Similarly, there may be large differences in the computational resources needed to regularly fit and deploy a model; some algorithms can be much more expensive than others (since much of this computation is performed on cloud architecture, it can cost substantially more money to train and deploy one algorithm versus another). Some models can be trained quickly, while others may require time to execute. In short, just because a model predicts well does not mean that it is the “best” for every application.

The final output from this step is an algorithm that provides the best performance on the range of criteria considered. This could include the decision to use many machine learning models together in a federation or ensemble.

7.3.8 Model Performance Assessment

The final step before model implementation is to assess the predictive performance of the selected algorithm(s) on an out-of-sample test data set. The selected algorithm is applied to the combined train and validation data set to build a new predictive model. Then, this model is used to predict outcomes on the out-of-sample test data set. This testing provides estimated performance for how the modeling procedure will perform when fielded. Note that predictive performance for an algorithm on the test data set is usually worse than that observed on the training data set due to some level of overfitting (a more detailed discussion of this issue will be provided later in this chapter), but test data set performance should be similar to that seen on the validation data set. A substantial drop in predictive performance between the validation and test data sets may indicate a problem with the model or algorithm.

7.3.9 Model Implementation

The final modeling step consists of combining all of the data available and the selected algorithm(s) to fit the model that will be deployed. Note that the model coefficients (e.g., for a linear regression model) may change from those that were fit in the training, model selection, and model performance evaluation steps. This is because we are seeking the algorithmic procedure that is best able to use the data available to predict future observations. The model we deploy is an implementation of the winning algorithmic approach and often a completely new model. In applications where there is a plethora of data available for training, it may not be necessary to train a new model by combining the training, validation, and test data sets together, because adding the information from the test data set will have little impact on the resulting model. In this case, the model used for performance assessment is directly fielded.

Note that model implementation often requires much more than simply identifying the model for implementation. Model implementation often requires the development of interactive visualizations, automated reports, or the identification of procedures so that the model can be employed for its intended purpose. In addition, it is wise to set up monitoring procedures so that the “online” performance of the model can be evaluated as it is used. The model performance evaluation step above provides an estimate of how well the model should perform in practice. If model performance begins to diverge from the expected performance, this can indicate that something has changed to diminish the effectiveness of the current model.

7.4 Model Fitting, Model Error, and the Bias-Variance Trade-Off

We seek to minimize predictive error–the difference between predicted and observed values–when applied to new cases. This section describes the procedures used to minimize predictive modeling error in practice.

7.4.1 Components of (Regression) Model Error

In the context of regression, model error is based on the difference between predicted values and observed values. Figure 7.3 shows three different models fit to some sample data. In this figure, the model error is the differences between the observed responses and the fitted line (the predicted values). The points in black represent data known at the time the model is fit (i.e., the training data set), while the points in gray depict points that “arrive” after the model is fit (i.e., during the model implementation phase). It is clear that the model in the middle provides the best overall performance because it minimizes the error of the gray points (i.e., it minimizes the prediction error in practice). The models depicted in the left and right panels do not fit as well. We say that the model depicted in the left panel is biased (or underfit), while the model depicted in the right panel is overfitted and suffers from high variance.¹⁶

**Figure 7.3** An illustration of model under- and overfitting. The model in the left panel is biased (underfit) because it misrepresents the relationship between the predictor and the response variable. The model in the middle provides a good model fit because it represents the true quadratic relationship between the predictor and the response. The model in the right panel is overfitted to the training data because it used transformations of the predictor variable to minimize the error in the training data set, resulting in a model that does not generalize.

Regression model error consists of bias, variance, and noise components according to the following formula:

The noise component of error describes the inherent uncertainty and randomness of the data and is also called irreducible error, because it is never possible to reduce modeling error below the threshold of the inherent randomness that exists in the real world for that data.

The bias component of error is due to incorrectly representing the relationship between predictor variables and the response variable. For example, in Figure 7.3, the left panel depicts a linear model fitted to nonlinear data, producing a biased estimate for many of the data points. The variance component of error is due to oversensitivity to noise or randomness represented in the data used to train the model. The model depicted in the right panel suffers from high variance. It is easy to observe that this model provides a superb fit to the training data (depicted in black) but poor performance on data it has not seen before. This occurs because the model is too complex–in trying to minimize the bias observed in the training data, the model now represents noise (inherent randomness in the data) observed in the training data with high-order terms that do not reflect the true underlying relationship. This model has adapted to idiosyncrasies in the training data and so is overfitted (too specific) to the training data.

The variance effect of model overfitting is best explained by illustrating the effect of fitting a new model on a different set of training points, as shown in Figure 7.4 that illustrates how overfitting produces high-variance predictions; different models, and as a result different predictions, can be generated by the same algorithm. This figure updates Figure 7.3 to illustrate a new model produced by using the same feature selection and algorithm but with the substitution of a different training data set. In the panel on the left, you can see that the effect of a different set of points for the training data is very small. In the panel on the right, in both cases the “fit” to the training points is better than in the left panel, but the prediction for the out-of-sample points (plotted in gray) is much worse. This effect is produced by not properly penalizing the addition of polynomial terms of the predictor variable in the model.

**Figure 7.4** An illustration of the variance effect when using different training data sets. Using a different training data set has little effect on the fitted model (and therefore the predictions) in the panel on the left. In the panel on the right, the model fit to the data (and therefore the predictions) is very different when a different training data set is used. The models (and therefore the predictions) in the panel on the right exhibit greater variance than those on the left.

7.4.2 Model Fitting: Balancing Bias and Variance

Machine learning seeks to find the optimal trade-off between bias and variance as depicted in Figure 7.5, which provides a simple sketch depicting this trade-off. Note that it isn't possible to completely eliminate error, but there exists a theoretical level of model complexity that balances bias and variance in a way that minimizes the total prediction error. Brighton and Gigerenzer [6] provide an excellent in-depth discussion of this trade-off, while Huddleston et al. [7] provide a real-world example of this effect in the development of models for forecasting crime in a major U.S. city.

**Figure 7.5** The bias–variance trade-off illustrated. This sketch illustrates how adding model complexity (by adding features etc.) reduces model bias (on the training data set) but increases variance (i.e., reduces the generalization of the model fit). There exists a theoretical “sweet spot” that balances bias and variance in a way that reduces overall error. The machine learning process in Figure 7.2 and machine learning techniques such as cross-validation and regularization are designed to find the optimal amount of complexity–the amount that reduces overall prediction error when the model is deployed.

The modeling procedure illustrated in Figure 7.2, which provides an overview of the algorithmic modeling “competition,” is designed to help identify the algorithm for model fitting that best identifies this “sweet spot” in the bias variance trade-off. By splitting data sets into the training data set to fit each model and then a validation data set to evaluate each model, we are able to see which modeling procedures best “generalize.” Choosing our models in this way provides us a picture of modeling performance similar to that presented in Figure 7.3.

Using a test data set gives us a “second look” at how the model is likely to perform on data that was not used to either train the model or choose the model. This protects against the situation in which the specific sample chosen for the validation set favors a particular model by random chance. The final performance observed on the test data set during performance evaluation should be very close to that observed on the validation data set. When this is not the case, it is an indication that something has gone wrong, and further investigation is needed.

There are two additional techniques commonly applied in the model-fitting stage to address the bias-variance trade-off: k-fold cross-validation and regularization. k-Fold cross-validation extends the practice of splitting data into training, validation, and test data sets by changing the way training and validation data sets are built.¹⁷ To perform k-fold cross-validation, you repeatedly (i.e., k times) perform a sampling procedure (without replacement) to build a validation data set out of the data put aside for training, tuning, and validation (i.e., all of the data that is not put “into the vault” to be used as the test data set for performance evaluation on the algorithm selected for implementation). One practical way to do this is to randomly order the records in your data set and then evenly split your data into the k folds. Figure 7.6 illustrates how this sampling procedure structures the data into k = 5 different folds.

**Figure 7.6** Illustration of k-fold cross-validation. Data observations are randomly ordered and then split into k sections (in this case, five sections), with each section used as a validation data set in sequence. Each model-fitting algorithm is applied k times, with 1/k of the data held out as the validation data set and the rest used as the training data set. This provides k independent estimates of performance for each algorithm competed. These k estimates are combined (usually averaged) to provide an overall estimate for each algorithm's performance.

Models are fit on each fold's training data set and model performance is estimated for that fold based on predictions made on that fold's validation data set. In this fivefold case, each of the algorithms applied would have five different validation performance estimates, which are then combined (usually averaged) to provide an overall cross-validation performance summary for that algorithm. The winning algorithm provides the best comprehensive performance over all k folds. This procedure improves on the basic modeling procedure illustrated in Figure 7.2, but it requires considerably more computation and so is not always feasible.

Regularization attempts to prevent overfitting by imposing a cost (or a penalty) on the complexity of a model.¹⁸ Returning to the models compared in Figure 7.3, you can see that as one progresses from the left panel toward the right panel, the model complexity (i.e., number of predictor variables formed by various transformations on the feature x) increases. At the time when the model is fit, only the information provided in the black dots is known and so a good model-fitting algorithm will try to minimize the error between the fitted line and the known points.

In this case, the model error for this regression problem is represented using a statistic such as sum of squared errors (SSE), where an error as depicted in Figure 7.3 is the vertical distance between a known black dot and the fitted line. In this case, the SSE serves as the objective function for the model selection and optimization procedure. Without some kind of penalty on the complexity of the model, any algorithm designed to reduce the error between fitted line and the observed training data will continue to reward the introduction of increasingly complex coefficients (e.g., higher-order transformations of the predictor variables as in the right panel of Figure 7.3) into the model in order to reduce the distance between fitted line and each known point. The result is a badly overfitted model. Regularization techniques improve upon basic error measures such as SSE by incorporating penalty functions for adding complexity that help to balance bias and variance more appropriately.

While there are many methods for regularization, the most commonly applied methods use an information criterion such as the AIC or the BIC as the target function for optimizing model fit.¹⁹ All approaches for regularization operate in the same manner: They modify the objective function for optimizing model fit by including both some measure of goodness of fit (such as SSE) and a component that penalizes either the addition of more features (and parameters) into the model or the magnitude of the coefficients fit for those parameters.

7.5 Predictive Performance Evaluation

The objective functions used for model fitting and feature selection (e.g., AIC and BIC) will often differ from those used to conduct model selection (on the validation data set) and evaluate model performance (on the test data set). The statistics used for model fitting and feature selection are designed to find an optimal trade-off point in the bias-variance trade-off space as already discussed. They perform regularization.

The statistics used for final model selection and performance evaluation provide meaningful information about the real-world predictive performance of the fitted models when applied in a “business” context. It is critical to test models and algorithms in the same way the model or algorithm will be used in practice. Considerations for model performance evaluation for regression problems, classification problems, and problems with time dependency are briefly discussed in the following sections.

In addition, analysts must understand that any measure of performance calculated on the training data set is not a valid estimate of predictive performance; performance measures calculated in this way usually (and often significantly) overestimate true predictive performance. Once again, Figure 7.3 provides an illustration of this phenomenon. The training error (the difference between the black points and the fitted line) for the model in the right panel is very small, while the true predictive performance (the difference between the gray points and the fitted line) is much worse. Model selection and performance evaluation always require the use of out-of-sample data sets (i.e., data not used to fit the model). Additional considerations for model performance evaluation for regression problems, classification problems, and problems with time dependency are briefly discussed in the following sections.

7.5.1 Regression Performance Evaluation

Regression problems have response data that are numeric, either continuous or integer valued, and therefore predictive performance for regression problems is based on the numerical difference between a real-world observed response (in the validation or test set) and the predicted value for that observation. The numerical difference between model prediction and observed value (in the validation or test set) is known as a model residual or a model error. The term residual is preferred as the difference between an observation and a predicted value could be due to noise rather than a “modeling error.” However, the term model error is widely used in practice, to include in the name of statistics that are often used to summarize model performance such as the mean squared error (MSE):

root mean squared error (RMSE):

or the mean absolute deviation (MAD), also known as the mean absolute error (MAE):

Note that these measures of performance all use the residuals, but each measure may favor a different model. MSE and RMSE penalize large residuals more heavily than does MAD. If the business use case for a model is a situation in which a few large prediction errors would be devastating, RMSE or MSE should be chosen over MAD for model selection and performance evaluation.

However, the best practice is to convert statistical measures of error (such as regression residuals) into real-world costs, profits, or other measures of real-world performance and then evaluate models in the context of real-world business effect. The model that you implement should be the model that provides the best business case, not the model that provides the best performance of a statistical measure. Structuring model performance in real-world terms facilitates model adoption and implementation because it contextualizes the models in a way that allows for informed decision-making about the effect of employing them.

7.5.2 Classification Performance Evaluation

The goal in a classification problem is to accurately predict the category or class of new observations given their observable features. As previously discussed, a classic example of a classification problem is predicting whether or not a person has (or will develop) a disease based on medical diagnostics. Another is predicting future loan defaults based on credit history. These are binary classification problems. Binary classifiers will usually return the probability of the “positive” case (i.e., the probability that the person develops the disease or defaults on the loan), which is usually denoted with the numerical value 1 for the response variable. A prediction of 51% would imply that the classifier believes it is more likely than not that the observation will present the positive (1) class (i.e., disease or default occurs). However, the threshold used to assert a classification can be varied and the threshold used for binary classifiers in practice can and often does vary from 50%, especially for problems with very low numbers of “positives” such as drug tests and fraud detection. If one infant in 250,000 develops isovaleric academia, a model for predicting this genetic disorder that is based on a 50% threshold would perform far worse than simply guessing that no child would develop the genetic disorder (which would be correct in 249,999 of every 250,000 infants or in 99.9996% of infants).

Binary classification performance measures are based around the confusion matrix, also known as a truth table, which is a table that records the counts of occurrence for each of the four outcomes for prediction in a classification problem at a given threshold as shown in Figure 7.7.²⁰

**Figure 7.7** Confusion matrix (truth table) for binary classification. (Data from https://en.wikipedia.org/wiki/Confusion_matrix, public domain.)

As shown in Figure 7.7, there are two ways to make a correct classification in a binary classification problem (denoted in green) and two ways to make an error (denoted in red). False positive errors are also known as Type I Errors, and false negative errors are also known as Type II Errors (see Chapter 6 for more discussion).²¹ There are a wide variety of performance statistics derived from the confusion matrix, but the two most widely used are sensitivity and specificity. Sensitivity measures how well a classifier does in predicting the actual presence of disease (or predicting default) when it actually occurs. Classifier sensitivity is also known as recall, true positive rate (TPR), or detection probability and is calculated from the confusion matrix for a given classification threshold as follows:

Classifier specificity measures how well a classifier does in predicting cases where the disease or default does actually not occur. Specificity is also known as the true negative rate (TNR) and is calculated as

Good classifiers perform well on both measures (high scores are desirable in both cases). However, as a classifier's threshold changes, the performance statistics of the classifier change as well. This creates a trade-off space that can be explored for a given classifier.

Receiver operating characteristics ( ROC) curves are another tool often used to evaluate the performance of binary classifiers because they allow investigation of sensitivity and specificity over the full range of possible thresholds. ROC curves plot a classifier's TPR (i.e., sensitivity or recall) against its false positive rate (FPR) as the classification threshold is adjusted. The FPR is calculated as

Figure 7.8 provides an illustration of ROC curves plotted for three different classifiers applied to a problem. The optimal point of performance on an ROC curve is the extreme top left position at coordinate (0,1), which indicates perfect prediction (i.e., a 100% TPR and a 0% FPR). Since this level of prediction is rarely (if ever) achieved, you must choose one of the available trade-off points between TPR and FPR performance (i.e., you must select a classification threshold that best balances these objectives).

**Figure 7.8** Example ROC curve. This ROC curve provides a visual summary of the available TPR and FPR trade-offs at different classification thresholds for three different models. In this case, no one model provides the best performance for all classification thresholds. For example, at FPRs below 0.15, the model depicted in black provides the best performance, while for FPRs between 0.25 and 0.4, the model depicted in red provides the best performance. Because all three lines plot above the dashed line, which illustrates the predictive performance of random guessing, all three models exhibit predictive power. (*Source*: BOR at English Wikipedia, https://en.wikipedia.org/wiki/Receiver_operating_characteristic. Used under CC BY-SA 3.0, https://creativecommons.org/licenses/by-sa/3.0/deed.en.)

In Figure 7.8, the trade-off point where you could get a 60% TPR for a 30% FPR might be ideal (and you would select the classifier depicted in red). For a different application, where false positives are more of a concern, you might select the black classifier at the threshold that provides a TPR of 35% with an FPR of 10%. The dashed line illustrates the performance of random guessing, and any classifier whose ROC curve lies above the dashed line outperforms random guessing (i.e., has some predictive power).

Performance for multinomial classification problems (i.e., when there are three or more classes considered) generalizes in a straightforward manner from the binary case. A binary confusion matrix can be built for each of the classes by consolidating all other classes as shown in Figure 7.9. More frequently, statistics such as overall classification accuracy are used to summarize classifier performance. Classification accuracy is calculated as

The overall classification accuracy for the multinomial case depicted in Figure 7.9 can be calculated directly from the number of correct (numbers on diagonal in green) and incorrect (numbers off-diagonal in red) classifications:

In practice, the cost of misclassification is often asymmetric. It may cost considerably more money or resources to commit a false positive error than a false negative error (or vice versa) in a binary classification or to misclassify one particular case in a multinomial classification. For example, when evaluating loan applicants, the cost for falsely predicting a default, and therefore losing the potential profit from the loan due to denying the application, might be much smaller than the cost of failing to predict the default and incurring the much larger costs of processing a defaulted loan. In such cases, it should be possible to develop a “business case” statistic for evaluating classifier performance, translating classification performance into real-world effect. You could evaluate performance by applying the classifiers against a representative test data set of previous long-term customers (some of whom defaulted and others who did not) and calculate the profit or loss generated by different classifiers. This “real-world” assessment is likely to provide a more useful performance evaluation for models than a “statistical” measure of performance.

**Figure 7.9** Collapse of a multinomial confusion matrix (truth table) to a binomial confusion matrix (truth table). Class 2 and Class 3 are collapsed into the “Not Class 1” case in order to build a binomial truth table that summarizes classification performance for Class 1.

7.5.3 Performance Evaluation for Time-Dependent Data

Another critical consideration for performance evaluation is time-dependency. There are many real-world applications of machine learning where ignoring the effects of time on the analysis will result in very poor performance for predictive models in practice. When the factors affecting a prediction are likely to change over time, or there are other processes taking place that affect the response and exhibit time dependency, then the design of your model-fitting procedure and performance evaluation must change to account for these time dependencies. While this consideration applies obviously and directly to time series forecasting problems (such as predicting sales in future quarters based on previous history), it is also frequently important in many other regression or classification contexts.

One classic example is in the development of crime prediction models (i.e., crime hot-spot maps that predict where future crimes are likely to take place).²² In this application, the question is whether or not crime hot-spots for future years can be predicted based upon features (or locations) associated with crime in past years. It may or may not be the case (depending on the city) that the locations and features associated with crime change over time (there are many factors such as urban development that can affect the spatial distribution of crime over time). Prior to conducting model-fitting and performance evaluation, it is critical to check for time dependency in your data so that specific procedures can be taken into account for this dependency in your model fitting, selection, and performance evaluation process.

A rolling-horizons design should be employed for model fitting and performance evaluation when there is time-dependency. In rolling-horizons design, rather than employing cross-validation procedures or splitting data sets between training, validation, and testing subsets, algorithms are evaluated based on their ability to use information from the training period to predict outcomes in a future validation or testing period. This is repeated many times, thus providing multiple estimates for how well a particular algorithm learns from previous data to predict future outcomes. Figure 7.10 illustrates a rolling-horizons design. Performance would be estimated in this case by averaging performance over periods four through seven.

Figure depicts illustration of a rolling-horizons design. Model performance is estimated by averaging predictive performance over periods four through seven. Red and blue portions are denoting testing and training periods, respectively. — **Figure 7.10** Illustration of a rolling-horizons design. Model performance is estimated by averaging predictive performance over periods four through seven.

It is not always necessary to implement a full rolling-horizons design in order to include consideration for time dependency. For many applications, one might simply reserve the most recent year (or month) of results as the test set and use previous years (or months) for training, tuning, and validating models. The key consideration is that model selection decisions and model performance evaluation should not be made on the basis of only one observation. So, in time series forecasting applications (e.g., predicting sales volume for the next time period based on previous periods), it is necessary to use a rolling-horizons design to evaluate the performance over many time periods because each time period contains only one prediction that can be used to evaluate model performance. In the case of the spatial crime prediction example, we could use the most recent time period (the last month or year of data) as the test period provided we had a large enough sample of observations (i.e., crime event locations) that occurred during that test period to develop a reasonable estimate for model predictive performance. One rule of thumb is to include both a minimum of 15 observations and at least 10% of your available data (not applicable in a rolling-horizons design) in your test data set.

7.6 An Overview of Supervised Learning Algorithms

Now that we have discussed machine learning paradigms in general terms, we turn to a discussion of the machine learning algorithms used to develop predictive models. This section provides brief summaries for many of the most frequently employed supervised learning algorithms, that is, methods that require labeled response data to train models for regression (numerical prediction) or classification (categorical prediction) problems. Methods described in this section include the following:

k-nearest neighbors (KNN)
Regression
Classification and regression trees (CART)
Time series forecasting
Decision trees
Support vector machines (SVM)
Artificial neural networks
Ensemble methods

This is not a comprehensive list of all of the supervised learning algorithms but rather a representative sample of procedures frequently used. For each of these methods, we briefly describe how the method works, describe some relevant extensions where applicable, and provide some example use cases.

7.6.1 k-Nearest Neighbors (KNN)

The KNN algorithm's simplicity belies its excellent performance for many real-world regression and classification problems.²³ In this approach, the training data provide the model. Any new observation is simply compared to the k-nearest observations (measured in high-dimensional feature space) in the training data set. For regression problems, the response values for the k-nearest training observations are usually averaged to provide the prediction for the new observation. In a classification problem, each of the k-nearest observations provides a vote for their own class, which provides an empirical estimate of the probability of belonging to each class. In the example provided in Figure 7.11, when the parameter k is set to three or less, the predicted class would be “red triangle,” because two out of two of the nearest neighbors are from the red triangle class. When the parameter k is set to five, the predicted class would be “blue square” with a probability of 60% (i.e., 3/5), because three out of five of the nearest neighbors are blue squares.

**Figure 7.11** Illustration of k-nearest neighbors algorithm. In this illustration, when k is less than 3, the classification would be “red triangle” because two out of three of the nearest observations are from the red triangle class. When k = 5, the classification would be “blue square” with a probability of 60% (i.e., 3/5) because three out of five of the nearest neighbors are blue squares. (*Source: Ajanki,* https://commons.wikimedia.org/wiki/File:KnnClassification.svg. Used under CC BY-SA 3.0, https://creativecommons.org/licenses/by-sa/3.0/deed.en.)

Key modeling decisions to be made in “fitting” these models include

selection of the parameter k (i.e., the number of neighbors to consider),
the features to include in the feature space (i.e., feature selection), and
consideration of various approaches for weighting the contribution of neighbors as a function of their distance from the target observation (i.e., near neighbor votes can factor more heavily that far neighbors).

The calculation of the distance between a new (out-of-sample) observation to each of the observations in the training data set for prediction (so that the k-nearest neighbors can be found) can be computationally expensive on large data sets with many features and so often unsupervised dimension reduction algorithms are used to shrink the considered feature space. One common use of the KNN algorithm is in online shopping recommendation systems that suggest products to customers that “near-neighbor” customers bought.

7.6.2 Extensions to Regression

Regression is widely used in machine learning for prediction, time series forecasting, and classification problems. In most applications, extensions or variations on basic linear regression are applied to improve predictive performance. Commonly applied extensions to linear regression include ridge regression,²⁴ least absolute shrinkage and selection operator (LASSO) regression,²⁵ and logistic regression.²⁶ Ridge regression and LASSO provide feature selection and regularization procedures used to improve the performance of regression models. Logistic regression (and other generalized linear models) provides the opportunity to fit models for classification problems and other problems for which the assumptions of linear regression are not met.²⁷

Figure 7.12 provides one example of a relationship described using logistic regression (note that the fitted relationship is nonlinear). This figure shows the modeled change in the relative likelihood (i.e., probability) of observing a crime committed by a criminal gang as distance from a claimed gang territory increases [7]. As can be seen in the figure, the probability of observing a gang crime ½ mile from a gang's territory is only about 25% of the probability of observing a gang crime if you are actually in a gang territory. The circles at the top of the figure plot actual crime occurrence (i.e., the response variable = 1), while the plotted circles at the bottom plot the distance from a gang territory for the locations where crimes did not occur (i.e., the response variable = 0).

**Figure 7.12** An example of a relationship model using logistic regression. This figure shows the fitted relationship asserting that the probability of observing a gang crime ½ mile from a gang's territory is only about 25% of the probability of observing a gang crime if you are actually in a gang territory. The circles at the top of the figure plot actual crime occurrence in the training data set (i.e., the response variable = 1), while the plotted circles at the bottom plot the distance from a gang territory for the locations where crimes did not occur in the training data set (i.e., the response variable = 0).

7.6.3 Classification and Regression Trees

Classification and regression tree (CART) models provide a highly visual and human-interpretable method for regression and classification problems.²⁸ They are closely related to the decision trees used for probabilistic decision-making. CART models are built by splitting the data from the top down into subgroups that are less “impure,” meaning that each subsequent feature-based split as you move down the tree produces a grouping of observations that is more homogenous in class (for classification) or has less variance (for regression) than the parent node above it.

A completely naïve application of decision trees partitions the training data based on its features until each of the terminal leaves (i.e., the nodes at the end of the branches) of the tree is absolutely pure (i.e., all observations in the leaf have the same class or value). Such a tree is usually overfitted, so the focus during model fitting is on selecting an appropriate algorithm for partitioning the tree and choosing the method used to prune the tree back to an appropriate level such that the tree describes what is generally true rather than simply reflecting the data (and the noise) in the training data set.

Figure 7.13 provides an illustration of a classification tree fitted for predicting survival in a shipwreck using data from the Titanic. As this figure illustrates, in a pruned tree, the terminal leaves will usually include observations from multiple classes. Classification trees predict by identifying the correct terminal node for new observations and return the class probability derived via “class voting” by the observations in the training data set in that node. Figure 7.13 asserts that if you are a male passenger over the age of 9.5, you are estimated to have a 17% probability of survival in a similar shipwreck. Regression procedures predict in a similar manner by returning the average of the response variable for the training observations in each terminal node.

**Figure 7.13** Illustration of a classification tree describing the probability of survival on the titanic based on the features of the passengers. Each terminal node provides the probability of survival (i.e., the probability of being in Class 1) as a number between 0 and 1 and the percentage of observations in the data set described by that node. A male passenger over the age of 9.5 is predicted to have a 17% probability of survival under similar circumstances. (*Source*: Dlary, https://commons.wikimedia.org/wiki/File:Titanic_Survival_Decison_Tree_SVG.png. Used under CC BY_SA 4.0, https://creativecommons.org/licenses/by-sa/4.0/deed.en/)

CART models are highly interpretable and therefore are an excellent method for applications in public policy decision-making where it is necessary to explain how employed models make their recommendations (i.e., CART models are not unexplainable “black boxes”). However, other machine learning methods often outperform them in practice. CART models are often used in ensemble methods such as random forest modeling that combine the predictions of many (perhaps thousands) of fitted CART models together. Ensemble procedures such as boosting and bagging can greatly improve upon CART modeling's predictive performance at the cost of reducing interpretability.

Interview With Wei-Yin Loh

Wei-Yin Loh, Professor of Statistics with the University of Wisconsin, discusses the original purpose of classification and regression trees and how classification and regression trees have evolved over time.

The original purpose of classification and regression trees was to fit models to data for which linear regression was inadequate, due to presence of collinearity, nonlinearity, and interaction effects as well as large numbers of predictor variables. AID, THAID, and CHAID were the first generation of algorithms. One of their main weaknesses was lack of an effective way to determine the size of a tree. CART, ID5, C4.5, and M5 addressed this problem by employing various methods of tree pruning. But they, as well as AID, THAID, and CHAID, have another weakness, namely, a bias toward selecting variables that generate more splits. This was addressed by QUEST and CRUISE, which use significance tests to rank the variables for splitting each node. This approach was refined in GUIDE, which has additional improvements for handling missing values in data and for fitting piecewise linear regression tree models. CTREE and MOB use permutation tests instead of the chi-squared tests in GUIDE to control selection bias. Random forest and BART are ensembles of trees. They sacrifice the interpretability of single trees for potentially higher prediction accuracy.

This is an excerpt from one of a series of interviews with analytics professionals and educators commissioned by the INFORMS Analytics Body of Knowledge Committee.

7.6.4 Time Series Forecasting

Although many machine learning and statistical learning texts do not provide a discussion of time series forecasting methods, these problems occur frequently in business, military, and public policy domains and so are briefly discussed here because analytics teams often spend a considerable amount of time working on these problems. A time series is a series of numerical values indexed by time.²⁹ This could include many different values of interest such as product sales volume, call volumes for a call center, or crime counts in a city.

In the context of time series forecasting, we make a distinction between a prediction and a forecast. A prediction is defined as an assertion (often probabilistic) that a specific event will take place, whereas a forecast is an assertion of how much of something will occur over a specified geographic area and period of time. In this context, the nightly weather “forecast” might include a prediction about the high temperature for the following day and a forecast for the amount of rain. We provide a brief discussion of some frequently used approaches here and recommend Forecasting: Principles and Practice [8], which is freely available online, as a first reference for the techniques presented here.³⁰ There are many more in-depth treatments of time series forecasting in the statistical literature.

Figure 7.14 provides an example use case for time series forecasting methods. This figure illustrates the results obtained by applying a rolling-horizon design to forecast the number of weekly burglaries in Pittsburgh in the year 2008. The first forecast is plotted for week 3 (with weeks 1 and 2 used as a training data set), and then each week the model is refit using the new data and a new forecast is made for the following week. This figure illustrates the intended use of the model–forecasting the burglary count for future periods, which in this case would be week 51, where the model forecasts an expected burglary count of 51.4.

**Figure 7.14** Illustration of a one-step ahead forecast for burglaries in Pittsburgh. Actual observations are depicted with gray bars and the forecast fit with an exponential smoothing model is depicted with the black line. A rolling-horizon design is used to make predictions for weeks 3 through 51.

Time series model fitting is similar in many ways to predictive regression modeling but is unique in that previous observations of the response variable are often the most important predictive feature for the forecast. The statistical term for this use of previous observations of the response variable is autoregression.³¹ Fitting time series models often involves modeling the four components of a time series: the trend, seasonality effects (rises and falls in the data on a fixed pattern, often due to the effects of weather), cycles (rises and falls in the data that fall outside of a fixed pattern), and noise.³² Fitting and evaluating time series models also requires the use of rolling-horizon design due to the time dependency inherent in these problems.

The three most common methods used for time series forecasting are time series regression, exponential smoothing, and autoregressive integrated moving average ( ARIMA) models.³³ Time series regression extends basic regression to include auto-regression against previous observations of the response variable (i.e., burglary counts in previous weeks as illustrated in Figure 7.14). Time series regression also facilitates the use of other variables (i.e., other time series) as predictive features. For example, various studies have related temperature and weather effects to crime occurrence and so predicted temperature over the next week could be incorporated into a forecasting model using time series regression. ARIMA models extend basic autoregressive modeling to account for trends and other effects.

Exponential smoothing is a nonparametric technique that develops a forecast for the next period by using the immediately previous forecast and the immediately previous observation as the predictor variables according to the following formula:

The parameter is a tuning parameter that places more weight either on the previous forecast or the previous observation, taking on values between 0 and 1. This model form results in a recursive relationship with previous forecasts, with the effects of previous forecasts decaying exponentially backward in time. Hence, the name for the algorithm. Fitting an exponential smoothing model is relatively simple and straightforward as it requires only the optimization of the weighting parameter . Basic exponential smoothing has been extended to account for trend and seasonal effects in an algorithm now known as Holt–Winters exponential smoothing [9,10].

7.6.5 Support Vector Machines

Support vector machines are a machine learning technique originally developed for classification problems, although extensions have now been developed to facilitate their use for regression problems. The motivating principle for support vector machines is to find a maximum separating hyperplane in feature space that separates classes. This hyperplane is illustrated for a two-dimensional feature space in the right panel of Figure 7.15.

Figure depicts illustration of support vector machine classification. Observations are remapped into a higher dimensional space where a linear separator can be drawn between classes. The dashed lines depict the margins, which define the edge of each class. The red line depicts the maximum separating hyperplane, which is used to define the decision rule for classification. — **Figure 7.15** Illustration of support vector machine classification. Observations are remapped into a higher dimensional space where a linear separator can be drawn between classes. The dashed lines depict the margins, which define the edge of each class. The red line depicts the maximum separating hyperplane, which is used to define the decision rule for classification. (*Source*: Alisneaky, https://commons.wikimedia.org/wiki/File:Kernel_Machine.svg, Used under CC BY-SA 4.0, https://creativecommons.org/licenses/by-sa/4.0/deed.en.)

The idea for finding the maximum separating plane is that if we can find the plane (i.e., line in a two-dimensional case) with the greatest space between the two classes, then the decision rule for classification that corresponds to that line will be most generalizable to new cases because we have separated the two classes by as much distance as possible. The linear vectors in high dimensional space that define the “edges” of each class are known as the margins. The margins in the right panel of Figure 7.15 are depicted as dashed lines, and the central red line depicts the classification threshold (the separating hyperplane) that defines the decision rule. The points from each class that lie on and define the margins are known as the support points.

Support vector machines can also be applied to nonlinear classification by mapping training observations into a new, usually higher dimensional feature space where a linear separating hyperplane can be drawn between the classes in the training data. An example of this remapping is illustrated in Figure 7.15, where the nonlinear relationship in the left panel is transformed and projected into a new space (the right panel), where a linear hyperplane exists that separates the two classes. New out-of-sample observations are projected into this same higher dimensional space and classified based on which side of the linear separating hyperplane they fall.

7.6.6 Artificial Neural Networks

Artificial neural networks are motivated by trying to mimic the way neurons in the brain fire to influence human learning and decision-making. This modeling framework is very flexible and can be adapted for classification problems, regression problems, and unsupervised learning applications such as clustering. They are also applied extensively in reinforcement learning, and they underpin artificial intelligence systems used for image recognition and self-driving cars.

In the basic operation of artificial neural networks, input data (such as predictive features) are fed into hidden layers of nodes (neurons), which perform transformations of input values and pass them on to other hidden layers of nodes, until eventually the network of connected nodes emits the output values (predictions or recommendations). In this way, at each hidden layer, a neural network automatically generates a new set of features that are functions of the features in the preceding layer. The features in these hidden layers are better able to predict the response variable than the original set of input features. The automated process of building these hidden layers is an example of the feature learning previously discussed. Figure 7.16 provides an illustration of how hidden layers of neurons transform input values (i.e., predictive features) into output values (predictions).

Figure depicts illustration of a multilayer neural network that translates input values (predictors) into output values (predictions). — **Figure 7.16** Illustration of a multilayer neural network that translates input values (predictors) into output values (predictions). (*Source*: John Salatas, https://commons.wikimedia.org/wiki/File:Multilayer_Neural_Network.png. Used under CC BY-SA 3.0, https://creativecommons.org/licenses/by-sa/3.0/deed.en.)

Model fitting for neural networks involves making decisions about the number of hidden layers to include in the network, the number of nodes in each hidden layer, which nodes should be linked together (i.e., defining all of the network paths), and the transformation functions for every node in the network. Most of these decisions are made iteratively during the model-fitting stage based on minimization of some penalty function–which is sometimes referred to as a loss function, cost function, or error function. The most common method for iterative improvement is backpropagation, or feeding the errors made by the neural network back into the network and allowing it to iteratively improve its performance (or learn from its errors) by revising the weights placed on the connections between nodes.

The strength of artificial neural networks lies in their incredible adaptability to a vast array of problems. However, the decision space to be explored for fitting an artificial neural network is very large, and so fitting artificial neural networks often requires considerably more training data and computational resources than other machine learning algorithms. Artificial neural networks also suffer from being almost completely uninterpretable. The use of artificial neural networks and the many extensions to the basic framework, such as recurrent neural networks and deep learning, has exploded in recent years due to the availability of large cloud-based computing clusters that can be leveraged to fit these highly adaptable but complex models.³⁴^,³⁵

Interview With Katya Scheinberg

Katya Scheinberg, the Harvey E. Wagner Endowed Chair Professor with the Industrial and Systems Engineering Department at Lehigh University, explains backpropagation and alternatives for facilitating learning by neural networks.

A neural network (or a deep learning network, popular these days) is a graph composed of consecutively connected layers. Each layer consists of neurons, and each neuron is associated with a weight and an activation function. The number of layers and neurons and the choice of the activation functions define a neural network.

The process of selecting specific values of the weights, let's call them w, is called training a neural network. The weights are chosen to optimize the accuracy of the network on a particular data set. This accuracy is thus a function, let's call it f, that depends on the data set and the weights w. This function is nonconvex and very high-dimensional, and it typically has complex structure.

The traditional way to optimize this function is to apply basic gradient descent steps where the gradient of f(w) is computed by essentially applying automatic differentiation to f. This method is called backpropagation. Computing the exact gradient of f requires applying this automatic differentiation process to each data point in the input data set. Since the input sets and the deep neural networks are very large in modern applications, each backpropagation step is very expensive.

An alternative, which is the current industry standard, is to randomly select a small subset of input data and apply backpropagation only based on this subset. This is known as stochastic gradient descent or SDG. Unlike the traditional gradient descent, this method is very efficient in some cases but is not very robust, since its performance depends substantially on the choice of parameters, such as the step size and the sizes of data subsets. The most efficient and commonly used variants, known as momentum SGD and Adam, are heuristics that try to reduce the dependence on parameters and improve stability of the SGD method.

Recently, a lot of research has been dedicated to applying second-order methods that are variants of Newton's method to training deep neural networks. Other traditional methods from deterministic optimization are also being explored. Applying these methods in a straightforward way, using evaluations of the entire data set at each iteration, is not only very costly but also seems to lead to undesirable local solutions in some cases. The goal is to adapt these methods in such a way that, similarly to SGD, only a handful of data are used at each iteration. The results of applying these methods to neural networks are so far mixed and not clearly understood. However, this area of research is expected to see a very significant progress in the next few years.

This is an excerpt from one of a series of interviews with analytics professionals and educators commissioned by the INFORMS Analytics Body of Knowledge Committee.

7.6.7 Ensemble Methods

One of the most powerful techniques in machine learning is the ensemble, or a model federation that combines predictions from several (or many) machine learning algorithms.³⁶ As a general rule, ensembles improve predictive performance at the cost of making the process for producing predictions far less interpretable. There are many techniques used in machine learning for combining the predictions of machine learning models. We will briefly discuss several of the most used including the following:

Simple averaging
Stacking
Bootstrap aggregating (i.e., bagging)
Boosting

The simplest approach to ensembles involves averaging the outputs of several different models to form a consensus prediction or classification. Figure 7.17 provides an illustration of the results of combining the forecasts from three different time series forecasting algorithms into an ensemble forecast. As can be seen in Figure 7.17, this method is a variance-reduction technique because the resulting ensemble forecast, as an average of the other models, reigns in the most extreme forecasts of each if the different algorithms.

**Figure 7.17** Illustration of the performance of an ensemble model developed from three time series forecasting models for forecasting the weekly homicide count in Chicago during the year 2014. The ensemble formed by the average of the three other models reigns in the extreme predictions of each method, reducing the variance. Although it can't be seen by visual inspection, this improves the predictive performance.

One can often improve on simple averaged ensembles by developing performance-based weights for the contributions of different models. This is known as stacking because you are stacking one machine learning algorithm on top of others. The default stacking approach uses linear regression to assign coefficient weights to the predictions of other machine learning algorithms, but there are many other schemes for implementing this; any supervised machine learning algorithm can serve as the model at the top of the stack.

Bootstrap aggregating, more frequently referred to as bagging, involves iteratively sampling from the training data set to create many training samples of the same size. A model is built from each sample, resulting in an ensemble of many prediction or classification models that have been fit with the same machine learning algorithm using slightly different training data sets. The bagged ensemble's predictions are averaged (for regression problems) or serve as class votes (for classification problems). The random forest algorithm, which is one of the most frequently employed machine learning techniques, is a bagging ensemble formed by iteratively fitting many (i.e., a forest of) classification or regression trees.³⁷ The random forest performs the basic bagging procedure for creating training data sets but extends it by also sampling for which features will be included in the training data set. Bagging procedures tend to reduce model error variance and improve model generalization.

Boosting is designed to reduce model bias; it was originally intended for classification problems and then extended to regression problems. Boosting builds a sequence of models where each new model in the sequence is designed to improve performance on observations in the training data set that were misclassified in the previous model. This is done by applying weights to misclassified instances from the previous step and then refitting the model (in essence items that were misclassified in the previous step are now prioritized more heavily in the next model). The predictions of all of these models are aggregated together to form a final prediction. A common approach to boosting is an algorithm known as Adaboost, which has been referred to as the “best off-the-shelf classifier in the world” when applied to decision trees ([3], pp. 302) [11].³⁸

7.7 Unsupervised Learning Algorithms

Unsupervised methods are designed to identify the latent (i.e., underlying or hidden) structures in data. These structures may be groups, underlying probability distributions, or variance structures. This section provides a brief description of several of the most frequently employed unsupervised learning techniques, although there are many more than are discussed here. Models discussed here include the following:

Kernel density estimation
Association rule mining
Principal components analysis (PCA)
Clustering methods
Bag-of-words or vector space models.

7.7.1 Kernel Density Estimation

Kernel density estimation is a nonparametric statistical method for estimating an underlying probability distribution.³⁹ Kernel density estimates are in essence a smoothing technique that uses observations that are assumed to have been generated by some underlying parent distribution to estimate that parent distribution. Figure 7.18 illustrates how kernel density is used to provide a density estimate in a two-dimensional space given some sample observations. A kernel smoothing function (in this case, a two-dimensional Gaussian density function) is applied to each of the observations. All of the functions are then combined, providing a smooth density estimate for the underlying probability density function, which is often plotted as a “heat map” or hot-spot map.

**Figure 7.18** Fitting a kernel density estimate in two dimensions. Kernel functions are placed on top of each observation and then summed. The resulting density estimate is often plotted as a heat map or hot-spot map to depict the estimate for the underlying density function. This is often employed for crime prediction.

Fitting kernel density estimates requires the selection of the kernel function, which is a nonnegative function that integrates to 1.0 (so probability density functions are natural kernel functions). Using a kernel function requires selection of parameters such as a bandwidth and the parameters that define the probability density function used as the kernel function.⁴⁰ The optimal setting of these parameters remains an open research question, but there are some widely employed “rules of thumb” for automatically selecting these parameters based on the observed data.

Although kernel density estimation is an unsupervised learning technique designed to reveal underlying probability distributions, the technique is widely employed in predictive policing to predict future “crime hot-spots” and make resourcing decisions about the deployment of security forces (i.e., police and military patrols) [12,13]. However, other machine learning algorithms, such as logistic regression, have been shown to substantially outperform kernel density estimation when an appropriate feature set can be built, and at the cost of a much more involved model fitting process [14,15]. Kernel density remains widely used in predictive military and police applications because the relatively simple model fitting procedure can be performed using mapping software that virtually every police or military organization fields.

7.7.2 Association Rule Mining

Association rule mining is often colloquially referred to as “market basket analysis” because one of its primary applications is in analyzing the shopping cart activity of retail customers using point-of-sale data. An association rule is a statement such as, “On Friday nights, customers who buy beer also frequently buy diapers.” More formally, another association rule might defined as

The goal of association rule mining is to discover relationships that are useful or important in the desired business context but for which there is no target or response variable against which to model. Rather, the available data are investigated for rules that meet criteria for “interestingness” such as support (frequency of appearance), confidence (how often the rule is true), lift (a measure of how much more frequently the combination of items occurs than it would under conditions of statistical independence), and others. Usually, due to the vast number of rules that can be developed on a large data set of point-of-sale records, analysts define minimum thresholds on multiple criteria that must be met simultaneously for rules to be considered for further investigation.

7.7.3 Clustering Methods

Clustering methods are designed to reveal hidden groupings of the observations in a data set and is sometimes referred to as data segmentation.⁴¹ A cluster is a set of observations in a data set that contains observations that are more similar to each other than they are to other observations in the data set. Clustering methods typically require a means to measure similarity or dissimilarity (and there are a variety of approaches for this) and an algorithm for grouping similar items. Two frequently employed techniques for clustering are k-means clustering and hierarchical clustering.

k-means clustering is an iterative algorithm for identifying clusters that requires that all features in the data set are numerical and that the analyst specify the number of clusters. The k-means algorithm starts with an initial guess for the center point of the clusters and then iteratively minimizes the sum of the squared distances between the cluster center points and the nearest data points to that temporary cluster center point. After several (perhaps many) iterations, the feature space will be partitioned based upon the locations of the k center points, with clusters defined by grouping every observation with the nearest center point, usually measured in Euclidean distance.⁴²

Hierarchical clustering partitions the data into a format similar to that shown for a decision tree. The top-level cluster includes all of the data. The deepest leaves of the hierarchy contain individual observations. Hierarchical clustering algorithms require the analyst to define a measure of dissimilarity to be used for partitioning the clusters. There are different strategies for partitioning the data, some of which use a top-down splitting approach and others that form clusters by building from the bottom up. The result is a complete mapping of the “similarity” for all items presented in a hierarchical format.

7.7.4 Principal Components Analysis (PCA)

PCA is a dimension-reduction technique used to represent data in a more compact yet more descriptive form. The key idea behind principal components analysis is to remap observations in a high-dimensional space (i.e., a feature space) into a new space where the first principal component is a vector that describes the direction of maximum variance in the data and subsequent components are linearly uncorrelated with each other. In the lexicon of linear algebra, the first principal component is the eigenvector with the highest eigenvalue.⁴³ Each subsequent component is orthogonal to all preceding components and describes the direction of the highest remaining variance under that condition. Figure 7.19 provides an illustration of the first (largest) principal component (the long vector pointing toward the top right) and the second (smallest by definition in two dimensions) principal component that lies orthogonal to first principal component. As can be seen in the figure, the first principal component points in the direction of maximum variance.

Figure depicts an illustration of principal component mapping in two dimensions. The first principal component maps the direction of highest variance and in this figure points to the top right. The second principal component (by definition the smallest in this case) lies orthogonal to the first. — **Figure 7.19** An illustration of principal component mapping in two dimensions. The first principal component maps the direction of highest variance and in this figure points to the top right. The second principal component (by definition the smallest in this case) lies orthogonal to the first. (*Source*: Nicoguaro, https://commons.wikimedia.org/wiki/File:GaussianScatterPCA.svg. Used under CC BY 4.0, https://creativecommons.org/licenses/by/4.0/deed.en.)

PCA produces a new feature space with the same number of components as the original dimensions of the feature space, and the PCA dimensions will explain all of the variation in the original data. PCA works best when most of the variance in the data set can be described with relatively few components and so a data set can be summarized using these most descriptive principal components. Dimensions with high variance often provide distinguishing features for regression and classification problems, and thus these dimensions provide a concise and useful summary of the data set. These few highly descriptive principal components are often used as the predictive features for subsequent applications of supervised machine learning algorithms such as principal components regression.⁴⁴ Ridge regression employs principal components analysis as a standard part of the model-fitting algorithm.

7.7.5 Bag-of-Words and Vector Space Models

Bag-of-words is employed for a frequent analytics problem known as “find similar items.”⁴⁵ This applies when the items to be found and matched contain blocks of text or other characters that make up “words” such as phone numbers or emojis. This technique is frequently applied to Web sites, tweets, e-mails, or text documents like books and articles. Bag-of-words is a dimension-reduction technique for representing text in a more compact form that also facilitates a matching procedure with other items of interest.

The basic idea is to convert the item of interest into a set list of all of the unique “words” in the document. This set of words is known as its “bag-of-words.” If every document in a large corpus (such as a library's holdings) is converted into a bag-of-words, then similarity measures such as the Jaccard Index can be used to match documents in the corpus to new documents (such as student thesis proposals) based on word similarity in the bag-of-words list. In this way, you can provide a list of items that are similar to the item offered as the “query” (i.e., help students find relevant references for their research in the library).

The Jaccard Index for this application is calculated as follows:

A vector space model extends the basic set representation of bag-of-words to include dimensions that represent how frequently a word appears.⁴⁶ A document can be represented as a vector in a large “global” feature space in which dimensions (i.e., features) of the space are words and the location of a document in a word's dimension is represented by the number of appearances of that word. When a 0 appears in a document's vector, it indicates that the word representing that dimension does not appear anywhere in the document. Vector-based similarity measures such cosine similarity can be used to find similar items based on the angle between any two documents' vector representations in this high-dimensional space according to the following formula (which uses vector notation):

There are many extensions to this basic construct that vary the way documents are represented in this feature space.

7.8 Conclusion

As has been shown in the many examples in this chapter, training machines (i.e., computers) to “learn” via the application of algorithmic modeling has a wide variety of very diverse applications. This chapter has also demonstrated that even though the algorithmic procedures employed to “fit” these models can be automated to a certain extent, the machines still require considerable input from analysts for their “training.” While individual machine learning algorithms provide a framework for approaching a particular class of problem, choosing the right machine learning algorithm for any particular problem is a highly complex and iterative process that requires considerable expertise, judgment, and often the active participation of domain experts and users of your results. Often, for best results, multiple machine learning algorithms, as well as best practices for data storage, data engineering, and computing will be needed. Practitioners are well advised to algorithmically model in teams that incorporate statisticians, operations research analysts, computer scientists, data engineers, data scientists, and domain experts to form a comprehensive unit dedicated to training the machines to “learn” to solve the right problems in best way.

7.9 Acknowledgments

The authors are grateful for helpful comments and suggested examples from colleagues Lyn Whitaker (modeling process, support vector machines, neural networks, and the bias-variance trade-off) and Ruriko Yoshida (principal components analysis).

Notes

References

1 Breiman L (2001) Statistical modeling: the two cultures. Stat. Sci. 16 (3): 199–231.
2 James G, Witten D, Hastie T, Tibshirani R (2013) An Introduction to Statistical Learning with Applications in R ( Springer New York, NY).
3 Hastie T, Tibshirani R, Friedman J (2001) The Elements of Statistical Learning: Data Mining, Inference, and Prediction ( Springer).
4 Huddleston SH, Brown DE (2009) A statistical threat assessment. IEEE Trans. Syst. Man Cybern. A Syst. Hum. 39 (6): 1307–1315.
5 Huddleston SH, Brown DE (2012) Mapping gang spheres of influence. Crime Mapp. J. Res. Pract. 4 (2): 39–67.
6 Brighton H, Gigerenzer G (2015) The bias bias. J. Bus. Res. 68 (8): 1772–1784.
7 Huddleston SH, Porter JH, Brown DE (2015) Improving forecasts for noisy geographic time series. J. Bus. Res. 68 (8): 1810–1818.
8 Hyndman RJ, Athanasopoulos G (2014) Forecasting: Principles and Practice. OTEXTS.com.
9 Holt CC (1957) Forecasting seasonals and trends by exponentially weighted moving averages. Int. J. Forecast. 20, 5–10.
10 Winters PR (1960) Forecasting sales by exponentially weighted moving averages. Manage. Sci. doi: https://doi.org/10.1287/mnsc.6.3.324.
11 Breiman L (1998) Arcing classifiers. Ann. Stat. 26 (3): 801–849.
12 Eck JE, Chainey S, Cameron JG, Leitner M, Wilson RE (2005) Mapping Crime: Understanding Hot Spots ( National Institute of Justice).
13 Boba R (2005) Crime Analysis and Crime Mapping, ( Sage Publications, London).
14 Liu H, Brown DE (2004) A new point process transition density model for space-time event prediction. IEEE Trans. Syst. Man Cybern. C Appl. Rev. 34 (3): 310–324.
15 Smith MA, Brown DE (2007) Discrete choice analysis of spatial attack sites. Inform. Syst. e-Bus. Manage. 5 (3): 255–274.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.