Preface

In order to make effective use of the tools provided in SAS Enterprise Miner, you need to be able to do more than just set the software in motion. While it is essential to know the mechanics of how to use each tool (or node, as Enterprise Miner tools are called), you should also understand the methodology behind each one, be able to interpret the output each produces, and know the multitude of options they have. Although this book will appeal to beginners because of its step-by-step, screen-by-screen introduction to the basic tasks that can be accomplished with Enterprise Miner, it also provides the depth and background needed to master many of the more complex but rewarding concepts contained within this versatile product.

The book begins by introducing the basics of creating a project, manipulating data sources, choosing the right property values for each node, and navigating through different results windows. It then demonstrates various pre-processing tools required for building predictive models before treating the three main predictive modeling tools: Decision Tree, Neural Network, and Regression. These are addressed in considerable detail, with numerous examples of practical business applications that are illustrated with tables, charts, displays, equations, and even manual calculations that let you see the essence of what Enterprise Miner is doing as it estimates or optimizes a given model. By the time you finish with this book, Enterprise Miner will no longer be a “black box”: you will have an in-depth understanding of the product’s inner workings. This book strives to show the link between the output generated by Enterprise Miner and the statistical theory behind the business analysis that requires Enterprise Miner. I also examine the SAS code generated by each node and show the correspondence between the theory and the results produced by Enterprise Miner. In many places, however, I give intuitive explanations of the way that various nodes such as Decision Tree, Neural Network, Regression, and Variable Selection operate and how different options such as Model Selection Criteria and Model Assessment are implemented. These explanations are intended not to replicate the exact steps that SAS uses internally to make these computations, but to give a good practical sense of how these tools work. Overall, I believe this approach will help you use the tools in Enterprise Miner with greater comprehension and confidence.

Several examples of business questions drawn from the insurance and banking industries that are based on simulated, but realistic, data are used to illustrate the Enterprise Miner tools. However, the procedures discussed are relevant for any industry. I also include tables and graphs from the output data sets created by various nodes to show how to make custom tables from the results produced by Enterprise Miner. In the end, you should have gained enough understanding of Enterprise Miner to become comfortable and innovative in adapting the applications discussed here to solve your own business problems.

Chapter Details

Chapter 1 discusses research strategy. This includes general issues such as defining the target population, defining the target (or dependent) variable, collecting data, cleaning the data, and selecting an appropriate model.

Chapter 2 shows how to open Enterprise Miner, start a new project, and create data sources. It shows various components of the Enterprise Miner window and shows how to create a process flow diagram. In this chapter, I use example data sets to demonstrate in detail how to use the Input Data, Data Partition, Filter, File Import, Time Series, Merge, Append, StatExplore, MultiPlot, Graph Explore, Variable Clustering, Cluster, Variable Selection, Drop, Replacement, Impute, Interactive Binning, Principal Components, Transform Variables, and SAS Code nodes. I also discuss the output and SAS code generated by some of these nodes. I manually compute certain statistics, such as Cramer’s V, and compare the results with those produced by StatExplore. Finally, I explain the details of how to compute Eigenvalues, Eigenvectors, and Principal Components.

Chapter 3 covers the Variable Selection and Transform Variables nodes in detail. When using the Variable Selection node you have many options, depending on the type of target and the measurement scale of the inputs. To help clarify the concepts, I illustrate each situation with a separate data set. This chapter also shows how to make variable selection using the Variable Clustering and Decision Tree nodes.

Chapter 4 discusses decision trees and regression trees. First, I present the general tree methodology and, using a simple example, I manually work through the sequence of steps— growing the tree, classifying the nodes, and pruning—which is performed by the Decision Tree node. I then show how decision tree models are built for predicting response and risk by presenting two examples based on a hypothetical auto insurance company. The first model predicts the probability of response to a mail order campaign. The second model predicts risk as measured by claim frequency, and since claim frequency is measured as a continuous variable, the model built in this case is a regression tree. A detailed discussion of the SAS code generated by the Decision Tree node is included at the end of the chapter. This chapter shows how to develop decision trees interactively.

Chapter 5 provides an introduction to neural networks. Here I try to demystify the neural networks methodology by giving an intuitive explanation using simple algebra. I show how to configure neural networks to be consistent with economic and statistical theory and how to interpret the results correctly. The neural network architecture—input layer, hidden layers, and output layer—is illustrated algebraically with numerical examples. Although the formulas presented here may look complex, they do not require a high-level knowledge of mathematics, and patience in working through them will be rewarded with a thorough understanding of neural networks and their applications.

In this chapter, I first intuitively discuss the iterative processes of estimating the model using the training data set as well as selecting the optimal weights for the model using the validation data set. Next, explicit numerical examples are given to clarify each step. As in Chapter 4, two models are developed using the hypothetical insurance data: a response model with a binary target, and a risk model with (in this case) an ordinal target representing accident frequency. Line by line, I examine the SAS code generated by each node and show the correspondence between the theory and the results produced by Enterprise Miner.

The calculations behind the Receiver Operating Characteristic (ROC) Charts are illustrated using the results produced by the Model Comparison node.

Alternative specifications of the Neural Networks including Multi Layer Perceptron (MLP) Neural Network, Radial Basis Function neural networks and various built-in architectures of the Neural Network node are illustrated through mathematical representation and SAS code generated by the Neural Network node.

AutoNeural, DMNeural and Dmine Regression nodes are illustrated and the models developed by DMNeural, AutoNeural and Dmine Regression nodes are compared.

Chapter 6 demonstrates how to develop logistic regression models for targets with different measurement scales: binary, categorical with more than two categories, ordinal, and continuous (interval-scaled). Using an example data set with a binary target, I demonstrate various model selection criteria and model selection methods. I also present business applications from the banking industry involving two predictive models: one with a binary target, and one with a continuous target. The model with a binary target predicts the probability of response to a mail campaign while the model with a continuous target predicts the increase in deposits that is due to an interest rate increase. This chapter also shows how to calculate the lift and capture rates of the models when the target is continuous.

In Chapter 7, I compare the results of three modeling tools—Decision Tree, Neural Network, and Regression—that were presented in earlier chapters. For this purpose I develop two predictive models and then take turns applying the three modeling tools to each model. The first model has a binary target and predicts the probability of customer attrition for a fictitious bank. The second model has an ordinal target, which is a discrete version of a continuous variable, and predicts risk (as measured by loss frequency) for a fictitious auto insurance company. This chapter also provides a method of computing the lift and capture rates of these models using the expected value of the target variable.

In this chapter the methods of boosting and combining predictive models is illustrated using the Gradient Boosting and Ensemble nodes. The predictive performance of these two methods are compared.

Chapter 8 shows how to calculate profitability for each of the ten deciles created when a data set of prospective customers is scored using the output of the modeling process. It then shows how to use these profitability estimates to address questions such as how to choose an optimum cut-off point for a mailing campaign. Here my objective is to introduce the notion of the marginal cost and marginal revenue associated with risk and response and to show how they can be used to make rational quantitative decisions in the marketing sphere.

Chapter 9 gives an introduction to predictive modeling using unstructured textual data. Quantifying and textual and put it into a spread sheet or SAS table form is an important pre-requisite for developing predictive models with textual data. Quantifying textual data involves several steps: These are parsing the documents, filtering, and reducing decreasing the dimension reduction. Dimension reduction is done by Singular Value Decomposition (SVD). I first illustrated the quantification of textual data, Boolean Retrieval method and dimension reduction using SVD method using a simplified example. Then I showed how to use Text Parsing, Text Filter, Text Topic and Text Cluster nodes. Then I illustrated how to use the output data set produced by the Text Topic node for estimating a logistic regression equation. Using a simple example I demonstrated the Expectation-Maximization (EM) Clustering. I have explained the Hierarchical clustering method with simple algebra.

Exercises are included at the end of Chapters 2 -7 and 9.

How to Use the Book

• To get the most out of this book, open Enterprise Miner and follow the sequence of tasks performed in each chapter, using either the data sets stored on the CD included with this book or, even better, your own data sets.

• Work through the manual calculations as well as the mathematical derivations presented in the book to get an in-depth understanding of the logic behind different models.

• To learn predictive modeling, read the general explanation and the theory, and then follow the steps given in the book to develop models using either the data sets provided on the CD or your own data sets. Try variations of what is done in the book to strengthen your understanding of the topics covered.

• If you already know Enterprise Miner and want to get a good understanding of decision trees and neural networks, focus on the examples and detailed derivations given in Chapters 4, 5, and 6. These derivations are not as complex as they appear to be.

Prerequisites

• Elementary algebra and basic training (equivalent to one to two semesters of course work) in statistics covering inference, hypothesis testing, probability, and regression

• Familiarity with measurement scales of variables—continuous, categorical, ordinal, etc.

• Experience with Base SAS software and some understanding of simple SAS macros and macro variables

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset