Chapter 7

Neural Networks in Data Mining

As with many similar business data mining applications, the ability to predict customer success would make decision making a lot easier. While perfect prediction models cannot be expected, there are a number of data mining techniques that can improve the predictive ability. Neural network models are applied to data that can be analyzed by alternative ­models. The normal data mining process is to try all alternative models and see which works best for a specific type of data over time. But, there are some types of data where neural network models usually outperform the alternatives, such as regression or decision trees. Neural networks tend to work better when there are complicated relationships in the data, such as high degrees of nonlinearity. Thus, they tend to be viable models in problem domains where there are high levels of unpredictability. Commercial banking is one such area.

Neural networks can be applied to a variety of data types. One of the early applications of neural networks was in deciphering letters of the alphabet in character recognition. This involved 26 different letters, a finite number of outcomes, but much more than two. Many business prediction problems involve more than two outcomes, such as categories of employee performance. Often, however, two outcome categories will do nicely, such as on-time repayment or not. Neural networks can deal with both continuous data input and categorical data input, making them flexible models applicable to a number of data mining applications. The same is true for regression models and decision trees, all three of which support the data mining process of modeling.

Neural networks are the most widely used method in data mining. They are computer programs that take the previously observed cases to build a system of relationships within a network of nodes connected by arcs. Figure 7.1 gives a simple sketch of a neural network.

Image

Figure 7.1 Simple neural network

The idea of neural networks came from the operation of neurons in the brain. Real neurons are connected to each other and accept electrical charges across synapses (small gaps between neurons), and in turn, pass on the electrical charge to other neighbor neurons. The relationship between real neural systems and artificial neural networks probably ends at that point. Human brains contain billions of synaptic connections, each of which contributes only a tiny bit to the overall transformation of the electrical synapses that encode knowledge.1 This provides a tremendous amount of storage capacity and makes the loss of a few thousand synaptic connections (due to minor damage or cell death) immaterial. Artificial neural networks are usually arranged in at least three layers and have a defined and constant structure that is capable of reflecting complex nonlinear relationships, although they do not have anything near the capacity of the human brain. Each variable of input data (akin to the independent variables in regression analysis) have a node in the first layer. The last layer represents the output. For classification of the neural network models, this output layer has one node for each classification category (in the simplest case, output such as prediction of system success is either true or false). Neural networks almost always have at least one middle (hidden) layer of nodes that adds complexity to the model. (Two-layer neural networks have not proven to be very successful.)

Each node is connected by an arc to the nodes in the next layer. These arcs have weights, which are multiplied by the value of incoming nodes and summed. The input node values are determined by the variable values in the dataset. The middle-layer node values are the sum of the incoming node values multiplied by the arc weights. These middle node values, in turn, are multiplied by the outgoing arc weights to successor nodes. Neural networks “learn” through feedback loops. For a given input, the output for starting weights is calculated. The output is compared with target values, and the difference between the attained and target output is fed back to the system to adjust the weights on the arcs.

This process is repeated until the network correctly classifies the proportion of learning data specified by the user (tolerance level). Ultimately, a set of weights might be encountered that explains the learning (training) dataset very well. The better the fit that is specified, the longer the neural network will take to train, although there is really no way to accurately predict how long a specific model will take to learn. The resulting set of weights from a model that satisfies the set tolerance level is retained within the system for application to future data.

Neural Network Operation

These programs can be used to apply the learned experience to new cases, for decisions, classifications, and forecasts. Because they can take datasets with many inputs and relate them to a set of categorical outputs, they require little modeling. This is not to say that they are simply a black box into which the data miner can throw data and expect good output. ­Neural networks have relative advantages, in that they make no assumptions about the data properties or statistical distributions. They also tend to be more accurate when dealing with complex data patterns, such as nonlinear relationships.

There is modeling required when using neural networks in the sense of input variable selection, to include manipulation of input data as well as selection of neural network parameters, such as the number of hidden layers used. But, computer software can perform complex calculations, applying nonlinear regression to relate inputs to output.

There are many neural network models. About 95 percent of the business applications were reported to use multilayered feedforward neural networks with the backpropagation learning rule. This model supports the prediction and classification when fed inputs and known outputs. Backpropagation is a supervised learning technique, in that it uses a training set to fit relationships (or learn). This model uses one or more hidden layers of neurons between inputs and outputs. Each element in each layer is connected to all elements of the next layer, and each connecting arc has a weight, which is adjusted until the rate of explanation is at or above a prescribed level of accuracy. The hidden layers provide a means to reflect nonlinearities quite well relative to regression models. The neural network model is computationally intensive.

Many business applications do not have as much data as would be ideal. Neural network software products take over from this point. Backpropagation is a means to explore the vector space of its hidden nodes and to find the effective linear or nonlinear transformations. Philosophers in artificial intelligence view this feature as a potential means for artificial neural network models to learn, by identifying a complex set of weights that we never could have identified a priori.

While multilayered feedforward neural networks are analogous to regression and discriminant analysis in dealing with cases where training data is available, self-organizing neural networks are analogous to ­clustering techniques used, when there is no training data. The intent is to classify data to maximize the similarity of patterns within clusters while minimizing the similarity of patterns of different clusters. Kohonen self-­organizing feature maps were developed to detect strong features of large datasets.2

Neural Networks in Data Mining

Artificial neural networks are the most common form of data mining models. They are extremely attractive because they can be fed data without a starting model estimation. This does not mean that they are best applied by automatically letting them operate on the data without model design. However, they are capable of going a long way toward the idea of the computer generating its own predictive model.

Neural network applications span most data mining activity, except for rule-based systems that are applied when the explanation of model results is emphasized, and the more exploratory data mining operations of market basket analysis. Neural networks have also been applied to stock market trading, electricity trading, and many other transactional environments. A common theme is to classify a new case for which multiple ­measures are available into a finite set of classes, such as on-time repayment, late repayment, or default.

Artificial neural networks operate much like regression models, except that they try many different b coefficient values to fit the training set of data until they obtain a fit as good as the modeler specifies. Artificial ­neural network models have the added benefit of considering variable interactions, giving them the ability to estimate training data contingent upon other independent variable values. (This could also be done with regression, but would lead to a tremendous amount of computational effort.)

Software Demonstration

We demonstrate open source software neural networks with the loan application dataset. The 650 observations in this dataset were divided; the first 400 observations were used for training and the model tested on the last 250 observations.

R (Rattle)

Data is loaded as we did with regression. We prune the data by selecting Ignore for intermediate variables, so Figure 7.2 matches Figure 6.1.

Image

Figure 7.2 Rattle screen to select the loan data

It is necessary to click on the Execute button to enact these choices. To run a neural network model, select the Model tab and click on the Neural Net radio button. This yields Figure 7.3.

Image

Figure 7.3 Rattle neural net opening screen

Again, Execute needs to be clicked on. This yields Figure 7.4, and shows weights. The control for neural networks is the number of hidden layers.

Image

Figure 7.4 R Neural network model for the loan data

Neural networks are black boxes in the sense that while in fact you could take the given weights and apply them, you don’t want to because there are so many. The next step is to evaluate this model (built using 400 observations) on the test set of 250 observations. Select the Evaluate tab, as shown in Figure 7.5.

Image

Figure 7.5 R evaluate tab

To obtain fit in the form of a coincidence (R calls it confusion) matrix, select the CSV File radio button, which allows you to link the test file. This yielded a degenerate model, calling all cases “OK.” This is not a useful model. It often arises with unbalanced data (as we have here, where one outcome is predominant over the other). The cure is to either balance the data. (The easiest ways are to replicate the rare (Problem) cases in training data or delete some of the majority (OK here) or to change model parameters.) But, it is an advanced practice to figure out how to adjust R neural network parameters. When we raised the number of hidden nodes from 10 to 20 and reran, we still obtained a degenerate model. The number of hidden nodes is the only parameter Rattle provides for neural network models. To balance the training data, we replicated the 45 “Problem” cases 6 times, yielding a more balanced training set with 355 cases for both “OK” and 315 for “Problem.” This yielded results as shown in Table 7.1.


Table 7.1 R neural net coincidence matrix

Model OK

Model problem

Actual OK

177

53

230

Actual problem

8

12

20

185

65

250



This outcome has a correct classification rate of 0.756. To obtain forecasts for new cases, you can select the Score radio button and attach a file of new cases (here, we used file LoanRawNew.csv), which contained the data in Table 7.2.


Table 7.2 File LoanRawNew.csv

Age

Income

Assets

Debts

Want

Credit

Risk

On-time

55

75,000

80,605

90,507

3,000

Amber

High

OK

30

23,506

22,300

18,506

2,200

Amber

Low

OK

48

48,912

72,507

123,541

7,600

Red

High

OK

22

8,106

0

1,205

800

Red

High

OK

31

28,571

20,136

30,625

2,500

Amber

High

OK

36

61,322

108,610

80,542

6,654

Green

Low

OK

41

70,486

150,375

120,523

5,863

Green

Low

OK

22

22,400

32,512

12,521

3,652

Green

Low

OK

25

27,908

12,582

8,654

4,003

Amber

Medium

OK

28

41,602

18,366

12,587

2,875

Green

Low

OK



These can be scored by R by selecting Evaluate, Score, and loading the CSV file with the data in Table 7.3, asking for a report of Class. Figure 7.6 shows how to do this.


Table 7.3 R Neural net classifications

Age

Income

Assets

Debts

Want

Credit

Risk

On-time

nnet

55

75,000

80,605

90,507

3,000

Amber

High

?

OK

30

23,506

22,300

18,506

2,200

Amber

Low

?

OK

48

48,912

72,507

123,541

7,600

Red

High

?

OK

22

8,106

0

1,205

800

Red

High

?

OK

31

28,571

20,136

30,625

2,500

Amber

High

?

OK

36

61,322

108,610

80,542

6,654

Green

Low

?

OK

41

70,486

150,375

120,523

5,863

Green

Low

?

OK

22

22,400

32,512

12,521

3,652

Green

Low

?

OK

25

27,908

12,582

8,654

4,003

Amber

Medium

?

OK

28

41,602

18,366

12,587

2,875

Green

Low

?

OK



Image

Figure 7.6 Selecting neural net classification in R

This yielded the output in Table 7.3.

In this case, all 10 cases were categorized as “OK.”

KNIME

KNIME has the feature that the workflow for other models can be modified to build other models. We can take the workflow from Figure 6.10 and replace the regression learner and predictor with neural network counterparts. Thus, the sequence is to load the training set with Browse, Configure, and Execute, yielding Figure 7.7.

Image

Figure 7.7 File reader output for KNIME neural network model

KNIME has two neural network algorithms. RProp uses the multilayer feedforward networks. If the expected outcome is nominal, the output will be class assignment. If the data is not nominal, a regression value is computed. The PNN algorithm trains a probabilistic neural network using the dynamic decay adjustment method. This algorithm needs numeric data. The model output port contains the PNN model, which can be used for prediction in the PNN Predictor node. In our case, we have nominal output, so we need the RProp algorithm. The data for neural network analysis in KNIME needs to be normalized. We select the Normalizer (PMML) icon and drag it to the workflow. Configure and Execute yields Figure 7.8.

Image

Figure 7.8 KNIME normalizer screen

The next operation is to drag in a PNN Learner node. Configure allows you to set the number of hidden layers and neurons. Execute runs the ­neural network model. We now want to test the model and bring in another File Reader node, which is linked to the file ­LoanTest250NN. We have to normalize this data in the same manner that we normalized the training file. Executing both files allows us to feed into a MultiLayerPerceptron Predictor node. Executing then can feed into a Scorer node. Figure 7.9 shows the workflow.

Image

Figure 7.9 KNIME workflow for neural network model

In configuring the Scorer node, make sure that the Script variables for On-time are selected. If you score a numerical model, you get 250 rows. With categorical On-time, you get a coincidence matrix. Unfortunately, in this case, the neural network yields a degenerate model. The RProp MLP Learner node can be used to specify more hidden layers and nodes, but it still generates a degenerate model. Thus, training data needs to be balanced and fed into node 1 of Figure 7.8. File LoanRaw670.csv includes extra 270 “Problem” outcomes by duplicating the original 45 “Problem” cases in the training set six times, resulting in 355 cases for both “OK” and “Problem.” Going through the path from node 1 through node 15, this yields the coincidence matrix in Table 7.4.


Table 7.4 KNIME coincidence matrix

Model OK

Model problem

Actual OK

193

37

230

Actual problem

18

2

20

211

39

250



This has a correct classification rate of 0.780. This model can be applied to new cases as with R. Figure 7.10 shows the output.

Image

Figure 7.10 KNIME neural network forecasts—New cases

Figure 7.10 shows that four new cases are identified as potentially problematic. The graphic also includes a probabilistic assessment that amplifies the information content of the binary prediction. The variable values are numbers that were categorical, were given numerical values for neural network reading, then normalized by KNIME. (Note that the On-Time “OK” list doesn’t mean anything.) Low risk ratings seem to play a role in prediction, as does low income.

WEKA

Figure 7.11 shows the loading of the data file in WEKA.

Image

Figure 7.11 WEKA opening screen

To run a neural network model, select Classify, Functions, and MultilayerPerceptron. This yields Figure 7.12, which allows the user to change the parameters.

Image

Figure 7.12 WEKA MultiLayerPerceptron display

We use the defaults, with 10-fold cross-validation. This results in the coincidence matrix given in Table 7.5.


Table 7.5 WEKA neural network coincidence matrix

Model OK

Model problem

Actual OK

222

9

231

Actual problem

16

3

19

238

12

250



This model, thus, had a correct classification rate of 0.90, better than the other nondegenerate models. Degenerate would have a correct classification rate of 0.92, but would provide no useful output. Applying this model would require appending the new cases to the bottom of the input file and going through the sequence: More options …, obtaining Figure 7.13.

Image

Figure 7.13 WEKA process to obtain predictions

We might consider 10 new cases, as shown in Table 7.6.


Table 7.6 New cases for WEKA

Age

Income

Credit

Risk

Unknown outcome

55

75,000

Amber

High

OK

30

23,506

Amber

Low

OK

48

48,912

Red

High

OK

22

8,106

Red

High

OK

31

28,571

Amber

High

OK

36

61,322

Green

Low

OK

41

70,486

Green

Low

OK

22

22,400

Green

Low

OK

25

27,908

Amber

Medium

OK

28

41,602

Green

Low

OK



The premise is that we don’t know the outcomes, but we need to enter something that was in the training set, so we arbitrarily enter “OK.” To obtain predictions, select the radio button Use training set, select More options … and click on the Output predictions box and rerun the model. The resulting forecasts might be obtained as in Table 7.7.


Table 7.7 WEKA predictions for the loan data

Instance

Phony actual

Predicted

Error

Probability distribution

651

1:OK

1:OK

*0.931

0.069

652

1:OK

1:OK

*0.828

0.172

653

1:OK

1:OK

*0.724

0.276

654

1:OK

1:OK

*0.582

0.418

655

1:OK

1:OK

*0.731

0.269

656

1:OK

1:OK

*0.991

0.009

657

1:OK

1:OK

*0.993

0.007

658

1:OK

1:OK

*0.976

0.024

659

1:OK

1:OK

*0.938

0.062

660

1:OK

1:OK

*0.985

0.01


Note that the model has a high propensity to call cases OK (like the model obtained from R).


Neural Network Products

Many data mining software products include neural network technology. This is a black box, in that much of the control is internal to the software, although some allow parameters, such as the number of ­layers, to be ­controlled by the user. There also are many neural network ­products listed on the Web. The site www.kdnuggets.com has a section on software, including a section on neural products. This dynamic market includes products free for download.

Summary

Regression models have been widely used in classical modeling. They continue to be very useful in data mining environments, which differ ­primarily in the scale of observations and the number of variables used. Neural networks have the very important strength that they can be applied to most data mining applications, and require minimal model building. They provide good results in complicated applications, ­especially when there are complex interactions among variables in the data. Neural ­networks can deal with categorical and continuous data. There also are many packages available.

There are some weaknesses of the method. The data needs to be massaged a bit, but that is not a major defect. The primary problem is that the neural network output tends to have a black-box effect, in that explanations in the form of a model are not available. Neural networks also have the technical defect of potentially converging to an inferior solution. However, this technical defect is detectable when applied to the test set of data.

Neural networks are, therefore, very attractive for problems where explanation of conclusions are not needed. This is often the case in classification and prediction problems. Neural networks should not be applied with excessive numbers of variables. Decision tree methods can be used to prune variables in that case. Genetic algorithms can also be applied to improve the neural network performance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset