Chapter 7 Neural Networks in Data Mining

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

As with many similar business data mining applications, the ability to predict customer success would make decision making a lot easier. While perfect prediction models cannot be expected, there are a number of data mining techniques that can improve the predictive ability. Neural network models are applied to data that can be analyzed by alternative models. The normal data mining process is to try all alternative models and see which works best for a specific type of data over time. But, there are some types of data where neural network models usually outperform the alternatives, such as regression or decision trees. Neural networks tend to work better when there are complicated relationships in the data, such as high degrees of nonlinearity. Thus, they tend to be viable models in problem domains where there are high levels of unpredictability. Commercial banking is one such area.

Neural networks can be applied to a variety of data types. One of the early applications of neural networks was in deciphering letters of the alphabet in character recognition. This involved 26 different letters, a finite number of outcomes, but much more than two. Many business prediction problems involve more than two outcomes, such as categories of employee performance. Often, however, two outcome categories will do nicely, such as on-time repayment or not. Neural networks can deal with both continuous data input and categorical data input, making them flexible models applicable to a number of data mining applications. The same is true for regression models and decision trees, all three of which support the data mining process of modeling.

Neural networks are the most widely used method in data mining. They are computer programs that take the previously observed cases to build a system of relationships within a network of nodes connected by arcs. Figure 7.1 gives a simple sketch of a neural network.

Figure 7.1 Simple neural network

The idea of neural networks came from the operation of neurons in the brain. Real neurons are connected to each other and accept electrical charges across synapses (small gaps between neurons), and in turn, pass on the electrical charge to other neighbor neurons. The relationship between real neural systems and artificial neural networks probably ends at that point. Human brains contain billions of synaptic connections, each of which contributes only a tiny bit to the overall transformation of the electrical synapses that encode knowledge.¹ This provides a tremendous amount of storage capacity and makes the loss of a few thousand synaptic connections (due to minor damage or cell death) immaterial. Artificial neural networks are usually arranged in at least three layers and have a defined and constant structure that is capable of reflecting complex nonlinear relationships, although they do not have anything near the capacity of the human brain. Each variable of input data (akin to the independent variables in regression analysis) have a node in the first layer. The last layer represents the output. For classification of the neural network models, this output layer has one node for each classification category (in the simplest case, output such as prediction of system success is either true or false). Neural networks almost always have at least one middle (hidden) layer of nodes that adds complexity to the model. (Two-layer neural networks have not proven to be very successful.)

Each node is connected by an arc to the nodes in the next layer. These arcs have weights, which are multiplied by the value of incoming nodes and summed. The input node values are determined by the variable values in the dataset. The middle-layer node values are the sum of the incoming node values multiplied by the arc weights. These middle node values, in turn, are multiplied by the outgoing arc weights to successor nodes. Neural networks “learn” through feedback loops. For a given input, the output for starting weights is calculated. The output is compared with target values, and the difference between the attained and target output is fed back to the system to adjust the weights on the arcs.

This process is repeated until the network correctly classifies the proportion of learning data specified by the user (tolerance level). Ultimately, a set of weights might be encountered that explains the learning (training) dataset very well. The better the fit that is specified, the longer the neural network will take to train, although there is really no way to accurately predict how long a specific model will take to learn. The resulting set of weights from a model that satisfies the set tolerance level is retained within the system for application to future data.

Neural Network Operation

These programs can be used to apply the learned experience to new cases, for decisions, classifications, and forecasts. Because they can take datasets with many inputs and relate them to a set of categorical outputs, they require little modeling. This is not to say that they are simply a black box into which the data miner can throw data and expect good output. Neural networks have relative advantages, in that they make no assumptions about the data properties or statistical distributions. They also tend to be more accurate when dealing with complex data patterns, such as nonlinear relationships.

There is modeling required when using neural networks in the sense of input variable selection, to include manipulation of input data as well as selection of neural network parameters, such as the number of hidden layers used. But, computer software can perform complex calculations, applying nonlinear regression to relate inputs to output.

There are many neural network models. About 95 percent of the business applications were reported to use multilayered feedforward neural networks with the backpropagation learning rule. This model supports the prediction and classification when fed inputs and known outputs. Backpropagation is a supervised learning technique, in that it uses a training set to fit relationships (or learn). This model uses one or more hidden layers of neurons between inputs and outputs. Each element in each layer is connected to all elements of the next layer, and each connecting arc has a weight, which is adjusted until the rate of explanation is at or above a prescribed level of accuracy. The hidden layers provide a means to reflect nonlinearities quite well relative to regression models. The neural network model is computationally intensive.

Many business applications do not have as much data as would be ideal. Neural network software products take over from this point. Backpropagation is a means to explore the vector space of its hidden nodes and to find the effective linear or nonlinear transformations. Philosophers in artificial intelligence view this feature as a potential means for artificial neural network models to learn, by identifying a complex set of weights that we never could have identified a priori.

While multilayered feedforward neural networks are analogous to regression and discriminant analysis in dealing with cases where training data is available, self-organizing neural networks are analogous to clustering techniques used, when there is no training data. The intent is to classify data to maximize the similarity of patterns within clusters while minimizing the similarity of patterns of different clusters. Kohonen self-organizing feature maps were developed to detect strong features of large datasets.²

Neural Networks in Data Mining

Artificial neural networks are the most common form of data mining models. They are extremely attractive because they can be fed data without a starting model estimation. This does not mean that they are best applied by automatically letting them operate on the data without model design. However, they are capable of going a long way toward the idea of the computer generating its own predictive model.

Neural network applications span most data mining activity, except for rule-based systems that are applied when the explanation of model results is emphasized, and the more exploratory data mining operations of market basket analysis. Neural networks have also been applied to stock market trading, electricity trading, and many other transactional environments. A common theme is to classify a new case for which multiple measures are available into a finite set of classes, such as on-time repayment, late repayment, or default.

Artificial neural networks operate much like regression models, except that they try many different b coefficient values to fit the training set of data until they obtain a fit as good as the modeler specifies. Artificial neural network models have the added benefit of considering variable interactions, giving them the ability to estimate training data contingent upon other independent variable values. (This could also be done with regression, but would lead to a tremendous amount of computational effort.)

Software Demonstration

We demonstrate open source software neural networks with the loan application dataset. The 650 observations in this dataset were divided; the first 400 observations were used for training and the model tested on the last 250 observations.

R (Rattle)

Data is loaded as we did with regression. We prune the data by selecting Ignore for intermediate variables, so Figure 7.2 matches Figure 6.1.

Figure 7.2 Rattle screen to select the loan data

It is necessary to click on the Execute button to enact these choices. To run a neural network model, select the Model tab and click on the Neural Net radio button. This yields Figure 7.3.

Figure 7.3 Rattle neural net opening screen

Again, Execute needs to be clicked on. This yields Figure 7.4, and shows weights. The control for neural networks is the number of hidden layers.

Figure 7.4 R Neural network model for the loan data

Neural networks are black boxes in the sense that while in fact you could take the given weights and apply them, you don’t want to because there are so many. The next step is to evaluate this model (built using 400 observations) on the test set of 250 observations. Select the Evaluate tab, as shown in Figure 7.5.

Figure 7.5 R evaluate tab

To obtain fit in the form of a coincidence (R calls it confusion) matrix, select the CSV File radio button, which allows you to link the test file. This yielded a degenerate model, calling all cases “OK.” This is not a useful model. It often arises with unbalanced data (as we have here, where one outcome is predominant over the other). The cure is to either balance the data. (The easiest ways are to replicate the rare (Problem) cases in training data or delete some of the majority (OK here) or to change model parameters.) But, it is an advanced practice to figure out how to adjust R neural network parameters. When we raised the number of hidden nodes from 10 to 20 and reran, we still obtained a degenerate model. The number of hidden nodes is the only parameter Rattle provides for neural network models. To balance the training data, we replicated the 45 “Problem” cases 6 times, yielding a more balanced training set with 355 cases for both “OK” and 315 for “Problem.” This yielded results as shown in Table 7.1.

Table 7.1 R neural net coincidence matrix

	Model OK	Model problem
Actual OK	177	53	230
Actual problem	8	12	20
	185	65	250

This outcome has a correct classification rate of 0.756. To obtain forecasts for new cases, you can select the Score radio button and attach a file of new cases (here, we used file LoanRawNew.csv), which contained the data in Table 7.2.

Table 7.2 File LoanRawNew.csv

Age	Income	Assets	Debts	Want	Credit	Risk	On-time
55	75,000	80,605	90,507	3,000	Amber	High	OK
30	23,506	22,300	18,506	2,200	Amber	Low	OK
48	48,912	72,507	123,541	7,600	Red	High	OK
22	8,106	0	1,205	800	Red	High	OK
31	28,571	20,136	30,625	2,500	Amber	High	OK
36	61,322	108,610	80,542	6,654	Green	Low	OK
41	70,486	150,375	120,523	5,863	Green	Low	OK
22	22,400	32,512	12,521	3,652	Green	Low	OK
25	27,908	12,582	8,654	4,003	Amber	Medium	OK
28	41,602	18,366	12,587	2,875	Green	Low	OK

These can be scored by R by selecting Evaluate, Score, and loading the CSV file with the data in Table 7.3, asking for a report of Class. Figure 7.6 shows how to do this.

Table 7.3 R Neural net classifications

Age	Income	Assets	Debts	Want	Credit	Risk	On-time	nnet
55	75,000	80,605	90,507	3,000	Amber	High	?	OK
30	23,506	22,300	18,506	2,200	Amber	Low	?	OK
48	48,912	72,507	123,541	7,600	Red	High	?	OK
22	8,106	0	1,205	800	Red	High	?	OK
31	28,571	20,136	30,625	2,500	Amber	High	?	OK
36	61,322	108,610	80,542	6,654	Green	Low	?	OK
41	70,486	150,375	120,523	5,863	Green	Low	?	OK
22	22,400	32,512	12,521	3,652	Green	Low	?	OK
25	27,908	12,582	8,654	4,003	Amber	Medium	?	OK
28	41,602	18,366	12,587	2,875	Green	Low	?	OK

Figure 7.6 Selecting neural net classification in R

This yielded the output in Table 7.3.

In this case, all 10 cases were categorized as “OK.”

KNIME

KNIME has the feature that the workflow for other models can be modified to build other models. We can take the workflow from Figure 6.10 and replace the regression learner and predictor with neural network counterparts. Thus, the sequence is to load the training set with Browse, Configure, and Execute, yielding Figure 7.7.

Figure 7.7 File reader output for KNIME neural network model

KNIME has two neural network algorithms. RProp uses the multilayer feedforward networks. If the expected outcome is nominal, the output will be class assignment. If the data is not nominal, a regression value is computed. The PNN algorithm trains a probabilistic neural network using the dynamic decay adjustment method. This algorithm needs numeric data. The model output port contains the PNN model, which can be used for prediction in the PNN Predictor node. In our case, we have nominal output, so we need the RProp algorithm. The data for neural network analysis in KNIME needs to be normalized. We select the Normalizer (PMML) icon and drag it to the workflow. Configure and Execute yields Figure 7.8.

Figure 7.8 KNIME normalizer screen

The next operation is to drag in a PNN Learner node. Configure allows you to set the number of hidden layers and neurons. Execute runs the neural network model. We now want to test the model and bring in another File Reader node, which is linked to the file LoanTest250NN. We have to normalize this data in the same manner that we normalized the training file. Executing both files allows us to feed into a MultiLayerPerceptron Predictor node. Executing then can feed into a Scorer node. Figure 7.9 shows the workflow.

Figure 7.9 KNIME workflow for neural network model

In configuring the Scorer node, make sure that the Script variables for On-time are selected. If you score a numerical model, you get 250 rows. With categorical On-time, you get a coincidence matrix. Unfortunately, in this case, the neural network yields a degenerate model. The RProp MLP Learner node can be used to specify more hidden layers and nodes, but it still generates a degenerate model. Thus, training data needs to be balanced and fed into node 1 of Figure 7.8. File LoanRaw670.csv includes extra 270 “Problem” outcomes by duplicating the original 45 “Problem” cases in the training set six times, resulting in 355 cases for both “OK” and “Problem.” Going through the path from node 1 through node 15, this yields the coincidence matrix in Table 7.4.

Table 7.4 KNIME coincidence matrix

	Model OK	Model problem
Actual OK	193	37	230
Actual problem	18	2	20
	211	39	250

This has a correct classification rate of 0.780. This model can be applied to new cases as with R. Figure 7.10 shows the output.

Figure 7.10 KNIME neural network forecasts—New cases

Figure 7.10 shows that four new cases are identified as potentially problematic. The graphic also includes a probabilistic assessment that amplifies the information content of the binary prediction. The variable values are numbers that were categorical, were given numerical values for neural network reading, then normalized by KNIME. (Note that the On-Time “OK” list doesn’t mean anything.) Low risk ratings seem to play a role in prediction, as does low income.

WEKA

Figure 7.11 shows the loading of the data file in WEKA.

Figure 7.11 WEKA opening screen

To run a neural network model, select Classify, Functions, and MultilayerPerceptron. This yields Figure 7.12, which allows the user to change the parameters.

Figure 7.12 WEKA MultiLayerPerceptron display

We use the defaults, with 10-fold cross-validation. This results in the coincidence matrix given in Table 7.5.

Table 7.5 WEKA neural network coincidence matrix

	Model OK	Model problem
Actual OK	222	9	231
Actual problem	16	3	19
	238	12	250

This model, thus, had a correct classification rate of 0.90, better than the other nondegenerate models. Degenerate would have a correct classification rate of 0.92, but would provide no useful output. Applying this model would require appending the new cases to the bottom of the input file and going through the sequence: More options …, obtaining Figure 7.13.

Figure 7.13 WEKA process to obtain predictions

We might consider 10 new cases, as shown in Table 7.6.

Table 7.6 New cases for WEKA

Age	Income	Credit	Risk	Unknown outcome
55	75,000	Amber	High	OK
30	23,506	Amber	Low	OK
48	48,912	Red	High	OK
22	8,106	Red	High	OK
31	28,571	Amber	High	OK
36	61,322	Green	Low	OK
41	70,486	Green	Low	OK
22	22,400	Green	Low	OK
25	27,908	Amber	Medium	OK
28	41,602	Green	Low	OK

The premise is that we don’t know the outcomes, but we need to enter something that was in the training set, so we arbitrarily enter “OK.” To obtain predictions, select the radio button Use training set, select More options … and click on the Output predictions box and rerun the model. The resulting forecasts might be obtained as in Table 7.7.

Table 7.7 WEKA predictions for the loan data

Instance	Phony actual	Predicted	Error	Probability distribution
651	1:OK	1:OK	*0.931	0.069
652	1:OK	1:OK	*0.828	0.172
653	1:OK	1:OK	*0.724	0.276
654	1:OK	1:OK	*0.582	0.418
655	1:OK	1:OK	*0.731	0.269
656	1:OK	1:OK	*0.991	0.009
657	1:OK	1:OK	*0.993	0.007
658	1:OK	1:OK	*0.976	0.024
659	1:OK	1:OK	*0.938	0.062
660	1:OK	1:OK	*0.985	0.01

Note that the model has a high propensity to call cases OK (like the model obtained from R).

Neural Network Products

Many data mining software products include neural network technology. This is a black box, in that much of the control is internal to the software, although some allow parameters, such as the number of layers, to be controlled by the user. There also are many neural network products listed on the Web. The site www.kdnuggets.com has a section on software, including a section on neural products. This dynamic market includes products free for download.

Summary

Regression models have been widely used in classical modeling. They continue to be very useful in data mining environments, which differ primarily in the scale of observations and the number of variables used. Neural networks have the very important strength that they can be applied to most data mining applications, and require minimal model building. They provide good results in complicated applications, especially when there are complex interactions among variables in the data. Neural networks can deal with categorical and continuous data. There also are many packages available.

There are some weaknesses of the method. The data needs to be massaged a bit, but that is not a major defect. The primary problem is that the neural network output tends to have a black-box effect, in that explanations in the form of a model are not available. Neural networks also have the technical defect of potentially converging to an inferior solution. However, this technical defect is detectable when applied to the test set of data.

Neural networks are, therefore, very attractive for problems where explanation of conclusions are not needed. This is often the case in classification and prediction problems. Neural networks should not be applied with excessive numbers of variables. Decision tree methods can be used to prune variables in that case. Genetic algorithms can also be applied to improve the neural network performance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 7 Neural Networks in Data Mining

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 7 Neural Networks in Data Mining