Chapter 9

Scalability

Previous chapters have focused on explaining models. A small dataset (involving loan application approval) was used for demonstration. Data mining, and especially big data, implies much larger datasets. Real-time data collection of key data is going to occur for all of the typical datasets we have presented. The major difference in operations is in scalability. R works with massive datasets and is completely scalable. KNIME and WEKA are potentially limited in the ability to deal with large sets of data.

In this chapter, we will demonstrate some data characteristics with R. The data mining process presented in Chapter 3 needs to consider the outcome (some data is predictive, like the proportion of income expended on groceries or any other category of spending), yielding a continuous number. Many data mining applications call for classification (fraud or not; repayment expected or not; high, medium, or low performance). In both cases, the usual process is to try all three of the major predictive models (regression, neural networks, and decision trees). We will show the different data needs for each model for both the prediction and classification.

Another data issue is balance. In classification models, sometimes one outcome is much rarer than the other. One hopes that the incidence of cancer is very low, just as insurance companies hope that file claims will not be fraudulent, and bankers hope that loans will be repaid. When high degrees of imbalance are present, there is a propensity for algorithms to spit out models that simply say the majority class will always be the outcome. That yields a correct classification rate equal to the proportion of the majority outcome, which may exceed 0.99. But, this does not provide a useful classification model.

Balancing was described in the fraud data set. Essentially, there are two easy ways to balance data. (There are more involved methods, but they are essentially variants of these two easy approaches.) One is to delete excessive majority cases, but this is problematic, in that you need to be fair in selecting the cases selected for deletion. The other approach is to increase the minority cases. You need advanced approaches if you want a precise number of cases for each outcome. But, it is pretty simple to ­simply replicate all of the minority cases as many times as you need in order to obtain roughly the same number of cases for each category.

This chapter will demonstrate Rattle models for a bigger dataset. Expenditure data was described in earlier chapters and poses a situation where the objective is to predict the proportion of expendable income spent on a particular category, such as expenditure on automotives. As stated earlier, we will demonstrate regression, neural networks, and decision trees for both predictive and categorical data.

Expenditure Data

The dataset we will use has 10,000 observations, by no means massive, but a bit larger than the loan data we have used in prior chapters to demonstrate concepts. Expenditure data was described in Chapter 4. Here, we will begin by dividing the data into a training set of 8,000 observations and a test set of 2,000 observations. We also have a small set of 20 new cases for predictive purposes, as given in Table 9.1.


Table 9.1 ExpenditureAutoNewBinary.csv

Age

Gender

Marital

Dependents

Income

YrJob

YrTown

YrEd

DrivLic

OwnHome

CredC

Churn

AutoBin

22

0

0

0

18

0

3

16

1

0

4

0

?

24

1

1

1

21

1

1

12

1

0

3

1

?

26

0

0.5

3

25

5

3

16

1

1

6

0

?

27

1

1

2

24

2

2

14

1

0

2

0

?

29

0

0

0

27

4

4

11

1

0

3

1

?

31

1

0

0

28

0

3

12

1

0

4

0

?

33

0

1

1

29

3

16

14

1

1

2

1

?

33

1

0.5

0

31

2

33

18

0

0

3

0

?

34

0

1

3

44

8

24

16

1

1

6

1

?

36

1

0

0

27

12

12

14

1

0

1

0

?

37

0

0

0

35

19

5

12

1

0

2

0

?

39

1

1

2

52

2

2

18

1

1

5

1

?

41

0

0.5

0

105

16

4

16

1

1

3

0

?

42

1

1

3

70

15

14

12

1

1

4

0

?

44

0

0

0

66

12

8

16

1

0

3

1

?

47

1

1

1

44

6

6

12

1

0

3

1

?

52

0

0.5

0

65

20

52

12

1

1

2

0

?

55

1

1

0

72

21

22

18

1

1

2

1

?

56

0

1

0

87

16

3

12

1

1

1

0

?

66

1

0

0

55

0

18

12

0

1

1

0

?



R (Rattle) Calculations

We took the original data and made it binary by assigning the continuous variable ProAuto a value of “high,” if it was above 0.12 (12 percent of the
expendable income spent on automotive items) and “low,” otherwise. The majority of cases were “low.” Figure 9.1 gives the R screen for linking the training file ExpenditureAutoTrainBinary.csv.

Image

Figure 9.1 Data input

Figure 9.2 displays the R correlation graphic, indicating a strong positive relationship between the expenditures on automotive items and driver’s license, as well as strong negative relationships between the automotive expenditures and income and age.

Image

Figure 9.2 Rattle correlation graphic

Credit cards and churn have a notable relationship, indicating that they might contain some overlapping information, as do driver’s license and age.

To run correlation over all variables, you need to make AutoCat an Input and re-execute the data. This yields the correlation graphic shown in Figure 9.2. The strongest correlations with AutoCat are Income and Age, with DrivLic also appearing to have something to contribute. However, Age and DrivLic have nearly as strong a correlation with Income as does with AutoCat, so we might leave it out and prune our model to include only Income as an independent variable.

Logistic Regression

The logistic regression model with all the available independent variables is shown in Figure 9.3.

Image

Figure 9.3 Rattle logistic regression model for auto expenditures

R reports a Pseudo R-squared for logistic regression models. In this case, it was 0.776 for the model using all independent variables.

The pruned model using only Income and Drivers License is shown in Figure 9.4. It has a pseudo R-square of 0.774, nearly as strong as the model in Figure 9.3 which had 12 independent variables. We expect this model to be much stabler, and to predict, we only need values for income and driver’s license.

Image

Figure 9.4 Pruned Regression Model

We can test both models with a test set of 2000 observations.


Table 9.2 Coincidence matrix for full and pruned logistic regression models

FULL

Full 0

Full 1

Total

PRUNED

Pru 0

Pru 1

Total

Actual 0

1709

52

1761

Actual 0

1710

51

1761

Actual 1

66

173

239

Actual 1

62

177

239

1775

225

2000

1772

228

2000



For the full model overall accuracy was 0.941, while the pruned model had 0.943. In predicting low automotive expenditure, the full model was 0.970 accurate over the test set (1709/1761), while the pruned model was 0.971. In predicting high automotive expenditure, the full model was 0.724 accurate, and the pruned model 0.741. Thus the pruned model was more accurate (slightly) on all three metrics, and provides a much stabler predictive model.

­Evaluate tab. Note that you need to click on the Score button, link the CSV File (ExpenditureAutoNewBin.csv), and select the Class button because this data has a categorical output (if you were running a linear regression on numerical data, you would select the Probability button, which would provide you with the logistic function between 0 and 1 – Rattle uses a default cutoff of 0.5).

This creates an Excel-readable file that can be stored on your hard drive. Table 9.3 shows the inputs used by the logistic regression and the output in terms of classification. The full model provided the same classification predictions, with very similar probability predictions.


Table 9.3 Logistic regression application

NewCase

Marital

Income

DrivLic

Class

1

0

18

1

High

2

1

21

1

High

3

0.5

25

1

Low

4

1

24

1

High

5

0

27

1

Low

6

0

28

1

Low

7

1

29

1

Low

8

0.5

31

0

Low

9

1

44

1

Low

10

0

27

1

Low

11

0

35

1

Low

12

1

52

1

Low

13

0.5

105

1

Low

14

1

70

1

Low

15

0

66

1

Low

16

1

44

1

Low

17

0.5

65

1

Low

18

1

72

1

Low

19

1

87

1

Low

20

0

55

0

Low



Neural Networks

The training, test, and new case files are loaded the same as with the logistic regression section. All Figure 9.5 shows is clicking on the Neural Net button.

Image

Figure 9.5 Running neural network model

The neural network model is run by clicking on Execute. The default model used 10 hidden layers, yielding a degenerate model calling all test cases low. Raising the hidden node level to 25 gave a more useful model.

The resulting coincidence matrix is shown in Table 9.4.


Table 9.4 Neural network model coincidence matrix

Full 0

Full 1

Total

Actual high

1713

48

1761

Actual low

66

173

239

1779

221

2000



This yields a correct classification rate of 0.943. High cases were correctly predicted 173/239, or 0.724, while low probability cases were correctly predicted 1,713/1,761, or 0.973. Application to the new cases resulted in every new case being predicted the same as with the logistic regression models.

Decision Tree

The inputs are the same as with the other models. The decision tree is reported in Figure 9.6.

Image

Figure 9.6 Decision tree model

The asterisks in Figure 9.6 indicate the seven rules obtained. ­Figure 9.7 shows the R graphic of this tree obtained by selecting the Draw button.

Image

Figure 9.7 R decision tree

Note that only Income and DrivLic were used, despite all variables selected as being available in the training set. The coincidence matrix is given in Table 9.5


Table 9.5 Decision tree model coincidence matrix

DT

Full 0

Full 1

Total

Actual 0

1715

46

1761

Actual 1

70

169

239

1785

215

2000



This yields a correct classification rate of 0.943. The correct predic- tion of high probability cases was 0.707 while the correct prediction of low probability cases was 0.974. The application to the new cases resulted in predicting cases 1 through 4 as high, the rest low.

Comparison

Table 9.6 compares these models.


Table 9.6 Model comparisons in predicting auto expenditures

Logistic regression

Neural network

Decision tree

Overall accuracy

0.943

0.943

0.943

High accuracy

0.741

0.724

0.707

Low accuracy

0.971

0.973

0.974

Case 1

High

High

High

Case 2

High

High

High

Case 3

Low

High

High

Case 4

High

High

High

Case 5

Low

Low

Low

Case 6

Low

Low

Low

Case 7

Low

Low

Low

Case 8

Low

Low

Low

Case 9

Low

Low

Low

Case 10

Low

Low

High



All three models had similar accuracies, on all three dimensions (although the decision tree was better at predicting high expenditure, and correspondingly lower at predicting low expenditure). The neural network didn’t predict any high expenditure cases, but it was the least accurate at doing that in the test case. The decision tree model predicted more high cases. These results are typical and to be expected—different models will yield different results, and these relative advantages are liable to change with new data. That is why automated systems applied to big data should probably utilize all three types of models. Data scientists need to focus attention on refining the parameters in each model type, seeking better fits for specific applications.

Summary

We have concluded this short book demonstrating R (Rattle) computations for the basic classification algorithms in data mining on a slightly larger dataset than we have used in prior chapters. The advent of big data has led to an environment where billions of records are possible. This book has not demonstrated that scope by any means, but it has demonstrated the small-scale version of the basic algorithms. The intent is to make data mining less of a black-box exercise, thus hopefully enabling users to be more intelligent in their application of data mining.

We have demonstrated three open source software products in earlier chapters of the book. KNIME has a very interesting GUI, demonstrating the data mining process in an object-oriented format. But, it is a little more involved for the user. WEKA is a very good tool, especially for users who want to compare different algorithms within each category. However, it is problematic in dealing with test files and new case files (it can work, it just doesn’t always). Thus, R is the recommended software. R is widely used in industry and has all of the benefits of open source software (many eyes are monitoring it, leading to fewer bugs; it is free; it is scalable). Further, the R system enables widespread data manipulation and management.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset