Chapter 9 Scalability

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous chapters have focused on explaining models. A small dataset (involving loan application approval) was used for demonstration. Data mining, and especially big data, implies much larger datasets. Real-time data collection of key data is going to occur for all of the typical datasets we have presented. The major difference in operations is in scalability. R works with massive datasets and is completely scalable. KNIME and WEKA are potentially limited in the ability to deal with large sets of data.

In this chapter, we will demonstrate some data characteristics with R. The data mining process presented in Chapter 3 needs to consider the outcome (some data is predictive, like the proportion of income expended on groceries or any other category of spending), yielding a continuous number. Many data mining applications call for classification (fraud or not; repayment expected or not; high, medium, or low performance). In both cases, the usual process is to try all three of the major predictive models (regression, neural networks, and decision trees). We will show the different data needs for each model for both the prediction and classification.

Another data issue is balance. In classification models, sometimes one outcome is much rarer than the other. One hopes that the incidence of cancer is very low, just as insurance companies hope that file claims will not be fraudulent, and bankers hope that loans will be repaid. When high degrees of imbalance are present, there is a propensity for algorithms to spit out models that simply say the majority class will always be the outcome. That yields a correct classification rate equal to the proportion of the majority outcome, which may exceed 0.99. But, this does not provide a useful classification model.

Balancing was described in the fraud data set. Essentially, there are two easy ways to balance data. (There are more involved methods, but they are essentially variants of these two easy approaches.) One is to delete excessive majority cases, but this is problematic, in that you need to be fair in selecting the cases selected for deletion. The other approach is to increase the minority cases. You need advanced approaches if you want a precise number of cases for each outcome. But, it is pretty simple to simply replicate all of the minority cases as many times as you need in order to obtain roughly the same number of cases for each category.

This chapter will demonstrate Rattle models for a bigger dataset. Expenditure data was described in earlier chapters and poses a situation where the objective is to predict the proportion of expendable income spent on a particular category, such as expenditure on automotives. As stated earlier, we will demonstrate regression, neural networks, and decision trees for both predictive and categorical data.

Expenditure Data

The dataset we will use has 10,000 observations, by no means massive, but a bit larger than the loan data we have used in prior chapters to demonstrate concepts. Expenditure data was described in Chapter 4. Here, we will begin by dividing the data into a training set of 8,000 observations and a test set of 2,000 observations. We also have a small set of 20 new cases for predictive purposes, as given in Table 9.1.

Table 9.1 ExpenditureAutoNewBinary.csv

Age	Gender	Marital	Dependents	Income	YrJob	YrTown	YrEd	DrivLic	OwnHome	CredC	Churn	AutoBin
22	0	0	0	18	0	3	16	1	0	4	0	?
24	1	1	1	21	1	1	12	1	0	3	1	?
26	0	0.5	3	25	5	3	16	1	1	6	0	?
27	1	1	2	24	2	2	14	1	0	2	0	?
29	0	0	0	27	4	4	11	1	0	3	1	?
31	1	0	0	28	0	3	12	1	0	4	0	?
33	0	1	1	29	3	16	14	1	1	2	1	?
33	1	0.5	0	31	2	33	18	0	0	3	0	?
34	0	1	3	44	8	24	16	1	1	6	1	?
36	1	0	0	27	12	12	14	1	0	1	0	?
37	0	0	0	35	19	5	12	1	0	2	0	?
39	1	1	2	52	2	2	18	1	1	5	1	?
41	0	0.5	0	105	16	4	16	1	1	3	0	?
42	1	1	3	70	15	14	12	1	1	4	0	?
44	0	0	0	66	12	8	16	1	0	3	1	?
47	1	1	1	44	6	6	12	1	0	3	1	?
52	0	0.5	0	65	20	52	12	1	1	2	0	?
55	1	1	0	72	21	22	18	1	1	2	1	?
56	0	1	0	87	16	3	12	1	1	1	0	?
66	1	0	0	55	0	18	12	0	1	1	0	?

R (Rattle) Calculations

We took the original data and made it binary by assigning the continuous variable ProAuto a value of “high,” if it was above 0.12 (12 percent of the
expendable income spent on automotive items) and “low,” otherwise. The majority of cases were “low.” Figure 9.1 gives the R screen for linking the training file ExpenditureAutoTrainBinary.csv.

Figure 9.1 Data input

Figure 9.2 displays the R correlation graphic, indicating a strong positive relationship between the expenditures on automotive items and driver’s license, as well as strong negative relationships between the automotive expenditures and income and age.

Figure 9.2 Rattle correlation graphic

Credit cards and churn have a notable relationship, indicating that they might contain some overlapping information, as do driver’s license and age.

To run correlation over all variables, you need to make AutoCat an Input and re-execute the data. This yields the correlation graphic shown in Figure 9.2. The strongest correlations with AutoCat are Income and Age, with DrivLic also appearing to have something to contribute. However, Age and DrivLic have nearly as strong a correlation with Income as does with AutoCat, so we might leave it out and prune our model to include only Income as an independent variable.

Logistic Regression

The logistic regression model with all the available independent variables is shown in Figure 9.3.

Figure 9.3 Rattle logistic regression model for auto expenditures

R reports a Pseudo R-squared for logistic regression models. In this case, it was 0.776 for the model using all independent variables.

The pruned model using only Income and Drivers License is shown in Figure 9.4. It has a pseudo R-square of 0.774, nearly as strong as the model in Figure 9.3 which had 12 independent variables. We expect this model to be much stabler, and to predict, we only need values for income and driver’s license.

Figure 9.4 Pruned Regression Model

We can test both models with a test set of 2000 observations.

Table 9.2 Coincidence matrix for full and pruned logistic regression models

FULL	Full 0	Full 1	Total	PRUNED	Pru 0	Pru 1	Total
Actual 0	1709	52	1761	Actual 0	1710	51	1761
Actual 1	66	173	239	Actual 1	62	177	239
	1775	225	2000		1772	228	2000

For the full model overall accuracy was 0.941, while the pruned model had 0.943. In predicting low automotive expenditure, the full model was 0.970 accurate over the test set (1709/1761), while the pruned model was 0.971. In predicting high automotive expenditure, the full model was 0.724 accurate, and the pruned model 0.741. Thus the pruned model was more accurate (slightly) on all three metrics, and provides a much stabler predictive model.

Evaluate tab. Note that you need to click on the Score button, link the CSV File (ExpenditureAutoNewBin.csv), and select the Class button because this data has a categorical output (if you were running a linear regression on numerical data, you would select the Probability button, which would provide you with the logistic function between 0 and 1 – Rattle uses a default cutoff of 0.5).

This creates an Excel-readable file that can be stored on your hard drive. Table 9.3 shows the inputs used by the logistic regression and the output in terms of classification. The full model provided the same classification predictions, with very similar probability predictions.

Table 9.3 Logistic regression application

NewCase	Marital	Income	DrivLic	Class
1	0	18	1	High
2	1	21	1	High
3	0.5	25	1	Low
4	1	24	1	High
5	0	27	1	Low
6	0	28	1	Low
7	1	29	1	Low
8	0.5	31	0	Low
9	1	44	1	Low
10	0	27	1	Low
11	0	35	1	Low
12	1	52	1	Low
13	0.5	105	1	Low
14	1	70	1	Low
15	0	66	1	Low
16	1	44	1	Low
17	0.5	65	1	Low
18	1	72	1	Low
19	1	87	1	Low
20	0	55	0	Low

Neural Networks

The training, test, and new case files are loaded the same as with the logistic regression section. All Figure 9.5 shows is clicking on the Neural Net button.

Figure 9.5 Running neural network model

The neural network model is run by clicking on Execute. The default model used 10 hidden layers, yielding a degenerate model calling all test cases low. Raising the hidden node level to 25 gave a more useful model.

The resulting coincidence matrix is shown in Table 9.4.

Table 9.4 Neural network model coincidence matrix

	Full 0	Full 1	Total
Actual high	1713	48	1761
Actual low	66	173	239
	1779	221	2000

This yields a correct classification rate of 0.943. High cases were correctly predicted 173/239, or 0.724, while low probability cases were correctly predicted 1,713/1,761, or 0.973. Application to the new cases resulted in every new case being predicted the same as with the logistic regression models.

Decision Tree

The inputs are the same as with the other models. The decision tree is reported in Figure 9.6.

Figure 9.6 Decision tree model

The asterisks in Figure 9.6 indicate the seven rules obtained. Figure 9.7 shows the R graphic of this tree obtained by selecting the Draw button.

Figure 9.7 R decision tree

Note that only Income and DrivLic were used, despite all variables selected as being available in the training set. The coincidence matrix is given in Table 9.5

Table 9.5 Decision tree model coincidence matrix

DT	Full 0	Full 1	Total
Actual 0	1715	46	1761
Actual 1	70	169	239
	1785	215	2000

This yields a correct classification rate of 0.943. The correct predic- tion of high probability cases was 0.707 while the correct prediction of low probability cases was 0.974. The application to the new cases resulted in predicting cases 1 through 4 as high, the rest low.

Comparison

Table 9.6 compares these models.

Table 9.6 Model comparisons in predicting auto expenditures

	Logistic regression	Neural network	Decision tree
Overall accuracy	0.943	0.943	0.943
High accuracy	0.741	0.724	0.707
Low accuracy	0.971	0.973	0.974
Case 1	High	High	High
Case 2	High	High	High
Case 3	Low	High	High
Case 4	High	High	High
Case 5	Low	Low	Low
Case 6	Low	Low	Low
Case 7	Low	Low	Low
Case 8	Low	Low	Low
Case 9	Low	Low	Low
Case 10	Low	Low	High

All three models had similar accuracies, on all three dimensions (although the decision tree was better at predicting high expenditure, and correspondingly lower at predicting low expenditure). The neural network didn’t predict any high expenditure cases, but it was the least accurate at doing that in the test case. The decision tree model predicted more high cases. These results are typical and to be expected—different models will yield different results, and these relative advantages are liable to change with new data. That is why automated systems applied to big data should probably utilize all three types of models. Data scientists need to focus attention on refining the parameters in each model type, seeking better fits for specific applications.

Summary

We have concluded this short book demonstrating R (Rattle) computations for the basic classification algorithms in data mining on a slightly larger dataset than we have used in prior chapters. The advent of big data has led to an environment where billions of records are possible. This book has not demonstrated that scope by any means, but it has demonstrated the small-scale version of the basic algorithms. The intent is to make data mining less of a black-box exercise, thus hopefully enabling users to be more intelligent in their application of data mining.

We have demonstrated three open source software products in earlier chapters of the book. KNIME has a very interesting GUI, demonstrating the data mining process in an object-oriented format. But, it is a little more involved for the user. WEKA is a very good tool, especially for users who want to compare different algorithms within each category. However, it is problematic in dealing with test files and new case files (it can work, it just doesn’t always). Thus, R is the recommended software. R is widely used in industry and has all of the benefits of open source software (many eyes are monitoring it, leading to fewer bugs; it is free; it is scalable). Further, the R system enables widespread data manipulation and management.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 9 Scalability

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 9 Scalability