Chapter 8

Decision Tree Algorithms

Decision trees provide a means to obtain product-specific forecasting models in the form of rules that are easy to implement. These rules have an if-then form, which is easy for the users to implement. This data ­mining approach can be used by groceries in a number of policy ­decisions, to include ordering inventory replenishment, as well as evaluation of alternative promotion campaigns.

As was the case with regression models and neural networks, decision tree models support the data mining process of modeling. ­Decision trees in the context of data mining refer to the tree structure of rules (often referred to as association rules). The data mining decision tree process involves collecting those variables that the analyst thinks might bear on the decision at issue, and analyzing these variables for their ability to predict outcome. Decision trees are useful to gain further insights into ­customer behavior as well as lead to ways to profitably act on the results. The algorithm automatically determines which variables are the most important, based on their ability to sort the data into the correct output category. The method has relative advantage over neural networks and genetic algorithms, in that a reusable set of rules are provided, thus explaining model conclusions. There are many examples where decision trees have been applied to business data mining, including classifying loan applicants, screening potential consumers, and rating job applicants.

Decision trees provide a way to implement rule-based system approaches. The ID3 system selects an attribute as a root, with branches for different values of the attribute. All objects in the training set are classified into these branches. If all objects in a branch belong to the same output class, the node is labeled, and this branch is terminated. If there are multiple classes on a branch, another attribute is selected as a node, with all possible attribute values branched. An entropy heuristic is used to select the attributes with the highest information. In other data mining tools, other bases for selecting branches are often used.

Decision Tree Operation

A bank may have a database of past loan applicants for short-term loans. This database (a simplification of Table 4.4) consists of the age of the applicant, the applicant’s income, and risk rating. The outcome of the loan is paid on time or not. The bank’s policy treats the applicants differently by age group, income level, and risk. Age groups are less than or equal to 30, over 30 but less than or equal to 60, and over 60 years. Income levels are less than or equal to $30,000 per year, over $30,000 but less than or equal to $80,000 per year, and over $80,000 per year. High risk is defined as applicants with greater debt than assets. If an applicant’s assets minus the requested loan amount exceed debt, the applicant is classified as low risk. Applicants with asset–debt relationships between these values are classified as medium risk. A tree sorts the possible combination of these variables. An exhaustive tree enumerates all combinations of variable values in Table 8.1.


Table 8.1 Enumeration of appliance loan variable combinations

Age = young

Income = low

Risk = low

Risk = medium

Risk = high

Income = average

Risk = low

Risk = medium

Risk = high

Income = high

Risk = low

Risk = medium

Risk = high

Age = middle

Income = low

Risk = low

Risk = medium

Risk = high

Income = average

Risk = low

Risk = medium

Risk = high

Income = high

Risk = low

Risk = medium

Risk = high

Age = old

Income = low

Risk = low

Risk = medium

Risk = high

Income = average

Risk = low

Risk = medium

Risk = high

Income = high

Risk = low

Risk = medium

Risk = high



The tree for this set of rules is partially shown in Figure 8.1.

Image

Figure 8.1 Partial tree for loan application data

A rule-based system model would require that the bank loan officers, who had respected judgment, be interviewed to classify (and justify) the decision for each of these combinations of variables. Some trees may be simplified. For instance, if income is high, the loan might be automatically granted by some loan officers. This would reduce the tree, as age and risk would not matter in that case. Further, for those without low income, if risk were high, the loan might not be granted. The first rule would be as shown in Table 8.2.


Table 8.2 Initial rule

Age

Income

Risk

Rule

High

Grant loan

Not high

Deny loan



This structure uses classification data, because there are a clear, finite number of outcomes (don’t grant the loan, grant the loan, and possibly authorize part of the loan). This form of tree is referred to as a classification tree. Trees can also be used on data with continuous outcomes, in an estimating or predicting mode. In these cases, the terminology is a regression tree.

Rule Interestingness

Data, even categorical data, can involve potentially many rules. For instance, in the loan application data just presented, there were four variables, each of which could take on three values. There were 3 × 3 × 3 = 27 combinations, each given in Table 8.1. If there were 10 variables, each with 4 possible values, the number of combinations would be over a million (1,048,576). Clearly, brute force generation of rule outcomes is unreasonable. Fortunately, decision tree methods identify the most useful rules in terms of predicting outcomes. Rule effectiveness is measured in terms of confidence and support. Confidence is the degree of accuracy of a rule. Support is the degree to which the antecedent conditions occur in the data. Support for an association rule indicates the proportion of records covered by the set of attributes in the association rule. If there were 10 million book purchases, support for the given rule would be 10/10,000,000, a very small support measure of 0.000001. These concepts are often used in the form of threshold levels in machine learning systems. The minimum confidence levels and support levels can be specified to retain the rules identified by decision tree (or other association rule) methods.

For rules to be interesting, they must identify something useful (have a high confidence level and sufficiently high support) and novel. For instance, a grocer applying data mining finds that eggs and bacon are purchased together at a confidence level of 0.9 and a support level of 0.2 might not be impressed. The grocer knew that prior to data mining. Interestingness is the idea that the data mining analysis found out something that was unexpected (knowledge discovery). It is still useful to confirm the hypothesis that eggs and bacon sell together. But, it is more useful to discover the knowledge that blackberry jam and eggs go together at a similar rate. (Please note that we do not know of such a real relationship.) Such information could lead to rearrangement of store displays or promotions or both, which would make it interesting.

Machine Learning

Rule induction algorithms have been developed to automatically process data of this type. For this approach to work, a clear outcome is needed. In this case, a clear outcome exists—two categories of payoff results. Rule induction works by searching through data for patterns and relationships. The records can be clustered into specific categories. Machine learning starts with no assumptions—thus looking only at the input data and results. The judgment developed by human experts is not considered, which might sound inefficient, but human biases can, thus, be eliminated. Recursive partitioning algorithms split data (original data, not grouped as aforementioned) into finer and finer subsets, leading to a decision tree.

For instance, let us consider the 20 past loan application cases in Table 8.3, with known outcomes.


Table 8.3 Twenty past loan application cases

Age

Income

Assets

Debts

Want

Risk

Result

20 (young)

17,152 (low)

11,090

20,455

400

High

On-time

23 (young)

25,862 (low)

24,756

30,083

2,300

High

On-time

28 (young)

26,169 (low)

47,355

49,341

3,100

High

Late

23 (young)

21,117 (low)

21,242

30,278

300

High

Default

22 (young)

7,127 (low)

23,903

17,231

900

Low

On-time

26 (young)

42,083 (average)

35,726

41,421

300

High

Late

24 (young)

55,557 (average)

27,040

48,191

1,500

High

On-time

27 (young)

34,843 (average)

0

21,031

2,100

High

On-time

29 (young)

74,295 (average)

88,827

100,599

100

High

On-time

23 (young)

38,887 (average)

6,260

33,635

9,400

Low

On-time

28 (young)

31,758 (average)

58,492

49,268

1,000

Low

On-time

25 (young)

80,180 (high)

31,696

69,529

1,000

High

Late

33 ­(middle)

40,921 (average)

91,111

90,076

2,900

Average

Late

36 ­(middle)

63,124 (average)

164,631

144,697

300

Low

On-time

39 ­(middle)

59,006 (average)

195,759

161,750

600

Low

On-time

39 ­(middle)

125,713 (high)

382,180

315,396

5,200

Low

On-time

55 ­(middle)

80,149 (high)

511,937

21,923

1,000

Low

On-time

62 (old)

101,291 (high)

783,164

23,052

1,800

Low

On-time

71 (old)

81,723 (high)

776,344

20,277

900

Low

On-time

63 (old)

99,522 (high)

783,491

24,643

200

Low

On-time



There are three variables, each with three possible levels. In practice, we would expect thousands of observations, making it unlikely that any combination would be empty. We use 20 observations to demonstrate the principals of the calculations. Of the 27 combinations, many here are empty. The combinations with representative observations are given in Table 8.4.


Table 8.4 Grouped data

Age

Income

Risk

Total

On-time

Not on-time

Probability

Young

Low

High

4

2

2

0.50

Young

Low

Low

1

1

0

1.00

Young

Average

High

4

3

1

0.75

Young

Average

Low

2

2

0

1.00

Young

High

High

1

0

1

0.00

Middle

Average

Average

1

0

1

0.00

Middle

Average

Low

2

2

0

1.00

Middle

High

Low

2

2

0

1.00

Old

High

Low

3

3

0

1.00



Automatic machine learning begins with identifying those variables that offer the greatest likelihood of distinguishing between the possible outcomes. For each of the three variables, we can identify the outcome probabilities in Table 8.5.


Table 8.5 Combination outcomes

Variable

Value

Cases

On-time

Late

Prob (on-time)

Age

Young

12

8

4

0.67

Middle

5

4

1

0.80

Old

3

3

0

1.00

Income

Low

5

3

2

0.60

Average

9

7

2

0.78

High

6

5

1

0.83

Risk

High

9

5

4

0.55

Average

1

0

1

0.00

Low

10

10

0

1.00



Most data mining packages use an entropy measure to gauge the discriminating power of each variable, selecting that variable with the greatest discriminating power as the first to split data with. (Chi-square measures can also be used to select variables.) There are three of the nine categories here with all observations in one outcome category or the other (age = old and risk = low, both have all cases in the on-time category; risk = average has only one observation, and it is in the late category—suspicious logically and based on the minimum possible sample size, but remember, we are trying to demonstrate the procedure with a small dataset).

One formula for entropy where p is the number of positive examples and n is the number of negative examples in the training set for each value of the attribute is as follows:

This formula has a problem: if either p or n is 0 (which would happen if there were unanimous outcomes for a category), then the log to base 2 is undefined, and the formula does not work. However, for values just above 0, the Inform formula will converge to 0. For the variable Age, there are 3 outcomes. Entropy for each Age category by this formula is shown in Table 8.6.


Table 8.6 Entropy calculation for Age

p/(p+n)

Log(base 2)

n/(p+n)

Log(base 2)

Sum of products

Probability (young)

Product

Young

8/12

−0.585

4/12

−1.585

0.918

12/20

0.551

Middle

4/5

−0.322

1/5

−2.322

0.722

5/20

0.180

Old

3/3

0*

0/3

0*

0*

3/20

0*

Sum

0.731


*formula strictly undefined, but converges to zero.


For category Young, the calculation is [−(8/12) × (−0.585) − (4/12) × (−1.585)] × (12/20) = 0.551.

The lower this entropy measure, the greater the information content (the greater the agreement probability). The entropy measures for the three variables are:

    Age                       0.731

    Income    0.782

    Risk      0.446

By this measure, Risk has the greatest information content. If Risk is low, the data indicates a 1.0 probability (10 of 10 cases) that the applicant will pay the loan back on-time. If Risk is not low, this data indicates a 0.5 probability that the applicant will pay the loan back on time. This would be the first rule selected by the machine learning algorithm.


IF (Risk = Low)

THEN Predict On-time payment

ELSE

Predict Late


This rule is subject to two types of errors. First, those applicants rated as low risk may actually not pay on time. (From the data, the probability of this happening is 0.0.) Second, those applicants rated as high or average risk may actually have paid if given a loan. (From the data, the probability of this happening is 0.5.) The expectation of this is the probability of an applicant being rated as high or average in risk (10/20, or 0.5) times the probability of being wrong (0.5), yielding an expected error of 0.25. To test this rule, it is best to apply it to a second set of data, different from the dataset used to develop the rule.

The set of rules can be examined further to see if greater accuracy can be obtained. The entropy formula for Age, given that risk was not low, is 0.991, while the same calculation for income is 1.971. This indicates that Age has greater discriminating power. With this data, if Age is middle, the one case did not pay on time. There were no old cases in this group. Therefore, the second rule is:


IF (Risk is NOT low)

AND (Age = Middle)

THEN Predict Late

ELSE

Predict On-time


In this case, the data would indicate probabilities shown in Table 8.7.


Table 8.7 Probabilities by case

Risk

Age

Count

Probability on-time

Low

10/20

1.0

NOT low

Middle

1/20

0.0

NOT low

Young

9/20

0.56



All 10 of this subset of the data with Low risk rating paid on time. Of the other 10 cases that were not low, the 1 that was middle aged did not pay on time (as stated earlier), while 5 of the 9 young cases paid on time. The expected error here is the 4/9 probability for the 9 cases out of 20 total, or 0.2. This is an improvement over the prior case where the expected error was 0.25.

For the last variable, Income, given that Risk was not low, and Age was not Middle, there are nine cases left, shown in Table 8.8.


Table 8.8 Probabilities recalculated

Income

On-time

Late

Probability on-time

Low

2

2

0.50

Average

3

1

0.75

High

0

1

0.00



Four of these nine cases were Low income, with two paying on time and two not. Four were Average income, three of which paid on time and one not. The last case was High income, which was not paid on time.

A third rule, taking advantage of the case with unanimous outcome, is:


IF (Risk NOT low)

AND (Age NOT middle)

AND (Income high)

THEN predict Late

ELSE

Predict on-time


The expected accuracy of the three rules together is shown in Table 8.9.


Table 8.9 Expected accuracy of three rules

Risk

Age

Income

Count

Probability (on-time)

Low

10/20

1.0

NOT low

Middle

1/20

0.0

NOT low

Young

High

1/20

0.0

NOT low

Young

NOT high

8/20

0.625



The expected error here is 8/20 times 0.375 = 0.15.

An additional rule could be generated. For the case of Risk NOT low, Age = Young, and Income NOT high, there are 4 cases with low income (probability of on-time = 0.5) and 4 cases with average income (probability of on-time = 0.75). The greater discrimination is provided by average income, resulting in the rule:


IF (Risk NOT low)

AND (Age NOT middle)

AND (Income average)

THEN predict on-time

ELSE

Predict either


There is no added accuracy obtained with this rule, shown in Table 8.10.


Table 8.10 Accuracy of fourth rule

Risk

Age

Income

Count

Probability (on-time)

Low

10/20

1.0

NOT low

Middle

1/20

0.0

NOT low

Young

High

1/20

0.0

NOT low

Young

Average

4/20

0.75

NOT low

Young

Low

4/20

0.50



The expected error is 4/20 times 0.25 + 4/20 times 0.5, equals 0.15, the same as without the rule. When machine learning methods encounter no improvement, they generally stop. (Here, we have exhausted all combinations anyway.) Therefore, the last rule would not be added to the first three. The machine learning algorithm would stop at:


IF (Risk low)

THEN predict On-time

ELSE

IF (Risk NOT low)

AND (Age middle)

THEN predict Late

ELSE

IF (Risk NOT low)

AND (Age NOT middle)

AND (Income high)

THEN predict Late

ELSE

Predict On-time


This model can be tested with the data given in Table 8.5. Table 8.11 shows the results.


Table 8.11 Rules applied to test cases

Record

Actual

Rule prediction

By rule

Result

1

On-time

On-time

1

Match

2

On-time

On-time

1

Match

3

On-time

On-time

1

Match

4

On-time

On-time

Else

Match

5

On-time

On-time

1

Match

6

On-time

On-time

Else

Match

7

Late

On-time

Else

Miss

8

On-time

On-time

1

Match

9

On-time

Late

3

Miss

10

On-time

On-time

1

Match



Figure 8.2 shows the decision tree for this set of rules.

Image

Figure 8.2 Decision tree for rule

In this case, the coincidence matrix is shown in Table 8.12.


Table 8.12 Coincidence matrix for rule set

Actual

Late

On-time

Total

Late

0

1

1

On-Time

1

8

9

Total

1

9

10



In this case, the accuracy of the model is 8/10, or 80 percent. There were errors of both types (denying a good loan application, accepting one bad application).

Software Demonstrations

We use the loan application dataset to demonstrate the decision tree operation, both manually as well as with software. The full loan application dataset had 650 observations over 4 variables. We can use the first 400 observations for training, reserving the last 250 observations for testing. The data for this covered in Chapter 4 is grouped in Table 8.13.


Table 8.13 Grouped data—young

Age

Income

Risk

Credit rating

On-time

Not on-time

Probability

Young

Low

High

Red

3

1

0.750

Amber

7

1

0.875

Green

17

1

0.944

Average

Red

1

1

0.500

Green

1

0

1.000

Low

Green

7

0

1.000

Average

High

Red

9

9

0.500

Amber

24

10

0.706

Green

42

6

0.875

Average

Amber

1

0

1.000

Green

4

0

1.000

Low

Red

2

0

1.000

High

High

Red

1

1

0.500

Amber

5

3

0.625

Green

12

0

1.000



Table 8.14 gives the categorical counts for middle-aged loan applicants.


Table 8.14 Grouped data—middle aged

Age

Income

Risk

Credit rating

On-time

Not on-time

Probability

Middle

Low

Average

Green

3

0

1.000

Low

Red

0

1

0

Amber

1

0

1.000

Green

2

0

1.000

Average

High

Red

1

1

0.500

Amber

4

2

0.667

Green

17

1

0.944

Average

Amber

4

0

1.000

Green

4

0

1.000

Low

Red

13

1

0.929

Amber

37

4

0.902

Green

64

1

0.985

High

High

Amber

1

0

1.000

Green

7

0

1.000

Low

Red

8

0

1.000

Amber

14

0

1.000

Green

30

0

1.000



Finally, we have the Old applicants in Table 8.15.


Table 8.15 Grouped data—age equals old

Age

Income

Risk

Credit rating

On-time

Not on-time

Probability

Old

Average

Low

Red

0

1

0

Amber

1

0

1.000

Green

2

0

1.000

High

Low

Red

2

0

1.000

Amber

1

0

1.000

Green

3

0

1.000



Most of the possible old categories were vacant. Furthermore, all of the old applicant categories had unanimity in the outcome. Automatic machine learning begins with identifying those variables that offer the greatest likelihood of distinguishing between the possible outcomes. For each of the four variables, we can identify the outcome probabilities in Table 8.16.


Table 8.16 Combination outcomes

Variable

Value

Cases

On-time

Late

Prob (on-time)

Age

Young

169

136

33

0.805

Middle

221

210

11

0.950

Old

10

9

1

0.900

Income

Low

47

42

5

0.894

Average

265

229

36

0.864

High

88

84

4

0.955

Risk

High

186

150

36

0.806

Average

19

18

1

0.947

Low

195

187

8

0.959

Credit rating

Red

56

40

16

0.714

Amber

120

100

20

0.833

Green

224

215

9

0.960



For the variable Age, there are three outcomes. Entropy for Age by this formula would be as shown in Table 8.17.


Table 8.17 Entropy calculation for Age

p/(p+n)

Log(base 2)

n/(p+n)

Log(base 2)

Sum

Probability

Product

Young

136/169

−0.313

33/169

−2.356

0.712

169/400

0.301

Middle

210/221

−0.074

11/221

−4.328

0.285

221/400

0.158

Old

9/10

−0.152

1/10

−3.322

0.469

10/400

0.012

Sum

0.470



Under the heading Sum is the following calculation. For Young, the first term of the Inform equation multiplies the ratio 136/169 times −0.313, and by −1. The second term multiplies the ratio 33/169 times −4.328, and by −1. The sum of these terms is 0.712. This represents 169 of the 400 training cases. The heading Product is the product of sum by probability. For Young, this is 0.712 times 169/400, yielding 0.301. The sum of the three products yields the entropy measure. The lower this entropy measure, the greater the information content (the greater the agreement probability). The entropy measures for the four variables are:

Image

By this measure, Credit rating has the greatest information content. Among the Credit rating categories, we want to select that one with the greatest ability to accurately categorize data (the closest probability to 1.0 for Loan, or 0.0 for Deny Loan). We can consider the relative cost of an error. An error of granting a loan to a case where it is not paid on time may cost nine times ($900) as much as one that does pay on time (but was denied, $100). We can, therefore, use a cutoff limit of 0.90 probability. If the category being considered has a probability greater than or equal to 0.9, we will create a rule granting the loan. If the category being considered has a probability less than 0.9, we will deny the loan. If Credit rating is green, the data indicates a 0.953 probability (244 of 256 cases) that the applicant will pay the loan back on time. If Credit rating is not green, this data indicates a 0.815 probability that the applicant will pay the loan back on time. This would be the first rule selected by the machine learning algorithm.


IF (Credit = Green)

THEN Predict On-time Payment

ELSE

Predict Late



This rule is subject to two types of errors. First, those applicants rated as green may actually not pay on time. (From the data, the probability of this happening is 0.040). Second, those applicants rated as amber or red may actually have paid, if given a loan. (From the data, the probability of this happening is 0.815). The expectation of this is the probability of an applicant being rated as green (224/400, or 0.550) times the probability of being wrong (0.040), plus the probability of not being green (176/400, or 0.440) times the probability of being wrong (0.795), yielding an expected error of 0.022 + 0.350 = 0.372. We can also calculate the error cost function. This would be $900 times 0.022 + $100 times 0.350, or $55.14. To test this rule, it is best to apply it to a second set of data, different from the dataset used to develop the rule. Using the last 250 observations as the test set, Table 8.18 shows the coincidence matrix.


Table 8.18 Coincidence matrix for first rule

Actual

Late

On-time

Total

Late

13

7

20

On-time

88

142

230

Total

101

149

250



The correct classification rate was 0.620. The cost function for this result would be $900 times 7 plus $100 times 88, or $15,100. Since we had 250 cases, the average error cost would have been $60.40 per case.

The set of rules can be examined further to see if greater accuracy can be obtained. The entropy formula for Age, given that Credit rating was not green, is 0.672, for Income 0.720, for Risk 0.657, and for the difference between red and amber Credit rating is 0.718. This indicates that Risk has the greater discriminating power at this point. With this data, if Risk is low, 79 of the 86 cases paid on time (0.919). The other two states had 61 of 90 pay on time (0.678). Therefore, the second rule is:


IF (Credit Rating is NOT green)

AND (Risk = Low)

THEN Predict On-Time

ELSE

Predict Late



In this case, the data would indicate probabilities shown in Table 8.19.


Table 8.19 Probabilities by case

Credit rating

Risk

Count

Probability on-time

Green

224/400

0.960

NOT green

Low

86/400

0.919

NOT green

NOT low

90/400

0.678



The expected error here is the 224/400 times (1 – 0.960) + 86/400 times (1 − 0.919) + 90/400 times (1 – 0.678), or 0.112. The cost function is $900 times 224/400 times 0.040 ($20.16) plus $900 times 86/400 times 0.081 ($15.67) plus $100 times 90/400 times 0.678 ($15.26), or $51.09 per case. This is an improvement over the prior case where the expected error was 0.380 with an expected cost per case of $60.40. The coincidence matrix is as shown in Table 8.20.


Table 8.20 Coincidence matrix for first two rules

Actual

Late

On-time

Total

Late

11

9

20

On-time

43

187

230

Total

54

196

250



The correct classification rate was 0.792, much better than in Table 8.18. The cost function here would be $900 times 9 plus $100 times 43, or $12,400, or $49.60 per case.

We recalculate entropy, eliminating Credit rating green and Risk rating Low. The values obtained were 0.902 for Age, 0.893 for Income, 0.897 for Risk, and 0.877 for Credit rating. The lowest of these values was for Credit. Given that Credit rating was not green and Risk was not high, there are 90 training cases left, shown in Table 8.21.


Table 8.21 Probabilities recalculated

Credit Rating

On-time

Late

Probability
on-time

Red

15

13

0.536

Amber

46

16

0.742



While below the specified cutoff limit of 0.9, those with Amber credit ratings clearly had a higher probability of on-time payment than did those with credit ratings of red. A third rule is:


IF (Credit Rating NOT green)

AND (Risk NOT high)

AND (Credit Rating amber)

THEN predict On-Time

ELSE

Predict Late



The expected accuracy of the three rules together is shown in Table 8.22.


Table 8.22 Expected accuracy of three rules

Credit rating

Risk

Credit rating

Count

Probability
(on-time}

Green

224/400

0.960

NOT green

High

86/400

0.919

NOT green

NOT high

Amber

62/400

0.742

NOT green

NOT high

NOT high

28/400

0.536



The expected error here is 0.117, with an expected cost function per case of $75.57 (due to the high error rate for the third rule). The expected classification error went down, although the expected cost function went up. Table 8.23 gives the coincidence matrix for this set of rules using the test data.


Table 8.23 Coincidence matrix for three rules

Actual

Late

On-time

Total

Late

3

17

20

On-time

11

219

230

Total

14

236

250



Here, the correct classification rate increased to 0.888 (it was 0.792 after two rules), while the average cost per case increased to $65.60 from $49.60. When machine learning methods encounter no improvement, they generally stop. Therefore, the last rule would not be added to the first two. The machine learning algorithm would stop at:


IF (Credit Rating is green)

THEN Predict On-time

IF (Credit Rating is NOT green)

AND (Risk = Low)

THEN Predict On-time

ELSE

Predict Late



Note that this approach does not guarantee optimality. In that sense, it is heuristic, which means it tends to give good models, but there might be better ones.

The model was also run using See 5, a very good decision tree software. The initial model, using 400 training observations of categorical, degenerated to classifying all cases as On-Time. Due to the highly skewed dataset (where 45 cases were late and 355 were on-time), this had a fairly high correct classification rate of 0.920. However, it was very bad at predicting applications that would end up late. Balancing the data can improve results. We pruned the observations from the training set to control the proportion of late cases, as shown in Table 8.24. Using 325 observations again yielded all forecasts to be on time. The model based on 225 observations (as well as the model based on 180 observations) was to classify all cases as on-time unless risk was high, and credit was either red or amber. The model using 150 training cases was to classify all cases as on-time unless the applicant was young and credit was either red or amber. The models obtained were tested on the same test set of 250 observations.


Table 8.24 Results from the balancing dataset

Train set

Late

On-time

Proportion

Predict 0 = 1

Predict 1 = 0

Correct rate

Cost

400

45

355

0.1125

20

0

0.920

$18,000

325

45

280

0.1385

20

0

0.920

$18,000

225

45

180

0.2000

9

39

0.808

$12,000

180

45

135

0.2500

9

39

0.808

$12,000

150

45

105

0.3000

9

32

0.956

$11,300



This data shows a clear trend toward more useful model results by balancing data.

R Decision Tree Model

We now demonstrate decision trees on the loan application dataset with R (Rattle). Figure 8.3 shows the modeling screen from Rattle for decision trees (the Tree radio button is the default).

Figure 8.4 shows the rules obtained from Execute being selected in Figure 8.3.

Image

Figure 8.3 Opening Rattle decision tree screen

Image

Figure 8.4 Rattle decision tree model for loan training file

This tree has six rules (the asterisked lines in Figure 8.4). Selecting the Draw button on Figure 8.4 yields the graphic tree in Figure 8.5.

Image

Figure 8.5 R decision tree

Table 8.25 gives the coincidence matrix for this decision tree.


Table 8.25 Coincidence matrix for Rattle decision tree model

Model OK

Model problem

Actual OK

219

11

230

Actual problem

17

3

20

236

14

250



Correct classification here is 0.880, quite good. For the actual “OK” cases, the correct classification rate is 219/230 = 0.952, while for the actual “Problem” cases, it is only 3/20, or 0.15. This model categorized case 4 as fraudulent, all others as “OK.”

KNIME

Figure 8.6 displays the workflow for decision trees using KNIME.

Image

Figure 8.6 KNIME decision tree workflow

This is very similar to the modeling workflows for regression and neural networks. File reader nodes are identical. Here you will need a Decision Tree Learner node (node 2) and Decision Tree Predictor nodes (nodes 4 and 8). The scorer is the same. To forecast new cases, the new cases are input on the File Reader node 11, run through the Predictor in node 8, and an Interactive Table attached in node 10. Figure 8.7 shows the resulting tree from the Decision Tree Learner.

Image

Figure 8.7 KNIME decision tree

The output from KNIME’s Scorer node is given in Table 8.26.


Table 8.26 Coincidence matrix for Rattle decision tree model

Model OK

Model problem

Actual OK

221

9

230

Actual problem

19

1

20

240

10

250



Here the correct classification rate is 222/250, or 0.888 (slightly better than R’s). The decision tree from KNIME was very good at predicting the actual “OK” cases (221/230, or 0.961), but not very good at predicting the “Problem” cases (1/20 = 0.05). The prediction of new cases is obtained from the Interactive Table node (see Figure 8.8).

Image

Figure 8.8 KNIME decision tree new case predictions

Note that the decision tree from Figure 8.7 classifies all 10 new cases as “OK.”

WEKA

WEKA is opened as usual, and under Preprocess, the file with all 650 cases (training and test in Rattle and KNIME) linked. We selected 61.5 percent of the cases for training. J48 was run with minimum support of 2 yielded a degenerate model, as did minimum support of 20. Thus, we balanced the dataset by replicating the smaller “Problem” cases 8 times, yielding a balanced dataset with 585 cases for both “OK” and “Problem.” We ran J48 decision tree model using 50 percent of this balanced data for testing. With minimum support of 2, the model contained 49 rules—too many rules. This model did have an overall correct classification rate of 0.911 which is actually very good, but a 0 correct rate of predicting the negatives. Raising minimum support to 20 yielded 21 rules and a correct classification rate of 0.802 (0.725 for positives, 0.879 for negatives). This is still too many rules, so minimum support was raised to 50.

These eight rules (Figure 8.9) yielded the coincidence matrix in Table 8.27.

Image

Figure 8.9 WEKA rules


Table 8.27 WEKA decision tree coincidence matrix

Model OK

Model problem

Actual OK

440

145

585

Actual problem

146

439

585

586

584

1170



The correct classification rate for positive cases was 0.752, for negative cases 0.750, and 0.751 overall. A nice feature of decision tree models is that they are easy to apply to new cases. They consist of If/Then rules that are very easy to interpret. By adjusting minimum support, you can ­balance the tradeoff between accuracy and smaller rule sets.

As to new cases, this model classified “Problem” for case in row 1 (see Figure 8.8 for input data) by rule 7, case in row 3 by rule 1, case in row 4 by rule 7, case in row 8 by rule 7, and case in row 9 by rule 3. Thus, this was a much more thorough (and probably useful) model than those ­decision tree models obtained from R and KNIME in this case.

Summary

Decision trees are very effective and useful data mining models. They are automatic, an application of machine learning, which in itself makes them attractive. They are relatively robust, in that they are not adversely affected by noisy data or missing data. They can handle very large datasets and can deal with data in the categorical or numeric form. They are also very strong in their ability to explain their conclusions (rules can be expressed in natural language, and communication to managers is excellent). As shown in the preceding section, there are many software ­products available to support decision tree modeling.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset