Decision trees provide a means to obtain product-specific forecasting models in the form of rules that are easy to implement. These rules have an if-then form, which is easy for the users to implement. This data mining approach can be used by groceries in a number of policy decisions, to include ordering inventory replenishment, as well as evaluation of alternative promotion campaigns.
As was the case with regression models and neural networks, decision tree models support the data mining process of modeling. Decision trees in the context of data mining refer to the tree structure of rules (often referred to as association rules). The data mining decision tree process involves collecting those variables that the analyst thinks might bear on the decision at issue, and analyzing these variables for their ability to predict outcome. Decision trees are useful to gain further insights into customer behavior as well as lead to ways to profitably act on the results. The algorithm automatically determines which variables are the most important, based on their ability to sort the data into the correct output category. The method has relative advantage over neural networks and genetic algorithms, in that a reusable set of rules are provided, thus explaining model conclusions. There are many examples where decision trees have been applied to business data mining, including classifying loan applicants, screening potential consumers, and rating job applicants.
Decision trees provide a way to implement rule-based system approaches. The ID3 system selects an attribute as a root, with branches for different values of the attribute. All objects in the training set are classified into these branches. If all objects in a branch belong to the same output class, the node is labeled, and this branch is terminated. If there are multiple classes on a branch, another attribute is selected as a node, with all possible attribute values branched. An entropy heuristic is used to select the attributes with the highest information. In other data mining tools, other bases for selecting branches are often used.
Decision Tree Operation
A bank may have a database of past loan applicants for short-term loans. This database (a simplification of Table 4.4) consists of the age of the applicant, the applicant’s income, and risk rating. The outcome of the loan is paid on time or not. The bank’s policy treats the applicants differently by age group, income level, and risk. Age groups are less than or equal to 30, over 30 but less than or equal to 60, and over 60 years. Income levels are less than or equal to $30,000 per year, over $30,000 but less than or equal to $80,000 per year, and over $80,000 per year. High risk is defined as applicants with greater debt than assets. If an applicant’s assets minus the requested loan amount exceed debt, the applicant is classified as low risk. Applicants with asset–debt relationships between these values are classified as medium risk. A tree sorts the possible combination of these variables. An exhaustive tree enumerates all combinations of variable values in Table 8.1.
Table 8.1 Enumeration of appliance loan variable combinations
Age = young |
Income = low |
Risk = low Risk = medium Risk = high |
Income = average |
Risk = low Risk = medium Risk = high |
|
Income = high |
Risk = low Risk = medium Risk = high |
|
Age = middle |
Income = low |
Risk = low Risk = medium Risk = high |
Income = average |
Risk = low Risk = medium Risk = high |
|
Income = high |
Risk = low Risk = medium Risk = high |
|
Age = old |
Income = low |
Risk = low Risk = medium Risk = high |
Income = average |
Risk = low Risk = medium Risk = high |
|
Income = high |
Risk = low Risk = medium Risk = high |
The tree for this set of rules is partially shown in Figure 8.1.
A rule-based system model would require that the bank loan officers, who had respected judgment, be interviewed to classify (and justify) the decision for each of these combinations of variables. Some trees may be simplified. For instance, if income is high, the loan might be automatically granted by some loan officers. This would reduce the tree, as age and risk would not matter in that case. Further, for those without low income, if risk were high, the loan might not be granted. The first rule would be as shown in Table 8.2.
Table 8.2 Initial rule
Age |
Income |
Risk |
Rule |
High |
Grant loan |
||
Not high |
Deny loan |
This structure uses classification data, because there are a clear, finite number of outcomes (don’t grant the loan, grant the loan, and possibly authorize part of the loan). This form of tree is referred to as a classification tree. Trees can also be used on data with continuous outcomes, in an estimating or predicting mode. In these cases, the terminology is a regression tree.
Rule Interestingness
Data, even categorical data, can involve potentially many rules. For instance, in the loan application data just presented, there were four variables, each of which could take on three values. There were 3 × 3 × 3 = 27 combinations, each given in Table 8.1. If there were 10 variables, each with 4 possible values, the number of combinations would be over a million (1,048,576). Clearly, brute force generation of rule outcomes is unreasonable. Fortunately, decision tree methods identify the most useful rules in terms of predicting outcomes. Rule effectiveness is measured in terms of confidence and support. Confidence is the degree of accuracy of a rule. Support is the degree to which the antecedent conditions occur in the data. Support for an association rule indicates the proportion of records covered by the set of attributes in the association rule. If there were 10 million book purchases, support for the given rule would be 10/10,000,000, a very small support measure of 0.000001. These concepts are often used in the form of threshold levels in machine learning systems. The minimum confidence levels and support levels can be specified to retain the rules identified by decision tree (or other association rule) methods.
For rules to be interesting, they must identify something useful (have a high confidence level and sufficiently high support) and novel. For instance, a grocer applying data mining finds that eggs and bacon are purchased together at a confidence level of 0.9 and a support level of 0.2 might not be impressed. The grocer knew that prior to data mining. Interestingness is the idea that the data mining analysis found out something that was unexpected (knowledge discovery). It is still useful to confirm the hypothesis that eggs and bacon sell together. But, it is more useful to discover the knowledge that blackberry jam and eggs go together at a similar rate. (Please note that we do not know of such a real relationship.) Such information could lead to rearrangement of store displays or promotions or both, which would make it interesting.
Machine Learning
Rule induction algorithms have been developed to automatically process data of this type. For this approach to work, a clear outcome is needed. In this case, a clear outcome exists—two categories of payoff results. Rule induction works by searching through data for patterns and relationships. The records can be clustered into specific categories. Machine learning starts with no assumptions—thus looking only at the input data and results. The judgment developed by human experts is not considered, which might sound inefficient, but human biases can, thus, be eliminated. Recursive partitioning algorithms split data (original data, not grouped as aforementioned) into finer and finer subsets, leading to a decision tree.
For instance, let us consider the 20 past loan application cases in Table 8.3, with known outcomes.
Table 8.3 Twenty past loan application cases
Age |
Income |
Assets |
Debts |
Want |
Risk |
Result |
20 (young) |
17,152 (low) |
11,090 |
20,455 |
400 |
High |
On-time |
23 (young) |
25,862 (low) |
24,756 |
30,083 |
2,300 |
High |
On-time |
28 (young) |
26,169 (low) |
47,355 |
49,341 |
3,100 |
High |
Late |
23 (young) |
21,117 (low) |
21,242 |
30,278 |
300 |
High |
Default |
22 (young) |
7,127 (low) |
23,903 |
17,231 |
900 |
Low |
On-time |
26 (young) |
42,083 (average) |
35,726 |
41,421 |
300 |
High |
Late |
24 (young) |
55,557 (average) |
27,040 |
48,191 |
1,500 |
High |
On-time |
27 (young) |
34,843 (average) |
0 |
21,031 |
2,100 |
High |
On-time |
29 (young) |
74,295 (average) |
88,827 |
100,599 |
100 |
High |
On-time |
23 (young) |
38,887 (average) |
6,260 |
33,635 |
9,400 |
Low |
On-time |
28 (young) |
31,758 (average) |
58,492 |
49,268 |
1,000 |
Low |
On-time |
25 (young) |
80,180 (high) |
31,696 |
69,529 |
1,000 |
High |
Late |
33 (middle) |
40,921 (average) |
91,111 |
90,076 |
2,900 |
Average |
Late |
36 (middle) |
63,124 (average) |
164,631 |
144,697 |
300 |
Low |
On-time |
39 (middle) |
59,006 (average) |
195,759 |
161,750 |
600 |
Low |
On-time |
39 (middle) |
125,713 (high) |
382,180 |
315,396 |
5,200 |
Low |
On-time |
55 (middle) |
80,149 (high) |
511,937 |
21,923 |
1,000 |
Low |
On-time |
62 (old) |
101,291 (high) |
783,164 |
23,052 |
1,800 |
Low |
On-time |
71 (old) |
81,723 (high) |
776,344 |
20,277 |
900 |
Low |
On-time |
63 (old) |
99,522 (high) |
783,491 |
24,643 |
200 |
Low |
On-time |
There are three variables, each with three possible levels. In practice, we would expect thousands of observations, making it unlikely that any combination would be empty. We use 20 observations to demonstrate the principals of the calculations. Of the 27 combinations, many here are empty. The combinations with representative observations are given in Table 8.4.
Table 8.4 Grouped data
Age |
Income |
Risk |
Total |
On-time |
Not on-time |
Probability |
Young |
Low |
High |
4 |
2 |
2 |
0.50 |
Young |
Low |
Low |
1 |
1 |
0 |
1.00 |
Young |
Average |
High |
4 |
3 |
1 |
0.75 |
Young |
Average |
Low |
2 |
2 |
0 |
1.00 |
Young |
High |
High |
1 |
0 |
1 |
0.00 |
Middle |
Average |
Average |
1 |
0 |
1 |
0.00 |
Middle |
Average |
Low |
2 |
2 |
0 |
1.00 |
Middle |
High |
Low |
2 |
2 |
0 |
1.00 |
Old |
High |
Low |
3 |
3 |
0 |
1.00 |
Automatic machine learning begins with identifying those variables that offer the greatest likelihood of distinguishing between the possible outcomes. For each of the three variables, we can identify the outcome probabilities in Table 8.5.
Table 8.5 Combination outcomes
Variable |
Value |
Cases |
On-time |
Late |
Prob (on-time) |
Age |
Young |
12 |
8 |
4 |
0.67 |
Middle |
5 |
4 |
1 |
0.80 |
|
Old |
3 |
3 |
0 |
1.00 |
|
Income |
Low |
5 |
3 |
2 |
0.60 |
Average |
9 |
7 |
2 |
0.78 |
|
High |
6 |
5 |
1 |
0.83 |
|
Risk |
High |
9 |
5 |
4 |
0.55 |
Average |
1 |
0 |
1 |
0.00 |
|
Low |
10 |
10 |
0 |
1.00 |
Most data mining packages use an entropy measure to gauge the discriminating power of each variable, selecting that variable with the greatest discriminating power as the first to split data with. (Chi-square measures can also be used to select variables.) There are three of the nine categories here with all observations in one outcome category or the other (age = old and risk = low, both have all cases in the on-time category; risk = average has only one observation, and it is in the late category—suspicious logically and based on the minimum possible sample size, but remember, we are trying to demonstrate the procedure with a small dataset).
One formula for entropy where p is the number of positive examples and n is the number of negative examples in the training set for each value of the attribute is as follows:
This formula has a problem: if either p or n is 0 (which would happen if there were unanimous outcomes for a category), then the log to base 2 is undefined, and the formula does not work. However, for values just above 0, the Inform formula will converge to 0. For the variable Age, there are 3 outcomes. Entropy for each Age category by this formula is shown in Table 8.6.
Table 8.6 Entropy calculation for Age
p/(p+n) |
Log(base 2) |
n/(p+n) |
Log(base 2) |
Sum of products |
Probability (young) |
Product |
|
Young |
8/12 |
−0.585 |
4/12 |
−1.585 |
0.918 |
12/20 |
0.551 |
Middle |
4/5 |
−0.322 |
1/5 |
−2.322 |
0.722 |
5/20 |
0.180 |
Old |
3/3 |
0* |
0/3 |
0* |
0* |
3/20 |
0* |
Sum |
0.731 |
*formula strictly undefined, but converges to zero.
For category Young, the calculation is [−(8/12) × (−0.585) − (4/12) × (−1.585)] × (12/20) = 0.551.
The lower this entropy measure, the greater the information content (the greater the agreement probability). The entropy measures for the three variables are:
Age 0.731
Income 0.782
Risk 0.446
By this measure, Risk has the greatest information content. If Risk is low, the data indicates a 1.0 probability (10 of 10 cases) that the applicant will pay the loan back on-time. If Risk is not low, this data indicates a 0.5 probability that the applicant will pay the loan back on time. This would be the first rule selected by the machine learning algorithm.
IF (Risk = Low) |
THEN Predict On-time payment |
ELSE |
Predict Late |
This rule is subject to two types of errors. First, those applicants rated as low risk may actually not pay on time. (From the data, the probability of this happening is 0.0.) Second, those applicants rated as high or average risk may actually have paid if given a loan. (From the data, the probability of this happening is 0.5.) The expectation of this is the probability of an applicant being rated as high or average in risk (10/20, or 0.5) times the probability of being wrong (0.5), yielding an expected error of 0.25. To test this rule, it is best to apply it to a second set of data, different from the dataset used to develop the rule.
The set of rules can be examined further to see if greater accuracy can be obtained. The entropy formula for Age, given that risk was not low, is 0.991, while the same calculation for income is 1.971. This indicates that Age has greater discriminating power. With this data, if Age is middle, the one case did not pay on time. There were no old cases in this group. Therefore, the second rule is:
IF (Risk is NOT low) |
AND (Age = Middle) |
THEN Predict Late |
ELSE |
Predict On-time |
In this case, the data would indicate probabilities shown in Table 8.7.
Table 8.7 Probabilities by case
Risk |
Age |
Count |
Probability on-time |
Low |
10/20 |
1.0 |
|
NOT low |
Middle |
1/20 |
0.0 |
NOT low |
Young |
9/20 |
0.56 |
All 10 of this subset of the data with Low risk rating paid on time. Of the other 10 cases that were not low, the 1 that was middle aged did not pay on time (as stated earlier), while 5 of the 9 young cases paid on time. The expected error here is the 4/9 probability for the 9 cases out of 20 total, or 0.2. This is an improvement over the prior case where the expected error was 0.25.
For the last variable, Income, given that Risk was not low, and Age was not Middle, there are nine cases left, shown in Table 8.8.
Table 8.8 Probabilities recalculated
Income |
On-time |
Late |
Probability on-time |
Low |
2 |
2 |
0.50 |
Average |
3 |
1 |
0.75 |
High |
0 |
1 |
0.00 |
Four of these nine cases were Low income, with two paying on time and two not. Four were Average income, three of which paid on time and one not. The last case was High income, which was not paid on time.
A third rule, taking advantage of the case with unanimous outcome, is:
IF (Risk NOT low) |
AND (Age NOT middle) |
AND (Income high) |
THEN predict Late |
ELSE |
Predict on-time |
The expected accuracy of the three rules together is shown in Table 8.9.
Table 8.9 Expected accuracy of three rules
Risk |
Age |
Income |
Count |
Probability (on-time) |
Low |
10/20 |
1.0 |
||
NOT low |
Middle |
1/20 |
0.0 |
|
NOT low |
Young |
High |
1/20 |
0.0 |
NOT low |
Young |
NOT high |
8/20 |
0.625 |
The expected error here is 8/20 times 0.375 = 0.15.
An additional rule could be generated. For the case of Risk NOT low, Age = Young, and Income NOT high, there are 4 cases with low income (probability of on-time = 0.5) and 4 cases with average income (probability of on-time = 0.75). The greater discrimination is provided by average income, resulting in the rule:
IF (Risk NOT low) |
AND (Age NOT middle) |
AND (Income average) |
THEN predict on-time |
ELSE |
Predict either |
There is no added accuracy obtained with this rule, shown in Table 8.10.
Table 8.10 Accuracy of fourth rule
Risk |
Age |
Income |
Count |
Probability (on-time) |
Low |
10/20 |
1.0 |
||
NOT low |
Middle |
1/20 |
0.0 |
|
NOT low |
Young |
High |
1/20 |
0.0 |
NOT low |
Young |
Average |
4/20 |
0.75 |
NOT low |
Young |
Low |
4/20 |
0.50 |
The expected error is 4/20 times 0.25 + 4/20 times 0.5, equals 0.15, the same as without the rule. When machine learning methods encounter no improvement, they generally stop. (Here, we have exhausted all combinations anyway.) Therefore, the last rule would not be added to the first three. The machine learning algorithm would stop at:
IF (Risk low) |
THEN predict On-time |
||
ELSE |
|||
IF (Risk NOT low) |
AND (Age middle) |
THEN predict Late |
|
ELSE |
|||
IF (Risk NOT low) |
AND (Age NOT middle) |
AND (Income high) |
THEN predict Late |
ELSE |
Predict On-time |
This model can be tested with the data given in Table 8.5. Table 8.11 shows the results.
Table 8.11 Rules applied to test cases
Record |
Actual |
Rule prediction |
By rule |
Result |
1 |
On-time |
On-time |
1 |
Match |
2 |
On-time |
On-time |
1 |
Match |
3 |
On-time |
On-time |
1 |
Match |
4 |
On-time |
On-time |
Else |
Match |
5 |
On-time |
On-time |
1 |
Match |
6 |
On-time |
On-time |
Else |
Match |
7 |
Late |
On-time |
Else |
Miss |
8 |
On-time |
On-time |
1 |
Match |
9 |
On-time |
Late |
3 |
Miss |
10 |
On-time |
On-time |
1 |
Match |
Figure 8.2 shows the decision tree for this set of rules.
In this case, the coincidence matrix is shown in Table 8.12.
Table 8.12 Coincidence matrix for rule set
Actual |
Late |
On-time |
Total |
Late |
0 |
1 |
1 |
On-Time |
1 |
8 |
9 |
Total |
1 |
9 |
10 |
In this case, the accuracy of the model is 8/10, or 80 percent. There were errors of both types (denying a good loan application, accepting one bad application).
Software Demonstrations
We use the loan application dataset to demonstrate the decision tree operation, both manually as well as with software. The full loan application dataset had 650 observations over 4 variables. We can use the first 400 observations for training, reserving the last 250 observations for testing. The data for this covered in Chapter 4 is grouped in Table 8.13.
Table 8.13 Grouped data—young
Age |
Income |
Risk |
Credit rating |
On-time |
Not on-time |
Probability |
Young |
Low |
High |
Red |
3 |
1 |
0.750 |
Amber |
7 |
1 |
0.875 |
|||
Green |
17 |
1 |
0.944 |
|||
Average |
Red |
1 |
1 |
0.500 |
||
Green |
1 |
0 |
1.000 |
|||
Low |
Green |
7 |
0 |
1.000 |
||
Average |
High |
Red |
9 |
9 |
0.500 |
|
Amber |
24 |
10 |
0.706 |
|||
Green |
42 |
6 |
0.875 |
|||
Average |
Amber |
1 |
0 |
1.000 |
||
Green |
4 |
0 |
1.000 |
|||
Low |
Red |
2 |
0 |
1.000 |
||
High |
High |
Red |
1 |
1 |
0.500 |
|
Amber |
5 |
3 |
0.625 |
|||
Green |
12 |
0 |
1.000 |
Table 8.14 gives the categorical counts for middle-aged loan applicants.
Table 8.14 Grouped data—middle aged
Age |
Income |
Risk |
Credit rating |
On-time |
Not on-time |
Probability |
Middle |
Low |
Average |
Green |
3 |
0 |
1.000 |
Low |
Red |
0 |
1 |
0 |
||
Amber |
1 |
0 |
1.000 |
|||
Green |
2 |
0 |
1.000 |
|||
Average |
High |
Red |
1 |
1 |
0.500 |
|
Amber |
4 |
2 |
0.667 |
|||
Green |
17 |
1 |
0.944 |
|||
Average |
Amber |
4 |
0 |
1.000 |
||
Green |
4 |
0 |
1.000 |
|||
Low |
Red |
13 |
1 |
0.929 |
||
Amber |
37 |
4 |
0.902 |
|||
Green |
64 |
1 |
0.985 |
|||
High |
High |
Amber |
1 |
0 |
1.000 |
|
Green |
7 |
0 |
1.000 |
|||
Low |
Red |
8 |
0 |
1.000 |
||
Amber |
14 |
0 |
1.000 |
|||
Green |
30 |
0 |
1.000 |
Finally, we have the Old applicants in Table 8.15.
Table 8.15 Grouped data—age equals old
Age |
Income |
Risk |
Credit rating |
On-time |
Not on-time |
Probability |
Old |
Average |
Low |
Red |
0 |
1 |
0 |
Amber |
1 |
0 |
1.000 |
|||
Green |
2 |
0 |
1.000 |
|||
High |
Low |
Red |
2 |
0 |
1.000 |
|
Amber |
1 |
0 |
1.000 |
|||
Green |
3 |
0 |
1.000 |
Most of the possible old categories were vacant. Furthermore, all of the old applicant categories had unanimity in the outcome. Automatic machine learning begins with identifying those variables that offer the greatest likelihood of distinguishing between the possible outcomes. For each of the four variables, we can identify the outcome probabilities in Table 8.16.
Table 8.16 Combination outcomes
Variable |
Value |
Cases |
On-time |
Late |
Prob (on-time) |
Age |
Young |
169 |
136 |
33 |
0.805 |
Middle |
221 |
210 |
11 |
0.950 |
|
Old |
10 |
9 |
1 |
0.900 |
|
Income |
Low |
47 |
42 |
5 |
0.894 |
Average |
265 |
229 |
36 |
0.864 |
|
High |
88 |
84 |
4 |
0.955 |
|
Risk |
High |
186 |
150 |
36 |
0.806 |
Average |
19 |
18 |
1 |
0.947 |
|
Low |
195 |
187 |
8 |
0.959 |
|
Credit rating |
Red |
56 |
40 |
16 |
0.714 |
Amber |
120 |
100 |
20 |
0.833 |
|
Green |
224 |
215 |
9 |
0.960 |
For the variable Age, there are three outcomes. Entropy for Age by this formula would be as shown in Table 8.17.
Table 8.17 Entropy calculation for Age
p/(p+n) |
Log(base 2) |
n/(p+n) |
Log(base 2) |
Sum |
Probability |
Product |
|
Young |
136/169 |
−0.313 |
33/169 |
−2.356 |
0.712 |
169/400 |
0.301 |
Middle |
210/221 |
−0.074 |
11/221 |
−4.328 |
0.285 |
221/400 |
0.158 |
Old |
9/10 |
−0.152 |
1/10 |
−3.322 |
0.469 |
10/400 |
0.012 |
Sum |
0.470 |
Under the heading Sum is the following calculation. For Young, the first term of the Inform equation multiplies the ratio 136/169 times −0.313, and by −1. The second term multiplies the ratio 33/169 times −4.328, and by −1. The sum of these terms is 0.712. This represents 169 of the 400 training cases. The heading Product is the product of sum by probability. For Young, this is 0.712 times 169/400, yielding 0.301. The sum of the three products yields the entropy measure. The lower this entropy measure, the greater the information content (the greater the agreement probability). The entropy measures for the four variables are:
By this measure, Credit rating has the greatest information content. Among the Credit rating categories, we want to select that one with the greatest ability to accurately categorize data (the closest probability to 1.0 for Loan, or 0.0 for Deny Loan). We can consider the relative cost of an error. An error of granting a loan to a case where it is not paid on time may cost nine times ($900) as much as one that does pay on time (but was denied, $100). We can, therefore, use a cutoff limit of 0.90 probability. If the category being considered has a probability greater than or equal to 0.9, we will create a rule granting the loan. If the category being considered has a probability less than 0.9, we will deny the loan. If Credit rating is green, the data indicates a 0.953 probability (244 of 256 cases) that the applicant will pay the loan back on time. If Credit rating is not green, this data indicates a 0.815 probability that the applicant will pay the loan back on time. This would be the first rule selected by the machine learning algorithm.
IF (Credit = Green) |
THEN Predict On-time Payment |
ELSE |
Predict Late |
This rule is subject to two types of errors. First, those applicants rated as green may actually not pay on time. (From the data, the probability of this happening is 0.040). Second, those applicants rated as amber or red may actually have paid, if given a loan. (From the data, the probability of this happening is 0.815). The expectation of this is the probability of an applicant being rated as green (224/400, or 0.550) times the probability of being wrong (0.040), plus the probability of not being green (176/400, or 0.440) times the probability of being wrong (0.795), yielding an expected error of 0.022 + 0.350 = 0.372. We can also calculate the error cost function. This would be $900 times 0.022 + $100 times 0.350, or $55.14. To test this rule, it is best to apply it to a second set of data, different from the dataset used to develop the rule. Using the last 250 observations as the test set, Table 8.18 shows the coincidence matrix.
Table 8.18 Coincidence matrix for first rule
Actual |
Late |
On-time |
Total |
Late |
13 |
7 |
20 |
On-time |
88 |
142 |
230 |
Total |
101 |
149 |
250 |
The correct classification rate was 0.620. The cost function for this result would be $900 times 7 plus $100 times 88, or $15,100. Since we had 250 cases, the average error cost would have been $60.40 per case.
The set of rules can be examined further to see if greater accuracy can be obtained. The entropy formula for Age, given that Credit rating was not green, is 0.672, for Income 0.720, for Risk 0.657, and for the difference between red and amber Credit rating is 0.718. This indicates that Risk has the greater discriminating power at this point. With this data, if Risk is low, 79 of the 86 cases paid on time (0.919). The other two states had 61 of 90 pay on time (0.678). Therefore, the second rule is:
IF (Credit Rating is NOT green) |
AND (Risk = Low) |
THEN Predict On-Time |
ELSE |
Predict Late |
In this case, the data would indicate probabilities shown in Table 8.19.
Table 8.19 Probabilities by case
Credit rating |
Risk |
Count |
Probability on-time |
Green |
224/400 |
0.960 |
|
NOT green |
Low |
86/400 |
0.919 |
NOT green |
NOT low |
90/400 |
0.678 |
The expected error here is the 224/400 times (1 – 0.960) + 86/400 times (1 − 0.919) + 90/400 times (1 – 0.678), or 0.112. The cost function is $900 times 224/400 times 0.040 ($20.16) plus $900 times 86/400 times 0.081 ($15.67) plus $100 times 90/400 times 0.678 ($15.26), or $51.09 per case. This is an improvement over the prior case where the expected error was 0.380 with an expected cost per case of $60.40. The coincidence matrix is as shown in Table 8.20.
Table 8.20 Coincidence matrix for first two rules
Actual |
Late |
On-time |
Total |
Late |
11 |
9 |
20 |
On-time |
43 |
187 |
230 |
Total |
54 |
196 |
250 |
The correct classification rate was 0.792, much better than in Table 8.18. The cost function here would be $900 times 9 plus $100 times 43, or $12,400, or $49.60 per case.
We recalculate entropy, eliminating Credit rating green and Risk rating Low. The values obtained were 0.902 for Age, 0.893 for Income, 0.897 for Risk, and 0.877 for Credit rating. The lowest of these values was for Credit. Given that Credit rating was not green and Risk was not high, there are 90 training cases left, shown in Table 8.21.
Table 8.21 Probabilities recalculated
Credit Rating |
On-time |
Late |
Probability |
Red |
15 |
13 |
0.536 |
Amber |
46 |
16 |
0.742 |
While below the specified cutoff limit of 0.9, those with Amber credit ratings clearly had a higher probability of on-time payment than did those with credit ratings of red. A third rule is:
IF (Credit Rating NOT green) |
AND (Risk NOT high) |
AND (Credit Rating amber) |
THEN predict On-Time |
ELSE |
Predict Late |
The expected accuracy of the three rules together is shown in Table 8.22.
Table 8.22 Expected accuracy of three rules
Credit rating |
Risk |
Credit rating |
Count |
Probability |
Green |
224/400 |
0.960 |
||
NOT green |
High |
86/400 |
0.919 |
|
NOT green |
NOT high |
Amber |
62/400 |
0.742 |
NOT green |
NOT high |
NOT high |
28/400 |
0.536 |
The expected error here is 0.117, with an expected cost function per case of $75.57 (due to the high error rate for the third rule). The expected classification error went down, although the expected cost function went up. Table 8.23 gives the coincidence matrix for this set of rules using the test data.
Table 8.23 Coincidence matrix for three rules
Actual |
Late |
On-time |
Total |
Late |
3 |
17 |
20 |
On-time |
11 |
219 |
230 |
Total |
14 |
236 |
250 |
Here, the correct classification rate increased to 0.888 (it was 0.792 after two rules), while the average cost per case increased to $65.60 from $49.60. When machine learning methods encounter no improvement, they generally stop. Therefore, the last rule would not be added to the first two. The machine learning algorithm would stop at:
IF (Credit Rating is green) |
THEN Predict On-time |
|
IF (Credit Rating is NOT green) |
AND (Risk = Low) |
THEN Predict On-time |
ELSE |
Predict Late |
Note that this approach does not guarantee optimality. In that sense, it is heuristic, which means it tends to give good models, but there might be better ones.
The model was also run using See 5, a very good decision tree software. The initial model, using 400 training observations of categorical, degenerated to classifying all cases as On-Time. Due to the highly skewed dataset (where 45 cases were late and 355 were on-time), this had a fairly high correct classification rate of 0.920. However, it was very bad at predicting applications that would end up late. Balancing the data can improve results. We pruned the observations from the training set to control the proportion of late cases, as shown in Table 8.24. Using 325 observations again yielded all forecasts to be on time. The model based on 225 observations (as well as the model based on 180 observations) was to classify all cases as on-time unless risk was high, and credit was either red or amber. The model using 150 training cases was to classify all cases as on-time unless the applicant was young and credit was either red or amber. The models obtained were tested on the same test set of 250 observations.
Table 8.24 Results from the balancing dataset
Train set |
Late |
On-time |
Proportion |
Predict 0 = 1 |
Predict 1 = 0 |
Correct rate |
Cost |
400 |
45 |
355 |
0.1125 |
20 |
0 |
0.920 |
$18,000 |
325 |
45 |
280 |
0.1385 |
20 |
0 |
0.920 |
$18,000 |
225 |
45 |
180 |
0.2000 |
9 |
39 |
0.808 |
$12,000 |
180 |
45 |
135 |
0.2500 |
9 |
39 |
0.808 |
$12,000 |
150 |
45 |
105 |
0.3000 |
9 |
32 |
0.956 |
$11,300 |
This data shows a clear trend toward more useful model results by balancing data.
R Decision Tree Model
We now demonstrate decision trees on the loan application dataset with R (Rattle). Figure 8.3 shows the modeling screen from Rattle for decision trees (the Tree radio button is the default).
Figure 8.4 shows the rules obtained from Execute being selected in Figure 8.3.
This tree has six rules (the asterisked lines in Figure 8.4). Selecting the Draw button on Figure 8.4 yields the graphic tree in Figure 8.5.
Table 8.25 gives the coincidence matrix for this decision tree.
Table 8.25 Coincidence matrix for Rattle decision tree model
Model OK |
Model problem |
||
Actual OK |
219 |
11 |
230 |
Actual problem |
17 |
3 |
20 |
236 |
14 |
250 |
Correct classification here is 0.880, quite good. For the actual “OK” cases, the correct classification rate is 219/230 = 0.952, while for the actual “Problem” cases, it is only 3/20, or 0.15. This model categorized case 4 as fraudulent, all others as “OK.”
KNIME
Figure 8.6 displays the workflow for decision trees using KNIME.
This is very similar to the modeling workflows for regression and neural networks. File reader nodes are identical. Here you will need a Decision Tree Learner node (node 2) and Decision Tree Predictor nodes (nodes 4 and 8). The scorer is the same. To forecast new cases, the new cases are input on the File Reader node 11, run through the Predictor in node 8, and an Interactive Table attached in node 10. Figure 8.7 shows the resulting tree from the Decision Tree Learner.
The output from KNIME’s Scorer node is given in Table 8.26.
Table 8.26 Coincidence matrix for Rattle decision tree model
Model OK |
Model problem |
||
Actual OK |
221 |
9 |
230 |
Actual problem |
19 |
1 |
20 |
240 |
10 |
250 |
Here the correct classification rate is 222/250, or 0.888 (slightly better than R’s). The decision tree from KNIME was very good at predicting the actual “OK” cases (221/230, or 0.961), but not very good at predicting the “Problem” cases (1/20 = 0.05). The prediction of new cases is obtained from the Interactive Table node (see Figure 8.8).
Note that the decision tree from Figure 8.7 classifies all 10 new cases as “OK.”
WEKA
WEKA is opened as usual, and under Preprocess, the file with all 650 cases (training and test in Rattle and KNIME) linked. We selected 61.5 percent of the cases for training. J48 was run with minimum support of 2 yielded a degenerate model, as did minimum support of 20. Thus, we balanced the dataset by replicating the smaller “Problem” cases 8 times, yielding a balanced dataset with 585 cases for both “OK” and “Problem.” We ran J48 decision tree model using 50 percent of this balanced data for testing. With minimum support of 2, the model contained 49 rules—too many rules. This model did have an overall correct classification rate of 0.911 which is actually very good, but a 0 correct rate of predicting the negatives. Raising minimum support to 20 yielded 21 rules and a correct classification rate of 0.802 (0.725 for positives, 0.879 for negatives). This is still too many rules, so minimum support was raised to 50.
These eight rules (Figure 8.9) yielded the coincidence matrix in Table 8.27.
Table 8.27 WEKA decision tree coincidence matrix
Model OK |
Model problem |
||
Actual OK |
440 |
145 |
585 |
Actual problem |
146 |
439 |
585 |
586 |
584 |
1170 |
The correct classification rate for positive cases was 0.752, for negative cases 0.750, and 0.751 overall. A nice feature of decision tree models is that they are easy to apply to new cases. They consist of If/Then rules that are very easy to interpret. By adjusting minimum support, you can balance the tradeoff between accuracy and smaller rule sets.
As to new cases, this model classified “Problem” for case in row 1 (see Figure 8.8 for input data) by rule 7, case in row 3 by rule 1, case in row 4 by rule 7, case in row 8 by rule 7, and case in row 9 by rule 3. Thus, this was a much more thorough (and probably useful) model than those decision tree models obtained from R and KNIME in this case.
Summary
Decision trees are very effective and useful data mining models. They are automatic, an application of machine learning, which in itself makes them attractive. They are relatively robust, in that they are not adversely affected by noisy data or missing data. They can handle very large datasets and can deal with data in the categorical or numeric form. They are also very strong in their ability to explain their conclusions (rules can be expressed in natural language, and communication to managers is excellent). As shown in the preceding section, there are many software products available to support decision tree modeling.