Chapter 8 Decision Tree Algorithms

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Decision trees provide a means to obtain product-specific forecasting models in the form of rules that are easy to implement. These rules have an if-then form, which is easy for the users to implement. This data mining approach can be used by groceries in a number of policy decisions, to include ordering inventory replenishment, as well as evaluation of alternative promotion campaigns.

As was the case with regression models and neural networks, decision tree models support the data mining process of modeling. Decision trees in the context of data mining refer to the tree structure of rules (often referred to as association rules). The data mining decision tree process involves collecting those variables that the analyst thinks might bear on the decision at issue, and analyzing these variables for their ability to predict outcome. Decision trees are useful to gain further insights into customer behavior as well as lead to ways to profitably act on the results. The algorithm automatically determines which variables are the most important, based on their ability to sort the data into the correct output category. The method has relative advantage over neural networks and genetic algorithms, in that a reusable set of rules are provided, thus explaining model conclusions. There are many examples where decision trees have been applied to business data mining, including classifying loan applicants, screening potential consumers, and rating job applicants.

Decision trees provide a way to implement rule-based system approaches. The ID3 system selects an attribute as a root, with branches for different values of the attribute. All objects in the training set are classified into these branches. If all objects in a branch belong to the same output class, the node is labeled, and this branch is terminated. If there are multiple classes on a branch, another attribute is selected as a node, with all possible attribute values branched. An entropy heuristic is used to select the attributes with the highest information. In other data mining tools, other bases for selecting branches are often used.

Decision Tree Operation

A bank may have a database of past loan applicants for short-term loans. This database (a simplification of Table 4.4) consists of the age of the applicant, the applicant’s income, and risk rating. The outcome of the loan is paid on time or not. The bank’s policy treats the applicants differently by age group, income level, and risk. Age groups are less than or equal to 30, over 30 but less than or equal to 60, and over 60 years. Income levels are less than or equal to $30,000 per year, over $30,000 but less than or equal to $80,000 per year, and over $80,000 per year. High risk is defined as applicants with greater debt than assets. If an applicant’s assets minus the requested loan amount exceed debt, the applicant is classified as low risk. Applicants with asset–debt relationships between these values are classified as medium risk. A tree sorts the possible combination of these variables. An exhaustive tree enumerates all combinations of variable values in Table 8.1.

Table 8.1 Enumeration of appliance loan variable combinations

Age = young	Income = low	Risk = low Risk = medium Risk = high
	Income = average	Risk = low Risk = medium Risk = high
	Income = high	Risk = low Risk = medium Risk = high
Age = middle	Income = low	Risk = low Risk = medium Risk = high
	Income = average	Risk = low Risk = medium Risk = high
	Income = high	Risk = low Risk = medium Risk = high
Age = old	Income = low	Risk = low Risk = medium Risk = high
	Income = average	Risk = low Risk = medium Risk = high
	Income = high	Risk = low Risk = medium Risk = high

The tree for this set of rules is partially shown in Figure 8.1.

Figure 8.1 Partial tree for loan application data

A rule-based system model would require that the bank loan officers, who had respected judgment, be interviewed to classify (and justify) the decision for each of these combinations of variables. Some trees may be simplified. For instance, if income is high, the loan might be automatically granted by some loan officers. This would reduce the tree, as age and risk would not matter in that case. Further, for those without low income, if risk were high, the loan might not be granted. The first rule would be as shown in Table 8.2.

Table 8.2 Initial rule

Age	Income	Risk	Rule
	High		Grant loan
	Not high		Deny loan

This structure uses classification data, because there are a clear, finite number of outcomes (don’t grant the loan, grant the loan, and possibly authorize part of the loan). This form of tree is referred to as a classification tree. Trees can also be used on data with continuous outcomes, in an estimating or predicting mode. In these cases, the terminology is a regression tree.

Rule Interestingness

Data, even categorical data, can involve potentially many rules. For instance, in the loan application data just presented, there were four variables, each of which could take on three values. There were 3 × 3 × 3 = 27 combinations, each given in Table 8.1. If there were 10 variables, each with 4 possible values, the number of combinations would be over a million (1,048,576). Clearly, brute force generation of rule outcomes is unreasonable. Fortunately, decision tree methods identify the most useful rules in terms of predicting outcomes. Rule effectiveness is measured in terms of confidence and support. Confidence is the degree of accuracy of a rule. Support is the degree to which the antecedent conditions occur in the data. Support for an association rule indicates the proportion of records covered by the set of attributes in the association rule. If there were 10 million book purchases, support for the given rule would be 10/10,000,000, a very small support measure of 0.000001. These concepts are often used in the form of threshold levels in machine learning systems. The minimum confidence levels and support levels can be specified to retain the rules identified by decision tree (or other association rule) methods.

For rules to be interesting, they must identify something useful (have a high confidence level and sufficiently high support) and novel. For instance, a grocer applying data mining finds that eggs and bacon are purchased together at a confidence level of 0.9 and a support level of 0.2 might not be impressed. The grocer knew that prior to data mining. Interestingness is the idea that the data mining analysis found out something that was unexpected (knowledge discovery). It is still useful to confirm the hypothesis that eggs and bacon sell together. But, it is more useful to discover the knowledge that blackberry jam and eggs go together at a similar rate. (Please note that we do not know of such a real relationship.) Such information could lead to rearrangement of store displays or promotions or both, which would make it interesting.

Machine Learning

Rule induction algorithms have been developed to automatically process data of this type. For this approach to work, a clear outcome is needed. In this case, a clear outcome exists—two categories of payoff results. Rule induction works by searching through data for patterns and relationships. The records can be clustered into specific categories. Machine learning starts with no assumptions—thus looking only at the input data and results. The judgment developed by human experts is not considered, which might sound inefficient, but human biases can, thus, be eliminated. Recursive partitioning algorithms split data (original data, not grouped as aforementioned) into finer and finer subsets, leading to a decision tree.

For instance, let us consider the 20 past loan application cases in Table 8.3, with known outcomes.

Table 8.3 Twenty past loan application cases

Age	Income	Assets	Debts	Want	Risk	Result
20 (young)	17,152 (low)	11,090	20,455	400	High	On-time
23 (young)	25,862 (low)	24,756	30,083	2,300	High	On-time
28 (young)	26,169 (low)	47,355	49,341	3,100	High	Late
23 (young)	21,117 (low)	21,242	30,278	300	High	Default
22 (young)	7,127 (low)	23,903	17,231	900	Low	On-time
26 (young)	42,083 (average)	35,726	41,421	300	High	Late
24 (young)	55,557 (average)	27,040	48,191	1,500	High	On-time
27 (young)	34,843 (average)	0	21,031	2,100	High	On-time
29 (young)	74,295 (average)	88,827	100,599	100	High	On-time
23 (young)	38,887 (average)	6,260	33,635	9,400	Low	On-time
28 (young)	31,758 (average)	58,492	49,268	1,000	Low	On-time
25 (young)	80,180 (high)	31,696	69,529	1,000	High	Late
33 (middle)	40,921 (average)	91,111	90,076	2,900	Average	Late
36 (middle)	63,124 (average)	164,631	144,697	300	Low	On-time
39 (middle)	59,006 (average)	195,759	161,750	600	Low	On-time
39 (middle)	125,713 (high)	382,180	315,396	5,200	Low	On-time
55 (middle)	80,149 (high)	511,937	21,923	1,000	Low	On-time
62 (old)	101,291 (high)	783,164	23,052	1,800	Low	On-time
71 (old)	81,723 (high)	776,344	20,277	900	Low	On-time
63 (old)	99,522 (high)	783,491	24,643	200	Low	On-time

There are three variables, each with three possible levels. In practice, we would expect thousands of observations, making it unlikely that any combination would be empty. We use 20 observations to demonstrate the principals of the calculations. Of the 27 combinations, many here are empty. The combinations with representative observations are given in Table 8.4.

Table 8.4 Grouped data

Age	Income	Risk	Total	On-time	Not on-time	Probability
Young	Low	High	4	2	2	0.50
Young	Low	Low	1	1	0	1.00
Young	Average	High	4	3	1	0.75
Young	Average	Low	2	2	0	1.00
Young	High	High	1	0	1	0.00
Middle	Average	Average	1	0	1	0.00
Middle	Average	Low	2	2	0	1.00
Middle	High	Low	2	2	0	1.00
Old	High	Low	3	3	0	1.00

Automatic machine learning begins with identifying those variables that offer the greatest likelihood of distinguishing between the possible outcomes. For each of the three variables, we can identify the outcome probabilities in Table 8.5.

Table 8.5 Combination outcomes

Variable	Value	Cases	On-time	Late	Prob (on-time)
Age	Young	12	8	4	0.67
	Middle	5	4	1	0.80
	Old	3	3	0	1.00
Income	Low	5	3	2	0.60
	Average	9	7	2	0.78
	High	6	5	1	0.83
Risk	High	9	5	4	0.55
	Average	1	0	1	0.00
	Low	10	10	0	1.00

Most data mining packages use an entropy measure to gauge the discriminating power of each variable, selecting that variable with the greatest discriminating power as the first to split data with. (Chi-square measures can also be used to select variables.) There are three of the nine categories here with all observations in one outcome category or the other (age = old and risk = low, both have all cases in the on-time category; risk = average has only one observation, and it is in the late category—suspicious logically and based on the minimum possible sample size, but remember, we are trying to demonstrate the procedure with a small dataset).

One formula for entropy where p is the number of positive examples and n is the number of negative examples in the training set for each value of the attribute is as follows:

This formula has a problem: if either p or n is 0 (which would happen if there were unanimous outcomes for a category), then the log to base 2 is undefined, and the formula does not work. However, for values just above 0, the Inform formula will converge to 0. For the variable Age, there are 3 outcomes. Entropy for each Age category by this formula is shown in Table 8.6.

Table 8.6 Entropy calculation for Age

	p/(p+n)	Log(base 2)	n/(p+n)	Log(base 2)	Sum of products	Probability (young)	Product
Young	8/12	−0.585	4/12	−1.585	0.918	12/20	0.551
Middle	4/5	−0.322	1/5	−2.322	0.722	5/20	0.180
Old	3/3	0*	0/3	0*	0*	3/20	0*
Sum							0.731

*formula strictly undefined, but converges to zero.

For category Young, the calculation is [−(8/12) × (−0.585) − (4/12) × (−1.585)] × (12/20) = 0.551.

The lower this entropy measure, the greater the information content (the greater the agreement probability). The entropy measures for the three variables are:

Age 0.731

Income 0.782

Risk 0.446

By this measure, Risk has the greatest information content. If Risk is low, the data indicates a 1.0 probability (10 of 10 cases) that the applicant will pay the loan back on-time. If Risk is not low, this data indicates a 0.5 probability that the applicant will pay the loan back on time. This would be the first rule selected by the machine learning algorithm.

IF (Risk = Low)	THEN Predict On-time payment
ELSE	Predict Late

This rule is subject to two types of errors. First, those applicants rated as low risk may actually not pay on time. (From the data, the probability of this happening is 0.0.) Second, those applicants rated as high or average risk may actually have paid if given a loan. (From the data, the probability of this happening is 0.5.) The expectation of this is the probability of an applicant being rated as high or average in risk (10/20, or 0.5) times the probability of being wrong (0.5), yielding an expected error of 0.25. To test this rule, it is best to apply it to a second set of data, different from the dataset used to develop the rule.

The set of rules can be examined further to see if greater accuracy can be obtained. The entropy formula for Age, given that risk was not low, is 0.991, while the same calculation for income is 1.971. This indicates that Age has greater discriminating power. With this data, if Age is middle, the one case did not pay on time. There were no old cases in this group. Therefore, the second rule is:

IF (Risk is NOT low)	AND (Age = Middle)	THEN Predict Late
ELSE		Predict On-time

In this case, the data would indicate probabilities shown in Table 8.7.

Table 8.7 Probabilities by case

Risk	Age	Count	Probability on-time
Low		10/20	1.0
NOT low	Middle	1/20	0.0
NOT low	Young	9/20	0.56

All 10 of this subset of the data with Low risk rating paid on time. Of the other 10 cases that were not low, the 1 that was middle aged did not pay on time (as stated earlier), while 5 of the 9 young cases paid on time. The expected error here is the 4/9 probability for the 9 cases out of 20 total, or 0.2. This is an improvement over the prior case where the expected error was 0.25.

For the last variable, Income, given that Risk was not low, and Age was not Middle, there are nine cases left, shown in Table 8.8.

Table 8.8 Probabilities recalculated

Income	On-time	Late	Probability on-time
Low	2	2	0.50
Average	3	1	0.75
High	0	1	0.00

Four of these nine cases were Low income, with two paying on time and two not. Four were Average income, three of which paid on time and one not. The last case was High income, which was not paid on time.

A third rule, taking advantage of the case with unanimous outcome, is:

IF (Risk NOT low)	AND (Age NOT middle)	AND (Income high)	THEN predict Late
ELSE			Predict on-time

The expected accuracy of the three rules together is shown in Table 8.9.

Table 8.9 Expected accuracy of three rules

Risk	Age	Income	Count	Probability (on-time)
Low			10/20	1.0
NOT low	Middle		1/20	0.0
NOT low	Young	High	1/20	0.0
NOT low	Young	NOT high	8/20	0.625

The expected error here is 8/20 times 0.375 = 0.15.

An additional rule could be generated. For the case of Risk NOT low, Age = Young, and Income NOT high, there are 4 cases with low income (probability of on-time = 0.5) and 4 cases with average income (probability of on-time = 0.75). The greater discrimination is provided by average income, resulting in the rule:

IF (Risk NOT low)	AND (Age NOT middle)	AND (Income average)	THEN predict on-time
ELSE			Predict either

There is no added accuracy obtained with this rule, shown in Table 8.10.

Table 8.10 Accuracy of fourth rule

Risk	Age	Income	Count	Probability (on-time)
Low			10/20	1.0
NOT low	Middle		1/20	0.0
NOT low	Young	High	1/20	0.0
NOT low	Young	Average	4/20	0.75
NOT low	Young	Low	4/20	0.50

The expected error is 4/20 times 0.25 + 4/20 times 0.5, equals 0.15, the same as without the rule. When machine learning methods encounter no improvement, they generally stop. (Here, we have exhausted all combinations anyway.) Therefore, the last rule would not be added to the first three. The machine learning algorithm would stop at:

IF (Risk low)			THEN predict On-time
ELSE
IF (Risk NOT low)	AND (Age middle)		THEN predict Late
ELSE
IF (Risk NOT low)	AND (Age NOT middle)	AND (Income high)	THEN predict Late
ELSE			Predict On-time

This model can be tested with the data given in Table 8.5. Table 8.11 shows the results.

Table 8.11 Rules applied to test cases

Record	Actual	Rule prediction	By rule	Result
1	On-time	On-time	1	Match
2	On-time	On-time	1	Match
3	On-time	On-time	1	Match
4	On-time	On-time	Else	Match
5	On-time	On-time	1	Match
6	On-time	On-time	Else	Match
7	Late	On-time	Else	Miss
8	On-time	On-time	1	Match
9	On-time	Late	3	Miss
10	On-time	On-time	1	Match

Figure 8.2 shows the decision tree for this set of rules.

Figure 8.2 Decision tree for rule

In this case, the coincidence matrix is shown in Table 8.12.

Table 8.12 Coincidence matrix for rule set

Actual	Late	On-time	Total
Late	0	1	1
On-Time	1	8	9
Total	1	9	10

In this case, the accuracy of the model is 8/10, or 80 percent. There were errors of both types (denying a good loan application, accepting one bad application).

Software Demonstrations

We use the loan application dataset to demonstrate the decision tree operation, both manually as well as with software. The full loan application dataset had 650 observations over 4 variables. We can use the first 400 observations for training, reserving the last 250 observations for testing. The data for this covered in Chapter 4 is grouped in Table 8.13.

Table 8.13 Grouped data—young

Age	Income	Risk	Credit rating	On-time	Not on-time	Probability
Young	Low	High	Red	3	1	0.750
			Amber	7	1	0.875
			Green	17	1	0.944
		Average	Red	1	1	0.500
			Green	1	0	1.000
		Low	Green	7	0	1.000
	Average	High	Red	9	9	0.500
			Amber	24	10	0.706
			Green	42	6	0.875
		Average	Amber	1	0	1.000
			Green	4	0	1.000
		Low	Red	2	0	1.000
	High	High	Red	1	1	0.500
			Amber	5	3	0.625
			Green	12	0	1.000

Table 8.14 gives the categorical counts for middle-aged loan applicants.

Table 8.14 Grouped data—middle aged

Age	Income	Risk	Credit rating	On-time	Not on-time	Probability
Middle	Low	Average	Green	3	0	1.000
		Low	Red	0	1	0
			Amber	1	0	1.000
			Green	2	0	1.000
	Average	High	Red	1	1	0.500
			Amber	4	2	0.667
			Green	17	1	0.944
		Average	Amber	4	0	1.000
			Green	4	0	1.000
		Low	Red	13	1	0.929
			Amber	37	4	0.902
			Green	64	1	0.985
	High	High	Amber	1	0	1.000
			Green	7	0	1.000
		Low	Red	8	0	1.000
			Amber	14	0	1.000
			Green	30	0	1.000

Finally, we have the Old applicants in Table 8.15.

Table 8.15 Grouped data—age equals old

Age	Income	Risk	Credit rating	On-time	Not on-time	Probability
Old	Average	Low	Red	0	1	0
			Amber	1	0	1.000
			Green	2	0	1.000
	High	Low	Red	2	0	1.000
			Amber	1	0	1.000
			Green	3	0	1.000

Most of the possible old categories were vacant. Furthermore, all of the old applicant categories had unanimity in the outcome. Automatic machine learning begins with identifying those variables that offer the greatest likelihood of distinguishing between the possible outcomes. For each of the four variables, we can identify the outcome probabilities in Table 8.16.

Table 8.16 Combination outcomes

Variable	Value	Cases	On-time	Late	Prob (on-time)
Age	Young	169	136	33	0.805
	Middle	221	210	11	0.950
	Old	10	9	1	0.900
Income	Low	47	42	5	0.894
	Average	265	229	36	0.864
	High	88	84	4	0.955
Risk	High	186	150	36	0.806
	Average	19	18	1	0.947
	Low	195	187	8	0.959
Credit rating	Red	56	40	16	0.714
	Amber	120	100	20	0.833
	Green	224	215	9	0.960

For the variable Age, there are three outcomes. Entropy for Age by this formula would be as shown in Table 8.17.

Table 8.17 Entropy calculation for Age

	p/(p+n)	Log(base 2)	n/(p+n)	Log(base 2)	Sum	Probability	Product
Young	136/169	−0.313	33/169	−2.356	0.712	169/400	0.301
Middle	210/221	−0.074	11/221	−4.328	0.285	221/400	0.158
Old	9/10	−0.152	1/10	−3.322	0.469	10/400	0.012
Sum							0.470

Under the heading Sum is the following calculation. For Young, the first term of the Inform equation multiplies the ratio 136/169 times −0.313, and by −1. The second term multiplies the ratio 33/169 times −4.328, and by −1. The sum of these terms is 0.712. This represents 169 of the 400 training cases. The heading Product is the product of sum by probability. For Young, this is 0.712 times 169/400, yielding 0.301. The sum of the three products yields the entropy measure. The lower this entropy measure, the greater the information content (the greater the agreement probability). The entropy measures for the four variables are:

By this measure, Credit rating has the greatest information content. Among the Credit rating categories, we want to select that one with the greatest ability to accurately categorize data (the closest probability to 1.0 for Loan, or 0.0 for Deny Loan). We can consider the relative cost of an error. An error of granting a loan to a case where it is not paid on time may cost nine times ($900) as much as one that does pay on time (but was denied, $100). We can, therefore, use a cutoff limit of 0.90 probability. If the category being considered has a probability greater than or equal to 0.9, we will create a rule granting the loan. If the category being considered has a probability less than 0.9, we will deny the loan. If Credit rating is green, the data indicates a 0.953 probability (244 of 256 cases) that the applicant will pay the loan back on time. If Credit rating is not green, this data indicates a 0.815 probability that the applicant will pay the loan back on time. This would be the first rule selected by the machine learning algorithm.

IF (Credit = Green)	THEN Predict On-time Payment
ELSE	Predict Late

This rule is subject to two types of errors. First, those applicants rated as green may actually not pay on time. (From the data, the probability of this happening is 0.040). Second, those applicants rated as amber or red may actually have paid, if given a loan. (From the data, the probability of this happening is 0.815). The expectation of this is the probability of an applicant being rated as green (224/400, or 0.550) times the probability of being wrong (0.040), plus the probability of not being green (176/400, or 0.440) times the probability of being wrong (0.795), yielding an expected error of 0.022 + 0.350 = 0.372. We can also calculate the error cost function. This would be $900 times 0.022 + $100 times 0.350, or $55.14. To test this rule, it is best to apply it to a second set of data, different from the dataset used to develop the rule. Using the last 250 observations as the test set, Table 8.18 shows the coincidence matrix.

Table 8.18 Coincidence matrix for first rule

Actual	Late	On-time	Total
Late	13	7	20
On-time	88	142	230
Total	101	149	250

The correct classification rate was 0.620. The cost function for this result would be $900 times 7 plus $100 times 88, or $15,100. Since we had 250 cases, the average error cost would have been $60.40 per case.

The set of rules can be examined further to see if greater accuracy can be obtained. The entropy formula for Age, given that Credit rating was not green, is 0.672, for Income 0.720, for Risk 0.657, and for the difference between red and amber Credit rating is 0.718. This indicates that Risk has the greater discriminating power at this point. With this data, if Risk is low, 79 of the 86 cases paid on time (0.919). The other two states had 61 of 90 pay on time (0.678). Therefore, the second rule is:

IF (Credit Rating is NOT green)	AND (Risk = Low)	THEN Predict On-Time
ELSE		Predict Late

In this case, the data would indicate probabilities shown in Table 8.19.

Table 8.19 Probabilities by case

Credit rating	Risk	Count	Probability on-time
Green		224/400	0.960
NOT green	Low	86/400	0.919
NOT green	NOT low	90/400	0.678

The expected error here is the 224/400 times (1 – 0.960) + 86/400 times (1 − 0.919) + 90/400 times (1 – 0.678), or 0.112. The cost function is $900 times 224/400 times 0.040 ($20.16) plus $900 times 86/400 times 0.081 ($15.67) plus $100 times 90/400 times 0.678 ($15.26), or $51.09 per case. This is an improvement over the prior case where the expected error was 0.380 with an expected cost per case of $60.40. The coincidence matrix is as shown in Table 8.20.

Table 8.20 Coincidence matrix for first two rules

Actual	Late	On-time	Total
Late	11	9	20
On-time	43	187	230
Total	54	196	250

The correct classification rate was 0.792, much better than in Table 8.18. The cost function here would be $900 times 9 plus $100 times 43, or $12,400, or $49.60 per case.

We recalculate entropy, eliminating Credit rating green and Risk rating Low. The values obtained were 0.902 for Age, 0.893 for Income, 0.897 for Risk, and 0.877 for Credit rating. The lowest of these values was for Credit. Given that Credit rating was not green and Risk was not high, there are 90 training cases left, shown in Table 8.21.

Table 8.21 Probabilities recalculated

Credit Rating	On-time	Late	Probability on-time
Red	15	13	0.536
Amber	46	16	0.742

While below the specified cutoff limit of 0.9, those with Amber credit ratings clearly had a higher probability of on-time payment than did those with credit ratings of red. A third rule is:

IF (Credit Rating NOT green)	AND (Risk NOT high)	AND (Credit Rating amber)	THEN predict On-Time
ELSE			Predict Late

The expected accuracy of the three rules together is shown in Table 8.22.

Table 8.22 Expected accuracy of three rules

Credit rating	Risk	Credit rating	Count	Probability (on-time}
Green			224/400	0.960
NOT green	High		86/400	0.919
NOT green	NOT high	Amber	62/400	0.742
NOT green	NOT high	NOT high	28/400	0.536

The expected error here is 0.117, with an expected cost function per case of $75.57 (due to the high error rate for the third rule). The expected classification error went down, although the expected cost function went up. Table 8.23 gives the coincidence matrix for this set of rules using the test data.

Table 8.23 Coincidence matrix for three rules

Actual	Late	On-time	Total
Late	3	17	20
On-time	11	219	230
Total	14	236	250

Here, the correct classification rate increased to 0.888 (it was 0.792 after two rules), while the average cost per case increased to $65.60 from $49.60. When machine learning methods encounter no improvement, they generally stop. Therefore, the last rule would not be added to the first two. The machine learning algorithm would stop at:

IF (Credit Rating is green)		THEN Predict On-time
IF (Credit Rating is NOT green)	AND (Risk = Low)	THEN Predict On-time
ELSE		Predict Late

Note that this approach does not guarantee optimality. In that sense, it is heuristic, which means it tends to give good models, but there might be better ones.

The model was also run using See 5, a very good decision tree software. The initial model, using 400 training observations of categorical, degenerated to classifying all cases as On-Time. Due to the highly skewed dataset (where 45 cases were late and 355 were on-time), this had a fairly high correct classification rate of 0.920. However, it was very bad at predicting applications that would end up late. Balancing the data can improve results. We pruned the observations from the training set to control the proportion of late cases, as shown in Table 8.24. Using 325 observations again yielded all forecasts to be on time. The model based on 225 observations (as well as the model based on 180 observations) was to classify all cases as on-time unless risk was high, and credit was either red or amber. The model using 150 training cases was to classify all cases as on-time unless the applicant was young and credit was either red or amber. The models obtained were tested on the same test set of 250 observations.

Table 8.24 Results from the balancing dataset

Train set	Late	On-time	Proportion	Predict 0 = 1	Predict 1 = 0	Correct rate	Cost
400	45	355	0.1125	20	0	0.920	$18,000
325	45	280	0.1385	20	0	0.920	$18,000
225	45	180	0.2000	9	39	0.808	$12,000
180	45	135	0.2500	9	39	0.808	$12,000
150	45	105	0.3000	9	32	0.956	$11,300

This data shows a clear trend toward more useful model results by balancing data.

R Decision Tree Model

We now demonstrate decision trees on the loan application dataset with R (Rattle). Figure 8.3 shows the modeling screen from Rattle for decision trees (the Tree radio button is the default).

Figure 8.4 shows the rules obtained from Execute being selected in Figure 8.3.

Figure 8.3 Opening Rattle decision tree screen

Figure 8.4 Rattle decision tree model for loan training file

This tree has six rules (the asterisked lines in Figure 8.4). Selecting the Draw button on Figure 8.4 yields the graphic tree in Figure 8.5.

Figure 8.5 R decision tree

Table 8.25 gives the coincidence matrix for this decision tree.

Table 8.25 Coincidence matrix for Rattle decision tree model

	Model OK	Model problem
Actual OK	219	11	230
Actual problem	17	3	20
	236	14	250

Correct classification here is 0.880, quite good. For the actual “OK” cases, the correct classification rate is 219/230 = 0.952, while for the actual “Problem” cases, it is only 3/20, or 0.15. This model categorized case 4 as fraudulent, all others as “OK.”

KNIME

Figure 8.6 displays the workflow for decision trees using KNIME.

Figure 8.6 KNIME decision tree workflow

This is very similar to the modeling workflows for regression and neural networks. File reader nodes are identical. Here you will need a Decision Tree Learner node (node 2) and Decision Tree Predictor nodes (nodes 4 and 8). The scorer is the same. To forecast new cases, the new cases are input on the File Reader node 11, run through the Predictor in node 8, and an Interactive Table attached in node 10. Figure 8.7 shows the resulting tree from the Decision Tree Learner.

Figure 8.7 KNIME decision tree

The output from KNIME’s Scorer node is given in Table 8.26.

Table 8.26 Coincidence matrix for Rattle decision tree model

	Model OK	Model problem
Actual OK	221	9	230
Actual problem	19	1	20
	240	10	250

Here the correct classification rate is 222/250, or 0.888 (slightly better than R’s). The decision tree from KNIME was very good at predicting the actual “OK” cases (221/230, or 0.961), but not very good at predicting the “Problem” cases (1/20 = 0.05). The prediction of new cases is obtained from the Interactive Table node (see Figure 8.8).

Figure 8.8 KNIME decision tree new case predictions

Note that the decision tree from Figure 8.7 classifies all 10 new cases as “OK.”

WEKA

WEKA is opened as usual, and under Preprocess, the file with all 650 cases (training and test in Rattle and KNIME) linked. We selected 61.5 percent of the cases for training. J48 was run with minimum support of 2 yielded a degenerate model, as did minimum support of 20. Thus, we balanced the dataset by replicating the smaller “Problem” cases 8 times, yielding a balanced dataset with 585 cases for both “OK” and “Problem.” We ran J48 decision tree model using 50 percent of this balanced data for testing. With minimum support of 2, the model contained 49 rules—too many rules. This model did have an overall correct classification rate of 0.911 which is actually very good, but a 0 correct rate of predicting the negatives. Raising minimum support to 20 yielded 21 rules and a correct classification rate of 0.802 (0.725 for positives, 0.879 for negatives). This is still too many rules, so minimum support was raised to 50.

These eight rules (Figure 8.9) yielded the coincidence matrix in Table 8.27.

Figure 8.9 WEKA rules

Table 8.27 WEKA decision tree coincidence matrix

	Model OK	Model problem
Actual OK	440	145	585
Actual problem	146	439	585
	586	584	1170

The correct classification rate for positive cases was 0.752, for negative cases 0.750, and 0.751 overall. A nice feature of decision tree models is that they are easy to apply to new cases. They consist of If/Then rules that are very easy to interpret. By adjusting minimum support, you can balance the tradeoff between accuracy and smaller rule sets.

As to new cases, this model classified “Problem” for case in row 1 (see Figure 8.8 for input data) by rule 7, case in row 3 by rule 1, case in row 4 by rule 7, case in row 8 by rule 7, and case in row 9 by rule 3. Thus, this was a much more thorough (and probably useful) model than those decision tree models obtained from R and KNIME in this case.

Summary

Decision trees are very effective and useful data mining models. They are automatic, an application of machine learning, which in itself makes them attractive. They are relatively robust, in that they are not adversely affected by noisy data or missing data. They can handle very large datasets and can deal with data in the categorical or numeric form. They are also very strong in their ability to explain their conclusions (rules can be expressed in natural language, and communication to managers is excellent). As shown in the preceding section, there are many software products available to support decision tree modeling.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 8 Decision Tree Algorithms

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 8 Decision Tree Algorithms