Classification trees are used to separate the data into classes belonging to the response variable. The response variable usually has two classes: Yes or No (1 or 0) and sunny or rain. If the target variable has more than two categories, then C4.5 can be applicable. C4.5 improves the ID3 algorithm for the continuous attributes, the discrete attributes, and the post construction process.
Similar to most learning algorithms, the classification tree algorithm analyzes a training set and then builds a classifier based on that training so that with new data in the future, it can classify the training as well as the new data correctly. A test example is an input object, and the algorithm must predict an output value. Classification trees are used when the response or target variable is categorical in nature.
On the contrary, regression trees are needed when the response variable is continuous and not discrete. For example, the predicted price of a product. A regression tree is built through binary partitioning. This is an iterative process that splits the data into partitions or branches and then continues splitting each partition into smaller groups as the method moves up each partition or branch. In other words, regression trees are applicable when the problem involves prediction as opposed to classification. For more details on this, we recommend you refer to books on classification and regression trees.
When the relationship between predictors and response is linear, a standard regression tree is more appropriate, and when the relationship between predictors and response is nonlinear, then C4.5 should be used. Furthermore, to summarize, when the response variable has only two categories, the classification tree algorithm should be used.
For a decision tree algorithm to play tennis or golf, one can easily sort down the decision process by asking a question, that is, is it raining out there or is it sunny? and draw the decision diagram branching out at every question based on the answers. The playing nature of the games are almost the same—tennis versus golf—and in any sporting event, if it is windy and raining, chances are that there is not going to be a game.
For tennis, if the outlook is sunny, but the humidity is high, then it is recommended to not play. Similarly, if it is raining and windy, then the whole dynamics of the tennis game will be pretty bad. Therefore, chances are that it is no fun playing tennis under these conditions as well. The following diagram shows all the possible conditions:
We can also add discrete attributes (such as temperature); for what range of temperatures does it not make sense to play tennis? Probably, if the temperature is greater than 70 degrees Fahrenheit, that is, if the temperature is hot. We can write the rules combining all these as follows:
If (Outlook = Sunny) and (Humidity = High) then play=No If (Outlook = Rain) and (Wind = Strong) then play=No If (Outlook = Sunny) and (Humidity = Normal) or (Outlook = Overcast) or (Outlook=Rain and Wind=Weak) then play=Yes
With the following training set, we can run the algorithm to select the next best classifier:
Outlook |
Temperature |
Humidity |
Wind |
Play? |
---|---|---|---|---|
Sunny |
Hot |
High |
Weak |
No |
Sunny |
Hot |
High |
Strong |
No |
Overcast |
Hot |
High |
Weak |
Yes |
Overcast |
Cool |
Normal |
Strong |
Yes |
Sunny |
Mild |
High |
Weak |
No |
Sunny |
Cool |
Normal |
Weak |
Yes |
Rain |
Mild |
High |
Weak |
Yes |
Rain |
Cool |
Normal |
Weak |
Yes |
Rain |
Cool |
Normal |
Strong |
No |
Rain |
Mild |
Normal |
Weak |
Yes |
Sunny |
Mild |
Normal |
Strong |
Yes |
Overcast |
Mild |
High |
Strong |
Yes |
Overcast |
Hot |
Normal |
Weak |
Yes |
Rain |
Mild |
High |
Strong |
No |
The top down induction of decision trees (ID3) is a method that follows these rules:
If the training data is classified within the threshold
One clear distinction between a linear regression and a decision tree algorithm is that the decision boundaries are parallel to the axes, for example, if we have two features (x1 and x2), then it can only create rules, such as x1 >=5.2, x2 >= 7.2. The advantage the decision tree algorithm has is that it is robust to errors, which means that the training set could have errors. Also, it doesn't affect the algorithm much.
Using the sklearn
package from scikit-learn
(scikit-learn.org) and the following code, we can plot the decision tree classifier:
from sklearn.externals.six import StringIO from sklearn import tree import pydot # Four columns from the table above with values # 1st col - 1 for Sunny, 2 for Overcast, and 3 for Rainy # 2nd col - 1 for Hot, 2 for Mild, 3 for Cool # 3rd col – 1 for High and 2 for Normal # 4th col – 0 for Weak and 1 for Strong X=[[1,1,1,0],[1,1,1,1],[2,1,1,0],[2,3,2,1],[1,2,1,0],[1,3,2,0], [3,2,1,0],[3,3,2,0],[3,3,2,1],[3,2,2,0],[1,2,2,1],[2,2,1,1], [2,1,2,0],[3,2,1,0]] # 1 for Play and 0 for Don't Play Y=[0,0,1,1,0,1,1,1,0,1,1,1,1,0] clf = tree.DecisionTreeClassifier() clf = clf.fit(X, Y) dot_data = StringIO() tree.export_graphviz(clf, out_file=dot_data) graph = pydot.graph_from_dot_data(dot_data.getvalue()) graph.write_pdf("game.pdf")
Use the export functionality of sklearn
to be able to convert the tree diagram in the form of a graph that looks similar to the following diagram:
In order to create your own tree structure, there is an option of using plotting methods from matplotlib
. In order to display a tree-like diagram, matplotlib
has annotation that allows you to create a tree-shaped structure with labels, as shown in the following code:
import matplotlib.pyplot as plt #create nodes here branchNode = dict(boxstyle="sawtooth", fc="0.8") leafNode = dict(boxstyle="round4", fc="0.8") startNode = dict(boxstyle="sawtooth", fc="0.9") def createPlot(): fig = plt.figure(1, facecolor='white') fig.clf() createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo purposes plotNode('from here', (0.3,0.8), (0.3, 0.8), startNode) plotNode('a decision node', (0.5, 0.1), (0.3, 0.8), branchNode) plotNode('a leaf node', (0.8, 0.1), (0.3, 0.8), leafNode) plt.show() ...
This is usually an idea of how you can create a tree structure from scratch using matplotlib
. Basically the preceding example shows the creation of three nodes and connecting them to form a small tree. The results of this code are shown as follows: