Decision trees

Decision trees are less of a black box than other machine learning algorithms. They can easily explain how they produce a prediction, which is called interpretability. The main concept is that they produce rules by splitting the training set using the provided features. By iteratively splitting the data, a tree form is produced, thus this is where their name derives from. Let's consider a dataset where the instances are individual persons deciding on their vacations.

The dataset features consist of the person's age and available money, while the target is their preferred destination, one of either Summer Camp, Lake, or Bahamas. A possible decision tree model is depicted in the following figure:

Decision tree model for the vacation destination problem

As is evident, the model can explain how it produces any predictions. The way that the model itself is built is by trying to select the feature and threshold that maximize the information produced. Roughly, this means that the model will try to iteratively split the dataset in a way that separates the greatest number of remaining instances.

Although intuitive to understand, decision trees can produce unreasonable models, with the extreme being the generation of so many rules that, eventually, each rule combination leads to a single instance. In order to avoid such models, we can restrict the model by requiring that it does not exceed a specific depth (maximum number of consecutive rules), or that each node has at least a minimum number of instances before it can be further split.

In scikit-learn, decision trees are implemented under the sklearn.tree package, with DecisionTreeClassifier and DecisionTreeRegressor. In our examples, using DecisionTreeRegressor with dtr = DecisionTreeRegressor(max_depth=2), we achieve an R2 of 0.52 and an MSE of 2655. On the breast cancer dataset, using dtc = DecisionTreeClassifier(max_depth=2), we achieve 89% accuracy and the following confusion matrix:

n = 169

Predicted: Malignant

Predicted: Benign

Target: Malignant

37

2

Target: Benign

17

113

 

Although not the best-performing algorithm so far, we can clearly see how each individual was classified, by exporting the tree to the graphviz format with export_graphviz(dtc, feature_names=bc.feature_names, class_names=bc.target_names, impurity=False):

The decision tree generated for the breast cancer dataset
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset