Decision trees

The basic idea behind decision trees is to use the training dataset to create a tree of decisions to make a prediction.

It recursively splits the training dataset into subsets based on the value of a single feature. Each split corresponds to a node in the decision tree. The splitting process is continued until every subset is pure; that is, all elements belong to a single class. This always works except in cases where there are duplicate training examples that fall into different classes. In this case, the majority class wins.

The end result is a ruleset for making predictions on the test dataset.

Decision trees encode a sequence of binary choices in a process that mimics how a human might classify things, but decide which question is most useful at each step by using the information criteria.

An example of this would be if you wished to determine whether animal x is a mammal, fish, or reptile; in this case, we would ask the following questions:

    - Does x have fur?
    Yes: x is a mammal
    No: Does x have feathers?
    Yes: x is a bird
    No: Does x have scales?
    Yes: Does x have gills?
    Yes: x is a fish
    No: x is a reptile
    No: x is an amphibian
  

This generates a decision tree that looks similar to the following:

Refer to the following link for more information: https://labs.opendns.com/wp-content/uploads/2013/09/animals.gif.

The binary splitting of questions at each node is the essence of a decision tree algorithm. A major drawback of decision trees is that they can overfit the data. They are so flexible that, given a large depth, they can memorize the inputs, and this results in poor results when they are used to classify unseen data.

The way to fix this is to use multiple decision trees, and this is known as using an ensemble estimator. An example of an ensemble estimator is the random forest algorithm, which we will address next.

To use a decision tree in scikit-learn, we import the tree module:

    from sklearn import tree

We then instantiate an SVM classifier, fit the model, and predict the following:

    model = tree.DecisionTreeClassifier(criterion='entropy', 
                 max_depth=3,min_samples_leaf=5)
    dt_model = model.fit(X_train, y_train)
X_test = dt.dmatrix(formula, test_df_filled)
#. . .

Upon submitting our data to Kaggle, the following results are obtained:

Formula

Kaggle Score

C(Pclass) + C(Sex) + Fare

0.77033

C(Pclass) + C(Sex)

0.76555

C(Sex)

0.76555

C(Pclass) + C(Sex) + Age + SibSp + Parch

0.76555

C(Pclass) + C(Sex) + Age + Parch +

C(Embarked)

0.78947

C(Pclass) + C(Sex) + Age + Sibsp + Parch + C(Embarked)

0.79426

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset