Transitioning from classification trees to regression trees

In classification, a decision tree is constructed by recursive binary splitting and growing each node into left and right children. In each partition, it greedily searches for the most significant combination of feature and its value as the optimal splitting point. The quality of separation is measured by the weighted purity of labels of two resulting children, specifically via metric Gini Impurity or Information Gain. In regression, the tree construction process is almost identical to the classification one, with only two differences due to the fact that the target becomes continuous:

  • The quality of splitting point is now measured by the weighted mean squared error (MSE) of two children; the MSE of a child is equivalent to the variance of all target values, and the smaller the weighted MSE, the better the split.
  • The average value of targets in a terminal node becomes the leaf value, instead of the majority of labels in the classification tree.

To make sure we understand regression tree, let's work on a small example of house price estimation:

We first define the MSE and weighted MSE computation functions as they'll be used in our calculation:

>>> def mse(targets):
... # When the set is empty
... if targets.size == 0:
... return 0
... return np.var(targets)
>>> def weighted_mse(groups):
... """ Calculate weighted MSE of children after a split
... Args:
... groups (list of children, and a child consists a list
of targets)
... Returns:
... float, weighted impurity
... """
... total = sum(len(group) for group in groups)
... weighted_sum = 0.0
... for group in groups:
... weighted_sum += len(group) / float(total) * mse(group)
... return weighted_sum

Test things out by executing the following commands:

>>> print('{0:.4f}'.format(mse(np.array([1, 2, 3]))))
0.6667
>>> print('{0:.4f}'.format(weighted_mse([np.array([1, 2, 3]),
np.array([1, 2])])))
0.5000

To build the house price regression tree, we first exhaust all possible pairs of feature and value and compute the corresponding MSE:

MSE(type, semi) = weighted_mse([[600, 400, 700], [700, 800]]) = 10333
MSE(bedroom, 2) = weighted_mse([[700, 400], [600, 800, 700]]) = 13000
MSE(bedroom, 3) = weighted_mse([[600, 800], [700, 400, 700]]) = 16000
MSE(bedroom, 4) = weighted_mse([[700], [600, 700, 800, 400]]) = 17500

The lowest MSE is achieved with the type, semi pair, and the root node is then formed by such a splitting point:

If we are satisfied with a one level deep regression tree, we can stop here by assigning both branches as leaf nodes with value as the average of targets of the samples included. Alternatively, we can go further down the road constructing the second level from the right branch (the left branch can't be further split):

MSE(bedroom, 2) = weighted_mse([[], [600, 400, 700]]) = 15556
MSE(bedroom, 3) = weighted_mse([[400], [600, 700]]) = 1667
MSE(bedroom, 4) = weighted_mse([[400, 600], [700]]) = 6667

With the second splitting point specified by the bedroom, 3 pair with the least MSE, our tree becomes as shown in the following diagram:

We can finish up the tree by assigning average values to both leaf nodes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset