Understanding and implementing regression trees

An algorithm very similar to decision trees is regression tree. The difference between the two is that the target variable in the case of a regression tree is a continuous numerical variable, unlike decision trees where the target variable is a categorical variable.

Regression tree algorithm

Regression trees are particularly useful when there are multiple features in the training dataset that interact in complicated and non-linear ways. In such cases, a simple linear regression or even the linear regression with some tweaks will not be feasible or produces a very complex model that will be of little use. An alternative to non-linear regression is to partition the dataset into smaller nodes/local partitions where the interactions are more manageable. We keep partitioning until the point where the non-linear interactions are non-existent or the observations in that partition/node are very similar to each other. This is called recursive partition.

A regression tree is similar to a decision tree because the algorithm behind them is more or less the same; that is, the node is split into subnodes based on certain criteria. The criterion of a split, in this case, is the maximum Reduction in Variance; as we discussed earlier, this approach is used when the target variable is a continuous numerical variable. The nodes are partitioned based on this criterion unless met by a stopping criterion. This process is called recursive partitioning. One of the common stopping criteria is the one we described above for the decision tree. The depth (level of nodes) after which the accuracy of the model stops improving is generally the stopping point for a regression tree. Also, the predictor variables that are continuous numerical variables are categorized into classes using the approach described earlier.

Once a leaf (terminal) node is decided, a local model is fit for all the observations falling under that node. The local model is nothing but the average of the output values of all the observations falling under that leaf node. If the observations (x1,y1), (x2,y2), (x3,y3),……, and (xn,yn) fall under the leaf node l, then the output value, y, for this node is given by:

Regression tree algorithm

A stepwise summary of the regression tree algorithm is as follows:

  1. Start with a single node, that is, all the observations, calculate the mean, and then the variance of the target variable.
  2. Calculate the reduction in variance caused by each of the variables that are potential candidates for being the next node, using the approach described earlier in this chapter. Choose the variable that provides the maximum reduction in the variance as the node.
  3. For each leaf node, check whether the maximum reduction in the variance provided by any of the variables is less than a set threshold, or the number of observations in a given node is less than a set threshold. If one of these criterions is satisfied, stop. If not, repeat step 2.

Some advantages of using the regression tree are as follows:

  • It can take care of non-linear and complicated relations between predictor and target variables. Non-linear models often become difficult to comprehend, while the regression trees are simple to implement and understand.
  • Even if some of the attributes of an observation or an entire observation is missing, the observation might not be able to reach a leaf node, but we can still get an output value for that observation by averaging the output values available at the terminal subnode of the observation.
  • Regression trees are also very useful for feature selection; that is, selecting the variables that are important to make a prediction. The variables that are a part of the tree are important variables to make a prediction.

Implementing a regression tree using Python

Let us see an implementation of the regression trees in Python on a commonly used dataset called Boston. This dataset has information about housing and median prices in Boston. Most of the predictor variables are continuous numerical variables. The target variable, the median price of the house, is also a continuous numerical variable. The purpose of fitting a regression tree is to predict these prices:

Let us take a look at the dataset and then see what the variables mean:

import pandas as pd
data=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/My Work/Chapter 7/Boston.csv')
data.head()
Implementing a regression tree using Python

Fig. 8.13: A look at the Boston dataset

Here is a brief data dictionary describing the meaning of all the columns in the dataset:

Note

  • CRIM: This is the per-capita crime rate by town
  • ZN: This is the proportion of residential land zoned for lots over 25,000 sq.ft
  • INDUS: This is the proportion of non-retail business acres per town
  • CHAS: The is the Charles River dummy variable (1 if the tract bounds river; 0 otherwise)
  • NOX: This is the nitric oxide concentration (parts per 10 million)
  • RM: This is the average number of rooms per dwelling
  • AGE: This is the proportion of owner-occupied units built prior to 1940
  • DIS: This is the weighted distance to five Boston employment centers
  • RAD: This is the index of accessibility to radial highways
  • TAX: This is the full-value property tax rate per $10,000
  • PTRATIO: This is the pupil-teacher ratio by town
  • B: 1000(Bk - 0.63)^2: Here, Bk is the proportion of blacks by town
  • LSTAT: This is the % of lower status of the population
  • MEDV: This is the median value of owner-occupied homes in $1000s

Let us perform the required preprocessing before we build the regression tree model. In the following code snippet, we just assign the first 13 variables of the preceding dataset as predictor variables and the last one (MEDV) as the target variable:

colnames=data.columns.values.tolist()
predictors=colnames[:13]
target=colnames[13]
X=data[predictors]
Y=data[target]

Let us now build the regression tree model:

from sklearn.tree import DecisionTreeRegressor
regression_tree = DecisionTreeRegressor(min_samples_split=30,min_samples_leaf=10,random_state=0)
regression_tree.fit(X,Y)

The min_samples_split specifies the minimum number of observations required in a node for it to be qualified for a split. The min_samples_leaf specifies the minimum number of observations required to classify a node as a leaf.

Let us now use the model to make some predictions on the same dataset and see how close they are to the actual value of the target variable:

reg_tree_pred=regression_tree.predict(data[predictors])
data['pred']=reg_tree_pred
cols=['pred','medv']
data[cols]
Implementing a regression tree using Python

Fig. 8.14: Comparing the actual and predicted values of the target variable

One point to observe here is that many of the observations have the same predicted values. This was expected because, if you remember, the output of a regression model is nothing but an average of the output of all the observations falling under a particular node. Thus, all the observations falling under the same node will have the same predicted output value.

Let us now cross-validate our model and see how accurate the cross-validated model is:

from sklearn.cross_validation import KFold
from sklearn.cross_validation import cross_val_score
import numpy as np
crossvalidation = KFold(n=X.shape[0], n_folds=10,shuffle=True, random_state=1)
score = np.mean(cross_val_score(regression_tree, X, Y,scoring='mean_squared_error', cv=crossvalidation,n_jobs=1))
score

In this case, the score we are interested in looking at is the mean squared error. A mean of the score in this comes out to be 20.10. The cross_val_predict module can be used, just like the cross_val_score module, to predict the values of the output variable from the cross-validated model.

Let us now have a look at the other outputs of the regression tree that can be useful. One important attribute of a regression tree is the feature importance of various variables. The importance of a feature is measured by the total reduction it has brought to the variance. The feature importance for the variables of a regression tree can be calculated as follows:

regression_tree.feature_importances_
Implementing a regression tree using Python

Fig. 8.15: Feature importance scores for regression tree on the Boston dataset

The higher the value of the feature importance for a variable, the more important it is. In this case, the three most important variables are age, lstat, and rm, in ascending order of importance.

A regression tree can be drawn in the same way as the decision tree to understand the results and predictions better.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset