Modelling data with Random Forest

Random Forest is an algorithm which can be used for different problems - binomial as we showed in the previous chapter, regression, or multiclass classification. The beauty of Random Forest is that it combines multiple weak learners represented by decision trees into one ensemble.

Furthermore, to reduce variance of individual decision trees, the algorithms use the concept of bagging (Bootstrap aggregation). Each decision tree is trained on a subset of data generated by random selection with replacement.

Do not confuse bagging with boosting. Boosting incrementally builds an ensemble by training each new model to emphasize observations that previous model misclassified. Typically, after a weak model is added into the ensemble, the data is reweighted, observations that are misclassified gain weight, and vice versa. Furthermore, bagging can be invoked in parallel while boosting is a sequential process. Nevertheless, the goal of boosting is the same as of bagging - combine predictions of several weak models in order to improve generalization and robustness over a single model.
An example of a boosting method is a Gradient Boosting Machine (GBM) which uses the boosting method to combine weak models (decision trees) into an ensemble; however, it generalizes the approach by allowing the use of an arbitrary loss function: instead of trying to correct the previous weak model misclassified observations, the GBM allows you to minimize a specified loss function (for example, mean squared error for regression).
There are different variations of GBM - for example, stochastic GBM which combines boosting with bagging. The regular GBM and also stochastic GBM are available in H2O's machine learning toolbox. Furthermore, it is important to mention that GBM (as well as RandomForest) is an algorithm which builds pretty good models without extensive tuning.
More information about GBM is available in original paper of J.H. Friedman: Greedy Function Approximation: A Gradient Boosting Machine http://www-stat.stanford.edu/~jhf/ftp/trebst.pdf.

Moreover, RandomForest employs so-called "feature bagging" - while building a decision tree, it selects a random subset of feature to make a split decision. The motivation is to build a weak learner and enhance generalization - for example, if one of the features is a strong predictor for given target variable, it would be selected by the majority of trees, resulting in highly similar trees. However, by random selection of features, the algorithm can avoid the strong predictor and build trees which find a finer-grained structure of data.

RandomForest also helps easily select the most predictive feature since it allows for computation of variable importance in different ways. For example, computing an overall feature impurity gain over all trees gives a good estimate of how the strong feature is.

From an implementation point of view, RandomForest can be easily parallelized since the built trees step is independent. On the other hand, distributing RandomForest computation is slightly harder problem, since each tree needs to explore almost the full set of data.

The disadvantage of RandomForest is complicated interpretability. The resulting ensemble is hard to explore and explain interactions between individual trees. However, it is still one of the best models to use if we need to obtain a good model without advanced parameters tuning.

A good source of information about RandomForest is the original paper of Leo Breiman and Adele Cutler available, for example, here: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset