Analyzing forests

Random Forests provide information about the underlying dataset that most of other methods cannot easily provide. A prominent example is the importance of each individual feature in the dataset. One method to estimate feature importance is to use the Gini index for each node of each tree and compare each feature's cumulative value. Another method uses the out-of-bag samples. First, the out-of-bag accuracy is recorded for all base learners. Then, a single feature is chosen and its values are shuffled in the out-of-bag samples. This results in out-of-bag sample sets with the same statistical properties as the original sets, but any predictive power that the chosen feature might have is removed (as there is now zero correlation between the selected feature's values and the target). The difference in accuracy between the original and the partially random dataset is used as measure for the selected feature's importance.

Concerning bias and variance, although random forests seem to cope well with both, they are certainly not immune. Bias can appear when the available features are great in number, but only few are correlated to the target. When using the recommended number of features to consider at each split (for example, the square root of the number of total features), the probability that a relevant feature will be selected can be small. The following graph shows the probability that at least one relevant feature will be selected, as a function of relevant and irrelevant features (when the square root of the number of total features is considered at each split):

Probability to select at least one relevant feature as a function of the number of relevant and irrelevant features

The Gini index measures the frequency of incorrect classifications, assuming that a randomly sampled instance would be classified according to the label distribution dictated by a specific node.

Variance can also appear in Random Forests, although the method is sufficiently resistant to it. Variance usually appears when the individual trees are allowed to grow fully. We have previously mentioned that as the number of trees increases, the error approximates a certain limit. Although this claim still holds true, it is possible that the limit itself overfits the data. Restricting the tree size (by increasing the minimum number of samples per leaf or reducing the maximum depth) can potentially help in such circumstances.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset