Summary

In this chapter, we presented the main concept of creating bootstrap samples and estimating bootstrap statistics. Building on this foundation, we introduced bootstrap aggregating, or bagging, which uses a number of bootstrap samples to train many base learners that utilize the same machine learning algorithm. Later, we provided a custom implementation of bagging for classification, as well as the means to parallelize it. Finally, we showcased the use of scikit-learn's own implementation of bagging for regression and classification problems.

The chapter can be summarized as follows. Bootstrap samples are created by resampling with replacement from the original dataset. The main idea is to treat the original sample as the population, and each subsample as an original sample. If the original dataset and the bootstrap dataset have the same size, each instance has a probability of 63.2% of being included in the bootstrap dataset (sample). Bootstrap methods are useful for calculating statistics such as confidence intervals and standard error, without making assumptions about the underlying distribution. Bagging generates a number of bootstrap samples to train each individual base learner. Bagging benefits unstable learners, where small variations in the train set induce great variations in the generated model. Bagging is a suitable ensemble learning method to reduce variance.

Bagging allows for easy parallelization, as each bootstrap sample and base learner can be generated, trained, and tested individually. As with all ensemble learning methods, using bagging reduces the interpretability and motivation behind individual predictions.

In the next chapter, we will introduce the second generative method, Boosting.

Table of Contents for Summary

Create new playlist

Sign In

Sign Up

Table of Contents for
Summary