Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Bagging to improve results

Bootstrap aggregating or bagging is an algorithm introduced by Leo Breiman in 1994, which applies bootstrapping to machine learning problems. Bagging was also mentioned in the Learning with random forests recipe.

The algorithm aims to reduce the chance of overfitting with the following steps:

We generate new training sets from input training data by sampling with replacement.
Fit models to each generated training set.
Combine the results of the models by averaging or majority voting.

The scikit-learn BaggingClassifier class allows us to bootstrap training examples, and we can also bootstrap features as in the random forests algorithm. When we perform a grid search, we refer to hyperparameters of the base estimator with the prefix base_estimator__. We will use a decision tree as the base estimator so that we can reuse some of the hyperparameter configuration from the Learning with random forests recipe.

How to do it...

The code is in the bagging.ipynb file in this book's code bundle:

The imports are as follows:

import ch9util
from sklearn.ensemble import BaggingClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
import numpy as np
import dautil as dl
from IPython.display import HTML

Load the data and create a BaggingClassifier:

X_train, X_test, y_train, y_test = ch9util.rain_split()
clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(
    min_samples_leaf=3, max_depth=4), random_state=43)

Grid search, fit, and predict as follows:

params = {
    'n_estimators': [320, 640],
    'bootstrap_features': [True, False],
    'base_estimator__criterion': ['gini', 'entropy']
}

gscv = GridSearchCV(estimator=clf, param_grid=params,
                    cv=5, n_jobs=-1)

gscv.fit(X_train, y_train)
preds = gscv.predict(X_test)

Plot the rain forecast confusion matrix as follows:

sp = dl.plotting.Subplotter(2, 2, context)
html = ch9util.report_rain(preds, y_test, gscv.best_params_, sp.ax)

Plot a validation curve for a range of ensemble sizes:

ntrees = 2 ** np.arange(4, 11)
ch9util.plot_validation(sp.next_ax(), gscv.best_estimator_, 
                        X_train, y_train, 'n_estimators', ntrees)

Plot a validation curve for the max_samples parameter:

nsamples = 2 ** np.arange(4, 14)
ch9util.plot_validation(sp.next_ax(), gscv.best_estimator_, 
                        X_train, y_train, 'max_samples', nsamples)

Plot the learning curve as follows:

ch9util.plot_learn_curve(sp.next_ax(), gscv.best_estimator_, 
                         X_train, y_train)
HTML(html + sp.exit())

Refer to the following screenshot for the end result:

Table of Contents for
Bagging to improve results

Bagging to improve results

How to do it...

See also

Table of Contents for Bagging to improve results

Create new playlist

Sign In

Sign Up

Bagging to improve results

How to do it...

See also

Table of Contents for
Bagging to improve results