Bagging to improve results

Bootstrap aggregating or bagging is an algorithm introduced by Leo Breiman in 1994, which applies bootstrapping to machine learning problems. Bagging was also mentioned in the Learning with random forests recipe.

The algorithm aims to reduce the chance of overfitting with the following steps:

  1. We generate new training sets from input training data by sampling with replacement.
  2. Fit models to each generated training set.
  3. Combine the results of the models by averaging or majority voting.

The scikit-learn BaggingClassifier class allows us to bootstrap training examples, and we can also bootstrap features as in the random forests algorithm. When we perform a grid search, we refer to hyperparameters of the base estimator with the prefix base_estimator__. We will use a decision tree as the base estimator so that we can reuse some of the hyperparameter configuration from the Learning with random forests recipe.

How to do it...

The code is in the bagging.ipynb file in this book's code bundle:

  1. The imports are as follows:
    import ch9util
    from sklearn.ensemble import BaggingClassifier
    from sklearn.grid_search import GridSearchCV
    from sklearn.tree import DecisionTreeClassifier
    import numpy as np
    import dautil as dl
    from IPython.display import HTML
  2. Load the data and create a BaggingClassifier:
    X_train, X_test, y_train, y_test = ch9util.rain_split()
    clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(
        min_samples_leaf=3, max_depth=4), random_state=43)
  3. Grid search, fit, and predict as follows:
    params = {
        'n_estimators': [320, 640],
        'bootstrap_features': [True, False],
        'base_estimator__criterion': ['gini', 'entropy']
    }
    
    gscv = GridSearchCV(estimator=clf, param_grid=params,
                        cv=5, n_jobs=-1)
    
    gscv.fit(X_train, y_train)
    preds = gscv.predict(X_test)
  4. Plot the rain forecast confusion matrix as follows:
    sp = dl.plotting.Subplotter(2, 2, context)
    html = ch9util.report_rain(preds, y_test, gscv.best_params_, sp.ax)
  5. Plot a validation curve for a range of ensemble sizes:
    ntrees = 2 ** np.arange(4, 11)
    ch9util.plot_validation(sp.next_ax(), gscv.best_estimator_, 
                            X_train, y_train, 'n_estimators', ntrees)
  6. Plot a validation curve for the max_samples parameter:
    nsamples = 2 ** np.arange(4, 14)
    ch9util.plot_validation(sp.next_ax(), gscv.best_estimator_, 
                            X_train, y_train, 'max_samples', nsamples)
  7. Plot the learning curve as follows:
    ch9util.plot_learn_curve(sp.next_ax(), gscv.best_estimator_, 
                             X_train, y_train)
    HTML(html + sp.exit())

Refer to the following screenshot for the end result:

How to do it...

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset