Stacking and majority voting for multiple models

It is generally believed that two people know more than one person alone. A democracy should work better than a dictatorship. In machine learning, we don't have humans making decisions, but algorithms. When we have multiple classifiers or regressors working together, we speak of ensemble learning.

There are many ensemble learning schemes. The simplest setup does majority voting for classification and averaging for regression. In scikit-learn 0.17, you can use the VotingClassifier class to do majority voting. This classifier lets you emphasize or suppress classifiers with weights.

Stacking takes the outputs of machine learning estimators and then uses those as inputs for another algorithm. You can, of course, feed the output of the higher-level algorithm to another predictor. It is possible to use any arbitrary topology, but for practical reasons, you should try a simple setup first.

How to do it...

  1. The imports are as follows:
    import dautil as dl
    from sklearn.tree import DecisionTreeClassifier
    import numpy as np
    import ch9util
    from sklearn.ensemble import VotingClassifier
    from sklearn.grid_search import GridSearchCV
    from IPython.display import HTML
  2. Load the data and create three decision tree classifiers:
    X_train, X_test, y_train, y_test = ch9util.rain_split()
    default = DecisionTreeClassifier(random_state=53, min_samples_leaf=3,
                                     max_depth=4)
    entropy = DecisionTreeClassifier(criterion='entropy',
                                     min_samples_leaf=3, max_depth=4,
                                     random_state=57)
    random = DecisionTreeClassifier(splitter='random', min_samples_leaf=3,
                                    max_depth=4, random_state=5)
  3. Use the classifiers to take a vote:
    clf = VotingClassifier([('default', default), 
                            ('entropy', entropy), ('random', random)])
    params = {'voting': ['soft', 'hard'],
             'weights': [None, (2, 1, 1), (1, 2, 1), (1, 1, 2)]}
    gscv = GridSearchCV(clf, param_grid=params, n_jobs=-1, cv=5)
    gscv.fit(X_train, y_train)
    votes = gscv.predict(X_test)
    
    preds = []
    
    for clf in [default, entropy, random]:
        clf.fit(X_train, y_train)
        preds.append(clf.predict(X_test))
    
    preds = np.array(preds)
  4. Plot the confusion matrix for the votes-based forecast:
    %matplotlib inline
    context = dl.nb.Context('stacking_multiple')
    dl.nb.RcWidget(context)
    
    sp = dl.plotting.Subplotter(2, 2, context)
    html = ch9util.report_rain(votes, y_test, gscv.best_params_, sp.ax)
    sp.ax.set_title(sp.ax.get_title() + ' | Voting')
  5. Plot the confusion matrix for the stacking-based forecast:
    default.fit(preds_train.T, y_train)
    stacked_preds = default.predict(preds.T)
    html += ch9util.report_rain(stacked_preds, 
                                y_test, default.get_params(), sp.next_ax())
    sp.ax.set_title(sp.ax.get_title() + ' | Stacking')
    ch9util.report_rain(default.predict(preds.T), y_test)
  6. Plot the learning curves of the voting and stacking classifiers:
    ch9util.plot_learn_curve(sp.next_ax(), gscv.best_estimator_, X_train,
                             y_train, title='Voting')
    
    ch9util.plot_learn_curve(sp.next_ax(), default, X_train,
                             y_train, title='Stacking')

Refer to the following screenshot for the end result:

How to do it...

The code is in the stacking_multiple.ipynb file in this book's code bundle.

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset