After the short introduction to ensemble learning in the previous section, let's start with a warm-up exercise and implement a simple ensemble classifier for majority voting in Python. Although the following algorithm also generalizes to multi-class settings via plurality voting, we will use the term majority voting for simplicity as is also often done in literature.
The algorithm that we are going to implement will allow us to combine different classification algorithms associated with individual weights for confidence. Our goal is to build a stronger meta-classifier that balances out the individual classifiers' weaknesses on a particular dataset. In more precise mathematical terms, we can write the weighted majority vote as follows:
Here, is a weight associated with a base classifier, , is the predicted class label of the ensemble, (Greek chi) is the characteristic function , and A is the set of unique class labels. For equal weights, we can simplify this equation and write it as follows:
To better understand the concept of weighting, we will now take a look at a more concrete example. Let's assume that we have an ensemble of three base classifiers ( and want to predict the class label of a given sample instance x. Two out of three base classifiers predict the class label 0, and one predicts that the sample belongs to class 1. If we weight the predictions of each base classifier equally, the majority vote will predict that the sample belongs to class 0:
Now let's assign a weight of 0.6 to and weight and by a coefficient of 0.2, respectively.
More intuitively, since , we can say that the prediction made by has three times more weight than the predictions by or , respectively. We can write this as follows:
To translate the concept of the weighted majority vote into Python code, we can use NumPy's convenient argmax
and bincount
functions:
>>> import numpy as np >>> np.argmax(np.bincount([0, 0, 1], ... weights=[0.2, 0.2, 0.6])) 1
As discussed in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, certain classifiers in scikit-learn can also return the probability of a predicted class label via the predict_proba
method. Using the predicted class probabilities instead of the class labels for majority voting can be useful if the classifiers in our ensemble are well calibrated. The modified version of the majority vote for predicting class labels from probabilities can be written as follows:
Here, is the predicted probability of the jth classifier for class label i.
To continue with our previous example, let's assume that we have a binary classification problem with class labels and an ensemble of three classifiers (). Let's assume that the classifier returns the following class membership probabilities for a particular sample :
We can then calculate the individual class probabilities as follows:
To implement the weighted majority vote based on class probabilities, we can again make use of NumPy using numpy.average
and np.argmax
:
>>> ex = np.array([[0.9, 0.1], ... [0.8, 0.2], ... [0.4, 0.6]]) >>> p = np.average(ex, axis=0, weights=[0.2, 0.2, 0.6]) >>> p array([ 0.58, 0.42]) >>> np.argmax(p) 0
Putting everything together, let's now implement a MajorityVoteClassifier
in Python:
from sklearn.base import BaseEstimator from sklearn.base import ClassifierMixin from sklearn.preprocessing import LabelEncoder from sklearn.externals import six from sklearn.base import clone from sklearn.pipeline import _name_estimators import numpy as np import operator class MajorityVoteClassifier(BaseEstimator, ClassifierMixin): """ A majority vote ensemble classifier Parameters ---------- classifiers : array-like, shape = [n_classifiers] Different classifiers for the ensemble vote : str, {'classlabel', 'probability'} Default: 'classlabel' If 'classlabel' the prediction is based on the argmax of class labels. Else if 'probability', the argmax of the sum of probabilities is used to predict the class label (recommended for calibrated classifiers). weights : array-like, shape = [n_classifiers] Optional, default: None If a list of `int` or `float` values are provided, the classifiers are weighted by importance; Uses uniform weights if `weights=None`. """ def __init__(self, classifiers, vote='classlabel', weights=None): self.classifiers = classifiers self.named_classifiers = {key: value for key, value in _name_estimators(classifiers)} self.vote = vote self.weights = weights def fit(self, X, y): """ Fit classifiers. Parameters ---------- X : {array-like, sparse matrix}, shape = [n_samples, n_features] Matrix of training samples. y : array-like, shape = [n_samples] Vector of target class labels. Returns ------- self : object """ # Use LabelEncoder to ensure class labels start # with 0, which is important for np.argmax # call in self.predict self.lablenc_ = LabelEncoder() self.lablenc_.fit(y) self.classes_ = self.lablenc_.classes_ self.classifiers_ = [] for clf in self.classifiers: fitted_clf = clone(clf).fit(X, self.lablenc_.transform(y)) self.classifiers_.append(fitted_clf) return self
I added a lot of comments to the code to better understand the individual parts. However, before we implement the remaining methods, let's take a quick break and discuss some of the code that may look confusing at first. We used the parent classes BaseEstimator
and ClassifierMixin
to get some base functionality for free, including the methods get_params
and set_params
to set and return the classifier's parameters as well as the score
method to calculate the prediction accuracy, respectively. Also note that we imported six
to make the MajorityVoteClassifier
compatible with Python 2.7.
Next we will add the predict
method to predict the class label via majority vote based on the class labels if we initialize a new MajorityVoteClassifier
object with vote='classlabel'
. Alternatively, we will be able to initialize the ensemble classifier with vote='probability'
to predict the class label based on the class membership probabilities. Furthermore, we will also add a predict_proba
method to return the average probabilities, which is useful to compute the Receiver Operator Characteristic area under the curve (ROC AUC).
def predict(self, X): """ Predict class labels for X. Parameters ---------- X : {array-like, sparse matrix}, Shape = [n_samples, n_features] Matrix of training samples. Returns ---------- maj_vote : array-like, shape = [n_samples] Predicted class labels. """ if self.vote == 'probability': maj_vote = np.argmax(self.predict_proba(X), axis=1) else: # 'classlabel' vote # Collect results from clf.predict calls predictions = np.asarray([clf.predict(X) for clf in self.classifiers_]).T maj_vote = np.apply_along_axis( lambda x: np.argmax(np.bincount(x, weights=self.weights)), axis=1, arr=predictions) maj_vote = self.lablenc_.inverse_transform(maj_vote) return maj_vote def predict_proba(self, X): """ Predict class probabilities for X. Parameters ---------- X : {array-like, sparse matrix}, shape = [n_samples, n_features] Training vectors, where n_samples is the number of samples and n_features is the number of features. Returns ---------- avg_proba : array-like, shape = [n_samples, n_classes] Weighted average probability for each class per sample. """ probas = np.asarray([clf.predict_proba(X) for clf in self.classifiers_]) avg_proba = np.average(probas, axis=0, weights=self.weights) return avg_proba def get_params(self, deep=True): """ Get classifier parameter names for GridSearch""" if not deep: return super(MajorityVoteClassifier, self).get_params(deep=False) else: out = self.named_classifiers.copy() for name, step in six.iteritems(self.named_classifiers): for key, value in six.iteritems( step.get_params(deep=True)): out['%s__%s' % (name, key)] = value return out
Also, note that we defined our own modified version of the get_params
methods to use the _name_estimators
function in order to access the parameters of individual classifiers in the ensemble. This may look a little bit complicated at first, but it will make perfect sense when we use grid search for hyperparameter-tuning in later sections.
Now it is about time to put the MajorityVoteClassifier
that we implemented in the previous section into action. But first, let's prepare a dataset that we can test it on. Since we are already familiar with techniques to load datasets from CSV files, we will take a shortcut and load the Iris dataset from scikit-learn's dataset module. Furthermore, we will only select two features, sepal width and
petal length, to make the classification task more challenging. Although our MajorityVoteClassifier
generalizes to multiclass problems, we will only classify flower samples from the two classes, Iris-Versicolor and Iris-Virginica, to compute the
ROC AUC. The code is as follows:
>>> from sklearn import datasets >>> from sklearn.cross_validation import train_test_split >>> from sklearn.preprocessing import StandardScaler >>> from sklearn.preprocessing import LabelEncoder >>> iris = datasets.load_iris() >>> X, y = iris.data[50:, [1, 2]], iris.target[50:] >>> le = LabelEncoder() >>> y = le.fit_transform(y)
Note that scikit-learn uses the predict_proba
method (if applicable) to compute the ROC AUC score. In Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, we saw how the class probabilities are computed in logistic regression models. In decision trees, the probabilities are calculated from a frequency vector that is created for each node at training time. The vector collects the frequency values of each class label computed from the class label distribution at that node. Then the frequencies are normalized so that they sum up to 1. Similarly, the class labels of the k-nearest neighbors are aggregated to return the normalized class label frequencies in the k-nearest neighbors algorithm. Although the normalized probabilities returned by both the decision tree and k-nearest neighbors classifier may look similar to the probabilities obtained from a logistic regression model, we have to be aware that these are actually not derived from probability mass functions.
Next we split the Iris samples into 50 percent training and 50 percent test data:
>>> X_train, X_test, y_train, y_test = ... train_test_split(X, y, ... test_size=0.5, ... random_state=1)
Using the training dataset, we now will train three different classifiers—a logistic regression classifier, a decision tree classifier, and a k-nearest neighbors classifier—and look at their individual performances via a 10-fold cross-validation on the training dataset before we combine them into an ensemble classifier:
>>> from sklearn.cross_validation import cross_val_score >>> from sklearn.linear_model import LogisticRegression >>> from sklearn.tree import DecisionTreeClassifier >>> from sklearn.neighbors import KNeighborsClassifier >>> from sklearn.pipeline import Pipeline >>> import numpy as np >>> clf1 = LogisticRegression(penalty='l2', ... C=0.001, ... random_state=0) >>> clf2 = DecisionTreeClassifier(max_depth=1, ... criterion='entropy', ... random_state=0) >>> clf3 = KNeighborsClassifier(n_neighbors=1, ... p=2, ... metric='minkowski') >>> pipe1 = Pipeline([['sc', StandardScaler()], ... ['clf', clf1]]) >>> pipe3 = Pipeline([['sc', StandardScaler()], ... ['clf', clf3]]) >>> clf_labels = ['Logistic Regression', 'Decision Tree', 'KNN'] >>> print('10-fold cross validation: ') >>> for clf, label in zip([pipe1, clf2, pipe3], clf_labels): ... scores = cross_val_score(estimator=clf, >>> X=X_train, >>> y=y_train, >>> cv=10, >>> scoring='roc_auc') >>> print("ROC AUC: %0.2f (+/- %0.2f) [%s]" ... % (scores.mean(), scores.std(), label))
The output that we receive, as shown in the following snippet, shows that the predictive performances of the individual classifiers are almost equal:
10-fold cross validation: ROC AUC: 0.92 (+/- 0.20) [Logistic Regression] ROC AUC: 0.92 (+/- 0.15) [Decision Tree] ROC AUC: 0.93 (+/- 0.10) [KNN]
You may be wondering why we trained the logistic regression and k-nearest neighbors classifier as part of a pipeline. The reason behind it is that, as discussed in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, both the logistic regression and k-nearest neighbors algorithms (using the Euclidean distance metric) are not scale-invariant in contrast with decision trees. Although the Iris features are all measured on the same scale (cm), it is a good habit to work with standardized features.
Now let's move on to the more exciting part and combine the individual classifiers for majority rule voting in our MajorityVoteClassifier
:
>>> mv_clf = MajorityVoteClassifier( ... classifiers=[pipe1, clf2, pipe3]) >>> clf_labels += ['Majority Voting'] >>> all_clf = [pipe1, clf2, pipe3, mv_clf] >>> for clf, label in zip(all_clf, clf_labels): ... scores = cross_val_score(estimator=clf, ... X=X_train, ... y=y_train, ... cv=10, ... scoring='roc_auc') ... print("Accuracy: %0.2f (+/- %0.2f) [%s]" ... % (scores.mean(), scores.std(), label)) ROC AUC: 0.92 (+/- 0.20) [Logistic Regression] ROC AUC: 0.92 (+/- 0.15) [Decision Tree] ROC AUC: 0.93 (+/- 0.10) [KNN] ROC AUC: 0.97 (+/- 0.10) [Majority Voting]
As we can see, the performance of the MajorityVotingClassifier
has substantially improved over the individual classifiers in the 10-fold cross-validation evaluation.