Stacking for regression

Here, we will try to create a stacking ensemble for the diabetes regression dataset. The ensemble will consist of a 5-neighbor k-Nearest Neighbors (k-NN), a decision tree limited to a max depth of four, and a ridge regression (a regularized form of least squares regression). The meta-learner will be a simple Ordinary Least Squares (OLS) linear regression.

First, we have to import the required libraries and data. Scikit-learn provides a convenient method to split data into K-folds, with the KFold class from the sklearn.model_selection module. As in previous chapters, we use the first 400 instances for training and the remaining instances for testing:

# --- SECTION 1 ---
# Libraries and data loading
from sklearn.datasets import load_diabetes
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import KFold
from sklearn import metrics
import numpy as np
diabetes = load_diabetes()

train_x, train_y = diabetes.data[:400], diabetes.target[:400]
test_x, test_y = diabetes.data[400:], diabetes.target[400:]

In the following code, we instantiate the base and meta-learners. In order to have ease of access to the individual base learners later on, we store each base learner in a list, called base_learners:

# --- SECTION 2 ---
# Create the ensemble's base learners and meta-learner
# Append base learners to a list for ease of access
base_learners = []
knn = KNeighborsRegressor(n_neighbors=5)

base_learners.append(knn)
dtr = DecisionTreeRegressor(max_depth=4 , random_state=123456)

base_learners.append(dtr)
ridge = Ridge()

base_learners.append(ridge)
meta_learner = LinearRegression()

After instantiating our learners, we need to create the metadata for the training set. We split the training set into five folds by first creating a KFold object, specifying the number of splits (K) with KFold(n_splits=5), and then calling KF.split(train_x). This, in turn, returns a generator for the train and test indices of the five splits. For each of these splits, we use the data indicated by train_indices (four folds) to train our base learners and create metadata on the data corresponding to test_indices. Furthermore, we store the metadata for each classifier in the meta_data array and the corresponding targets in the meta_targets array. Finally, we transpose meta_data in order to get a (instance, feature) shape:

# --- SECTION 3 ---
# Create the training metadata

# Create variables to store metadata and their targets
meta_data = np.zeros((len(base_learners), len(train_x)))
meta_targets = np.zeros(len(train_x))

# Create the cross-validation folds
KF = KFold(n_splits=5)
meta_index = 0
for train_indices, test_indices in KF.split(train_x):
  # Train each learner on the K-1 folds 
  # and create metadata for the Kth fold
  for i in range(len(base_learners)):
    learner = base_learners[i]
    learner.fit(train_x[train_indices], train_y[train_indices])
    predictions = learner.predict(train_x[test_indices])
    meta_data[i][meta_index:meta_index+len(test_indices)] = 
                              predictions

  meta_targets[meta_index:meta_index+len(test_indices)] = 
                          train_y[test_indices]
  meta_index += len(test_indices)

# Transpose the metadata to be fed into the meta-learner
meta_data = meta_data.transpose()

For the test set, we do not need to split it into folds. We simply train the base learners on the whole train set and predict on the test set. Furthermore, we evaluate each base learner and store the evaluation metrics, in order to compare them with the ensemble's performance. As this is a regression problem, we use R-squared and Mean Squared Error (MSE) as evaluation metrics:

# --- SECTION 4 ---
# Create the metadata for the test set and evaluate the base learners
test_meta_data = np.zeros((len(base_learners), len(test_x)))
base_errors = []
base_r2 = []
for i in range(len(base_learners)):
  learner = base_learners[i]
  learner.fit(train_x, train_y)
  predictions = learner.predict(test_x)
  test_meta_data[i] = predictions

  err = metrics.mean_squared_error(test_y, predictions)
  r2 = metrics.r2_score(test_y, predictions)

  base_errors.append(err)
  base_r2.append(r2)

test_meta_data = test_meta_data.transpose()

Now, that we have the metadata for both the train and test sets, we can train our meta-learner on the train set and evaluate on the test set:

# --- SECTION 5 ---
# Fit the meta-learner on the train set and evaluate it on the test set
meta_learner.fit(meta_data, meta_targets)
ensemble_predictions = meta_learner.predict(test_meta_data)

err = metrics.mean_squared_error(test_y, ensemble_predictions)
r2 = metrics.r2_score(test_y, ensemble_predictions)

# --- SECTION 6 ---
# Print the results 
print('ERROR R2 Name')
print('-'*20)
for i in range(len(base_learners)):
  learner = base_learners[i]
  print(f'{base_errors[i]:.1f} {base_r2[i]:.2f} {learner.__class__.__name__}')
print(f'{err:.1f} {r2:.2f} Ensemble')

We get the following output:

ERROR R2 Name
--------------------
2697.8 0.51 KNeighborsRegressor
3142.5 0.43 DecisionTreeRegressor
2564.8 0.54 Ridge
2066.6 0.63 Ensemble

As is evident, r-squared has improved by over 16% from the best base learner (ridge regression), while MSE has improved by almost 20%. This is a considerable improvement.

Table of Contents for Stacking for regression

Create new playlist

Sign In

Sign Up

Table of Contents for
Stacking for regression