We will present a simple regression example with XGBoost, using the diabetes dataset. As it will be shown, its usage is quite simple and similar to the scikit-learn classifiers. XGBoost implements regression with XGBRegressor. The constructor has a respectably large number of parameters, which are very well-documented in the official documentation. In our example, we will use the n_estimators, n_jobs, max_depth, and learning_rate parameters. Following scikit-learn's conventions, they define the ensemble size, the number of parallel processes, the tree's maximum depth, and the learning rate, respectively:
# --- SECTION 1 ---
# Libraries and data loading
from sklearn.datasets import load_diabetes
from xgboost import XGBRegressor
from sklearn import metrics
import numpy as np
diabetes = load_diabetes()
train_size = 400
train_x, train_y = diabetes.data[:train_size], diabetes.target[:train_size]
test_x, test_y = diabetes.data[train_size:], diabetes.target[train_size:]
np.random.seed(123456)
# --- SECTION 2 ---
# Create the ensemble
ensemble_size = 200
ensemble = XGBRegressor(n_estimators=ensemble_size, n_jobs=4,
max_depth=1, learning_rate=0.1,
objective ='reg:squarederror')
The rest of the code evaluates the generated ensemble, and is similar to any of the previous examples:
# --- SECTION 3 ---
# Evaluate the ensemble
ensemble.fit(train_x, train_y)
predictions = ensemble.predict(test_x)
# --- SECTION 4 ---
# Print the metrics
r2 = metrics.r2_score(test_y, predictions)
mse = metrics.mean_squared_error(test_y, predictions)
print('Gradient Boosting:')
print('R-squared: %.2f' % r2)
print('MSE: %.2f' % mse)
XGBoost achieves an R-squared of 0.65 and an MSE of 1932.9, the best performance out of all the boosting methods we tested and implemented in this chapter. Furthermore, we did not fine-tune any of its parameters, which further displays its modeling power.