Let's see whether the gradient-boosted trees model can beat the preceding result:
gbt_obj = rg.GBTRegressor(
labelCol='Elevation'
, minInstancesPerNode=10
, minInfoGain=0.1
)
pip = Pipeline(stages=[vectorAssembler, gbt_obj])
The only change compared to the random forest regressor is the fact that we now use the .GBTRegressor(...) class to fit the gradient-boosted trees model to our data. The most notable parameters for this class include:
- maxDepth: This specifies the maximum depth of the built trees, which by default is set to 5
- maxBins: This specifies the maximum number of bins
- minInfoGain: This specifies the minimum level of information gain between iterations
- minInstancesPerNode: This specifies the minimum number of instances when the tree will still perform a split
- lossType: This specifies the loss type, and accepts the squared or absolute values
- impurity: This is, by default, set to variance, and for now (in Spark 2.3) is the only option allowed
- maxIter: This specifies the maximum number of iterations—a stopping criterion for the algorithm
Let's check the performance now:
results = (
pip
.fit(forest)
.transform(forest)
.select('Elevation', 'prediction')
)
evaluator = ev.RegressionEvaluator(labelCol='Elevation')
evaluator.evaluate(results, {evaluator.metricName: 'r2'})
Here's what we got:
As you can see, we have still (even though ever-so-slightly) improved over the random forest regressor.