There's more...

Let's see whether the gradient-boosted trees model can beat the preceding result:

gbt_obj = rg.GBTRegressor(
    labelCol='Elevation'
    , minInstancesPerNode=10
    , minInfoGain=0.1
)

pip = Pipeline(stages=[vectorAssembler, gbt_obj])

The only change compared to the random forest regressor is the fact that we now use the .GBTRegressor(...) class to fit the gradient-boosted trees model to our data. The most notable parameters for this class include:

maxDepth: This specifies the maximum depth of the built trees, which by default is set to 5
maxBins: This specifies the maximum number of bins
minInfoGain: This specifies the minimum level of information gain between iterations
minInstancesPerNode: This specifies the minimum number of instances when the tree will still perform a split
lossType: This specifies the loss type, and accepts the squared or absolute values
impurity: This is, by default, set to variance, and for now (in Spark 2.3) is the only option allowed
maxIter: This specifies the maximum number of iterations—a stopping criterion for the algorithm

Let's check the performance now:

results = (
    pip
    .fit(forest)
    .transform(forest)
    .select('Elevation', 'prediction')
)

evaluator = ev.RegressionEvaluator(labelCol='Elevation')
evaluator.evaluate(results, {evaluator.metricName: 'r2'})

Here's what we got:

As you can see, we have still (even though ever-so-slightly) improved over the random forest regressor.

Table of Contents for There's more...

Create new playlist

Sign In

Sign Up

Table of Contents for
There's more...