There's more...

Let's see whether the gradient-boosted trees model can beat the preceding result:

gbt_obj = rg.GBTRegressor(
labelCol='Elevation'
, minInstancesPerNode=10
, minInfoGain=0.1
)

pip = Pipeline(stages=[vectorAssembler, gbt_obj])

The only change compared to the random forest regressor is the fact that we now use the .GBTRegressor(...) class to fit the gradient-boosted trees model to our data. The most notable parameters for this class include:

  • maxDepth: This specifies the maximum depth of the built trees, which by default is set to 5
  • maxBins: This specifies the maximum number of bins
  • minInfoGain: This specifies the minimum level of information gain between iterations
  • minInstancesPerNode: This specifies the minimum number of instances when the tree will still perform a split
  • lossType: This specifies the loss type, and accepts the squared or absolute values
  • impurity: This is, by default, set to variance, and for now (in Spark 2.3) is the only option allowed
  • maxIter: This specifies the maximum number of iterations—a stopping criterion for the algorithm

Let's check the performance now:

results = (
pip
.fit(forest)
.transform(forest)
.select('Elevation', 'prediction')
)

evaluator = ev.RegressionEvaluator(labelCol='Elevation')
evaluator.evaluate(results, {evaluator.metricName: 'r2'})

Here's what we got:

As you can see, we have still (even though ever-so-slightly) improved over the random forest regressor.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset