First, as always, we collate all the features we want to use in our model using the .VectorAssembler(...) method. Note that we only use the columns starting from the second one as the first one is our target—the elevation feature.
Next, we specify the .RandomForestRegressor(...) object. The object uses an almost-identical list of parameters as .RandomForestClassifier(...).
The last step is to build the Pipeline object; pip has only two stages: vectorAssembler and rf_obj.
Next, let's see how our model is performing compared to the linear regression model we estimated in the Introducing Estimators recipe:
results = (
pip
.fit(forest)
.transform(forest)
.select('Elevation', 'prediction')
)
evaluator = ev.RegressionEvaluator(labelCol='Elevation')
evaluator.evaluate(results, {evaluator.metricName: 'r2'})
.RegressionEvaluator(...) calculates the performance metrics of regression models. By default, it returns rmse, the root mean-squared error, but it can also return:
- mse: This is the mean-squared error
- r2: This is the R2 metric
- mae: This is the mean-absolute error
From the preceding code, we got:
This is better than the linear regression model we built earlier, meaning that our model might not be as linearly separable as we initially thought.