There's more...

Let's see whether the random forest model can do any better:

rf_obj = cl.RandomForestClassifier(
    labelCol='CoverType'
    , featuresCol=selector.getOutputCol()
    , minInstancesPerNode=10
    , numTrees=10
)

pipeline = Pipeline(
    stages=[vectorAssembler, selector, rf_obj]
)

pModel = pipeline.fit(forest_train)

As you can see from the preceding code, we will be reusing most of the objects we have already created for the logistic regression model; all we introduced here was .RandomForestClassifier(...) and we can reuse the vectorAssembler and selector objects. This is one examples of how simple it is to work with Pipelines.

The .RandomForestClassifier(...) object will build the random forest model for us. In this example, we specified only four parameters, most of which you are most likely familiar with, such as labelCol and featuresCol. minInstancesPerNode specifies the minimum number of records still allowed to split the node into two sub-nodes, while numTrees specifies how many trees in the forest to estimate. Other notable parameters include:

impurity: This specifies the criterion used for information gain. By default, it is set to gini but can also be entropy.
maxDepth: This specifies the maximum depth of any of the trees.
maxBins: This specifies the maximum number of bins in any of the trees.
minInfoGain: This specifies the minimum level of information gain between iterations.

For a full specification of the class, see http://bit.ly/2sgQAFa.

Having estimated the model, let's see how it performs so we can compare it to the logistic regression one:

results_rf = (
    pModel
    .transform(forest_test)
    .select('CoverType', 'probability', 'prediction')
)

(
    evaluator.evaluate(results_rf)
    , evaluator.evaluate(
        results_rf
        , {evaluator.metricName: 'weightedPrecision'}
    )
    , evaluator.evaluate(
        results_rf
        , {evaluator.metricName: 'accuracy'}
    )
)

The preceding code should produce results similar to the following:

The results are exactly the same, indicating that the two models perform equally well and we might want to increase the number of selected features in the selector stage to potentially achieve better results.

Table of Contents for There's more...

Create new playlist

Sign In

Sign Up

Table of Contents for
There's more...