How it works...

There's a lot happening here, so let's unpack it step-by-step.

We already know the .VectorAssembler(...), .ChiSqSelector(...), and .LogisticRegression(...) classes, so we will not be repeating ourselves here.

Check out previous recipes if you are not familiar with the preceding concepts.

The core of this recipe starts with the logReg_grid object. This is the .ParamGridBuilder() class, which allows us to add elements to the grid that the algorithm will loop through and estimate the models with all the combinations of all the parameters and the specified values.

A word of caution: the more parameters you include and the more levels you specify, the more models you will have to estimate. The number of models grows exponentially in both the number of parameters and in the number of levels you specify for these parameters. Beware!

In this example, we loop through two parameters: regParam and elasticNetParam. For each of the parameters, we specify two levels, thus we will need to build four models.

As an evaluator, we once again use .MulticlassClassificationEvaluator(...).

Next, we specify the .CrossValidator(...) object, which binds all these things together: our estimator will be logReg_obj, estimatorParamMaps will be equal to the built logReg_grid, and evaluator is going to be logReg_ev.

The .CrossValidator(...) object splits the training data into a set of folds (by default, 3) and these are used as separate training and test datasets to fit the models. Therefore, we not only need to fit four models based on the parameters grid we want to traverse, but also for each of those four models we build three models with different training and validation datasets.

Note that we first build the Pipeline that is purely data-transformative, that is, it only collates the features into the full features vector and then selects the top five features with the most predictive power; we do not fit logReg_obj at this stage.

The model-fitting starts when we use the cross_v object to fit the transformed data. Only then will Spark estimate four distinct models and select the one that performs best.

Having now estimated the models and selected the best performing one, let's see whether the selected model performs better than the one we estimated in the Predicting forest coverage types recipe:

data_trans_test = data_trans.transform(forest_test)
results = logReg_modelTest.transform(data_trans_test)

print(logReg_ev.evaluate(results, {logReg_ev.metricName: 'weightedPrecision'}))
print(logReg_ev.evaluate(results, {logReg_ev.metricName: 'weightedRecall'}))
print(logReg_ev.evaluate(results, {logReg_ev.metricName: 'accuracy'}))

With the help of the preceding code, we get the following results:

As you can see, we do slightly worse than the previous model, but this is most likely due to the fact that we only selected the top 5 (versus 10 before) features with our selector.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...