There's more...

Another approach that aims at finding the best performing model is called train-validation split. This method performs a split of the training data into two smaller subsets: one that is use to train the model, and another one that is used to validate whether the model is not overfitting. The split is only performed once, thus in contrast to cross-validation, it is less expensive:

train_v = tune.TrainValidationSplit(
    estimator=logReg_obj
    , estimatorParamMaps=logReg_grid
    , evaluator=logReg_ev
    , parallelism=4
)

logReg_modelTrainV = (
    train_v
    .fit(data_trans.transform(forest_train))

results = logReg_modelTrainV.transform(data_trans_test)

print(logReg_ev.evaluate(results, {logReg_ev.metricName: 'weightedPrecision'}))
print(logReg_ev.evaluate(results, {logReg_ev.metricName: 'weightedRecall'}))
print(logReg_ev.evaluate(results, {logReg_ev.metricName: 'accuracy'}))

The preceding code is not that dissimilar from what we saw with .CrossValidator(...). The only additional parameter we specify for the .TrainValidationSplit(...) method is the level of parallelism that controls how many threads are spun up when you select the best model.

Using the .TrainValidationSplit(...) method produces the same results as the .CrossValidator(...) approach:

Table of Contents for There's more...

Create new playlist

Sign In

Sign Up

Table of Contents for
There's more...