Reusing models with joblib

The joblib Memory class is a utility class that facilitates caching of function or method results to disk. We create a Memory object by specifying a caching directory. We can then decorate the function to cache or specify methods to cache in a class constructor. If you like, you can specify the arguments to ignore. The default behavior of the Memory class is to remove the cache any time the function is modified or the input values change. Obviously, you can also remove the cache manually by moving or deleting cache directories and files.

In this recipe, I describe how to reuse a scikit-learn regressor or classifier. The naïve method would be to store the object in a standard Python pickle or use joblib. However, in most cases, it is better to store the hyperparameters of the estimator.

We will use the ExtraTreesRegressor class as estimator. Extra trees (extremely randomized trees) are a variation of the random forest algorithm, which is covered in the Learning with random forests recipe.

How to do it...

  1. The imports are as follows:
    from sklearn.grid_search import GridSearchCV
    from sklearn.ensemble import ExtraTreesRegressor
    import ch9util
    from tempfile import mkdtemp
    import os
    import joblib
  2. Load the data and define a hyperparameter grid search dictionary:
    X_train, X_test, y_train, y_test = ch9util.temp_split()
    params = {'min_samples_split': [1, 3],
              'bootstrap': [True, False],
              'min_samples_leaf': [3, 4]}
  3. Do a grid search as follows:
    gscv = GridSearchCV(ExtraTreesRegressor(random_state=41),
                        param_grid=params, cv=5)
  4. Fit and predict as follows:
    gscv.fit(X_train, y_train)
    preds = gscv.predict(X_test)
  5. Store the best parameters found by the grid search:
    dir = mkdtemp()
    pkl = os.path.join(dir, 'params.pkl')
    joblib.dump(gscv.best_params_, pkl)
    params = joblib.load(pkl)
    print('Best params', gscv.best_params_)
    print('From pkl', params)
  6. Create a new estimator and compare the predictions:
    est = ExtraTreesRegressor(random_state=41)
    est.set_params(**params)
    est.fit(X_train, y_train)
    preds2 = est.predict(X_test)
    print('Max diff', (preds - preds2).max())

Refer to the following screenshot for the end result:

How to do it...

The code is in the reusing_models.py file in this book's code bundle.

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset