The joblib Memory
class is a utility class that facilitates caching of function or method results to disk. We create a Memory
object by specifying a caching directory. We can then decorate the function to cache or specify methods to cache in a class constructor. If you like, you can specify the arguments to ignore. The default behavior of the Memory
class is to remove the cache any time the function is modified or the input values change. Obviously, you can also remove the cache manually by moving or deleting cache directories and files.
In this recipe, I describe how to reuse a scikit-learn regressor or classifier. The naïve method would be to store the object in a standard Python pickle or use joblib. However, in most cases, it is better to store the hyperparameters of the estimator.
We will use the ExtraTreesRegressor
class as estimator. Extra trees (extremely randomized trees) are a variation of the random forest algorithm, which is covered in the Learning with random forests recipe.
from sklearn.grid_search import GridSearchCV from sklearn.ensemble import ExtraTreesRegressor import ch9util from tempfile import mkdtemp import os import joblib
X_train, X_test, y_train, y_test = ch9util.temp_split() params = {'min_samples_split': [1, 3], 'bootstrap': [True, False], 'min_samples_leaf': [3, 4]}
gscv = GridSearchCV(ExtraTreesRegressor(random_state=41), param_grid=params, cv=5)
gscv.fit(X_train, y_train) preds = gscv.predict(X_test)
dir = mkdtemp() pkl = os.path.join(dir, 'params.pkl') joblib.dump(gscv.best_params_, pkl) params = joblib.load(pkl) print('Best params', gscv.best_params_) print('From pkl', params)
est = ExtraTreesRegressor(random_state=41) est.set_params(**params) est.fit(X_train, y_train) preds2 = est.predict(X_test) print('Max diff', (preds - preds2).max())
Refer to the following screenshot for the end result:
The code is in the reusing_models.py
file in this book's code bundle.
Memory
class at https://pythonhosted.org/joblib/memory.html (retrieved November 2015)