Achieving better performance in feature engineering

Throughout this book, we have relied on a base definition of better when it came to the various feature engineering methods we put into place. Our implicit goal was to achieve better predictive performance measured purely on simple metrics such as accuracy for classification tasks and RMSE for regression tasks (mostly accuracy). There are other metrics we may measure and track to gauge predictive performance. For example, we will use the following metrics for classification:

  • True and false positive rate
  • Sensitivity (AKA true positive rate) and specificity
  • False negative and false positive rate

and for regression, the metrics that will be applied are:

  • Mean absolute error
  • R2

These lists go on, and while we will not be abandoning the idea of quantifying performance through metrics such as the ones precedingly listed, we may also measure other meta metrics, or metrics that do not directly correlate to the performance of the prediction of the model, rather, so-called meta metrics attempt to measure the performance around the prediction and include such ideas as:

  • Time in which the model needs to fit/train to the data
  • Time it takes for a fitted model to predict new instances of data
  • The size of the data in case data must be persisted (stored for later)

These ideas will add to our definition of better machine learning as they help to encompass a much larger picture of our machine learning pipeline outside of model predictive performance. In order to help us track these metrics, let's create a function that is generic enough to evaluate several models but specific enough to give us metrics for each one. We will call our function get_best_model_and_accuracy and it will do many jobs, such as:

  • It will search across all given parameters in order to optimize the machine learning pipeline
  • It will spit out some metrics that will help us assess the quality of the pipeline entered

Let's go ahead and define such a function with the help of the following code:

# import out grid search module
from sklearn.model_selection import GridSearchCV


def get_best_model_and_accuracy(model, params, X, y):
grid = GridSearchCV(model, # the model to grid search
params, # the parameter set to try
error_score=0.) # if a parameter set raises an error, continue and set the performance as a big, fat 0
grid.fit(X, y) # fit the model and parameters
# our classical metric for performance
print "Best Accuracy: {}".format(grid.best_score_)
# the best parameters that caused the best accuracy
print "Best Parameters: {}".format(grid.best_params_)
# the average time it took a model to fit to the data (in seconds)
print "Average Time to Fit (s): {}".format(round(grid.cv_results_['mean_fit_time'].mean(), 3))
# the average time it took a model to predict out of sample data (in seconds)
# this metric gives us insight into how this model will perform in real-time analysis
print "Average Time to Score (s): {}".format(round(grid.cv_results_['mean_score_time'].mean(), 3))

The overall goal of this function is to act as a ground truth in that we will use it to evaluate every feature selection method in this chapter to give us a sense of standardization of evaluation. This is not really any different to what we have been doing already, but we are now formalizing our work as a function, and also using metrics other than accuracy to grade our feature selection modules and machine learning pipelines.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset