D.8. Pro tips

Once you have the basics down, some simple tricks will help you build good models faster:

  • Work with a small random sample of your dataset to get the kinks out of your pipeline.
  • When you’re ready to deploy to production, train your model on all the data you have.
  • The first approach you should try is the one you know best. This goes for both the feature extractors and the model itself.
  • Use scatter plots and scatter matrices on low-dimensional features and targets to make sure you aren’t missing some obvious patterns.
  • Plot high-dimensional data as a raw image to discover shifting across features.[1]

    1

    Time series training sets will often be generated with a time shift or lag. Discovering this can help you on Kaggle competitions that hide the source of the data, like the Santander Value Prediction competition (www.kaggle.com/c/santander-value-prediction-challenge/discussion/61394).

  • Try PCA on high-dimensional data (LSA on NLP data) when you want to maximize the differences between pairs of vectors (separation).
  • Use nonlinear dimension reduction, like t-SNE, when you want to find matches between pairs of vectors or perform regression in the low-dimensional space.
  • Build an sklearn.Pipeline object to improve the maintainability and reusability of your models and feature extractors.
  • Automate the hyperparameter tuning so your model can learn about the data and you can spend your time learning about machine learning.
Hyperparameter tuning

Hyperparameters are all the values that determine the performance of your pipeline, including the model type and how it’s configured. This can be things like how many neurons and layers are in a neural network or the value of alpha in a sklearn.linear_model.Ridge regressor. Hyperparameters also include the values that govern any preprocessing steps, like the tokenizer type, any list of words that are ignored, the minimum and maximum document frequency for the TF-IDF vocabulary, whether or not to use a lemmatizer, the TF-IDF normalization approach, and so on.

Hyperparameter tuning can be a slow process, because each experiment requires you to train and validate a new model. So it pays to reduce your dataset size to a minimum representative sample while you’re searching a broad range of hyperparameters. When your search gets close to the final model that you think is going to meet your needs, you can increase the dataset size to use as much of the data as you need.

Tuning the hyperparameters of your pipeline is how you improve the performance of your model. Automating the hyperparameter tuning can give you more time to spend reading books like this or visualizing and analyzing your results. You can still guide the tuning with your intuition by setting the hyperparameter ranges to try.

Tip

The most efficient algorithms for hyperparameter tuning are (from best to worst)

  1. Bayesian search
  2. Genetic algorithms
  3. Random search
  4. Multi-resolution grid searches
  5. Grid search

But any algorithm that lets your computer do this searching at night while you sleep is better than manually guessing new parameters one by one.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset