Chapter 7. Tuning Machine Learning Models

Tuning an algorithm or a machine learning application is simply a process that one goes through in order to enable the algorithm to perform optimally (in terms of runtime and memory usage) when optimizing the parameters that impact the model. This chapter aims to guide the reader through model tuning. It will cover the main techniques used to optimize an ML algorithm's performance. Techniques will be explained both from the MLlib and Spark ML perspective. In Chapter 5, Supervised and Unsupervised Learning by Examples and Chapter 6, Building Scalable Machine Learning Pipelines, we described how to develop some complete machine learning applications and pipelines from data collection to model evaluation. In this chapter, we will try to reuse some of those applications to improve the performance by tuning several parameters such as Hyperparameter tuning, Grid search parameter tuning with MLlib and Spark ML, random search parameter tuning and cross-validation. The hypothesis ,which is also an important statistical test, will be discussed. In summary, the following topics will be covered in this chapter:

  • Details about machine learning model tuning
  • Typical challenges in model tuning
  • Evaluating machine learning models
  • Validation and evaluation techniques
  • Parameter tuning for machine learning
  • Hypothesis testing
  • Machine learning model selection

Details about machine learning model tuning

Fundamentally, one can argue that the ultimate goal of machine learning (ML) is to make a machine that can automatically build models from data without requiring tedious and time-consuming human involvement. You will see that one of the difficulties in ML is that learning algorithms are like decision trees, random forests, and clustering techniques that require you to set parameters before you use the models for practical purposes. Alternatively, you need to set some constraints on those parameters.

How you set those parameters depends on a set of factors and the specification. Your goal in this regard is usually to set those parameters to the optimal values that enable you to complete a learning task in the best possible way. Thus, tuning an algorithm or ML technique can be simply thought of as a process where one goes through a series of steps in which they optimize the parameters that impact the model's performance in order to enable the algorithm to perform in the best way.

In Chapter 3, Understanding the Problem by Understanding the Data, and Chapter 5, Supervised and Unsupervised Learning by Examples, we discussed some techniques to choose the best algorithm based on your data and discuss the most widely used algorithms. To get the best result out of your model, you have to first define what the best actually is. We will discuss tuning in both abstract and concrete ways.

In the abstract sense of machine learning, tuning involves working with variables or based on parameters that have been identified to affect system performance as evaluated by some appropriate metric. Hence, the improved performance reveals which parameter settings are more favorable (that is, tuned) or less favorable (that is, un-tuned). In common sense terms, tuning is essentially selecting the best parameters for an algorithm to optimize its performance, given the working environment of hardware, specific workloads, and so on And tuning in machine learning is an automated process for doing this.

Well, let's get to the point and make the discussion more concrete through some examples. If you take an ML algorithm for clustering like KNN or K-Means, as a developer/data scientist/data engineer you need to specify the number of K in your model or centroids.

So the question is 'how can you do this?' Technically, there is no shortcut around the need to tune the model. Computationally a naïve approach would be to try with different values of K as a model and of course observing how it goes to inter and intra-group error as you vary the number of K in your model.

The second example could be by using the Support Vector Machine (SVM) for classification tasks. As you know, an SVM classification requires an initial learning phase in which the training data are used to adjust the classification parameters. This really denotes an initial parameter-tuning phase where you might try to tune the models in order to achieve high-quality results.

The third practical example suggest that there is no such thing as a perfect set of optimizations for all deployments of an Apache web server. A sysadmin learns from the data on the job, so to speak, and optimizes it's own Apache web server configuration as appropriate for its specific environment. Now imagine an automated process for doing those three things, that is, a system that can learn from data on its own; the definition of machine learning. A system that tunes its own parameters in such a data-based fashion would be an instance of tuning in machine learning. Now, let us summarize the main points of why we evaluate the predictive performance of a model:

  • We want to estimate the generalization error, the predictive performance of our model on future (unseen) data.
  • We want to increase the predictive performance by tweaking the learning algorithm and selecting the best-performing model from a given hypothesis space.
  • We want to identify the machine learning algorithm that is best-suited for the problem at hand; thus, we want to compare different algorithms, selecting the best-performing one as well as the best-performing model from the algorithm's hypothesis space.

In a nutshell, there are four steps in the process of finding the best parameter set:

  • Define the parametric space: First, we need to decide the exact parameter values we would like to consider for the algorithm.
  • Define the cross-validation settings: Secondly we need to decide how to choose the optimal cross-validation folds for the data (to be discussed later in this chapter).
  • Define the metric: Thirdly, we need to decide which metric to use for determining the best combination of parameters. For example, accuracy, root means squared error, precision, recall, or f-score, and so on.
  • Train, evaluate and compare: Fourthly, for each unique combination of the parameter values, cross-validation is carried out and based on the error metric defined by the user in the third step, the best-performing model can be chosen.

There are many techniques and algorithms available for model tunings like hyperparameter optimization or model selection, hyperparameter tuning, grid search parameter tuning, random search parameter tuning and Cross Validation (CV). Unfortunately, the current implementation of Spark has developed only a few of them including Cross Validator and TrainValidationSplit.

Therefore, we will try to use these two hyperparameters to tune different models including Random Forest, Liner Regression, and Logistic Regression. Some applications from Chapter 5, Supervised and Unsupervised Learning by Examples and Chapter 6, Building Scalable Machine Learning Pipelines will be re-used without providing many details again to make the model tuning easier.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset