Validation and evaluation techniques

There are several widely used terms in machine learning application development that can be a bit tricky and confusing, so let's talk through them and sort them out. These terms include model, target function, hypothesis, confusion matrix, model deployment, induction algorithm, classifier, learning algorithms, cross-validation, and parameters:

  • Target function: In reinforcement learning or predictive modeling, let's say we focus on modeling an object. The ultimate target is to learn or approximate a specific and unknown but targeted function. The target function is denoted as f(x) = y; where x and y both are variable and f(x) is the true function that we want to model and it also signifies that we are trying to maximize or achieve the target value y.
  • Hypothesis: A statistical hypothesis (don't confuse this with the research hypothesis proposed by the researcher) is a generalized function that is testable on the basis of observing a process. The process is similar to the true function that is modeled through a set of random variables from the training dataset. The objective behind the hypothesis testing is that a hypothesis is proposed for measuring the statistical relationship between the two data sets for the example training set and test set, where, both the datasets have to be statistically significant from an idealized (remember not a randomized or normalized model) model.
  • Learning algorithm: As already stated, the ultimate goal of a machine learning application is to find or approximate the target function. In this continuous process, the learning algorithm is a set of instructions that models the target function using the training dataset. Technically, a learning algorithm often comes with a hypothesis space and formulates the final hypothesis.
  • Model: A statistical model is a mathematical model that exemplifies a set of assumptions while generating sample and similar data from a larger population. Finally, the model often represents a significantly idealized form for the data-generating process. Also, in the machine learning area, the terms hypothesis and model are often used interchangeably.
  • Induction algorithm: An induction algorithm takes input specific instances to produce a generalized model beyond these input instances.
  • Model deployment: Model deployment usually denotes applying an already built and developed model to the real data in order to make a prediction for an example.
  • Cross-validation: this is a method for estimating accuracy in terms of an error out of a machine learning model by dividing the data into K mutually exclusive subsets or folds of approximately equal size. The model then is trained and tested K times in iteration, each time on the available dataset excluding a fold and then tested on that fold.
  • Classifier: A classifier is a special case of hypothesis or discrete-valued function that is used to assign the most categorical class labels to particular data points such as the label point.
  • Regressor: A regressor is also a special case of hypothesis that does the mapping from unlabeled features to a value within a predefined metric space.
  • Hyperparameters: Hyperparameters are the tuning parameters of a machine learning algorithm to which a learning algorithm fits the training data.

According to Pedro D. et al., A Few Useful Things to Know about Machine Learning at http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf, we also need a few more things outlined, as follows:

  • Representation: a classifier or regressor must be represented in some formal language that a computer can process by creating proper hypothesis spaces.
  • Evaluation: An evaluation function (that is, objective function or scoring function) is needed to distinguish a good vs bad classifier r regressor, which is used internally by the algorithm by which the model has been build or trained.
  • Optimization: We also need to have a method for searching among the classifiers or regressor, aiming for the highest scoring one. Thus the optimization is the key to the efficiency of the learner that helps to determine the optimal parameters.

In a nutshell, the key formula in learning in an ML algorithm is:

Learning = Representation + Evaluation + Optimization

Consequently, to validate and evaluate the trained model you need to understand the above terms very clearly so that you can conceptualize the ML problem and the proper uses of the Spark ML and MLlib APIs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset