Before we discuss what testing means in data science, let's summarize a few concepts.
Firstly and in general, what is a model in science? We can cite the following definitions:
In science, a model is a representation of an idea, an object or even a process or a system that is used to describe and explain phenomena that cannot be experienced directly.
Scientific Modelling, Science Learning Hub, http://sciencelearn.org.nz/Contexts/The-Noisy-Reef/Science-Ideas-and-Concepts/Scientific-modelling
And this:
A scientific model is a conceptual, mathematical or physical representation of a real-world phenomenon. A model is generally constructed for an object or process when it is at least partially understood, but difficult to observe directly. Examples include sticks and balls representing molecules, mathematical models of planetary movements or conceptual principles like the ideal gas law. Because of the infinite variations actually found in nature, all but the simplest and most vague models are imperfect representations of real-world phenomena.
What is a model in science?, Reference: https://www.reference.com/science/model-science-727cde390380e207
We need a model in order to simplify the complexity of a system in the form of a hypothesis. We proved that deep neural networks can describe complex non-linear relationships. Even though we are just approximating a real system with something more complex than shallow models, in the end this is just another approximation. I doubt any real system actually works as a neural network. Neural networks were inspired by the way our brain processes information, but they are a huge simplification of it.
A model is defined according to some parameters (parametric model). On one hand, we have a definition of a model as function mapping an input space to an output. On the other hand, we have a bunch of parameters that the function needs in order to apply the mapping. For instance, the weights matrix and biases.
Model fitting and training are two terms referring to the process of estimating the parameters of that model so that it best describes the underlying data. Model fitting happens via a learning algorithms that defines a loss function depending on both the model parameters and the data, and it tries to minimize this function by estimating the best set of values for the model parameters. One of the most common algorithm is Gradient Descent, with all its variants. See the previous Training section. For auto-encoder, you would minimize the reconstruction error plus the regularization penalty, if any.
Validation is sometimes confused with testing and evaluation. Validation and testing often use the same techniques and/or methodology but they serve two different purposes.
Model validation corresponds to a type of hypothesis validation. We consider our data to be well described by a model. The hypothesis is that, if that model is correct, after having been trained (parameters estimation), it will describe unseen data the same way it describes the training set. We hypothesize that the model generalizes enough given the limits of the scenario in which we will use it. Model validation aims to find a measure (often referred to as a metric) that quantifies how well the model fits the validation data. For labeled data, we might derive a few metrics from either the Receiver Operating Characteristic (ROC) or Precision-Recall (PR) curve computed from the anomaly scores on the validation data. For unlabeled data, you could for instance use the Excess-Mass (EM) or Mass-Volume (MV) curve.
Although model validation can be a way to evaluate performances, it is widely used for model selection and tuning.
Model selection is the process of selecting among a set of candidates, the model that scores highest in the validation. The set of candidates could be different configurations of the same model, many different models, a selection of different features, different normalization, and/or transformation techniques, and so on.
In deep neural networks, feature selection could be omitted because we delegate to the network itself the role of figuring out and generating relevant features. Moreover, features are also discarded via regularization during learning.
The hypothesis space (the model parameters) depends on the choice of topology, the activation functions, size and depth, pre-processing (for example, whitening of an image or data cleansing), and post-processing (for example, use an auto-encoder to reduce the dimensionality and then run a clustering algorithm). We might see the whole pipeline (the set of components on a given configuration) as the model, even though the fitting could happen independently for each piece.
Analogously, the learning algorithm will introduce a few parameters (for example, learning rate or decay rate). In particular, since we want to maximize the generalization of the model, we generally introduce a regularization technique during the learning function, and that will introduce additional parameters (for example, sparsity coefficient, noise ratio, or regularization weight).
Moreover, the particular implementation of the algorithm also has a few parameters (for example, epochs, number of samples per iteration).
We can use the same validation technique to quantify the performance of the model and learning algorithm together. We can imagine to have a single big vector of parameters that include the model parameters plus the hyper-parameters. We can tune everything in order to minimize the validation metric.
At the end of the model selection and tuning via validation, we have obtained a system that:
The selected point will formalize our final theory. The theory says that our observations are generated from a model that is the outcome of the pipeline corresponding to the selected point.
Evaluation is the process of verifying that the final theory is acceptable and quantifying its quality from both technical and business perspectives.
Scientific literature shows how, during the course of history, one theory has succeeded another. Choosing the right theory without introducing a cognitive bias requires rationality, accurate judgment, and logical interpretation.
Confirmation theory, the study that guides scientific reasoning other than reasoning of the deductive kind, can help us defining a few principles.
In our context, we want to quantify the quality of our theory and verify it is good enough and that it an evident advantage with respect to a much simpler theory (the baseline). A baseline could be a Naïve implementation of our system. In the case of an anomaly detector, it could simply be a rule-based threshold model where anomalies are flagged for each observation whose feature values are above a static set of thresholds. Such a baseline is probably the simplest theory we can implement and maintain over time. It will probably not satisfy the full acceptance criteria, but it will help us to justify why we need another theory, that is, a more advanced model.
Colyvan, in his book The Indispensability of Mathematics, summarized the criteria for accepting a good theory as a replacement for another based on four major criteria:
These criteria, in the case of neural networks, are translated into the following:
From a business perspective, we really need to think carefully about what the acceptance criteria are.
We would like to answer at least the following questions:
We will try to use as example an intrusion detection system and try to respond to these questions.
We would like to monitor a network traffic in real time, taking individual network connections and marking them as normal or suspicious. This will allow the business to have enhanced protection against intruders. The flagged connections will be stopped and will go into a queue for manual inspection. A team of security experts will look into those connections and determine whether it is a false alarm and, in the case of a confirmed attack, will mark the connection under one of the available labels. Thus, the model has to provide a real-time list of connections sorted by their anomaly score. The list cannot contain more elements than the capability of the security team. Moreover, we need to balance the cost of permitting an attack, the cost of damages in the case of an attack, and the cost required for inspection. A minimum requirement consisting of precision and recall is a must in order to probabilistically limit the worst-case scenario.
All of these evaluation strategies have been mainly defined qualitatively rather quantitatively. It would be quite hard to compare and report something that is not numerically measurable.
Bryan Hudson, a Data Science practitioner, said:
If you can't define it, you can't measure it. If it can't be measured, it shouldn't be reported. Define, then measure, then report.
Define, then measure, then report. But be careful. We might think of defining a new evaluation metric that takes into account every possible aspect and scenario discussed so far.
Whilst many data scientists may attempt to quantify the evaluation of a model using a single utility function, as you do during validation, for a real production system, this is not advised. As also expressed in the Professional Data Science Manifesto:
A product needs a pool of measures to evaluate its quality. A single number cannot capture the complexity of reality.
The Professional Data Science Manifesto, www.datasciencemanifesto.org
And even after we have defined our Key Performance Indicators (KPIs), their real meaning is relative when compared to a baseline. We must ponder over why we need this solution with respect to a much simpler or existing one.
The evaluation strategy requires defining test cases and KPIs so that we can cover the most scientific aspects and business needs. Some of them are aggregated numbers, others can be represented in charts. We aim to summarize and efficiently present all of them in a single evaluation dashboard.
In the following sections, we will see a few techniques used for model validation using both labeled and unlabelled data.
Then we will see how to tune the space for parameters using some parallel search space techniques.
Lastly we will give an example of a final evaluation for a network intrusion use case using A/B testing techniques.