Testing

Before we discuss what testing means in data science, let's summarize a few concepts.

Firstly and in general, what is a model in science? We can cite the following definitions:

In science, a model is a representation of an idea, an object or even a process or a system that is used to describe and explain phenomena that cannot be experienced directly.

Scientific Modelling, Science Learning Hub, http://sciencelearn.org.nz/Contexts/The-Noisy-Reef/Science-Ideas-and-Concepts/Scientific-modelling

And this:

A scientific model is a conceptual, mathematical or physical representation of a real-world phenomenon. A model is generally constructed for an object or process when it is at least partially understood, but difficult to observe directly. Examples include sticks and balls representing molecules, mathematical models of planetary movements or conceptual principles like the ideal gas law. Because of the infinite variations actually found in nature, all but the simplest and most vague models are imperfect representations of real-world phenomena.

What is a model in science?, Reference: https://www.reference.com/science/model-science-727cde390380e207

We need a model in order to simplify the complexity of a system in the form of a hypothesis. We proved that deep neural networks can describe complex non-linear relationships. Even though we are just approximating a real system with something more complex than shallow models, in the end this is just another approximation. I doubt any real system actually works as a neural network. Neural networks were inspired by the way our brain processes information, but they are a huge simplification of it.

A model is defined according to some parameters (parametric model). On one hand, we have a definition of a model as function mapping an input space to an output. On the other hand, we have a bunch of parameters that the function needs in order to apply the mapping. For instance, the weights matrix and biases.

Model fitting and training are two terms referring to the process of estimating the parameters of that model so that it best describes the underlying data. Model fitting happens via a learning algorithms that defines a loss function depending on both the model parameters and the data, and it tries to minimize this function by estimating the best set of values for the model parameters. One of the most common algorithm is Gradient Descent, with all its variants. See the previous Training section. For auto-encoder, you would minimize the reconstruction error plus the regularization penalty, if any.

Validation is sometimes confused with testing and evaluation. Validation and testing often use the same techniques and/or methodology but they serve two different purposes.

Model validation corresponds to a type of hypothesis validation. We consider our data to be well described by a model. The hypothesis is that, if that model is correct, after having been trained (parameters estimation), it will describe unseen data the same way it describes the training set. We hypothesize that the model generalizes enough given the limits of the scenario in which we will use it. Model validation aims to find a measure (often referred to as a metric) that quantifies how well the model fits the validation data. For labeled data, we might derive a few metrics from either the Receiver Operating Characteristic (ROC) or Precision-Recall (PR) curve computed from the anomaly scores on the validation data. For unlabeled data, you could for instance use the Excess-Mass (EM) or Mass-Volume (MV) curve.

Although model validation can be a way to evaluate performances, it is widely used for model selection and tuning.

Model selection is the process of selecting among a set of candidates, the model that scores highest in the validation. The set of candidates could be different configurations of the same model, many different models, a selection of different features, different normalization, and/or transformation techniques, and so on.

In deep neural networks, feature selection could be omitted because we delegate to the network itself the role of figuring out and generating relevant features. Moreover, features are also discarded via regularization during learning.

The hypothesis space (the model parameters) depends on the choice of topology, the activation functions, size and depth, pre-processing (for example, whitening of an image or data cleansing), and post-processing (for example, use an auto-encoder to reduce the dimensionality and then run a clustering algorithm). We might see the whole pipeline (the set of components on a given configuration) as the model, even though the fitting could happen independently for each piece.

Analogously, the learning algorithm will introduce a few parameters (for example, learning rate or decay rate). In particular, since we want to maximize the generalization of the model, we generally introduce a regularization technique during the learning function, and that will introduce additional parameters (for example, sparsity coefficient, noise ratio, or regularization weight).

Moreover, the particular implementation of the algorithm also has a few parameters (for example, epochs, number of samples per iteration).

We can use the same validation technique to quantify the performance of the model and learning algorithm together. We can imagine to have a single big vector of parameters that include the model parameters plus the hyper-parameters. We can tune everything in order to minimize the validation metric.

At the end of the model selection and tuning via validation, we have obtained a system that:

  • Takes some of the available data
  • Divides into training and validation, making sure to not introduce biases or unbalancing
  • Creates a search space made up of the set of different models, or different configurations, learning parameters, and implementation parameters
  • Fits each model on the training set by using the training data and learning algorithm with a given loss function, including regularization, according to the specified parameters
  • Computes the validation metric by applying the fitted model on the validation data
  • Selects the one point in the search space that minimizes the validation metric

The selected point will formalize our final theory. The theory says that our observations are generated from a model that is the outcome of the pipeline corresponding to the selected point.

Evaluation is the process of verifying that the final theory is acceptable and quantifying its quality from both technical and business perspectives.

Scientific literature shows how, during the course of history, one theory has succeeded another. Choosing the right theory without introducing a cognitive bias requires rationality, accurate judgment, and logical interpretation.

Confirmation theory, the study that guides scientific reasoning other than reasoning of the deductive kind, can help us defining a few principles.

In our context, we want to quantify the quality of our theory and verify it is good enough and that it an evident advantage with respect to a much simpler theory (the baseline). A baseline could be a Naïve implementation of our system. In the case of an anomaly detector, it could simply be a rule-based threshold model where anomalies are flagged for each observation whose feature values are above a static set of thresholds. Such a baseline is probably the simplest theory we can implement and maintain over time. It will probably not satisfy the full acceptance criteria, but it will help us to justify why we need another theory, that is, a more advanced model.

Colyvan, in his book The Indispensability of Mathematics, summarized the criteria for accepting a good theory as a replacement for another based on four major criteria:

  1. Simplicity/Parsimony: Simple is better than complex if the empirical results are comparable. Complexity is only required when you need to overcome some limitation. Otherwise, simplicity should be preferred in both its mathematical form and its ontological commitments.
  2. Unification/Explanatory Power: The capacity of consistently explaining both existing and future observations. Moreover, unification means minimizing the number of theoretical devices needed for the explanation. A good theory offers an intuitive way of explaining why a given prediction is expected.
  3. Boldness/Fruitfulness: A bold theory is an idea that, if it was true, would be able to predict and/or explain a lot more about the system we are modeling. Boldness helps us refuse theories that would contribute very little to what we know already. It is allowed to formulate something new and innovative and then try to contradict it with known evidence. If we can't prove a theory is correct we can demonstrate that the evidence does not prove the contrary. Another aspect is heuristic potential. A good theory can enable more theories. Between two theories we want to favor the more fruitful: the one that has more potential for being reused or extended in future.
  4. Formal elegance: A theory must have an aesthetic appeal and should be robust enough for ad-hoc modifications to a failing theory. Elegance is the quality of explaining something in a clear, economical, and concise way. Elegance also enables better scrutiny and maintainability.

These criteria, in the case of neural networks, are translated into the following:

  1. Shallow models with a few layers and small capacity are preferred. As we discussed in the Network design section, we start with something simpler and incrementally increase complexity if we need so. Eventually the complexity will converge and any further increase will not give any benefit.
  2. We will distinguish between explanatory power and unificatory power:
    • Explanatory power is evaluated similarly to model validation but with a different dataset. We mentioned earlier that we broke the data into three groups: training, validation, and testing. We will use the training and validation to formulate the theory (the model and hyper-parameters) that the model is retrained on the union of both training and validation set becoming the new training set; and ultimately the final, already validated, model is evaluated against the test set. It is important at this stage to consider the validation metrics on the training set and test set. We would expect the model to perform better on the training set, but having too wide a gap between the two means the model does not explain unseen observations very well.
    • Unificatory power can be represented by the model sparsity. Explaining means mapping input to output. Unifying means reducing the number of elements required to apply the mapping. By adding a regularization penalty, we make the features sparser, which means we can explain an observation and its prediction using fewer regressors (theoretical devices).
  3. Boldness and fruitfulness can also be split into two aspects:
    • Boldness is represented by our test-driven approach. In addition to point 2, where we try to make clear what a model does and why, in the test-driven approach, we treat the system as a black box and check the responses under different conditions. For an anomaly detection, we can systematically create some failing scenarios with different degrees of anomalousness and measure at which level the system is able to detect and react. Or for time-responsive detectors, we could measure how long it takes to detect a drift in the data. If the tests pass, then we have achieved confidence that it works no matter how. This is probably one of the most common approaches in machine learning. We try everything that we believe can work; we carefully evaluate and tentatively accept when our critical efforts are unsuccessful (that is,. the tests pass).
    • Fruitfulness comes from the reusability of a given model and system. Is it too strongly coupled to the specific use case? Auto-encoders work independently of what the underlying data represent, they use very little domain knowledge. Thus, if the theory is that a given auto-encoders can be used for explaining a system in its working conditions, then we could extend it and re-use it for detecting in any kind of system. If we introduce a pre-processing step (such as image whitening), then we are assuming the input data are pixels of an image, thus even if this theory superbly fit our use case it has a smaller contribution to the greater usability. Nevertheless, if the domain-specific pre-processing improves the final result noticeably, then we will consider it as an important part of the theory. But if the contribution is negligible, it is recommended to refuse it in favor of something more reusable.
  4. One aspect of elegance in deep neural networks could implicitly be represented as the capacity of learning features from the data rather than hand-crafting them. If that is the case, we can measure how the same model is able to self-adapt to different scenarios by learning relevant features. For example, we could test that given any dataset we consider normal, we can always construct an auto-encoder that learns the normal distribution. We can either add or remove features from the same dataset or partition according to some external criteria generating dataset with different distributions. Then we can inspect the learned representations and measure the reconstruction ability of the model. Instead of describing the model as a function of the specific input features and weights, we describe it in terms of neurons—entities with learning capabilities. Arguably, this is a good example of elegance.

From a business perspective, we really need to think carefully about what the acceptance criteria are.

We would like to answer at least the following questions:

  • What problem are we trying to solve?
  • How is the business going to benefit from it?
  • In which way will the model be integrated within an existing system from a practical and technical point of view?
  • What is the final deliverable so that it is consumable and actionable?

We will try to use as example an intrusion detection system and try to respond to these questions.

We would like to monitor a network traffic in real time, taking individual network connections and marking them as normal or suspicious. This will allow the business to have enhanced protection against intruders. The flagged connections will be stopped and will go into a queue for manual inspection. A team of security experts will look into those connections and determine whether it is a false alarm and, in the case of a confirmed attack, will mark the connection under one of the available labels. Thus, the model has to provide a real-time list of connections sorted by their anomaly score. The list cannot contain more elements than the capability of the security team. Moreover, we need to balance the cost of permitting an attack, the cost of damages in the case of an attack, and the cost required for inspection. A minimum requirement consisting of precision and recall is a must in order to probabilistically limit the worst-case scenario.

All of these evaluation strategies have been mainly defined qualitatively rather quantitatively. It would be quite hard to compare and report something that is not numerically measurable.

Bryan Hudson, a Data Science practitioner, said:

If you can't define it, you can't measure it. If it can't be measured, it shouldn't be reported. Define, then measure, then report.

Define, then measure, then report. But be careful. We might think of defining a new evaluation metric that takes into account every possible aspect and scenario discussed so far.

Whilst many data scientists may attempt to quantify the evaluation of a model using a single utility function, as you do during validation, for a real production system, this is not advised. As also expressed in the Professional Data Science Manifesto:

A product needs a pool of measures to evaluate its quality. A single number cannot capture the complexity of reality.

The Professional Data Science Manifesto, www.datasciencemanifesto.org

And even after we have defined our Key Performance Indicators (KPIs), their real meaning is relative when compared to a baseline. We must ponder over why we need this solution with respect to a much simpler or existing one.

The evaluation strategy requires defining test cases and KPIs so that we can cover the most scientific aspects and business needs. Some of them are aggregated numbers, others can be represented in charts. We aim to summarize and efficiently present all of them in a single evaluation dashboard.

In the following sections, we will see a few techniques used for model validation using both labeled and unlabelled data.

Then we will see how to tune the space for parameters using some parallel search space techniques.

Lastly we will give an example of a final evaluation for a network intrusion use case using A/B testing techniques.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset