Model validation

The goal of model validation is to evaluate whether the numerical results quantifying the hypothesized estimations/predictions of the trained model are acceptable descriptions of an independent dataset. The main reason is that any measure on the training set would be biased and optimistic since the model has already seen those observations. If we don't have a different dataset for validation, we can hold one fold of the data out from training and use it as benchmark. Another common technique is the cross-fold validation, and its stratified version, where the whole historical dataset is split into multiple folds. For simplicity, we will discuss the hold-one-out method; the same criteria apply also to the cross-fold validation.

The splitting into training and validation set cannot be purely random. The validation set should represent the future hypothetical scenario in which we will use the model for scoring. It is important not to contaminate the validation set with information that is highly correlated with the training set (leakage).

A bunch of criteria can be considered. The easiest is the time. If your data is chronological, then you'll want to select the validation set to always be after the training set.

If your deployment plan is to retrain once a day and score all the observations of the next 24 hours, then your validation set should be exactly 24 hours. All observations after 24 hours would never be scored with the last trained model but with a model trained with the additional past 24 hours' observations.

Of course, only using 24 hours of observations for validation is too restrictive. We will have to perform a few validations, where we select a number of time split points; for each split point, we train the model up to that point and validate on the data in the following validation window.

The choice of number of split points depends on the amount of available resources. Ideally, we would like to map the exact frequency at which the model will be trained, that is, one split point a day for the last year or so.

There are a bunch of operational things to consider when splitting in train and validation set:

  • Regardless of whether the data has a timestamp or not, the chronological time should be set by what would have been available at that time. In other words, let's suppose that you have 6 hours of delay between the data generation and the time when it is turned into a feature space for training; you should consider the latter time in order to filter what was before or after the given split point.
  • How long does the training procedure take? Suppose our model requires 1 hour to be retrained; we would schedule its training one hour before the expiration of the previous model. The scores during its training interval will be covered by the previous model. That means we cannot predict any observation that happens in the following hour of the last data collected for training. This introduces a gap between the training set and validation set.
  • How does the model perform for day-0 malware (the cold start problem)? During validation, we want to project the model in the worst-case scenario instead of being over-optimistic. If we can find a partitioning attribute, such as device ID or network card MAC address, we can then divide users into buckets representing different validation folds and perform a cross-fold validation where iteratively you select one fold of users to validate the model trained with the remaining users folds. By doing so, we always validate predictions for users whose history we have never seen before. That helps with truly measuring the generalization for those cases where the training set already contains a strong signal of anomaly for the same device over past connections. In that case, it would be very easy for the model to spot anomalies but they would not necessary match a real use case.
  • The choice of attribute (primary key) on which to apply the partitioning is not simple. We want to reduce the correlation among folds as much as possible. If we ingenuously partition on the device ID, how will we cope with the same user or the same machine with multiple devices, all registered with a different identifier? The choice of partitioning key is an entity resolution problem. The correct way of solving this issue would be to firstly cluster the data belonging to the same entity and then partition such that data of the same entity is never split among different folds. The definition of the entity depends on the particular use case context.
  • When performing cross-fold validation, we still need to ensure the time constraint. That is, for each validation fold, we need to find a time split point in the intersection with the other training folds. Filter the training set both on the entity id and timestamp; then filter the data in the validation fold according to the validation window and gap.
  • Cross-fold validation introduces a problem with class unbalancing. By definition; anomalies are rare; thus our dataset is highly skewed. If we randomly sample entities, then we would probably end up with a few folds without anomalies and a few with too many. Thus, we need to apply a stratified cross-fold validation where we want to preserve the same distribution of anomalies uniformly in each fold. This is a tricky problem in the case of unlabeled data. But we can still run some statistics on the whole feature space and partition in such a way as to minimize the distribution differences among folds.

We have just listed a few of the common pitfalls to consider when defining the splitting strategy. Now we need to compute some metrics. The choice of the validation metric should be significant with the real operational use case.

We will see in the following sections a few possible metrics defined for both labeled and unlabeled data.

Labeled Data

Anomaly detection on labeled data can be seen just as a standard binary classifier.

Let Labeled Data be our anomaly scoring function where the higher the score, the higher the probability of being an anomaly. For auto-encoders, it could simply be the MSE computed on the reconstruction error and rescaled to be in the range[0,1]. We are mainly interested in relative ordering rather than absolute values.

We can now validate using either the ROC or PR curve.

In order to do so, we need to set a threshold a that corresponds to the scoring function s and consider all of the points x with score s(x) = a to be classified as anomalies.

For each value of a, we can calculate the confusion matrix as:

Number of observations n

Predicted anomaly s(x) = a

Predicted non-anomaly (s < a)

True anomaly

True Positive (TP)

False Negative (FN)

True non-anomaly

False Positive (FP)

True Negative (TN)

From each confusion matrix corresponding to a value of a, we can derive the measures of True Positive Rate (TPR) and False Positive Rate (FPR) as:

Labeled Data
Labeled Data

We can draw each value of a in a two-dimensional space that generates the ROC curve consisting of Labeled Data.

The way we interpret the plot is as follows: each cut-off point tells us on the y-axis the fraction of anomalies that we have spotted among the full set of anomalies in the validation data (Recall). The x-axis is the false alarm ratio, the fraction of observations marked as anomalies among the full set of normal observations.

If we set the threshold close to 0, it means we are flagging everything as anomaly but all the normal observations will produce false alarms. If we set it close to 1, we will never fire any anomaly.

Let's suppose for a given value of a the corresponding TPR = 0.9 and FPR = 0.5; this means that we detected 90% of anomalies but the anomaly queue contained half of the normal observations as well.

The best threshold point would be the one located at coordinates (0,1), which corresponds to 0 false positive and 0 false negatives. This never happens, so we need to find a trade-off between the Recall and false alarm ratio.

One of the issues with the ROC curve is that does not show very well what happens for a highly skewed dataset. If anomalies represent only 1% of the data, the x axis is very likely to be small and we might be tempted to relax the threshold in order to increase the Recall without any major effect on the x axis.

The Precision-Recall (PR) plot swaps the axis and replaces the FPR with the Precision defined as:

Labeled Data

Precision is a more meaningful metric and represents the fraction of anomalies among the list of detected ones.

The idea now is to maximize both the axes. On the y axis, we can observe the expected results of the portion that will be inspected, and the x axis tells how many anomalies we will miss, both of them on a scale that depends only on the anomaly probability.

Having a two-dimensional plot can help us understand how the detector would behave in different scenarios, but in order to apply model selection, we need to minimize a single utility function.

A bunch of measures can be used to synthesize this. The most common one is the area under the curve (AUC), which is an indicator of the average performance of the detector under any threshold. For the ROC curve, the AUC can be interpreted as the probability that a uniformly drawn random anomaly observation is ranked higher than a uniformly drawn random normal observation. It is not very useful for anomaly detection.

The absolute values of Precision and Recall being defined on the same scale can be aggregated using the harmonic mean, also known as F-score:

Labeled Data

Here, ß is a coefficient that weights to what extent Recall is more important than Precision.

The term Labeled Data is added in order to scale the score between 0 and 1.

In case of symmetry we obtain the F1-score:

Labeled Data

Security analysts can also set preferences based on minimum requirements for the values of Precision and Recall. In that situation, we can define the Preference-Centric score as:

Labeled Data

The PC-score allows us to select a range of acceptable thresholds and optimize the points in the middle based on the F1-score. The unit term in the first case is added so that it will always outperform the second one.

Unlabeled Data

Unfortunately, most of the times data comes without a label and it would require too much human effort to categorize each observation.

We propose two alternatives to the ROC and PR curves that do not require labels: the Mass-Volume (MV) and the Excess-Mass (EM) curves.

Let Unlabeled Data be our inverse anomaly scoring function this time, where the smaller the score, the higher the probability of it being an anomaly. In the case of an auto-encoder, we can use the inverse of the reconstruction error:

Unlabeled Data

Here ϵ is a small term to stabilize in the case of a near zero reconstruction error.

The scoring function will give an ordering of each observation.

Let Unlabeled Data be the probability density function of the normal distribution of a set of i.i.d. observations X1,…,Xn and F its cumulative density function.

The function f would return a score very close to 0 for any observation that does not belong to the normal distribution. We want to find a measure of how close the scoring function s is to f. The ideal scoring function would just coincide with f. We will call such a performance criterion C(s).

Given a set S of scoring functions integrable with respect to the Lebesgue measure.

The MV-curve of s is the plot of the mapping:

Unlabeled Data

Here Unlabeled Data.

The Lebesgue measure of a set X is obtained by dividing the set into buckets (sequence of open intervals) and summing the n-volume of each bucket. The n-volume is the multiplication of the lengths of each dimension defined as the difference between max and min values. If Xi is a subset of a bunch of d-dimensional points, their projection on each axis will give the lengths and the multiplication of the lengths will give the d-dimensional volume.

The MV measure at a corresponds to the n-volume corresponding to the infimum subset of X defined by the threshold t such that the c.d.f. of s(X) at t is higher than or equal to a.

Unlabeled Data

Volume-Mass curve from "Mass Volume Curves and Anomaly Ranking", S. Clemencon, UMR LTCI No. 5141, Telecom ParisTech/CNRS

The optimal MV curve would be the one calculated on f. We would like to find the scoring function s which minimizes the L1 norm of the point-wise difference with MVf on an interested interval IMV representing the large density level-sets (for example, [0.9, 1]).

It is proven that Unlabeled Data. Since MV s is always below MV f, the Unlabeled Data will correspond to the Unlabeled Data. Our performance criterion for MV is Unlabeled Data. The smaller the value of CMV the better is the scoring function.

One problem with the MV-curve is that the area under the curve (AUC) diverges for a = 1 if the support of the distribution is infinite (the set of possible values is not bounded).

One workaround is to choose the interval Unlabeled Data.

A better variant is the Excess-Mass (EM) curve defined as the plot of the mapping:

Unlabeled Data

The performance criterion will be Unlabeled Data and Unlabeled Data, where Unlabeled Data. EMs is now always finite.

Unlabeled Data

Excess-Mass curve from "On anomaly Ranking and Excess-Mass curves", N. Goix, A. Sabourin, S. Clemencon, UMR LTCI No. 5141, Telecom ParisTech/CNRS

One problem of EM is that the interval of large level sets is of the same order of magnitude as the inverse of the total support volume. This is a problem for datasets with large dimensions. Moreover, for both EM and MV, the distribution f of the normal data is not known and must be estimated. For practicality, the Lebesgue volume can be estimated via the Monte Carlo approximation, which applies only to small dimensions.

In order to scale to large-dimensional data, we can sub-sample training and validation data with replacement iteratively along a randomly fixed number of features d' in order to compute the EM or MV performance criterion score. Replacement is done only after we have drawn the samples for each subset of features.

The final performance criterion is obtained by averaging these partial criteria along the different features draw. The drawback is that we cannot validate combinations of more than d' features. On the other hand, this feature sampling allows us to estimate EM or MV for large dimensions and allows us to compare models produced from space of different dimensions, supposing we want to select over models that consume different views of the input data.

Summary of validation

We have seen how we can plot curve diagrams and compute aggregated measures in the case of both labeled and unlabeled data.

We have shown how to select sub ranges of the threshold value of the scoring function in order to make the aggregated metric more significant for anomaly detections. For the PR-curve, we can set the minimum requirements of Precision and Recall; for EM or MV we arbitrarily select the interval corresponding to large level-sets even if they don't have a directly corresponding meaning.

In our example of network intrusion, we score anomalous points and store them into a queue for further human inspection. In that scenario, we need to consider also the throughput of the security team. Let's suppose they can inspect only 50 connections per day; our performance metrics should be computed only on the top 50 elements of the queue. Even if the model is able to reach a recall of 100% on the first 1,000 elements, those 1,000 elements are not feasible to inspect in a real scenario.

This situation kind of simplifies the problem because we will automatically select the threshold that gives us the expected number of predicted anomalies independently of true positive or false positives. This is the best the model can do given the top N observations most likely to be anomalies.

There is also another issue in this kind of threshold-based validation metrics in the case of cross-fold validation, that is, the aggregation technique. There are two major ways of aggregating: micro and macro.

Macro aggregation is the most common one; we compute thresholds and metrics in each validation fold and then we average them. Micro aggregation consists of storing the results of each validation fold, concatenating them together and computing one single threshold and metric at the end.

The macro aggregation technique also gives a measure of stability, and of how much the performance of our system changes if we perturb by using different samples. On the other hand, macro aggregation introduces more bias into the model estimates, especially in rare classes like anomaly detection. Thus, micro aggregation is generally preferred.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset