End-to-end evaluation

From a business point of view what really matters is the final end-to-end performance. None of your stakeholders will be interested in your training error, parameters tuning, model selection, and so on. What matters is the KPIs to compute on top of the final model. Evaluation can be seen as the ultimate verdict.

Also, as we anticipated, evaluating a product cannot be done with a single metric. Generally, it is a good and effective practice to build an internal dashboard that can report, or measure in real-time, a bunch of performance indicators of our product in the form of aggregated numbers or easy-to-interpret visualization charts. Within a single glance, we would like to understand the whole picture and translate it in the value we are generating within the business.

The evaluation phase can, and generally does, include the same methodology as the model validation. We have seen in previous sections a few techniques for validating in case of labeled and unlabeled data. Those can be the starting points.

In addition to those, we ought to include a few specific test scenarios. For instance:

  • Known versus unknown detection performance: This means measuring the performance of the detector for both known and unknown attacks. We can use the labels to create different training sets, some of them with no attacks at all and some of them with a small percentage; remember that having too many anomalies in the training set would be against the definition of anomalies. We could measure the precision on the top N elements in the function of the percentage of anomalies in the training set. This will give us an indicator of how general the detector is with respect to past anomalies and hypothetical novel ones. Depending on what we are trying to build, we might be interested more on novel anomalies or more on known ones.
  • Relevance performance: Just scoring enough to hit the threshold or being select in the top priority queue is important but the ranking also matters. We would like the most relevant anomalies to always score at the top of the queue. Here we could either define the priorities of the different labels and compute a ranking coefficient (for example, Spearman) or use some evaluation technique used for Recommender Systems. One example of the latter is mean average precision at k (MAP@k) used in Information Retrieval to score a query engine with regards to the relevance of the returned documents.
  • Model stability: We select the best model during validation. If we sample the training data differently or use slightly different validation dataset (containing different types of anomalies) we would like the best model to always be the same or at least among the top selected models. We can create histogram charts showing the frequency of a given model of being selected. If there is no obvious winner or a subset of frequent candidates, then the model selection is a bit unstable. Every day, we might select a different model that is good for reacting to new attacks but at the price of instability.
  • Attack outcome: If the model detects an attack with a very high score and this attack is confirmed by the analysts, is the model also able to detect whether the system has been compromised or returned to normalcy? One way of testing this is to measure the distribution of the anomaly score right after an alert is raised. Comparing the new distribution with the older one and measuring any gap. A good anomaly detector should be able to tell you about the state of the system. The evaluation dashboard could have this information visualized for the last or recently detected anomalies.
  • Failure case simulations: Security analysts can define some scenarios and generate some synthetic data. One business target could be "being able to protect from those future types of attacks". Dedicated performance indicators can be derived from this artificial dataset. For example, an increasing ramp of network connections to the same host and port could be a sign of Denial of Service (DOS) attack.
  • Time to detect: The detector generally scores each point independently. For contextual and time-based anomalies, the same entities might generate many points. For example, if we open a new network connection, we can start scoring it against the detector while it is still open and every few seconds generate a new point with the features collected over a different time interval. Likely, you will collect multiple sequential connections together into a single point to score. We would like to measure how long it takes to react. If the first connection is not considered anomalous, maybe after 10 consecutive attempts, the detector will react. We can pick a known anomaly, break it down into sequentially growing data points, and then report after how many of those the contextual anomaly is raised.
  • Damage cost: If somehow we are able to quantify the impact of attack damages or savings due to the detection, we should incorporate this in the final evaluation. We could use as benchmark the last past month or year and estimate the savings; hopefully this balance will be positive, in case we have deployed the current solution since then or the real savings if the current solution was deployed in this last period.

We would like to summarize all of this information within a single dashboard from where we can make statements such as: Our anomaly detector is able to detect previously seen anomalies with a precision of 76% (+- 5%) and average reacting time of 10 seconds and novel anomalies with precision of 68% (+- 15%) and reaction time of 14 seconds. We observed an average of 10 anomalies per day. Considering the capability of 1,000 inspections per day, we can fill the 80% of the most relevant detections corresponding to 6 anomalies within just 120 top elements of the queue. Of these, only the 2 out of 10 that compromise the system are included in this list. We can then divide the inspections in 2 tiers; the first tier will respond immediately of the top 120 elements and the second tier will take care of the tail. Standing to the current simulated failing scenarios, we are protected in 90% of them. Total saving since last year corresponds to 1.2 million dollars.

A/B Testing

So far, we have only considered evaluation based on past historical data (retrospective analysis) and/or based on simulations with synthetic dataset. The second one is based on the assumption of a particular failure scenario to happen in the future. Evaluating only based on historical data assumes that the system will always behave under those conditions and that the current data distribution also describes the stream of future data. Moreover, any KPI or performance metric should be evaluated relative to a baseline. The product owner wants to justify the investment for that project. What if the same problem could have been solved in a much cheaper way?

For this reason, the only truth for evaluating any machine learning system is A/B testing. A/B testing is a statistical hypothesis testing with two variants (the control and variation) in a controlled experiment. The goal of A/B testing is to identify performance differences between the two groups. It is a technique widely used in user experience design for websites or for advertising and/or marketing campaigns. In the case of anomaly detection, we can use a baseline (the simplest rule-based detector) as the control version and the currently selected model as variation candidate.

The next step is to find a meaningful evaluation that quantifies the return of investment.

"We have to find a way of making the important measurable, instead of making the measurable important."

Robert McNamara, former US Secretary of Defense

The return of investment will be represented by the uplift defined as:

A/B Testing

It is the difference between the two KPIs that quantifies the effectiveness of the treatment.

In order to make the comparison fair we must ensure that the two groups share the same distribution of the population. We want to remove any bias given by the choice of individuals (data samples). In the case of the anomaly detector, we could, in principle, apply the same stream of data to both the two models. This is not recommended though. By applying one model you can influence the behavior of a given process. A typical example is an intruder who is first detected by a model, and as such, the system would react by dropping his open connections. A smart intruder would realize that he has been discovered and would not attempt to connect again. In that case, the second model may never observe a given expected pattern because of the influence of the first model.

By separating the two models over two disjoint subsets of data, we make sure the two models cannot influence each other. Moreover, if our use case requires the anomalies to be further investigated by our analysts, then they cannot be duplicated.

Here, we must split according to the same criteria as we have seen in the data validation: no data leakage and entity sub-sampling. The final test that can confirm whether the two groups are actually identically distributed is A/A testing.

As the name suggests, A/A testing consists on re-using the control version on both the two groups. We expect that the performance should be very similar equivalent to an uplift close to 0. It is also an indicator of the performance variance. If the A/A uplift is non-zero, then we have to redesign the controlled experiment to be more stable.

A/B testing is great for measuring the difference in performance between the two models but just the model is not the only factor that influence the final performance. If we take into account the damage cost model, which is the business core, the model must be accurate on generating a prioritized list of anomalies to investigate but also the analysts must be good on identifying, confirming and reacting upon.

Hence, we have two factors: the model accuracy and the security team effectiveness.

We can divide the controlled experiment into an A/B/C/D test where four independent groups are created, as follows:

 

Base model

Advanced model

No action from security team

Group A

Group B

Intervention from security team

Group C

Group D

We can compute a bunch of uplift measures that quantify both the model accuracy and security team effectiveness. In particular:

  • uplift(A,B): The effectiveness of the advanced model alone
  • uplift(D,C): The effectiveness of the advanced model in case of security intervention
  • uplift(D,A): The effectiveness of both advanced model and security intervention joint together
  • uplift(C,A): The effectiveness of the security intervention on the low-accuracy queue
  • uplift(D,B): The effectiveness of the security intervention on the high-accuracy queue

This is just an example of meaningful experiment and evaluations you want to carry out in order to quantify in numbers what the business really cares about.

Furthermore, there are a bunch of advanced techniques for A/B testing. Just to name a popular one, the multi-armed bandit algorithm allows you to dynamically adjust the size of the different testing groups in order to adapt to the performance of those and minimize the loss due to low performing groups.

A summary of testing

To summarize, for an anomaly detection system using neural networks and labeled data, we can define the following:

  • Model as the definition of the network topology (number and size of hidden layers), activation functions, pre-processing and post-processing transformations.
  • Model parameters as the weights of hidden units and biases of hidden layers.
  • Fitted model as the model with an estimated value of parameters and able to map samples from the input layer to the output.
  • Learning algorithm (also training algorithm) as SGD or its variants (HOGWILD!, adaptive learning) + the loss function + regularization.
  • Training set, validation set and test set are three disjoint and possibly independent subsets of the available data where we preserve the same distribution.
  • Model validation as the maximum F-measure score from the ROC curve computed on the validation set using model fitted on the training set.
  • Model selection as the best validated model among a set of possible configurations (1 hidden layer Vs. 3 hidden layers, 50 neurons Vs. 1000 neurons, Tanh Vs. Sigmoid, Z-scaling Vs. Min/Max normalization and so on…).
  • Hyper-parameters tuning as the extension of model selection with algorithm and implementation parameters such as learning parameters (epochs, batch size, learning rate, decay factor, momentum…), distributed implementation parameters (samples per iteration), regularization parameters (lambda in L1 and L2, noise factor, sparsity constraint…), initialization parameters (weights distribution) and so on.
  • Model evaluation, or testing, as the final business metrics and acceptance criteria computed on the test set using model fitted on both training and validation set merged together. Some examples are the precision and recall for just top N test samples, time to detection, and so on.
  • A/B testing as the uplift of evaluation performances of a model with respect to a baseline computed on two different, but homogeneous, subsets of the live data population (the control and variation groups).

We hope that we've clarified the essential and most important steps to consider when testing a production-ready deep learning intrusion detection system. These techniques, metrics, or tuning parameters may not be the same for your use case, but we hope that the thoughtful methodology can serve as a guideline for any data product.

A great resource of guidelines and best practices for building Data Science systems that are both scientifically correct and valuable for the business is the Professional Data Science Manifesto: www.datasciencemanifesto.org. It is recommended the reading and reasoning around the listed principles.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset