From a business point of view what really matters is the final end-to-end performance. None of your stakeholders will be interested in your training error, parameters tuning, model selection, and so on. What matters is the KPIs to compute on top of the final model. Evaluation can be seen as the ultimate verdict.
Also, as we anticipated, evaluating a product cannot be done with a single metric. Generally, it is a good and effective practice to build an internal dashboard that can report, or measure in real-time, a bunch of performance indicators of our product in the form of aggregated numbers or easy-to-interpret visualization charts. Within a single glance, we would like to understand the whole picture and translate it in the value we are generating within the business.
The evaluation phase can, and generally does, include the same methodology as the model validation. We have seen in previous sections a few techniques for validating in case of labeled and unlabeled data. Those can be the starting points.
In addition to those, we ought to include a few specific test scenarios. For instance:
We would like to summarize all of this information within a single dashboard from where we can make statements such as: Our anomaly detector is able to detect previously seen anomalies with a precision of 76% (+- 5%) and average reacting time of 10 seconds and novel anomalies with precision of 68% (+- 15%) and reaction time of 14 seconds. We observed an average of 10 anomalies per day. Considering the capability of 1,000 inspections per day, we can fill the 80% of the most relevant detections corresponding to 6 anomalies within just 120 top elements of the queue. Of these, only the 2 out of 10 that compromise the system are included in this list. We can then divide the inspections in 2 tiers; the first tier will respond immediately of the top 120 elements and the second tier will take care of the tail. Standing to the current simulated failing scenarios, we are protected in 90% of them. Total saving since last year corresponds to 1.2 million dollars.
So far, we have only considered evaluation based on past historical data (retrospective analysis) and/or based on simulations with synthetic dataset. The second one is based on the assumption of a particular failure scenario to happen in the future. Evaluating only based on historical data assumes that the system will always behave under those conditions and that the current data distribution also describes the stream of future data. Moreover, any KPI or performance metric should be evaluated relative to a baseline. The product owner wants to justify the investment for that project. What if the same problem could have been solved in a much cheaper way?
For this reason, the only truth for evaluating any machine learning system is A/B testing. A/B testing is a statistical hypothesis testing with two variants (the control and variation) in a controlled experiment. The goal of A/B testing is to identify performance differences between the two groups. It is a technique widely used in user experience design for websites or for advertising and/or marketing campaigns. In the case of anomaly detection, we can use a baseline (the simplest rule-based detector) as the control version and the currently selected model as variation candidate.
The next step is to find a meaningful evaluation that quantifies the return of investment.
"We have to find a way of making the important measurable, instead of making the measurable important."
Robert McNamara, former US Secretary of Defense
The return of investment will be represented by the uplift defined as:
It is the difference between the two KPIs that quantifies the effectiveness of the treatment.
In order to make the comparison fair we must ensure that the two groups share the same distribution of the population. We want to remove any bias given by the choice of individuals (data samples). In the case of the anomaly detector, we could, in principle, apply the same stream of data to both the two models. This is not recommended though. By applying one model you can influence the behavior of a given process. A typical example is an intruder who is first detected by a model, and as such, the system would react by dropping his open connections. A smart intruder would realize that he has been discovered and would not attempt to connect again. In that case, the second model may never observe a given expected pattern because of the influence of the first model.
By separating the two models over two disjoint subsets of data, we make sure the two models cannot influence each other. Moreover, if our use case requires the anomalies to be further investigated by our analysts, then they cannot be duplicated.
Here, we must split according to the same criteria as we have seen in the data validation: no data leakage and entity sub-sampling. The final test that can confirm whether the two groups are actually identically distributed is A/A testing.
As the name suggests, A/A testing consists on re-using the control version on both the two groups. We expect that the performance should be very similar equivalent to an uplift close to 0. It is also an indicator of the performance variance. If the A/A uplift is non-zero, then we have to redesign the controlled experiment to be more stable.
A/B testing is great for measuring the difference in performance between the two models but just the model is not the only factor that influence the final performance. If we take into account the damage cost model, which is the business core, the model must be accurate on generating a prioritized list of anomalies to investigate but also the analysts must be good on identifying, confirming and reacting upon.
Hence, we have two factors: the model accuracy and the security team effectiveness.
We can divide the controlled experiment into an A/B/C/D test where four independent groups are created, as follows:
Base model |
Advanced model | |
---|---|---|
No action from security team |
Group A |
Group B |
Intervention from security team |
Group C |
Group D |
We can compute a bunch of uplift measures that quantify both the model accuracy and security team effectiveness. In particular:
uplift(A,B)
: The effectiveness of the advanced model aloneuplift(D,C)
: The effectiveness of the advanced model in case of security interventionuplift(D,A)
: The effectiveness of both advanced model and security intervention joint togetheruplift(C,A)
: The effectiveness of the security intervention on the low-accuracy queueuplift(D,B)
: The effectiveness of the security intervention on the high-accuracy queueThis is just an example of meaningful experiment and evaluations you want to carry out in order to quantify in numbers what the business really cares about.
Furthermore, there are a bunch of advanced techniques for A/B testing. Just to name a popular one, the multi-armed bandit algorithm allows you to dynamically adjust the size of the different testing groups in order to adapt to the performance of those and minimize the loss due to low performing groups.
To summarize, for an anomaly detection system using neural networks and labeled data, we can define the following:
We hope that we've clarified the essential and most important steps to consider when testing a production-ready deep learning intrusion detection system. These techniques, metrics, or tuning parameters may not be the same for your use case, but we hope that the thoughtful methodology can serve as a guideline for any data product.
A great resource of guidelines and best practices for building Data Science systems that are both scientifically correct and valuable for the business is the Professional Data Science Manifesto: www.datasciencemanifesto.org. It is recommended the reading and reasoning around the listed principles.