Chapter 10. Building a Production-Ready Intrusion Detection System

In the previous chapter, we explained in detail what an anomaly detection is and how it can be implemented using auto-encoders. We proposed a semi-supervised approach for novelty detection. We introduced H2O and showed a couple of examples (MNIST digit recognition and ECG pulse signals) implemented on top of the framework and running in local mode. Those examples used a small dataset already cleaned and prepared to be used as proof-of-concept.

Real-world data and enterprise environments work very differently. In this chapter, we will leverage H2O and general common practices to build a scalable distributed system ready for deployment in production.

We will use as an example an intrusion detection system with the goal of detecting intrusions and attacks in a network environment.

We will raise a few practical and technical issues that you would probably face in building a data product for intrusion detection.

In particular, you will learn:

  • What a data product is
  • How to better initialize the weights of a deep network
  • How to parallelize in multi-threading the Stochastic Gradient Descent algorithm with HOGWILD!
  • How to distribute computation using Map/Reduce on top of Apache Spark using Sparkling Water
  • A few rules of thumb for tweaking scalability and implementation parameters
  • A comprehensive list of techniques for adaptive learning
  • How to validate both in presence and absence of ground truth
  • How to pick the right trade-off between precision and reduced false alarms
  • An example of an exhaustive evaluation framework considering both technical and business aspects
  • A summary of model hyper parameters and tuning techniques
  • How to export your trained model as a POJO and deploy it in an anomaly detection API

What is a data product?

The final goal in data science is to solve problems by adopting data-intensive solutions. The focus is not only on answering questions but also on satisfying business requirements.

Just building data-driven solutions is not enough. Nowadays, any app or website is powered by data. Building a web platform for listing items on sale does consume data but is not necessarily a data product.

Mike Loukides gives an excellent definition:

A data application acquires its value from the data itself, and creates more data as a result; it's not just an application with data; it's a data product. Data science enables the creation of data products.

From "What is Data Science" (https://www.oreilly.com/ideas/what-is-data-science)

The fundamental requirement is that the system is able to derive value from data—not just consuming it as it is—and generate knowledge (in the form of data or insights) as output. A data product is the automation that let you extract information from raw data, build knowledge, and consume it effectively to solve a specific problem.

The two examples in the anomaly detection chapter are the definition of what a data product is not. We opened a notebook, loaded a snapshot of data, started analyzing and experimenting with deep learning, and ultimately produced some plots that prove we could apply auto-encoders for detecting anomalies. Although the whole analysis is reproducible, in the best case, we could have built a proof-of-concept or a toy model. Will this be suitable for solving a real-world problem? Is this a Minimum Viable Product (MVP) for your business? Probably not.

Machine learning, statistics, and data analysis techniques are not new. The origin of mathematical statistics dates back to the 17th century; Machine Learning is a subset of Artificial Intelligence (AI), which was proven by Alan Turing with his Turing Test in 1950. You might argue that the data revolution started with the increase of data collection and advances in technology. I would say this is what enabled the data revolution to happen smoothly. The real shift probably happened when companies started realizing they can create new products, offer better services, and significantly improve their decision-making by trusting their data. Nevertheless, the innovation is not in manually looking for answers in data; it is in integrating streams of information generated from data-driven systems that can extract and provide insights able to drive human actions.

A data product is the result of the intersection between science and technology in order to generate artificial intelligence, able to scale and take unbiased decisions on our behalf.

Because a data product grows and get better by consuming more data, and because it generates data itself, the generative effect could theoretically establish an infinite stream of information. For this reason, a data product must also be self-adapting and able to incrementally incorporate new knowledge as new observations are collected. A statistical model is just one component of the final data product. For instance, an intrusion detection system after the anomaly inspection would feed back a bunch of labeled data that can be re-used for training the model in the following generations.

Nevertheless, data analytics is also extremely important in every organization. It is quite common to find hybrid teams of Data Scientists and Analysts within organizations. The manual supervision, inspection, and visualization of intermediate results is a must requirement for building successful solutions. What we aim to remove is the manual intervention in the finite product. In other words, the development stage involves a lot of exploratory analysis and manual checkpoints but the final deliverable is generally an end-to-end pipeline (or a bunch of independent micro-services) that receives data as input and produces data as output. The whole workflow should preferably be automated, tested, and scalable. Ideally we would like to have real-time predictions integrated within the enterprise system that can react upon each detection.

An example could be a large screen in a factory showing a live dashboard with real-time measurements coming from the active machines and firing alerts whenever something goes wrong. This data product would not fix the machine for you but would be a support tool for human intervention.

Human interaction should generally happen as:

  • Domain expertise by setting priors coming from their experience
  • Developing and testing
  • Final consumption of the product

In our intrusion detection system, we will use the data to recommend actions for a team of security analysts so that they can prioritize and take better decisions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset