Chapter 9. Anomaly Detection

In Chapter 4, Unsupervised Feature Learning, we saw the mechanisms of feature learning and in particular the use of auto-encoders as an unsupervised pre-training step for supervised learning tasks.

In this chapter, we are going to apply similar concepts, but for a different use case, anomaly detection.

One of the determinants for a good anomaly detector is finding smart data representations that can easily evince deviations from the normal distribution. Deep auto-encoders work very well in learning high-level abstractions and non-linear relationships of the underlying data. We will show how deep learning is a great fit for anomaly detection.

In this chapter, we will start by explaining the differences and communalities of concepts between outlier detection and anomaly detection. The reader will be guided through an imaginary fraud case study followed by examples showing the danger of having anomalies in real-world applications and the importance of automated and fast detection systems.

Before to move onto the deep learning implementations, we will cover a few families of techniques widely used in traditional machine learning and their current limits.

We will apply the architectures of deep auto-encoders seen in Chapter 4, Unsupervised Feature Learning, but for a particular kind of semi-supervised learning, also known as novelty detection. We will propose two powerful approaches: one based on reconstruction errors and another based on low-dimensional feature compression.

We will introduce H2O, one of the most demanded open source frameworks for building simple, but scalable feed-forward multi-layer neural networks.

Lastly, we will code a couple of examples of anomaly detection using the Python API of the H2O auto-encoder model.

The first example will reuse the MNIST digit dataset that you have seen Chapter 3, Deep Learning Fundamentals and Chapter 4, Unsupervised Feature Learning, but for detecting badly written digits. A second example will show how to detect anomalous pulsations in electrocardiogram time series.

To summarize, this chapter will cover the following topics:

  • What is anomaly and outlier detection?
  • Real-world applications of anomaly detection
  • Popular shallow machine learning techniques
  • Anomaly detection using deep auto-encoders
  • H2O overview
  • Code examples:
    • MNIST digit anomaly recognition
    • Electrocardiogram pulse detection

What is anomaly and outlier detection?

Anomaly detection, often related to outlier detection and novelty detection, is the identification of items, events, or observations that deviate considerably from an expected pattern observed in a homogeneous dataset.

Anomaly detection is about predicting the unknown.

Whenever we find a discordant observation in the data, we could call it an anomaly or outlier. Although the two words are often used interchangeably, they actual refer to two different concepts, as Ravi Parikh describes in one of his blog posts (https://blog.heapanalytics.com/garbage-in-garbage-out-how-anomalies-can-wreck-your-data/):

"An outlier is a legitimate data point that's far away from the mean or median in a distribution. It may be unusual, like a 9.6-second 100-meter dash, but still within the realm of reality. An anomaly is an illegitimate data point that's generated by a different process than whatever generated the rest of the data."

Let's try to explain the difference using a simple example of fraud detection.

In a log of transactions, we observe that a particular customer spends an average of $10 for their lunch every weekday. Suddenly, one day they spend $120. This is certainly an outlier, but perhaps that day they decided to pay the whole bill with their credit card. If a few of those transactions are orders of magnitude higher than their expected amount, then we could identify an anomaly. An anomaly is when the singular rare event justification does not hold anymore, for instance, transactions of $120 or higher over three consecutive orders. In this scenario, we are talking of anomalies because a pattern of repeated and linked outliers have been generated from a different process, possibly credit card fraud, with respect to the usual behavior.

While threshold rules can solve many detection problems, discovering complicated anomalies requires more advanced techniques.

What if a cloned credit card makes a lot of micro-payments of the amount of 10$? The rule-based detector would probably fail.

By simply looking at the measures over each dimension independently, the anomaly generation process could still be hidden within the average distribution. A single dimension signal would not trigger any alert. Let's see what happens if we add a few extra dimensions to the credit card fraud example: the geo-location, the time of the day in the local time zone, and the day of the week.

Let's analyze the same fraud example in more detail. Our customer is a full-time employee based in Milan, but resident in Rome. Every Monday morning, he takes the train, goes to work, and comes back to Rome on Saturday morning to see his friends and family. He loves cooking at home; he only goes out for dinner a few times during the week. In Rome, he lives near to his relatives, so he never has to prepare lunch during weekends, but he often enjoys spending the night out with friends. The distributions of the expected behavior would be as follows:

  • Amount: Between $5 and $40
  • Location: Milan 70% and Rome 30%
  • Time of the day: 70% between noon and 2 P.M. and 30% between 9 P.M. and 11 P.M.
  • Day of the week: Uniform over the week

One day, his credit card is cloned. The fraudulent person lives near his workplace and in order not to get caught, they systematically make small payments of $25 every night around 10 P.M. in an accomplice's corner shop.

If we look at the single dimensions, the fraudulent transactions would be just slightly outside the expected distribution, but still acceptable. The effect on the distributions of the amount and the day of the week would stay more or less the same while the location and time of the day would slightly increase toward Milan at evening time.

Even if systematically repeated, a little change in his lifestyle would be a reasonable explanation. The fraudulent activity would soon turn into the newer expected behavior, the normality.

Let's consider the joint distribution instead:

  • 70% amount around 10$ in Milan over lunch time only on weekdays
  • 30% amount around 30$ in Rome at dinner time only at weekends

In this scenario, the fraudulent activity would immediately be flagged as an outlier at its first occurrence since transactions in Milan at night above $20 are very rare.

Given the preceding example, we might think that considering more dimensions together makes our anomaly detection smarter. Just like any other machine learning algorithm, you need to find a trade-off between complexity and generalization.

Having too many dimensions would project all of the observations in a space where all of them are equally distant from each other. As a consequence, everything would be an "outlier", which, in the way we defined an outlier, intrinsically makes the whole dataset "normal". In other words, if every point looks just the same then you can't distinguish between the two cases. Having too few dimensions would not allow the model to spot an outlier from the haystack and may let it hide in the mass distribution for longer or maybe forever.

Nevertheless, only identifying outliers is not enough. Outliers can happen due to rare events, errors in data collection, or noise. Data is always dirty and full of inconsistencies. The first rule is "never assume your data is clean and correct". Finding outliers is just a standard routine. What would be surprising, instead, is finding contingent and unexplainable repeated behaviors:

"Data scientists realize that their best days coincide with discovery of truly odd features in the data."

Haystacks and Needles: Anomaly Detection By: Gerhard Pilcher & Kenny Darrell, Data Mining Analyst, Elder Research, Inc.

The persistence of a given outlier pattern is the signal that something has changed in the system we are monitoring. The real anomaly detection happens when observing systematic deviations in the underlying data generation process.

This also has an implication in the data preprocessing step. Contrary to what you would do for many machine learning problems, in anomaly detection you can't just filter out all of the outliers! Nevertheless, you should be careful in distinguishing between the nature of those. You do want to filter out wrong data entries, remove the noise, and normalize the remaining. Ultimately, you want to detect novelties in your cleaned dataset.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset