Anomaly detection using deep auto-encoders

The proposed approach using deep learning is semi-supervised and it is broadly explained in the following three steps:

  1. Identify a set of data that represents the normal distribution. In this context, the word "normal" represents a set of points that we are confident to majorly represent non-anomalous entities and not to be confused with the Gaussian normal distribution.

    The identification is generally historical, where we know that no anomalies were officially recognized. This is why this approach is not purely unsupervised. It relies on the assumption that the majority of observations are anomaly-free. We can use external information (even labels if available) to achieve a higher quality of the selected subset.

  2. Learn what "normal" means from this training dataset. The trained model will provide a sort of metric in its mathematical definition; that is, a function mapping every point to a real number representing the distance from another point representing the normal distribution.

Detected based on a threshold, on the anomaly score. By selecting the right threshold we can achieve the desired trade-off between precision (fewer false alarms) and recall (fewer missed detections).

One of the pros of this approach is robustness to noise. We can accept a small portion of outliers in the normal data used for training since that model will try to generalize the main distribution of the population and not the single observations. This property gives us an enormous advantage in terms of generalization with respect to the supervised approach, which is limited to only what can be observed in the past.

Moreover, this approach can be extended to labeled data as well, making it suitable for every class of anomaly detection problems. Since the label information is not taken into account in the modeling, we can discard it from the feature space and consider everything to be under the same label. Labels can still be used as ground truth during the validation phase. We could then treat the anomaly score as a binary classification score and use the ROC curve, and related measures, as benchmarks.

For our use cases, we will make use of auto-encoder architecture to learn the distribution of the training data. As we have seen in Chapter 4, Unsupervised Feature Learning, the network is designed to have arbitrary, but symmetric hidden layers with the same number of neurons in both the input layer and output layer. The whole topology has to be symmetric in the sense that the encoding topology on the left side is just mirrored to the decoding part to the right and they both share the same number of hidden units and activation functions:

Anomaly detection using deep auto-encoders

Auto-encoder simple representation from H2O training book (https://github.com/h2oai/h2o-training-book/blob/master/hands-on_training/images/autoencoder.png)

The loss function generally used is the MSE (mean squared error) between the input and the corresponding neurons in the output layer. This way, the network is forced to approximate an identity function via a non-linear and compressed representation of the original data.

Deep auto-encoders are also frequently used as a pre-training step for supervised learning models and for dimensionality reduction. In fact, the central layer of the auto-encoder could be used to represent the points in a reduced dimensionality, as we will see in the last example.

We can then start making analysis with the fully reconstructed representation that is the result of the encoding and decoding in cascade. An identity auto-encoder would reconstruct exactly the same values of the original point. That would not be very useful. In practice, auto-encoders reconstruct based on intermediate representations that minimize the training error. Thus, we learn those compression functions from the training set so that a normal point is very likely to be reconstructed correctly, but an outlier would have a higher reconstruction error (the mean squared error between the original point and the reconstructed one).

We can then use the reconstruction error as an anomaly score.

Alternatively, we can use a trick of setting the middle layer of the network small enough so that we can transform every point into a low-dimensional compressed representation. If we set it equal to two or three, we can even visualize the points. Hence, we can use auto-encoders for reducing the dimensionality followed by standard machine learning techniques for detection.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset