Outlier detection in MNIST

All right, so admittedly our previous application has nothing to do with fraud or outlier detection so far. We can do a small modification on the previous setup to show how a similar framework works. For this, let's assume that the number 7 is an outlier class and we will try to identify it from the result of our normal numbers: 0, 1, 2 , 3, 4, 5, 6, 8, 9.

We will train the autoencoder on the normal dataset and then apply it to the test set. The aim will be to abstract as many features of the normal situation as possible. This requires knowledge of the normal situation, which translates into availability of labelled data and hence, it is an ideal scenario, for many practical applications, for instance credit card fraud or intrusion detection, we sometimes (or rather often) lack such labeled data.

We begin as before:

library(keras)
mnist <- dataset_mnist()
X_train <- mnist$train$x
y_train <- mnist$train$y
X_test <- mnist$test$x
y_test <- mnist$test$y

But now we will exclude 7 from the training set, as it will be the outlier in our example.

## Exclude "7" from the training set. "7" will be the outlier
outlier_idxs <- which(y_train!=7, arr.ind = T)
X_train <- X_train[outlier_idxs,,]
y_test <- sapply(y_test, function(x){ ifelse(x==7,"outlier","normal")})

We continue as before, with re-scaling and reshaping before defining our autoencoder:

# reshape
dim(X_train) <- c(nrow(X_train), 784)
dim(X_test) <- c(nrow(X_test), 784)
# rescale
X_train <- X_train / 255
X_test <- X_test / 255
input_dim <- 28*28 #784
inner_layer_dim <- 32
# Create the autoencoder
input_layer <- layer_input(shape=c(input_dim))
encoder <- layer_dense(units=inner_layer_dim, activation='relu')(input_layer)
decoder <- layer_dense(units=784)(encoder)
autoencoder <- keras_model(inputs=input_layer, outputs = decoder)
autoencoder %>% compile(optimizer='adam',
loss='mean_squared_error',
metrics=c('accuracy'))
history <- autoencoder %>% fit(
X_train,X_train,
epochs = 50, batch_size = 256,
validation_split=0.2
)
plot(history)

Once the autoencoder is trained, we can start looking at the performance, using the reconstruction of the test set:

# Reconstruct on the test set
preds <- autoencoder %>% predict(X_test)
error <- rowSums((preds-X_test)**2)
eval <- data.frame(error=error, class=as.factor(y_test))
library(ggplot2)
library(dplyr)
eval %>%
group_by(class) %>%
summarise(avg_error=mean(error)) %>%
ggplot(aes(x=class,fill=class,y=avg_error))+geom_boxplot()

Let's look at the reconstruction error in our different classes:

Distribution of reconstruction error in the test set

From the plot, we see that we can set up the threshold value at 15, that is, observations with a reconstruction error above 15 would be marked as outliers:

threshold <- 15
y_preds <- sapply(error, function(x) ifelse(x>threshold,"outlier","normal")})

Once this is done, we can calculate the confusion matrix. This is a useful way of visualizing what the model is doing:

# Confusion matrix
table(y_preds,y_test)

This gives us the following:

         y_test
y_preds normal outlier
normal 5707 496
outlier 3265 532

So clearly we could do better. Perhaps the vertical stroke shared by digits 1 and 7 contributes to the huge error rate. We caught, however, a bit over 50% of the outlier cases with this simple architecture. One way to improve this would be to add more hidden layers. We will use this trick later in this chapter. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset