Autoencoder on the matrix representation

Once we put the text in matrix form, we can continue the training of the autoencoder, as in the previous sections.

Note that our autoencoder will have only non-responsive emails on the training part. This turns out to be quite helpful in this dataset, which has only a few hundred samples.

Once this is done, we create our training and testing sets, splitting in X and y components as before:

X_train <- subset(train,select=-responsive)
y_train <- train$responsive
X_test <- subset(test,select=-responsive)
y_test <- test$responsive

Now, we are ready to define our autoencoder. We will use only an inner layer with size 32:

library(keras)
input_dim <- ncol(X_train)
inner_layer_dim <- 32
input_layer <- layer_input(shape=c(input_dim))
encoder <- layer_dense(units=inner_layer_dim, activation='relu')(input_layer)
decoder <- layer_dense(units=input_dim)(encoder)
autoencoder <- keras_model(inputs=input_layer, outputs = decoder)
autoencoder %>% compile(optimizer='adam',
loss='mean_squared_error',
metrics=c('accuracy'))

Then, for the training:

X_train <- as.matrix(X_train)
X_test <- as.matrix(X_test)
history <- autoencoder %>% fit(
X_train,X_train,
epochs = 100, batch_size = 32,
validation_data = list(X_test, X_test)
)
plot(history)

We look at the reconstruction on the test set and look at the distribution of errors across both classes:

# Reconstruct on the test set
preds <- autoencoder %>% predict(X_test)
error <- rowSums((preds-X_test)**2)
library(tidyverse)
eval %>%
filter(error < 1000) %>%
ggplot(aes(x=error,color=class))+geom_density()

As usual, let's take a look at the distribution of error per class, this time with a density plot:

Distribution of the reconstruction error per class.

Note that we filtered on the reconstruction error, As before, this helps us look at the scale on where the majority of the observations are. Our goal is to set a threshold for the reconstruction error, to flag as outlier (which in this context means that the email is not an ordinary email communication). Visually, it seems that 100 is a reasonable threshold, although we will get a high number of false positives:

threshold <- 100
y_preds <- sapply(error, function(x){ifelse(x>threshold,"outlier","normal")})
# Confusion matrix
table(y_preds,y_test)
y_test
y_preds 0 1
normal 142 7
outlier 73 35

We do a reasonably good job catching the suspicious emails, at the cost of 73 false positives. There is always a trade off between catching a high number of false positives and ignoring true positives. The model could be improved by adding more data, we used only around 800 emails from 500,000 available, so clearly there is room for improvement. The model works reasonably well, nonetheless, as confirmed by the AUC value of 0.79 and the ROC plot:

library(ROCR)
pred <- prediction(error, y_test)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
auc <- unlist(performance(pred, measure = "auc")@y.values)
plot(perf, col=rainbow(10))
auc
[1] 0.7951274
plot(perf, col=rainbow(10))

We see the ROC curve for the Enron dataset, as follows. This allows us to diagnose our model for binary classifiers in general, not only in this case:

ROC curve for the Enron dataset.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset