Fraud detection with H2O

Let's try a slightly different tool, that might help us in real-life deployments. It is often useful to try different tools in the ever-growing data science landscape, if only for sanity0-check purposes.

H2O is an open source software for doing big data analytics. The young start-up (founded in 2016) counts with top researchers in mathematical optimization and statistical learning theory on their advisory board. It runs in standard environments (Linux/Mac/Windows) as well as big data systems and cloud computing environments. 

You can run H2O in R, but you need to install the package first:

install.packages("h2o")

Once this is done, you can load the library:

library(h2o)

You will then see a welcome message, among some warnings (objects that are masked from other packages):

Your next step is to start H2O:
> h2o.init()
For H2O package documentation, ask for help:
> ??h2o
After starting H2O, you can use the Web UI at http://localhost:54321
For more information visit http://docs.h2o.ai

So let's do that, and then we will be ready for work:

h2o.init()

Now we need to read our data into H2O. As the computations work somehow differently, we can not use the vanilla dataframe structure from R, so we either read the file as usual and then coerce it:

df <- read.csv("./data/creditcard.csv", stringsAsFactors = F)
df <- as.h2o(df)

Or we read it with the h2o.uploadFile function:

df2 <- h2o.uploadFile("./data/creditcard.csv")

Either way, the resulting structure type is no longer a dataframe, but an environment. 

Let's leave aside one portion of the data for training and one for testing, as usual. In h2o, we can use the h2o.splitFrame function:

splits <- h2o.splitFrame(df, ratios=c(0.8), seed=1)
train <- splits[[1]]
test <- splits[[2]]

Now let's identify between features and label, which will be useful in a minute:

label <- "Class"
features <- setdiff(colnames(train), label)

We are ready to start the training of our autoencoder:

autoencoder <- h2o.deeplearning(x=features,
training_frame = train,
autoencoder = TRUE,
seed = 1,
hidden=c(10,2,10),
epochs = 10,
activation = "Tanh")

Some comments are in order. The autoencoder parameter is set to true, as you would expect. We will use a slightly different architecture this time, just for illustration purposes. You can see in the hidden parameter, the structure of the layers. We will also use a different activation function. In practice, it is sometimes useful to use bounded activation functions, such as tanh instead of ReLu, which can be numerically unstable.

We can generate the reconstructions in a similar way as we did with keras:

# Use the predict function as before
preds <- h2o.predict(autoencoder, test)

We get something like this:

> head(preds)
reconstr_Time reconstr_V1 reconstr_V2 reconstr_V3 reconstr_V4 reconstr_V5 reconstr_V6 reconstr_V7
1 380.1466 -0.3041237 0.2373746 1.617792 0.1876353 -0.7355559 0.3570959 -0.1331038
2 1446.0211 -0.2568674 0.2218221 1.581772 0.2254702 -0.6452812 0.4204379 -0.1337738
3 1912.0357 -0.2589679 0.2212748 1.578886 0.2171786 -0.6604871 0.4070894 -0.1352975
4 1134.1723 -0.3319681 0.2431342 1.626862 0.1473913 -0.8192215 0.2911475 -0.1369512
5 1123.6757 -0.3194054 0.2397288 1.619868 0.1612631 -0.7887480 0.3140728 -0.1362253
6 1004.4545 -0.3589335 0.2508191 1.643208 0.1196120 -0.8811920 0.2451117 -0.1380364

And from here on, we can proceed as before. However, h2o has a built-in function, h2o.anomaly, that simplifies part of our work.

Another simplification we can do is instead of importing ggplot2 and dplyr separately, we can import the tidyverse package, that brings these (and other) packages useful for data manipulation into our environment:

We call this function and do a bit of formatting to make the row names a column itself, as well as adding the label for the real class:

library(tidyverse)
anomaly <- h2o.anomaly(autoencoder, test) %>%
as.data.frame() %>%
tibble::rownames_to_column() %>%
mutate(Class = as.vector(test[, 31]))

Let's calculate the average mean square error:

# Type coercion useful for plotting later
anomaly$Class <- as.factor(anomaly$Class)
mean_mse <- anomaly %>%
group_by(Class) %>%
summarise(mean = mean(Reconstruction.MSE))

And finally, visualize our test data as per the reconstruction error:

anomaly$Class <- as.factor(anomaly$Class)
mean_mse <- anomaly %>%
group_by(Class) %>%
summarise(mean = mean(Reconstruction.MSE))

We see that the autoencoder does a not too terrible job. A good proportion of the fraud cases have a relatively high reconstruction error, although it is far from perfect. How could you improve it?:

Results from our architecture using H2O, we see that the autoencoder does a good job flagging the fraud cases,
but it could still be improved.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset