Exploratory data analysis

An often overlooked step is exploratory data analysis. Before jumping straight into the data and trying to do fancy deep learning architectures, let's step back and look at what we have around.

Let's begin by downloading the dataset from Kaggle: (https://www.kaggle.com/dalpozz/creditcardfraud) and importing it into R:

df <- read.csv("./data/creditcard.csv", stringsAsFactors = F)
head(df)

Before moving on, we should do a basic sanity check. Some of the things we should look for are:

  • Verify that there are indeed only two classes (0 for normal transactions, 1 for fraudulent)
  • Verify that the timestamp corresponds to two days
  • Check that there are no missing values

Once this is done, we can perform two quick checks, an idea would be to see if there is an obvious pattern between the time of day and the amount. Perhaps fraudulent transactions happen at a certain time, when our system is vulnerable? We should check this first:

library(ggplot2)
library(dplyr)
df %>% ggplot(aes(Time,Amount))+geom_point()+facet_grid(Class~.)

First, let's see if there is some seasonality pattern. We just plot the time variable against the amount, per class:

Quick inspection for fraud: the class 0 corresponds to normal transactions and the class 1 to fraudulent transactions.

So nothing jumps out. Interestingly, the amount involved in fraud transactions is much lower than in normal transactions. This suggests we should filter out the transactions and look at them on the right scale. For this, let's use the dplyr and filter out the transactions above 300 and look at smaller transactions:

 df$Class <- as.factor(df$Class)
df %>%filter(Amount<300) %>%ggplot(aes(Class,Amount))+geom_violin()

How does the distribution look by class? The following plot tells us something:

First insight on the data: The amount involved in fraudulent transactions seems more likely to be around 100
than in non-fraudulent transactions.

Aha! So we get our first insight on the data! Fraudulent transactions, although much smaller, are anomalously centered around 100. This might be part of the fraudster's strategy, instead of having large amounts at regular times, they hide small amounts more or less uniformly in time.

Sure, this was fun to find out, but it is definitely not a scalable approach and requires domain knowledge and intuition. It is time to try something more sophisticated.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset