An often overlooked step is exploratory data analysis. Before jumping straight into the data and trying to do fancy deep learning architectures, let's step back and look at what we have around.
Let's begin by downloading the dataset from Kaggle: (https://www.kaggle.com/dalpozz/creditcardfraud) and importing it into R:
df <- read.csv("./data/creditcard.csv", stringsAsFactors = F)
head(df)
Before moving on, we should do a basic sanity check. Some of the things we should look for are:
- Verify that there are indeed only two classes (0 for normal transactions, 1 for fraudulent)
- Verify that the timestamp corresponds to two days
- Check that there are no missing values
Once this is done, we can perform two quick checks, an idea would be to see if there is an obvious pattern between the time of day and the amount. Perhaps fraudulent transactions happen at a certain time, when our system is vulnerable? We should check this first:
library(ggplot2)
library(dplyr)
df %>% ggplot(aes(Time,Amount))+geom_point()+facet_grid(Class~.)
First, let's see if there is some seasonality pattern. We just plot the time variable against the amount, per class:
So nothing jumps out. Interestingly, the amount involved in fraud transactions is much lower than in normal transactions. This suggests we should filter out the transactions and look at them on the right scale. For this, let's use the dplyr and filter out the transactions above 300 and look at smaller transactions:
df$Class <- as.factor(df$Class)
df %>%filter(Amount<300) %>%ggplot(aes(Class,Amount))+geom_violin()
How does the distribution look by class? The following plot tells us something:
than in non-fraudulent transactions.
Aha! So we get our first insight on the data! Fraudulent transactions, although much smaller, are anomalously centered around 100. This might be part of the fraudster's strategy, instead of having large amounts at regular times, they hide small amounts more or less uniformly in time.
Sure, this was fun to find out, but it is definitely not a scalable approach and requires domain knowledge and intuition. It is time to try something more sophisticated.