Credit card fraud detection with autoencoders

Fraud is a multi-billion dollar industry, with credit card fraud being probably the closest to our daily lives. Fraud begins with the theft of the physical credit card or with data that could compromise the security of the account, such as the credit card number, expiration date and security codes. A stolen card can be reported directly, if the victim knows that their card has been stolen, however, when the data is stolen, a compromised account can take weeks or even months to be used, and the victim then only knows from their bank statement that the card has been used. 

Traditionally, fraud detection systems rely on the creation of manually engineered features by subject matter experts, working either directly with financial institutions or with specialized software vendors. 

One of the biggest challenges in fraud detection is the availability of labelled datasets, which are often hard or even impossible to come by.

Our first fraud example comes from a dataset made public in Kaggle, (https://www.kaggle.com/dalpozz/creditcardfraud), by researchers from the Université Libre de Bruxelles in Belgium (for the full work, you can read their paper: Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi, Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015).

The datasets contain transactions made by credit cards in two days in September 2013 by European cardholders. We have 492 frauds out of 284,807 transactions. Unlike toy datasets (I am looking at you, Iris), real-life datasets are highly unbalanced. In this example, the positive class (frauds) account for 0.172% of all transactions.

It contains just numerical information factors which are the aftereffect of a PCA change. Because of classification issues, the creators can't give the first highlights and more foundation data about the information. Features V1, V2, ... V28 are the chief segments got with PCA, the main features which have not been changed with PCA are Time and Amount.

The feature, Time contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature, Amount is the transaction's amount, this feature can be used for example-dependent, cost-sensitive learning. The feature, Class is the response variable and it takes value 1 in case of fraud and 0 otherwise. Given the class imbalance ratio, the authors recommend measuring the area under the precision-recall curve (AUC), instead of the confusion matrix. The precision-recall curve is also called ROC (receiver-operator characteristic).

At this point you might be thinking: well, why should I bother with autoencoders since this is clearly a binary classification problem, and we already have the labeled data? Sure, you can go the traditional way and try to do standard supervised learning algorithms, such as random forests or support vector machines, just be careful to either oversample the fraud class or undersample the normal class, so that these methods can perform well. However, in many real-life instances we do not have the labeled data beforehand, and in complex fraud scenarios it might be very tricky to get an accurate label. Suppose you are a criminal willing to commit fraud.

Previous to the fraud (or even after) you may have completely normal activity in your account. So shall we flag all of your transactions as rogue? Or only a certain subset? Some people in the business may argue that, after all, the transactions were committed by a criminal, so they are tainted somehow and we should flag all your activity, introducing bias into the model. Instead of relying on the label, we will treat the problem as an anomaly detection or outlier detection problem and use autoencoders, as before.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset