Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 2. Applying Machine Learning to Structured Data

Structured data is a term used for any data that resides in a fixed field within a record or file, two such examples being relational databases and spreadsheets. Usually, structured data is presented in a table in which each column presents a type of value, and each row represents a new entry. Its structured format means that this type of data lends itself to classical statistical analysis, which is also why most data science and analysis work is done on structured data.

In day-to-day life, structured data is also the most common type of data available to businesses, and most machine learning problems that need to be solved in finance deal with structured data in one way or another. The fundamentals of any modern company's day-to-day running is built around structured data, including, transactions, order books, option prices, and suppliers, which are all examples of information usually collected in spreadsheets or databases.

This chapter will walk you through a structured data problem involving credit card fraud, where we will use feature engineering to identify the fraudulent transaction from a dataset successfully. We'll also introduce the basics of an end-to-end (E2E) approach so that we can solve common financial problems.

Fraud is an unfortunate reality that all financial institutions have to deal with. It's a constant race between companies trying to protect their systems and fraudsters who are trying to defeat the protection in place. For a long time, fraud detection has relied on simple heuristics. For example, a large transaction made while you're in a country you usually don't live in will likely result in that transaction being flagged.

Yet, as fraudsters continue to understand and circumvent the rules, credit card providers are deploying increasingly sophisticated machine learning systems to counter this.

In this chapter, we'll look at how a real bank might tackle the problem of fraud. It's a real-world exploration of how a team of data scientists starts with a heuristic baseline, then develops an understanding of its features, and from that, builds increasingly sophisticated machine learning models that can detect fraud. While the data we will use is synthetic, the process of development and tools that we'll use to tackle fraud are similar to the tools and processes that are used every day by international retail banks.

So where do you start? To put it in the words of one anonymous fraud detection expert that I spoke to, "I keep thinking about how I would steal from my employer, and then I create some features that would catch my heist. To catch a fraudster, think like a fraudster." Yet, even the most ingenious feature engineers are not able to pick up on all the subtle and sometimes counterintuitive signs of fraud, which is why the industry is slowly shifting toward entirely E2E-trained systems. These systems, in addition to machine learning, are both focuses of this chapter where we will explore several commonly used approaches to flag fraud.

This chapter will act as an important baseline to Chapter 6, Using Generative Models, where we will again be revisiting the credit card fraud problem for a full E2E model using auto-encoders.

The data

The dataset we will work with is a synthetic dataset of transactions generated by a payment simulator. The goal of this case study and the focus of this chapter is to find fraudulent transactions within a dataset, a classic machine learning problem many financial institutions deal with.

Note

Note: Before we go further, a digital copy of the code, as well as an interactive notebook for this chapter are accessible online, via the following two links:

An interactive notebook containing the code for this chapter can be found under https://www.kaggle.com/jannesklaas/structured-data-code

The code can also be found on GitHub, in this book's repository: https://github.com/PacktPublishing/Machine-Learning-for-Finance

The dataset we're using stems from the paper PaySim: A financial mobile money simulator for fraud detection, by E. A. Lopez-Rojas, A. Elmir, and S. Axelsson. The dataset can be found on Kaggle under this URL: https://www.kaggle.com/ntnu-testimon/paysim1.

Before we break it down on the next page, let's take a minute to look at the dataset that we'll be using in this chapter. Remember, you can download the data with the preceding link.

step	type	amount	nameOrig	oldBalance Orig	newBalance Orig	nameDest	oldBalance Dest	newBalance Dest	isFraud
1	PAYMENT	9839.64	C1231006815	170136.0	160296.36	M1979787155	0.0	0.0	0
1	PAYMENT	1864.28	C1666544295	21249.0	19384.72	M2044282225	0.0	0.0	0
1	TRANSFER	181.0	C1305486145	181.0	0.0	C553264065	0.0	0.0	1
1	CASH_OUT	181.0	C840083671	181.0	0.0	C38997010	21182.0	0.0	1
1	PAYMENT	11668.14	C2048537720	41554.0	29885.86	M1230701703	0.0	0.0	0
1	PAYMENT	7817.71	C90045638	53860.0	46042.29	M573487274	0.0	0.0	0
1	PAYMENT	7107.77	C154988899	183195.0	176087.23	M408069119	0.0	0.0	0
1	PAYMENT	7861.64	C1912850431	176087.23	168225.59	M633326333	0.0	0.0	0
1	PAYMENT	4024.36	C1265012928	2671.0	0.0	M1176932104	0.0	0.0	0
1	DEBIT	5337.77	C712410124	41720.0	36382.23	C195600860	41898.0	40348.79	0

As seen in the first row, the dataset has 11 columns. Let's explain what each one represents before we move on:

step: Maps time, with each step corresponding to one hour.
type: The type of the transaction, which can be CASH_IN, CASH_OUT, DEBIT, PAYMENT, or TRANSFER.
amount: The amount of the transaction.
nameOrig: The origin account that started the transaction. C relates to customer accounts, while M is the account of merchants.
oldbalanceOrig: The old balance of the origin account.
newbalanceOrig: The new balance of the origin account after the transaction amount has been added.
nameDest: The destination account.
oldbalanceDest: The old balance of the destination account. This information is not available for merchant accounts whose names start with M.
newbalanceDest: The new balance of the destination account. This information is not available for merchant accounts.
isFraud: Whether the transaction was fraudulent.
isFlaggedFraud: Whether the old system has flagged the transaction as fraud.

In the preceding table, we can see 10 rows of data. It's worth noting that there are about 6.3 million transactions in our total dataset, so what we've seen is a small fraction of the total amount. As the fraud we're looking at only occurs in transactions marked as either TRANSFER or CASH_OUT, all other transactions can be dropped, leaving us with around 2.8 million examples to work with.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 2. Applying Machine Learning to Structured Data

Create new playlist

Sign In

Sign Up

Chapter 2. Applying Machine Learning to Structured Data

The data

Note

Table of Contents for
2. Applying Machine Learning to Structured Data