Chapter 2. Applying Machine Learning to Structured Data

Structured data is a term used for any data that resides in a fixed field within a record or file, two such examples being relational databases and spreadsheets. Usually, structured data is presented in a table in which each column presents a type of value, and each row represents a new entry. Its structured format means that this type of data lends itself to classical statistical analysis, which is also why most data science and analysis work is done on structured data.

In day-to-day life, structured data is also the most common type of data available to businesses, and most machine learning problems that need to be solved in finance deal with structured data in one way or another. The fundamentals of any modern company's day-to-day running is built around structured data, including, transactions, order books, option prices, and suppliers, which are all examples of information usually collected in spreadsheets or databases.

This chapter will walk you through a structured data problem involving credit card fraud, where we will use feature engineering to identify the fraudulent transaction from a dataset successfully. We'll also introduce the basics of an end-to-end (E2E) approach so that we can solve common financial problems.

Fraud is an unfortunate reality that all financial institutions have to deal with. It's a constant race between companies trying to protect their systems and fraudsters who are trying to defeat the protection in place. For a long time, fraud detection has relied on simple heuristics. For example, a large transaction made while you're in a country you usually don't live in will likely result in that transaction being flagged.

Yet, as fraudsters continue to understand and circumvent the rules, credit card providers are deploying increasingly sophisticated machine learning systems to counter this.

In this chapter, we'll look at how a real bank might tackle the problem of fraud. It's a real-world exploration of how a team of data scientists starts with a heuristic baseline, then develops an understanding of its features, and from that, builds increasingly sophisticated machine learning models that can detect fraud. While the data we will use is synthetic, the process of development and tools that we'll use to tackle fraud are similar to the tools and processes that are used every day by international retail banks.

So where do you start? To put it in the words of one anonymous fraud detection expert that I spoke to, "I keep thinking about how I would steal from my employer, and then I create some features that would catch my heist. To catch a fraudster, think like a fraudster." Yet, even the most ingenious feature engineers are not able to pick up on all the subtle and sometimes counterintuitive signs of fraud, which is why the industry is slowly shifting toward entirely E2E-trained systems. These systems, in addition to machine learning, are both focuses of this chapter where we will explore several commonly used approaches to flag fraud.

This chapter will act as an important baseline to Chapter 6, Using Generative Models, where we will again be revisiting the credit card fraud problem for a full E2E model using auto-encoders.

The data

The dataset we will work with is a synthetic dataset of transactions generated by a payment simulator. The goal of this case study and the focus of this chapter is to find fraudulent transactions within a dataset, a classic machine learning problem many financial institutions deal with.

Note

Note: Before we go further, a digital copy of the code, as well as an interactive notebook for this chapter are accessible online, via the following two links:

An interactive notebook containing the code for this chapter can be found under https://www.kaggle.com/jannesklaas/structured-data-code

The code can also be found on GitHub, in this book's repository: https://github.com/PacktPublishing/Machine-Learning-for-Finance

The dataset we're using stems from the paper PaySim: A financial mobile money simulator for fraud detection, by E. A. Lopez-Rojas, A. Elmir, and S. Axelsson. The dataset can be found on Kaggle under this URL: https://www.kaggle.com/ntnu-testimon/paysim1.

Before we break it down on the next page, let's take a minute to look at the dataset that we'll be using in this chapter. Remember, you can download the data with the preceding link.

step

type

amount

nameOrig

oldBalance Orig

newBalance Orig

nameDest

oldBalance Dest

newBalance Dest

isFraud

isFlagged Fraud

1

PAYMENT

9839.64

C1231006815

170136.0

160296.36

M1979787155

0.0

0.0

0

0

1

PAYMENT

1864.28

C1666544295

21249.0

19384.72

M2044282225

0.0

0.0

0

0

1

TRANSFER

181.0

C1305486145

181.0

0.0

C553264065

0.0

0.0

1

0

1

CASH_OUT

181.0

C840083671

181.0

0.0

C38997010

21182.0

0.0

1

0

1

PAYMENT

11668.14

C2048537720

41554.0

29885.86

M1230701703

0.0

0.0

0

0

1

PAYMENT

7817.71

C90045638

53860.0

46042.29

M573487274

0.0

0.0

0

0

1

PAYMENT

7107.77

C154988899

183195.0

176087.23

M408069119

0.0

0.0

0

0

1

PAYMENT

7861.64

C1912850431

176087.23

168225.59

M633326333

0.0

0.0

0

0

1

PAYMENT

4024.36

C1265012928

2671.0

0.0

M1176932104

0.0

0.0

0

0

1

DEBIT

5337.77

C712410124

41720.0

36382.23

C195600860

41898.0

40348.79

0

0

As seen in the first row, the dataset has 11 columns. Let's explain what each one represents before we move on:

  • step: Maps time, with each step corresponding to one hour.
  • type: The type of the transaction, which can be CASH_IN, CASH_OUT, DEBIT, PAYMENT, or TRANSFER.
  • amount: The amount of the transaction.
  • nameOrig: The origin account that started the transaction. C relates to customer accounts, while M is the account of merchants.
  • oldbalanceOrig: The old balance of the origin account.
  • newbalanceOrig: The new balance of the origin account after the transaction amount has been added.
  • nameDest: The destination account.
  • oldbalanceDest: The old balance of the destination account. This information is not available for merchant accounts whose names start with M.
  • newbalanceDest: The new balance of the destination account. This information is not available for merchant accounts.
  • isFraud: Whether the transaction was fraudulent.
  • isFlaggedFraud: Whether the old system has flagged the transaction as fraud.

In the preceding table, we can see 10 rows of data. It's worth noting that there are about 6.3 million transactions in our total dataset, so what we've seen is a small fraction of the total amount. As the fraud we're looking at only occurs in transactions marked as either TRANSFER or CASH_OUT, all other transactions can be dropped, leaving us with around 2.8 million examples to work with.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset