The feature engineering approach

The objective of feature engineering is to exploit the qualitative insight of humans in order to create better machine learning models. A human engineer usually uses three types of insight: intuition, expert domain knowledge, and statistical analysis. Quite often, it's possible to come up with features for a problem just from intuition.

As an example, in our fraud case, it seems intuitive that fraudsters will create new accounts for their fraudulent schemes and won't be using the same bank account that they pay for their groceries with.

Domain experts are able to use their extensive knowledge of a problem in order to come up with other such examples of intuition. They'll know more about how fraudsters behave and can craft features that indicate such behavior. All of these intuitions are then usually confirmed by statistical analysis, something that can even be used to open the possibilities of discovering new features.

Statistical analysis can sometimes turn up quirks that can be turned into predictive features. However, with this method, engineers must beware of the data trap. Predictive features found in the data might only exist in that data because any dataset will spit out a predictive feature if it's wrangled with for long enough.

A data trap refers to engineers digging within the data for features forever, and never questioning whether those features they are searching for are relevant.

Data scientists stuck in to the data trap keep euphorically finding features, only to realize later that their model, with all those features, does not work well. Finding strong predictive features in the training set is like a drug for data science teams. Yes, there's an immediate reward, a quick win that feels like a validation of one's skills. However, as with many drugs, the data trap can lead to an after-effect in which teams find that weeks' or months' worth of work in finding those features was actually, useless.

Take a minute to ask yourself, are you in that position? If you ever find yourself applying analysis after analysis, transforming data in every possible way, chasing correlation values, you might very well be stuck in a data trap.

To avoid the data trap, it is important to establish a qualitative rationale as to why this statistical predictive feature exists and should exist outside of the dataset as well. By establishing this rationale, you will keep both yourself and your team alert to avoiding crafting features that represent noise. The data trap is the human form of overfitting and finding patterns in noise, which is a problem for models as well.

Humans can use their qualitative reasoning skills to avoid fitting noise, which is a big advantage humans have over machines. If you're a data scientist, you should use this skill to create more generalizable models.

The goal of this section was not to showcase all the features that feature engineering could perform on this dataset, but just to highlight the three approaches and how they can be turned into features.

A feature from intuition – fraudsters don't sleep

Without knowing much about fraud, we can intuitively describe fraudsters as shady people that operate in the dark. In most cases, genuine transactions happen during the day, as people sleep at night.

The time steps in our dataset represent one hour. Therefore, we can generate the time of the day by simply taking the remainder of a division by 24, as seen in this code:

df['hour'] = df['step'] % 24

From there, we can then count the number of fraudulent and genuine transactions at different times. To calculate this, we must run the following code:

frauds = []
genuine = []
for i in range(24):
    f = len(df[(df['hour'] == i) & (df['isFraud'] == 1)])
    g = len(df[(df['hour'] == i) & (df['isFraud'] == 0)])
    frauds.append(f)
    genuine.append(g)

Then finally, we can plot the share of genuine and fraudulent transactions over the course of the day into a chart. To do this, we must run the following code:

fig, ax = plt.subplots(figsize=(10,6)) 
ax.plot(genuine/np.sum(genuine), label='Genuine') 
ax.plot(frauds/np.sum(frauds),dashes=[5, 2], label='Fraud') 
plt.xticks(np.arange(24))
legend = ax.legend(loc='upper center', shadow=True)
A feature from intuition – fraudsters don't sleep

The share of fraudulent and genuine transactions conducted throughout each hour of the day

As we can see in the preceding chart, there are much fewer genuine transactions at night, while fraudulent behavior continues over the day. To be sure that night is a time when we can hope to catch fraud, we can also plot the number of fraudulent transactions as a share of all transactions. To do this, we must run the following command:

fig, ax = plt.subplots(figsize=(10,6))
ax.plot(np.divide(frauds,np.add(genuine,frauds)), label='Share of fraud')
plt.xticks(np.arange(24))
legend = ax.legend(loc='upper center', shadow=True)
A feature from intuition – fraudsters don't sleep

The share of transactions that are fraudulent per hour of the day

Once we run that code, we can see that at around 5 AM, over 60% of all transactions seem to be fraudulent, which appears to make this a great time of the day to catch fraud.

Expert insight – transfer, then cash out

The description of the dataset came with another description that explained the expected behavior of fraudsters. First, they transfer money to a bank account they control. Then, they cash out that money from an ATM.

We can check whether there are fraudulent transfer destination accounts that are the origin of the fraudulent cash outs by running the following code:

dfFraudTransfer = df[(df.isFraud == 1) & (df.type == 'TRANSFER')]
dfFraudCashOut = df[(df.isFraud == 1) & (df.type == 'CASH_OUT')]
dfFraudTransfer.nameDest.isin(dfFraudCashOut.nameOrig).any()
out: False

According to the output, there seems to be no fraudulent transfers that are the origin of fraudulent cash outs. The behavior expected by the experts is not visible in our data. This could mean two things: firstly, it could mean that the fraudsters behave differently now, or secondly that our data does not capture their behavior. Either way, we cannot use this insight for predictive modeling here.

Statistical quirks – errors in balances

A closer examination of the data shows that there are some transactions where the old and new balances of the destination account is zero, although the transaction amount is not zero. This is odd, or more so a quirk, and so we want to investigate whether this type of oddity yields predictive power.

To begin with, we can calculate the share of fraudulent transactions with this property by running the following code:

dfOdd = df[(df.oldBalanceDest == 0) &
           (df.newBalanceDest == 0) & 
           (df.amount)]
len(dfOdd[(df.isFraud == 1)]) / len(dfOdd)
out: 0.7046398891966759

As you can see, the share of fraudulent transactions stands at 70%, so this quirk seems to be a good feature at detecting fraud in transactions. However, it is important to ask ourselves how this quirk got into our data in the first place. One possibility could be that the transactions never come through.

This could happen for a number of reasons including that there might be another fraud prevention system in place that blocks the transactions, or that the origin account for the transaction has insufficient funds.

While we have no way of verifying if there's another fraud prevention system in place, we can check to see if the origin accounts have insufficient funds. To do this, we have to run the following code:

len(dfOdd[(dfOdd.oldBalanceOrig <= dfOdd.amount)]) / len(dfOdd)
out: 0.8966412742382271

As we can see in the output, close to 90% of the odd transactions have insufficient funds in their origin accounts. From this, we can now construct a rationale in which fraudsters try to drain a bank account of all its funds more often than regular people do.

We need this rationale to avoid the data trap. Once established, the rationale must be constantly scrutinized. In our case, it has failed to explain 10% of the odd transactions, and if this number rises, it could end up hurting the performance of our model in production.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset