Fraud is deception that is perpetrated with either financial or personal gain in mind. Many people think about identity theft or credit card theft when thinking about fraud, but that’s only a very small part of the picture. A person claiming or being credited with another’s accomplishments or qualities is another form of fraud. The key word when thinking about fraud is deception, which takes many forms – for example, disinformation and hypocrisy. Consequently, it’s important to come up with a solid definition of what fraud is when working with ML applications and data, which is the goal of the first part of this chapter.
Determining fraud sources is essential with ML because fraud sources generate data – deceptive data. Yes, the data looks just fine, but when you study it in depth, it contains one or more of the five mistruths of data described in the Defining the human element section of Chapter 1, Defining Machine Learning Security. Using this data will cause hard-to-locate issues within your ML models and make it difficult to ensure a secure environment.
In the ML realm, fraud either occurs in the background or in real time. Background fraud is the type that a hacker or another individual hopes will go unnoticed while the perpetrator makes small changes a little at a time. This is the kind of fraud that eventually improves sales for one company, while starving another, because the data used for comparisons is just a little off or resources are being siphoned off. Real-time fraud (the kind that most people think about when the term fraud is mentioned) occurs quickly with the intent of obtaining a fast gain, such as identity theft. With these issues in mind, this chapter discusses these topics:
This chapter requires that you have access to either Google Colab or Jupyter Notebook to work with the example code. The Requirements to use this book section of Chapter 1, Defining Machine Learning Security, provides additional details on how to set up and configure your programming environment. When testing the code, use a test site, test data, and test APIs to avoid damaging production setups and to improve the reliability of the testing process. Testing over a non-production network is highly recommended, but not necessary. Using the downloadable source is always highly recommended. You can find the downloadable source on the Packt GitHub site at https://github.com/PacktPublishing/Machine-Learning-Security-Principles or my website at http://www.johnmuellerbooks.com/source-code/.
Fraud entails deception. The kind of deception depends on the fraud being perpetrated. For example, a compelling product message could result in identity theft, stolen credentials, or other resource gains for the perpetrator. Creating a condition in which one entity receives the blame for another entity’s actions is another form of fraud. When considering fraud, it pays to have a Machiavellian mindset because the deception can become quite complex. However, the majority of fraud is quite simple: someone is deceived into giving someone else a resource the other wants for no apparent return. It amounts to a kind of theft.
There are many types of fraud. Some are watched by professionals, while others aren’t, possibly because no one thinks to monitor them. With this in mind, here are a few common types of fraud that professionals do monitor:
Email is fraud’s friend
This chapter could become a book if it were to cover all of the creative ways in which fraud occurs through email. One of the most famous forms of fraud is Nigerian Fraud. There are whole websites devoted to just this one form of fraud (see https://consumer.georgia.gov/consumer-topics/nigerian-fraud-scams as an example). There are also debt elimination, advanced fee, cashier check, high-yield investment, and a great many other email-based fraud attacks. They all have one thing in common: they use email to perform their tasks to take money or other resources from people. As such, monitoring the email stream to your organization using a combination of off-the-shelf and customized ML applications can reduce the problem by reducing these emails. Of course, there still has to be a mechanism in place for dealing with false positives, but targeting your defenses is a good way to reduce false positives as well. Oddly enough, one of the suggested methods for dealing with this problem that seems effective is to mark all emails that come from outside the organization with the word [EXTERNAL] in the subject line.
This list doesn’t include some types of common fraud that are unlikely to affect a business or its applications directly, such as adoption fraud, holiday scams, or money mules. However, because these kinds of fraud do affect your employees and make them less efficient because they’re thinking about something other than work, you also need to be aware of them and provide the required support. There are many other kinds of fraud that can take place and it’s important to remember that the people who perpetrate fraud can be quite creative. The simple rule to follow is that if something seems too simple, too good to be true, or simply too convenient, then it’s likely a fraud of some sort. A fraudster uses human weaknesses to their advantage.
Now that you have a better understanding of fraud types, it’s time to look at the source of fraud. Often, the source of fraud is the entity that is most trusted by the target. It’s easier to perpetrate fraud when there is a trust relationship. However, it’s important to understand that organizations have trust relationships with other organizations, organizations trust a particular application or service, and people trust other people. To understand the scope of fraud sources, it’s important to stretch the concept of trust to fit all sorts of scenarios.
A fraud source is an entity that is generating deception. People are most often considered fraud sources, but organizations and even non-entities (things that someone would normally not consider an entity, such as a smart device) can become fraud sources as well. An application that is fed misinformation can become a fraud source by outputting the incorrect analysis of data input. A user is deceived into believing one thing is true when another thing is actually true. The deception isn’t intentional, but fraud occurs just the same. However, if the application contains a bug, then it’s broken and needs to be fixed; it isn’t a fraud source in this case. Organizational fraud usually involves a group of people working together to develop this deception. For example, an organization could try to attract investors for a product that doesn’t exist yet and will never exist (vaporware is specific to the computer industry because it involves non-existent hardware or software).
Distinguishing between types of fraud and fraud sources
Most articles online on this topic cover the different kinds of fraud. For example, identity theft is a type of fraud because it focuses on what an entity is doing. Unfortunately, without some knowledge of the potential sources for identity theft, the task of locating the culprit and ending the activity is made much harder. A fraudster is the primary perpetrator of identity theft, followed by hackers. You already have detailed knowledge of other organizations, company insiders, and customers (or fraudsters posing as customers; you don’t know until you ferret out their intentions), so the potential for identity theft from these sources is much less (or possibly non-existent) when compared to other sources. So, it’s important to know the type of fraud, but also important to know the potential fraud source so that you don’t spend a lot of time looking for a particular type of fraud from a source that’s unlikely to commit it.
A fraudster is someone who perpetrates fraud solely for some type of gain, usually immediate gain. Fraud is just business to them. They aren’t out for revenge, to gain access to your organization, or any of the other motivations that can and do affect others who commit fraud. A fraudster uses some technique to gain a person’s or organization’s trust and then robs them of everything possible. However, some fraudsters take years to set up their fraud or they engage in it continually. There is a form of fraud called long-term fraud where a fraudster places small orders with a business and promptly pays for the goods to win the business’s trust. The fraudster then places one or more large orders and absconds with the goods without paying. ML techniques can look for patterns in the fraudster’s buying habits. For example, would any one organization need the eclectic assortment of goods that the fraudster is buying? Anomalies (as discussed in Chapter 6, Detecting and Analyzing Anomalies) can help detect these sorts of frauds because the fraudster isn’t buying for a business, but rather to create a reputation.
Of the kinds of input that these schemes produce, the Ponzi scheme (https://www.investor.gov/protect-your-investments/fraud/types-fraud/ponzi-scheme) is the most common for businesses because it affects people managing the businesses’ assets directly. The use of virtual currencies has made such Ponzi schemes incredibly easy to perpetrate and it may take a long time for anyone to see that there is any sort of problem. ML can help avoid this particular type of fraud by automating a paperwork review. A Ponzi scheme often provides payments that are too regular and hides details in copious paperwork that might be hard to analyze without a computer. Given that a human would take too long to spot the necessary patterns, ML is the only solution to augment human capability when dealing with lots of high-speed transactions using virtual currency.
Hackers have a different set of goals from fraudsters when it comes to fraud. As discussed previously in this book, a hacker is usually trying to gain entry to your organization to perform some task, such as installing ransomware or stealing credentials. The hacker’s motivations can include personal gain or it may simply be business on the part of someone else who wants to access your business. Unlike fraudsters, you won’t have any actual contact with a hacker in most cases. The organization may receive an email, but that email won’t be personalized for the most part unless the hacker is perpetrating some type of personal attack. The goal is to blanket an organization in the hopes that someone will take the bait and leave your organization wide open. While a fraudster usually makes targeted attacks based on social engineering, a hacker can use any of the techniques described in Chapter 5, Keeping Your Network Clean, through Chapter 7, Dealing with Malware, to commit fraud based on automation, hitting as many targets as possible in the shortest time possible. These differences between fraudsters are clues that you can use when trying to detect a fraud source, whether you’re dealing with a fraudster or a hacker.
When dealing with fraud from other organizations, the prime motivators are espionage, sabotage, or stealing trade secrets (sometimes, there is a bit of revenge involved as well). The organization is using fraud to obtain something that your organization has that gives you some sort of advantage or that can be sold to someone else for a profit. Fraud between organizations tends to be more personalized than that provided by fraudsters because the members of each organization know each other. With this fact in mind, the kind of fraud is usually very personal based on what one party knows about the other.
Considering only electronic interactions (and setting aside personal interactions), it’s possible to use ML to look for patterns in data exchanges with the other organization. In this case, the main means of detection is fake data. The other organization will have enough information to avoid potential issues with anomalies or making randomized purchases. However, it’s still possible to use ML to detect fake data that the other organization generates to skew the target organization’s perception of it. This is one of the situations where using unsupervised learning techniques may be necessary, as described in Unsupervised Deep Learning for Fake Content Detection in Social Media at https://scholarspace.manoa.hawaii.edu/items/6d7560aa-2aff-4439-a884-35994e242c06. The same techniques work when reviewing other sorts of textual data exchanged between organizations.
Oddly enough, company insiders are often the most eager perpetrators of fraud. A company insider generally has good access to company assets and is familiar enough with company policies and procedures to avoid detection. However, the kinds of fraud that the company insiders perform are different because the focus of the fraud is different. For example, a company insider is unlikely to try to use someone else’s credit card to make a purchase and may not make any purchases at all (depending on what the company sells). Consequently, consider these kinds of fraud when looking at company insiders as a potential source:
As you can see, these are still kinds of fraud, but they’re different because they would be harder for someone outside the organization to perpetrate due to a lack of knowledge and/or access. Make sure that you also consider the role third parties can play in company insider fraud. In addition, employees can be blackmailed into committing fraud by a third party, which complicates the situation even further. Given modern ML methods, it’s entirely possible that the target of a fraud investigation isn’t even involved, but is being blamed for the fraud activity by someone else (so, look for things such as misuse of credentials).
Customers are the lifeblood of any business, so learning how to avoid fraud when dealing with customers is essential. A customer that wants to perpetrate fraud against your business will usually do so in the form of a scam. The idea is to get whatever is desired as quickly as possible and then disappear without a trace. With this in mind, here are typical scammer tactics you need to be aware of:
Customer or fraudster?
Some fraudsters start as customers and only perpetrate fraud when they see weakness on the part of the business, so there is a great amount of ambiguity here: you don’t always know when someone goes from being a customer to being a fraudster; a well-prepared business will avoid certain kinds of fraud through a show of strength. The addition of a sign saying “You’re on candid camera!” immediately below a surveillance camera has been shown to reduce theft. That camera could be attached to an ML application with facial recognition capabilities to make it easier for you to detect repeat offenders. With this ambiguity in mind, the following section uses the term customer because what you see is a customer until you find out otherwise.
There are potential ways to use ML to help reduce the potential for customer fraud in your organization. One of the most important methods is to create and maintain a database of customer contacts with detailed information. A customer who wants to perpetrate fraud will often try multiple employees, looking for one that is more open to creating a trust environment. You can also use this database to search for odd customer contact patterns that don’t match other customers that you deal with. ML applications can also validate customers in several ways:
The actual techniques that customers use to defraud your business vary, but they tend to fall into one of several areas. The following list provides ideas of what you should look for:
As you can see, customers use several highly successful approaches to defrauding businesses. This list doesn’t even include the social engineering, phishing, and ransomware attacks discussed in other chapters. Many of these scams can be tracked down with the assistance of an ML application. In addition, using expert systems to create smart scripts so that employees know what process to follow when working with customers can make a huge difference in ensuring that things go smoothly and don’t necessarily require a lot of management time to solve.
To create effective ML models for your organization, you need to train the model using real data or something that approximates real data. The best choice for a dataset is one that your organization gathers, but this is a time-consuming process and you may not generate enough data to produce a good dataset. With this in mind, you can use any of several fraud datasets available online. One of the best places to get this sort of dataset if you need several different types is the FDB: Fraud Dataset Benchmark at https://github.com/amazon-research/fraud-dataset-benchmark. This one download provides you with access to nine different fraud datasets as listed on the website. You can read the goals of creating this dataset at https://www.linkedin.com/posts/groverpr_fdb-fraud-dataset-benchmark-activity-6970921322067427328-fNCo.
It’s tough, perhaps impossible, to create a single grouping of datasets that answers all fraud detection needs. The Building a fraud detection example section of this chapter uses an entirely different dataset to demonstrate actual credit card fraud based on sanitized data from a real-world dataset. Any public dataset you get will likely be sanitized already (and you need to exercise care in downloading or using any dataset that isn’t already sanitized). Here are some other dataset sites you should visit:
Note that older datasets are often removed without much comment from the provider. Two examples that are often cited by researchers and data scientists are the Kaggle dataset at https://www.kaggle.com/dalpozz/creditcardfraud and the one at http://weka.8497.n7.nabble.com/file/n23121/credit_fruad.arff. Both of these datasets are gone and it isn’t worth pursuing them because there are others. The reason this is important is that the previous list may become outdated at some point, but because fraud is such a common and consistent problem, there will be others to take their place.
Now that you’ve considered all of the kinds of fraud sources and perhaps reviewed a few fraud datasets for use in your research, it’s time to look at the slow type of fraud that occurs in the background. A background fraud scenario is the most dangerous kind of fraud because it happens slowly over a certain period. In some cases, this sort of fraud is never found out except accidentally as part of an audit or other research that has nothing to do with the fraud in question. Of course, this kind of fraud also requires patience on the part of the deceiver, so it can also be the hardest kind of fraud to perpetrate successfully.
Niccolo Machiavelli, the person from whose name the term Machiavellian is derived, is one of those individuals who observed human nature with great patience and in great depth in many cases. Unlike the perception that many have of him being a scoundrel of the worst sort, he was a philosopher who saw human nature as it was at the time when it came to politics. Background fraud is often Machiavellian in nature. It has an “ends justifies the means” view of the world and can be utterly immoral in its approach to obtaining some goal through the use of deception. In other words, yelling that something isn’t fair is unlikely to garner any sort of response in this situation. The following sections define the kinds of background fraud and how to detect it.
There are many different definitions for the difference between background, or long-term, fraud and real-time, or short-term, fraud. When viewing long-term fraud strictly from a financial perspective, many experts view it as a scenario where a fraudster makes lots of small purchases and pays for them to build trust and then makes one or two large purchases, but doesn’t pay for them. By the time anyone thinks to look for the fraudster, the products have been sold for cash, and the fraudster is gone.
Because the term long-term fraud has a specific meaning in many financial circles, this book uses the term background fraud to indicate fraud that happens over weeks or months in a manner that the fraudster hopes no one will notice until it’s too late to do anything about it. The key difference between any sort of fraud and an attack is that fraud relies on deception – it appears to be one thing when it’s another. For example, the Using supervised learning example section of Chapter 5, Keeping Your Network Clean, represents a kind of attack because no deception is employed during the probing process. To change that example into a kind of fraud, the hacker would need to make completely legitimate calls exclusively during the learning process from what appears to be a legitimate caller, and then commit an act that would derive some sort of tangible benefit once the learning process is done by using API calls to manipulate the database in some manner (as an example). Perhaps the hacker would obtain company secrets from the database to sell to a competitor. Consequently, the technique shown in Chapter 5, Keeping Your Network Clean, would have a slim chance of working because the model wouldn’t necessarily have time to recognize a change in pattern.
Detecting background fraud often involves more detailed detective work that relies on some sort of targeted knowledge. Looking again at the example from Chapter 5, Keeping Your Network Clean, a form of fraud might be detected by reviewing the IP addresses making calls to the API and comparing them to a list of IP addresses that could make legitimate requests. As an alternative, it might be possible to trace the IP addresses to a specific caller. Reviewing the history and background of the caller might reveal some anomalies that would point to fraud. Yes, automation plays a very big role in helping a human track the caller down and check into the caller’s information, but eventually, a human has to decide on the viability of the caller because that’s something automation can’t do.
There are some types of fraud that can be tracked using pattern recognition models such as the one described in Chapter 5, Keeping Your Network Clean. For example, if a fraudster wanted to slowly manipulate the price of a product to obtain financial gain, the act of manipulating the price would create a pattern of some sort that an ML application could detect, given a model that is trained using enough examples (there are a lot of caveats here, so make sure you pursue this approach with care).
It’s also important to think small in some cases when it comes to fraud. For example, loyalty rewards may seem like a very small sort of fraud to commit, but according to Kount (https://kount.com/blog/how-to-prevent-loyalty-fraud/), the value of just the unredeemed loyalty rewards in a given year may amount to $160 billion. Fraudsters perform an Account Takeover (ATO) attack to gain access to royalty rewards using stolen credentials in many cases. The reason that this is such perfect fraud is that many people never check their royalty rewards information; they just assume that the business will automatically apply a loyalty reward at the cash register when one is available. Consequently, the fraudster has a significantly reduced chance of being caught when perpetrating this form of fraud. An ML application could help in this case by flagging customer accounts that usually have no access, but are currently experiencing a surge in access. In addition, the ML application could monitor the customer’s loyalty rewards for large withdrawals of benefits in a short time.
Now that you have a better idea of how background fraud works, it’s time to look at something with a faster turnaround: real-time fraud. Real-time fraud is marked by impatience; the quick use of deception to make a fast gain, even if that gain isn’t substantial. The idea is that even a small gain is worthwhile if there are enough small gains.
Real-time fraud is marked by a certain level of impatience and often requires quick thinking to pull off. Real-time fraud usually infers a kind of interaction that is performed by an entity that can perpetrate the fraud and then become inaccessible (usually by changing venue). In addition, real-time fraud normally relies on social engineering, a lack of knowledge, or some type of artifice. The following sections provide insights into real-time fraud and its detection.
Real-time fraud, a term this book uses to indicate a kind of fraud that occurs within hours or possibly days (or sometimes even seconds), targets quick gains with little effort on the part of the fraudster. Here are a few real-time fraud types to consider:
There are many other kinds of real-time fraud. Fraudsters are extremely skilled at coming up with the next confidence trick (con) to play on unwitting victims. So, the people that are protecting others from the fraudster have to be equally skilled and quite fast. The next section provides some ideas on the tools that someone can use to aid in detection.
The act of detecting real-time fraud as it occurs is really hard because everything happens so quickly. Humans have developed intuitive approaches to knowing when fraud is occurring based on clues that ML has yet to pick up on (and may never pick up on). For example, there is some evidence to suggest that odors can indicate that a person is lying. According to the article, Artificial networks learn to smell like the brain (https://news.mit.edu/2021/artificial-networks-learn-smell-like-the-brain-1018), this approach is still a work in progress. Gamblers often rely on a person’s tell (behavior change) to know what appears in the person’s hand. What this amounts to is a kind of facial recognition over time. Chapter 10 will tell you that ML is getting closer to good facial recognition, but this too is a work in progress. So, what can ML do for you today to help mitigate real-time fraud?
The set of features commonly used for real-time fraud detection is reduced compared to those used for background fraud detection, partly because less data is available and partly because time is of the essence. Here are the four features most commonly used to detect real-time fraud:
The effect of using multiple devices
The use of different devices by a single individual today commonly triggers an email to the authorized person, an indicator that tracking the device does work.
As you can see from this list, it’s not a perfect setup for detecting absolutely every kind of fraud because there is a potential for fraudsters to slip through the cracks. A highly motivated fraudster could verify that the identity used is perfect, make purchases only within the range that the target would make, spoof the location of the purchase, and ensure that they followed the person’s habits within a reasonable range. Of course, that’s a lot of ifs to consider, but it could happen. The point is that observing these four characteristics as part of an ML application will greatly reduce the potential for fraud.
Now that you have some basis for understanding the nature of fraud detection, the next section looks at a specific fraud detection example. In this case, you will see how to detect fraud in credit card purchases.
This section will show you how to build a simple fraud detection example using real sanitized credit card data available on Kaggle. The transactions occurred in September 2013 and there are 492 frauds out of 284,807 transactions, which is unbalanced because the number of frauds is a little low for training a model. The data has been transformed by Principal Component Analysis (PCA) using the techniques demonstrated in the Relying on Principle Component Analysis section of Chapter 6, Detecting and Analyzing Anomalies. Only the Amount column has the original value in it. The Class column has been added to label the data. You can also find the source code for this example in the MLSec; 08; Perform Fraud Detection.ipynb file of the downloadable source.
The dataset used in this example appears at https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud?resource=download. The data is in a 69 MB .zip file. Download the file manually and unzip it into the source code directory. Note that you must obtain a Kaggle subscription if you don’t already have one to download this dataset. Getting a subscription is easy and free; check out https://www.kaggle.com/subscribe.
This example begins with importing the data, which requires a little work in this case. The following steps show you how to do so:
import pandas as pdfrom sklearn.preprocessing import StandardScaler
cardData=pd.read_csv("creditcard.csv")
total_transactions = len(cardData)normal = len(cardData[cardData.Class == 0]) fraudulent = len(cardData[cardData.Class == 1]) fraud_percentage = fraudulent/normal print(f'Total Number Transactions: {total_transactions}') print(f'Normal Transactions: {normal}') print(f'Fraudulent Transactions: {fraudulent}') print(f'Fraudulent Transactions Percent: ' f'{fraud_percentage:.2%}')
The number of actual fraudulent transactions is smaller, accounting for only 0.17% of the transactions, as shown in Figure 8.1:
Figure 8.1 – Output showing the transaction statistics
cardData.info()
Figure 8.2 shows that there are no null values in the dataset. If there had been null values, then you would need to clean the data by replacing the null values with a specific value, such as the mean of the other entries in the column. The same thing applies to missingness, which is missing data in a dataset and sometimes indicates fraud. You need to replace the missing value with some useful alternative:
Figure 8.2 – Dataset output showing a lack of missing or null values
print(f'Minimum Value: {min(cardData.Amount)}')print(f'Mean Value: ' f'{sum(cardData.Amount)/total_transactions}') print(f'Maximum Value: {max(cardData.Amount)}') print(cardData['Amount'])
If the Amount column has too great a range, as shown in Figure 8.3, then the model will become skewed. In this case, the standard practice is to scale the data to obtain better results:
Figure 8.3 – Output showing variance in the Amount column
cardData.drop(['Time'], axis=1, inplace=True)print(cardData.shape)
scaler = MinMaxScaler(feature_range=(0,1))col = cardData.columns cardData = pd.DataFrame( scaler.fit_transform(cardData), columns = col) print(cardData)
Figure 8.4 shows the partial results of the scaling operation:
Figure 8.4 – Presentation of data values after scaling
cardData.drop_duplicates(inplace=True)print(cardData.shape)
print(f'Minimum Value: {min(cardData.Amount)}')print(f'Mean Value: ' f'{sum(cardData.Amount)/total_transactions}') print(f'Maximum Value: {max(cardData.Amount)}') print(cardData['Amount'])
Now that the dataset has been massaged, it’s time to see the result, as shown in Figure 8.5:
Figure 8.5 – Output of the Amount column after massaging
A lot of ML comes down to ensuring that the data you use is prepared properly to create a good model. Now that the data has been prepared, you can split it into training and testing sets. This process ensures that you have enough data to train the model, and then test it using data the model hasn’t seen before so that you can ascertain the goodness of the model (something you will see later in this process).
Splitting the data into training and testing sets makes it possible to train the model on one set of data and test it using another set of data that the model hasn’t seen. This approach ensures that you can validate the model concerning its goodness in locating credit card fraud (or anything else for that matter). The following steps show how to split the data in this case:
from sklearn.model_selection import train_test_split
X = cardData.drop('Class', axis=1).valuesy = cardData['Class'].values print(X) print(y)
When you print the result, you will see that X is indeed a matrix and y is indeed a vector, as shown in Figure 8.6:
Figure 8.6 – Contents of the training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1) print(f"X training data size: {X_train.shape}") print(f"X testing data size: {X_test.shape}") print(f"y training data size: {y_train.shape}") print(f"y testing data size: {y_test.shape}")
Figure 8.7 shows that the data is split according to the 80:20 ratio and that the training and testing variables have the same number of entries:
Figure 8.7 – The output shows that the data is split according to an 80:20 ratio
Now that the data is in the correct form, you can finally build a model. Of course, that means selecting a model and configuring it. For this example, you will use DecisionTreeClassifier. However, there are a wealth of other models that could give you an edge when working with various kinds of data.
Considering the importance of testing model goodness
The whole issue of data splitting, selecting the correct model, configuring the model correctly, and so on, comes down to getting a model with very high accuracy. This is especially important when doing things such as looking for malware or detecting fraud. Unfortunately, no method or rule of thumb provides a high level of accuracy in every case. The only real way to tweak your model is to change one item at a time, rebuild the model, and then test it.
As you’ve seen in other chapters, building the model involves fitting it to the data. The following steps show how to build the model using a minimal number of configuration changes:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(max_depth = 5)dtc.fit(X_train, y_train)
You’ll know that the process is complete when you see the output shown in Figure 8.8:
Figure 8.8 – Description of the DecisionTreeClassifier model
You now have a model to use to detect credit card fraud. Of course, you have no idea of how good that model is at its job. Perhaps it’s not very generalized and overfitted to the data. Then again, it might be underfitted. The next section shows how to verify the goodness of the model in this case.
Having a model to use means that you can start detecting fraud. Of course, you don’t know how well you can detect fraud until you test it using the following steps:
from sklearn.metrics import accuracy_scorefrom sklearn.metrics import precision_recall_fscore_support from sklearn.metrics import confusion_matrix from sklearn.metrics import plot_confusion_matrix import matplotlib.pyplot as plt
dtc_yHat = dtc.predict(X_test)print(f"Accuracy score: ' f'{accuracy_score(y_test, dtc_yHat)}")
Figure 8.9 shows that the accuracy is very high:
Figure 8.9 – Output of the DecisionTreeClassifier model accuracy
precision, recall, fbeta_score, support = precision_recall_fscore_support(y_test, dtc_yHat) print(f"Precision: {precision}") print(f"Recall: {recall}") print(f"F-beta score: {fbeta_score}") print(f"Support: {support}")
Figure 8.10 – Output of the precision, recall, F-beta, and support statistics
print(confusion_matrix(y_test, dtc_yHat, labels=[0, 1]))
Figure 8.11 shows the output. The number of true positives appears in the upper-left corner, which has a value of 55025. Only 29 of the records generated a false positive. There were 73 true negatives and six false negatives in the dataset. So, the chances of finding credit card fraud are excellent, but not perfect:
Figure 8.11 – The confusion matrix output for the DecisionTreeClassifier model prediction
matrix = plot_confusion_matrix(dtc, X=X_test, y_true=y_test, cmap=plt.cm.Blues) plt.title('Confusion Matrix for Fraud Detection') plt.show(matrix) plt.show()
Figure 8.12 shows the output in this case:
Figure 8.12 – A view of the graphic version of the confusion matrix
This section has shown you one complete model building and testing cycle. However, you don’t know that this is the best model to use. Testing other models is important, as described in the next section.
The decision tree classifier does an adequate job of separating fraudulent credit purchases from those that aren’t, but it could do better. A random forest classifier is a group of decision tree classifiers. In other words, you put multiple algorithms to work on the same problem. When the classification process is complete, the trees vote and the classification with the most votes wins.
In the previous example, you used the max_depth argument to determine how far the tree should go to reach a classification. Now that you have a whole forest, rather than an individual tree, at your disposal, you also need to define the n_estimators argument to define how many trees to use. There are a lot of other arguments that you can use to tune your model in this case, as described in the documentation at https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html. The following steps help you create a random forest classifier model so that you can compare it to the decision tree classifier used earlier:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(max_depth=9, n_estimators=100, n_jobs=-1) rfc.fit(X_train, y_train) rfc_yHat = rfc.predict(X_test) print(f"Accuracy score: " f"{accuracy_score(y_test, rfc_yHat)}")
Figure 8.13 shows that even though it took longer to build this model (using all of the processors no less), it performs only slightly better than the decision tree classifier. In this case, the contributing factors are the small dataset and the fact that the number of fraud cases is small. However, even a small difference is better than no difference at all when it comes to fraud and you need to consider that a real-world scenario will be dealing with far more entries in the dataset:
Figure 8.13 – The accuracy of the random forest classifier
precision, recall, fbeta_score, support = precision_recall_fscore_support(y_test, rfc_yHat) print(f"Precision: {precision}") print(f"Recall: {recall}") print(f"F-beta score: {fbeta_score}") print(f"Support: {support}")
As with accuracy, the differences (as shown in Figure 8.14) between the two models are very small, but important when dealing with fraud. After all, you don’t want to claim a customer has committed fraud unless it’s true:
Figure 8.14 – The precision, recall, F-beta, and support scores for the random forest classifier
matrix = plot_confusion_matrix(rfc, X=X_test, y_true=y_test, cmap=plt.cm.Blues) plt.title('Confusion Matrix for RFC Fraud Detection') plt.show(matrix) plt.show()
Figure 8.15 shows what it all comes down to in the end. The random forest classifier has the same prediction rate for transactions that aren’t fraudulent as the decision tree classifier in this case. However, it also finds one more case of fraud, which is important:
Figure 8.15 – The confusion matrix for the random forest classifier
This example has looked at credit card fraud, but the same techniques work on other sorts of fraud as well. The main things to consider when looking for fraud are to obtain a large enough amount of data, train the model using your best assumptions about the data, and then test the model for accuracy. Tweak the model as needed to obtain the required level of accuracy. Of course, no model is going to be completely accurate. While ML will greatly reduce the burden on human detectives looking for fraud, it can’t eliminate the need for a human to look at the data entirely.
A Receiver Operating Characteristic (ROC) curve and Area Under the Curve (AUC) calculation help you determine where to set thresholds in your ML model. These are methods of looking at the performance of a model at all classification thresholds. The X-axis shows the false positive rate, while the Y-axis shows the true positive rate. As the true positive rate increases, so does the false positive rate. The goal is to determine where to place the threshold for determining whether a particular sample is fraudulent or not based on its score during analysis. A score indicates the model’s confidence as to whether a particular sample is fraud or not, but the model doesn’t determine where to place the line between fraud and legitimate; that line is the threshold. Therefore, a ROC curve helps a human user of a model determine where to set the threshold, and where to say that it’s best to detect fraud.
The True Positive Rate (TPR) defines the ratio between true positives (TP) (as shown in Figure 8.11, Figure 8.12, and Figure 8.15) and false negatives (FN): TPR = TP / TP + FN. The False Positive Rate (FPR) defines the ratio between the false positives (FP) and the true negatives (TN): FPR = FP / FP + TN. Essentially, what you’re trying to determine is how many true positives are acceptable for a given number of false positives when working with a model.
As part of plotting a ROC curve, you also calculate the AUC, which is essentially another good measure for the model. It’s a measure of overall performance against all classification thresholds; the higher the number, the better the model. The AUC is a probability measure that tells you how likely it is that the model will rank a random positive example higher than a random negative example. Consequently, the MultinomialNB classifier used in an earlier example would have an AUC of 0, which means it never produces a correct detection. With all of these things in mind, use the following steps to create two ROC curves comparing DecisionTreeClassifier to the RandomForestClassifier classifier used earlier:
from sklearn.metrics import roc_curvefrom sklearn.metrics import auc from numpy import argmax from numpy import sqrt
dtc_y_scores = dtc.predict_proba(X_test)dtc_fpr, dtc_tpr, dtc_thresholds = roc_curve(y_test, dtc_y_scores[:, 1]) dtc_roc_auc = auc(dtc_fpr, dtc_tpr)
rfc_y_scores = rfc.predict_proba(X_test)rfc_fpr, rfc_tpr, rfc_thresholds = roc_curve(y_test, rfc_y_scores[:, 1]) rfc_roc_auc = auc(rfc_fpr, rfc_tpr)
dtc_gmeans = sqrt(dtc_tpr * (1-dtc_fpr))dtc_ix = argmax(dtc_gmeans) print('Best DTC Threshold=%f, G-Mean=%.3f' % (dtc_thresholds[dtc_ix], dtc_gmeans[dtc_ix])) rfc_gmeans = sqrt(rfc_tpr * (1-rfc_fpr)) rfc_ix = argmax(rfc_gmeans) print('Best RFC Threshold=%f, G-Mean=%.3f' % (rfc_thresholds[rfc_ix], rfc_gmeans[rfc_ix]))
Figure 8.16 shows the result of the calculation for each model:
Figure 8.16 – Calculated best threshold and G-mean values for each model
plt.title('Receiver Operating Characteristic')plt.plot(dtc_fpr, dtc_tpr, 'g', label = 'DTC AUC = %0.2f' % dtc_roc_auc) plt.plot(rfc_fpr, rfc_tpr, 'b', label = 'RFC AUC = %0.2f' % rfc_roc_auc) plt.plot([0, 1], [0, 1],'r--', label = 'No Skill') plt.scatter(dtc_fpr[dtc_ix], dtc_tpr[dtc_ix], marker='o', color='g', label='DTC Best') plt.scatter(rfc_fpr[rfc_ix], rfc_tpr[rfc_ix], marker='o', color='b', label='RFC Best') plt.legend(loc = 'lower right') plt.xlim([0, 1]) plt.ylim([0, 1]) plt.ylabel('True Positive Rate') plt.xlabel('False Positive Rate') plt.title('ROC Curve Comparison DTC to RFC') plt.show()
The output of this example appears in Figure 8.17. Notice that the RFC model outperforms the DTC in this particular case by a small margin. In addition, the plot shows where you’d place the threshold for each model. Given the data and other characteristics of this example, once the model has achieved a maximum value, there is little advantage in increasing the threshold further:
Figure 8.17 – The ROC curve and AUC calculation for each model
Of course, you won’t likely always see this result. The main takeaway from this example is that you need to compare models and settings to determine how best to configure your ML application to detect as much fraud as possible without creating an overabundance of false positives.
This chapter introduced you to the topic of fraud as it applies to ML. The key takeaway from this chapter is that fraud involves deception for some type of gain. Often, this deception is completely hidden and subtle; sometimes, the gain is even hard to decipher unless you know how the gain is used. Fraud affects ML security by introducing flawed data into the dataset, which produces unreliable or unpredictable results that are skewed to the perpetrator’s goals. In addition, because the data is unreliable, it also presents a security risk.
When reviewing the security needs of an organization, it’s important to consider both background and real-time fraud. Depending on your organization, one form of fraud or the other may take precedence. For example, a marketing company with no direct consumer interaction would need to consider background fraud more strongly. Likewise, an online seller would need to consider real-time fraud more strongly. Tailoring the type of fraud detection used is incredibly important to make detection both precise and efficient.
Chapter 9, Defending Against Hackers, moves on to defending against direct hacker attacks. Previous chapters have considered what might be termed hacker agents, such as network interference, the introduction of anomalies, the reliance on malware, and now the use of fraud in this chapter. In Chapter 9, the hacker will become actively engaged in what could be termed either sabotage or espionage. It’s the sort of attack seen in movies, but not in the way that movies depict them. You may be amazed at just how subtle a hacker’s machinations against your organization can become.
The following bullets provide you with some additional reading that you may find useful in understanding the materials in this chapter: