Loading the dataset

We use a dataset from the 2017 Black Hat conference. We will be doing some basic statistical testing to better understand the data:

data = pd.read_csv("https://s3-us-west-1.amazonaws.com/blackhat-us-2017/creditcard.csv")
data.head()

The preceding code provides the data that has 31 columns in total.

We check for the target classes with a Histogram, where the x axis depicts the Class and the y axis depicts the Frequency, as shown in the following code:


count_classes = pd.value_counts(data['Class'], sort = True).sort_index()
count_classes.plot(kind = 'bar')
plt.title("Fraud class histogram")
plt.xlabel("Class")
plt.ylabel("Frequency")

Here is the output for the preceding code:

This histogram clearly shows that the data is totally unbalanced.

This is an example of using a typical accuracy score to evaluate our classification algorithm. For example, if we just use a majority class to assign values to all records, we will still have a high accuracy, but we would be classifying all one incorrectly.

There are several ways to approach this classification problem while taking into consideration this unbalance: Do we collect more data? It's a nice strategy but not applicable in this case:

We approach the problem by changing the performance metric:
- Use the confusion matrix to calculate precision, recall
- F1 score (weighted average of precision-recall)
- Use Kappa which is a classification accuracy normalized by the imbalance of the classes in the data
- ROC curves calculates sensitivity/specificity ratio
We can also resample the dataset
- Essentially this is a method that will process the data to have an approximate 50:50 ratio.
- One way to achieve this is by oversampling, which is adding copies of the under-represented class (better when you have little data).
- Another is under-sampling, which deletes instances from the overrepresented class (better when we have lots of data).

Table of Contents for Loading the dataset

Create new playlist

Sign In

Sign Up

Table of Contents for
Loading the dataset