We use a dataset from the 2017 Black Hat conference. We will be doing some basic statistical testing to better understand the data:
data = pd.read_csv("https://s3-us-west-1.amazonaws.com/blackhat-us-2017/creditcard.csv")
data.head()
The preceding code provides the data that has 31 columns in total.
We check for the target classes with a Histogram, where the x axis depicts the Class and the y axis depicts the Frequency, as shown in the following code:
count_classes = pd.value_counts(data['Class'], sort = True).sort_index()
count_classes.plot(kind = 'bar')
plt.title("Fraud class histogram")
plt.xlabel("Class")
plt.ylabel("Frequency")
Here is the output for the preceding code:
This histogram clearly shows that the data is totally unbalanced.
This is an example of using a typical accuracy score to evaluate our classification algorithm. For example, if we just use a majority class to assign values to all records, we will still have a high accuracy, but we would be classifying all one incorrectly.
There are several ways to approach this classification problem while taking into consideration this unbalance: Do we collect more data? It's a nice strategy but not applicable in this case:
- We approach the problem by changing the performance metric:
- Use the confusion matrix to calculate precision, recall
- F1 score (weighted average of precision-recall)
- Use Kappa which is a classification accuracy normalized by the imbalance of the classes in the data
- ROC curves calculates sensitivity/specificity ratio
- We can also resample the dataset
- Essentially this is a method that will process the data to have an approximate 50:50 ratio.
- One way to achieve this is by oversampling, which is adding copies of the under-represented class (better when you have little data).
- Another is under-sampling, which deletes instances from the overrepresented class (better when we have lots of data).