Exploratory analysis

One important characteristic of the dataset is that there are no missing values, as it is indicated by the count statistic. All features have the same number of values. Another important aspect is that most features are normalized. This is due to the PCA applied to the data. PCA normalizes the data before decomposing it into principal components. The only two features not normalized are the Time and Amount features. The following histogram for each feature is depicted:

Histograms for the dataset's features

It is interesting to examine more closely the Time and Amount of each transaction. In the Time histogram, we notice a sudden drop in transaction frequency between 75,000 and 125,000 seconds after the first transaction (around 13 hours). This is probably due to daily time cycles (for example, during the night, when most stores are closed). The histogram for each transaction's amount is provided as follows in the logarithmic scale. It is evident that most transactions concern small amounts, with the average being almost €88.00:

Histogram for amount, logarithmic scale for y-axis

In order to avoid problems with uneven distribution of weights between features, we will standardize the features Amount and Time. Algorithms that employ distance metrics for example (such as K-Nearest Neighbors), can under perform when features are not scaled correctly. The standardized features' histograms are provided as follows. Note that standardization transforms the variables in order to have a mean value close to 0 and standard deviation of 1:

Standardized amount histogram

The following plot depicts the histogram for standardized time. We can see that it does not affect the drop in transactions during the night time:

Standardized time histogram
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset