One important characteristic of the dataset is that there are no missing values, as it is indicated by the count statistic. All features have the same number of values. Another important aspect is that most features are normalized. This is due to the PCA applied to the data. PCA normalizes the data before decomposing it into principal components. The only two features not normalized are the Time and Amount features. The following histogram for each feature is depicted:
It is interesting to examine more closely the Time and Amount of each transaction. In the Time histogram, we notice a sudden drop in transaction frequency between 75,000 and 125,000 seconds after the first transaction (around 13 hours). This is probably due to daily time cycles (for example, during the night, when most stores are closed). The histogram for each transaction's amount is provided as follows in the logarithmic scale. It is evident that most transactions concern small amounts, with the average being almost €88.00:
In order to avoid problems with uneven distribution of weights between features, we will standardize the features Amount and Time. Algorithms that employ distance metrics for example (such as K-Nearest Neighbors), can under perform when features are not scaled correctly. The standardized features' histograms are provided as follows. Note that standardization transforms the variables in order to have a mean value close to 0 and standard deviation of 1:
The following plot depicts the histogram for standardized time. We can see that it does not affect the drop in transactions during the night time: