Getting familiar with your data

Although we would strongly discourage such behavior, you can build a model without knowing your data; it will most likely take you longer, and the quality of the resulting model might be less than optimal, but it is doable.

Note

In this section, we will use the dataset we downloaded from http://packages.revolutionanalytics.com/datasets/ccFraud.csv. We did not alter the dataset itself, but it was GZipped and uploaded to http://tomdrabas.com/data/LearningPySpark/ccFraud.csv.gz. Please download the file first and save it in the same folder that contains your notebook for this chapter.

The head of the dataset looks as follows:

Getting familiar with your data

Thus, any serious data scientist or data modeler will become acquainted with the dataset before starting any modeling. As a first thing, we normally start with some descriptive statistics to get a feeling for what we are dealing with.

Descriptive statistics

Descriptive statistics, in the simplest sense, will tell you the basic information about your dataset: how many non-missing observations there are in your dataset, the mean and the standard deviation for the column, as well as the min and max values.

However, first things first—let's load our data and convert it to a Spark DataFrame:

import pyspark.sql.types as typ

First, we load the only module we will need. The pyspark.sql.types exposes all the data types we can use, such as IntegerType() or FloatType().

Next, we read the data in and remove the header line using the .filter(...) method. This is followed by splitting the row on each comma (since this is a .csv file) and converting each element to an integer:

fraud = sc.textFile('ccFraud.csv.gz')
header = fraud.first()

fraud = fraud 
    .filter(lambda row: row != header) 
    .map(lambda row: [int(elem) for elem in row.split(',')])

Next, we create the schema for our DataFrame:

fields = [
    *[
        typ.StructField(h[1:-1], typ.IntegerType(), True)
        for h in header.split(',')
    ]
]
schema = typ.StructType(fields)

Finally, we create our DataFrame:

fraud_df = spark.createDataFrame(fraud, schema)

Having created our fraud_df DataFrame, we can calculate the basic descriptive statistics for our dataset. However, you need to remember that even though all of our features appear as numeric in nature, some of them are categorical (for example, gender or state).

Here's the schema of our DataFrame:

fraud_df.printSchema()

The representation is shown here:

Descriptive statistics

Also, no information would be gained from calculating the mean and standard deviation of the custId column, so we will not be doing that.

For a better understanding of categorical columns, we will count the frequencies of their values using the .groupby(...) method. In this example, we will count the frequencies of the gender column:

fraud_df.groupby('gender').count().show()

The preceding code will produce the following output:

Descriptive statistics

As you can see, we are dealing with a fairly imbalanced dataset. What you would expect to see is an equal distribution for both genders.

Note

It goes beyond the scope of this chapter, but if we were building a statistical model, you would need to take care of these kinds of biases. You can read more at http://www.va.gov/VETDATA/docs/SurveysAndStudies/SAMPLE_WEIGHT.pdf.

For the truly numerical features, we can use the .describe() method:

numerical = ['balance', 'numTrans', 'numIntlTrans']
desc = fraud_df.describe(numerical)
desc.show()

The .show() method will produce the following output:

Descriptive statistics

Even from these relatively few numbers we can tell quite a bit:

  • All of the features are positively skewed. The maximum values are a number of times larger than the average.
  • The coefficient of variation (the ratio of mean to standard deviation) is very high (close or greater than 1), suggesting a wide spread of observations.

Here's how you check the skeweness (we will do it for the 'balance' feature only):

fraud_df.agg({'balance': 'skewness'}).show()

The preceding code produces the following output:

Descriptive statistics

A list of aggregation functions (the names are fairly self-explanatory) includes: avg(), count(), countDistinct(), first(), kurtosis(), max(), mean(), min(), skewness(), stddev(), stddev_pop(), stddev_samp(), sum(), sumDistinct(), var_pop(), var_samp() and variance().

Correlations

Another highly useful measure of mutual relationships between features is correlation. Your model would normally include only those features that are highly correlated with your target. However, it is almost equally important to check the correlation between the features; including features that are highly correlated among them (that is, are collinear) may lead to unpredictable behavior of your model, or might unnecessarily complicate it.

Note

I talk more about multicollinearity in my other book, Practical Data Analysis Cookbook, Packt Publishing (https://www.packtpub.com/big-data-and-business-intelligence/practical-data-analysis-cookbook), in Chapter 5, Introducing MLlib, under the section titled Identifying and tackling multicollinearity.

Calculating correlations in PySpark is very easy once your data is in a DataFrame form. The only difficulties are that the .corr(...) method supports the Pearson correlation coefficient at the moment, and it can only calculate pairwise correlations, such as the following:

fraud_df.corr('balance', 'numTrans')

In order to create a correlations matrix, you can use the following script:

n_numerical = len(numerical)

corr = []

for i in range(0, n_numerical):
    temp = [None] * i
    
    for j in range(i, n_numerical):
        temp.append(fraud_df.corr(numerical[i], numerical[j]))
    corr.append(temp)

The preceding code will create the following output:

Correlations

As you can see, the correlations between the numerical features in the credit card fraud dataset are pretty much non-existent. Thus, all these features can be used in our models, should they turn out to be statistically sound in explaining our target.

Having checked the correlations, we can now move on to visually inspecting our data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset