Categorical features

For the categorical data, we cannot calculate simple descriptive statistics. Thus, we are going to calculate frequencies for each distinct value in each categorical column. Here's a code snippet that will achieve this:

rdd_cat = (
    census_subset
    .select(cols_cat + ['label'])
    .rdd
    .map(lambda row: [e for e in row])
)

results_cat = {}

for i, col in enumerate(cols_cat + ['label']):
    results_cat[col] = (
        rdd_cat
        .groupBy(lambda row: row[i])
        .map(lambda el: (el[0], len(el[1])))
        .collect()
    )

First, we repeat what we have just done for the numerical columns but for the categorical ones: we subset census_subset to only the categorical columns and the label, access the underlying RDD, and transform each row into a list. We're going to store the results in the results_cat dictionary. We loop through all the categorical columns and aggregate the data using the .groupBy(...) transformation. Finally, we create a list of tuples where the first element is the value (el[0]) and the second element is the frequency (len(el[1])).

The .groupBy(...) transformation outputs a list where the first element is the value and the second is a pyspark.resultIterable.ResultIterable object that is effectively a list of all elements from the RDD that contains the value.

Now that we have our data aggregated, let's see what we deal with:

The preceding list is abbreviated for brevity. Check (or run the code in) the 5. Machine Learning with MLlib.ipynb notebook present in our GitHub repository.

As you can see, we are dealing with an imbalanced sample: it is heavily skewed toward males and mostly white people. Also, in 1994 there were not many people earning more than $50,000, only about a quarter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset