For the categorical data, we cannot calculate simple descriptive statistics. Thus, we are going to calculate frequencies for each distinct value in each categorical column. Here's a code snippet that will achieve this:
rdd_cat = ( census_subset .select(cols_cat + ['label']) .rdd .map(lambda row: [e for e in row]) ) results_cat = {} for i, col in enumerate(cols_cat + ['label']): results_cat[col] = ( rdd_cat .groupBy(lambda row: row[i]) .map(lambda el: (el[0], len(el[1]))) .collect() )
First, we repeat what we have just done for the numerical columns but for the categorical ones: we subset census_subset to only the categorical columns and the label, access the underlying RDD, and transform each row into a list. We're going to store the results in the results_cat dictionary. We loop through all the categorical columns and aggregate the data using the .groupBy(...) transformation. Finally, we create a list of tuples where the first element is the value (el[0]) and the second element is the frequency (len(el[1])).
Now that we have our data aggregated, let's see what we deal with:
As you can see, we are dealing with an imbalanced sample: it is heavily skewed toward males and mostly white people. Also, in 1994 there were not many people earning more than $50,000, only about a quarter.