We will be reducing the dimensionality of our dataset roughly by half, so first we need to extract the total number of distinct values in each column:
len_ftrs = [] for col in cols_cat: ( len_ftrs .append( (col , census .select(col) .distinct() .count() ) ) ) len_ftrs = dict(len_ftrs)
Next, for each feature, we will use the .HashingTF(...) method to encode our data:
import pyspark.mllib.feature as feat
final_data = ( census .select(cols_to_keep) .rdd .map(lambda row: [ list( feat.HashingTF(int(len_ftrs[col] / 2.0)) .transform(row[i]) .toArray() ) if i >= 5 else [row[i]] for i, col in enumerate(cols_to_keep)] ) ) final_data.take(3)