How to do it...

We will be reducing the dimensionality of our dataset roughly by half, so first we need to extract the total number of distinct values in each column:

len_ftrs = []

for col in cols_cat:
    (
        len_ftrs
        .append(
            (col
             , census
                 .select(col)
                 .distinct()
                 .count()
            )
        )
    )
    
len_ftrs = dict(len_ftrs)

Next, for each feature, we will use the .HashingTF(...) method to encode our data:

import pyspark.mllib.feature as feat
final_data = (    census
    .select(cols_to_keep)
    .rdd
    .map(lambda row: [
        list(
            feat.HashingTF(int(len_ftrs[col] / 2.0))
            .transform(row[i])
            .toArray()
        ) if i >= 5
        else [row[i]] 
        for i, col in enumerate(cols_to_keep)]
    )
)

final_data.take(3)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset