How it works...

First, we loop through all the categoricals and append a tuple of the column name (the col) and the count of distinct values found in that column. The latter is achieved by selecting the column of interest, running the .distinct() transformation, and counting the resulting number of values. len_ftrs is now a list of tuples. By calling the dict(...) method, Python will create a dictionary that will take the first element of the tuple as a key and the second element as the corresponding value. The resulting dictionary looks as follows:

Now that we know the total number of distinct values in each feature, we can use the hashing trick. First, we import the feature component of the MLlib as that is where the .HashingTF(...) is located. Next, we subset the census DataFrame to only the columns we want to keep. We then use the .map(...) transformation on the underlying RDD: for each element, we enumerate all the columns and if the index of the column is greater than or equal to five, we create a new instance of .HashingTF(...), which we then use to transform the value and convert it into an NumPy array. The only thing you need to specify for the .HashingTF(...) method is the output number of elements; in our case, we roughly halve the number of the number of distinct values so we will have some hashing collisions, but that is fine.

For your reference, our cols_to_keep looks as follows:

After doing the preceding to our current dataset, final_data, it looks as follows; note the format might look a bit odd but we will soon be getting it ready for creating the training RDD:

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...