How it works...

As always, we first load the necessary module we will use throughout, pyspark.sql.functions, which will allow us to calculate minimum and maximum values of the Horizontal_Distance_To_Hydrology feature. pyspark.ml.feature exposes the .Bucketizer(...) transformer for us to use, while NumPy will help us to create an equispaced list of thresholds.

We want to bucketize our numerical variable into 10 buckets, hence our buckets_no is equal to 10. Next, we calculate the minimum and maximum values for the Horizontal_Distance_To_Hydrology feature and return these two values to the driver. On the driver, we create the list of thresholds (the splits list); the first parameter to the np.arange(...) method is the minimum, the second one is the maximum, and the third one defines the size of each step.

Now that we have the splits list defined, we pass it to the .Bucketizer(...) method.

Each transformer (Estimators work similarly) has a very similar API, but two parameters are always required: inputCol and outputCol, which define the input and output columns to be consumed and their output, respectively. The two classes—Transformer and Estimator—also universally implement the .getOutputCol() method, which returns the name of the output column. 

Finally, we use the bucketizer object to transform our DataFrame. Here's what we expect to see:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset