First, let's learn how to use the .Bucketizer(...) transformer. Here's the snippet that allows us to transform the Horizontal_Distance_To_Hydrology column into 10 equidistant buckets:
import pyspark.sql.functions as f
import pyspark.ml.feature as feat
import numpy as np
buckets_no = 10
dist_min_max = (
forest.agg(
f.min('Horizontal_Distance_To_Hydrology')
.alias('min')
, f.max('Horizontal_Distance_To_Hydrology')
.alias('max')
)
.rdd
.map(lambda row: (row.min, row.max))
.collect()[0]
)
rng = dist_min_max[1] - dist_min_max[0]
splits = list(np.arange(
dist_min_max[0]
, dist_min_max[1]
, rng / (buckets_no + 1)))
bucketizer = feat.Bucketizer(
splits=splits
, inputCol= 'Horizontal_Distance_To_Hydrology'
, outputCol='Horizontal_Distance_To_Hydrology_Bkt'
)
(
bucketizer
.transform(forest)
.select(
'Horizontal_Distance_To_Hydrology'
,'Horizontal_Distance_To_Hydrology_Bkt'
).show(5)
)
Any ideas why we could not use .QuantileDiscretizer(...) to achieve this?