The process of building a clustering model in Spark does not deviate significantly from what we have already seen in either the classification or regression examples:
import pyspark.ml.clustering as clust
vectorAssembler = feat.VectorAssembler(
inputCols=forest.columns[:-1]
, outputCol='features')
kmeans_obj = clust.KMeans(k=7, seed=666)
pip = Pipeline(stages=[vectorAssembler, kmeans_obj])