How it works...

We, as always, start with importing the relevant modules; in this case, it is the pyspark.ml.clustering module.

Next, we collate all the features together that we will use in building the model using the well-known .VectorAssembler(...) Transformer. 

This is followed by instantiating the .KMeans(...) object. We only specified two parameters, but the list of the most notable ones is as follows:

  • k: This specifies the expected number of clusters and is the only required parameter to build the k-means model
  • initMode: This specifies the initialization type of the cluster centroids; k-means|| to use a parallel variant of k-means, or random to choose random centroid points 
  • initSteps: This specifies the initialization steps
  • maxIter: This specifies the maximum number of iterations after which the algorithm stops, even if it had not achieved a convergence

Finally, we build the Pipeline with two stages only.

Once the results are calculated, we can look at what we got. Our aim was to see whether there are any underlying patterns found in the type of forest coverage:

results = (
pip
.fit(forest)
.transform(forest)
.select('features', 'CoverType', 'prediction')
)

results.show(5)

Here's what we got from running the preceding code:

As you can see, there do not seem to be many patterns that would differentiate the forest cover types. However, let's see whether our segmentation simply performs poorly and that this is why we are not finding any patterns, or whether we are finding patterns that are simply not really aligning with CoverType:

clustering_ev = ev.ClusteringEvaluator()
clustering_ev.evaluate(results)

.ClusteringEvaluator(...) is a new evaluator available since Spark 2.3 and is still experimental. It calculates the Silhouette metrics for the clustering results.

Here's what we got for our k-means model:

As you can see, we got a decent model, as anything around 0.5 or more indicates well-separated clusters.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset