There's more...

Once the model is estimated, we can use it to predict the clusters and see how good our model actually is. However, at the moment, Spark does not provide the means to evaluate clustering models. Thus, we will use the metrics provided by scikit-learn:

import sklearn.metrics as m

predicted = (
    model
        .predict(
            final_data.map(lambda row: row[1])
        )
)
predicted = predicted.collect()

true = final_data.map(lambda row: row[0]).collect()

print(m.homogeneity_score(true, predicted))
print(m.completeness_score(true, predicted))

The clustering metrics are located in the .metrics submodule of scikit-learn. We are using two of the metrics available: homogeneity and completeness. Homogeneity measures whether all the points in a cluster come from the same class whereas the completeness score estimates whether, for a given class, all the points end up in the same cluster; a value of 1 for either of the scores means a perfect model.

Let's see what we get:

Well, our clustering model did not do so well: the homogeneity score of 15% means that the remaining 85% of observations were misclustered, and we only clustered ∼12% properly of all those that belong to the same class.

Table of Contents for There's more...

Create new playlist

Sign In

Sign Up

Table of Contents for
There's more...