As we have described earlier, the k-means (medians) algorithm is best suited to particular distance metrics, the squared Euclidean and Manhattan distance (respectively), since these distance metrics are equivalent to the optimal value for the statistic (such as total squared distance or total distance) that these algorithms are attempting to minimize. In cases where we might have other distance metrics (such as correlations), we might also use the k-medoid method (Theodoridis, Sergios, and Konstantinos Koutroumbas. Pattern recognition. (2003).), which consists of the following steps:
This is obviously not an exhaustive search (since we don't repeat step 1), but has the advantage that the optimality criterion is not a specific optimization function but rather improving the compactness of the clusters by a flexible distance metric. Can k-medoids improve our clustering of concentric circles? Let's try running using the following commands and plotting the result:
>>> from pyclust import KMedoids >>> kmedoids_clusters = KMedoids(2).fit_predict(np.array(df)[:,1:]) >>> df.plot(kind='scatter', x='x_coord', y='y_coord', c=kmedoids_clusters)
There isn't much improvement over k-means, so perhaps we need to change our clustering algorithm entirely. Perhaps instead of generating a similarity between datapoints in a single stage, we could examine hierarchical measures of similarity and clustering, which is the goal of the agglomerative clustering algorithms we will examine next.