K-means clustering

The idea behind clustering is to group together similar points in a dataset based on a given criterion, hence finding clusters in the data.

The K-means algorithm aims to partition a set of data points into K clusters such that each data point belongs to the cluster with the nearest mean point or centroid.

To illustrate K-means clustering, we can apply it to the set of reduced iris data that we obtained via PCA, but in this case, we do not pass the actual labels to the fit(..) method as we do for supervised learning:

    In [142]: from sklearn.cluster import KMeans
              k_means = KMeans(n_clusters=3, random_state=0)
              k_means.fit(X_red)
              y_pred = k_means.predict(X_red)

We now display the clustered data as follows:

    In [145]: figsize(8,6)
              fig=plt.figure()
              fig.suptitle("K-Means clustering on PCA-reduced iris data, 
              K=3")
              ax=fig.add_subplot(1,1,1)
              ax.scatter(X_red[:, 0], X_red[:, 1], c=y_pred);

Note that our K-means algorithm clusters do not exactly correspond to the dimensions obtained via PCA. The source code is available at https://github.com/jakevdp/sklearn_pycon2014.

More information on K-means clustering in scikit-learn and, in general can be found at http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_iris.html and http://en.wikipedia.org/wiki/K-means_clustering.

Table of Contents for K-means clustering

Create new playlist

Sign In

Sign Up

Table of Contents for
K-means clustering