K-means is a clustering algorithm that presents similarities to k-NN. A number of cluster centers are produced, and each instance is assigned to its nearest cluster. After all instances are assigned to a cluster, the centroid of the cluster becomes the new center, until the algorithm converges to a stable solution. In scikit-learn, this algorithm is implemented in sklearn.cluster.KMeans. We can try to cluster the first two features of the breast cancer dataset: the mean radius and the texture of the FNA imaging.
First, we load the required data and libraries, while retaining only the first two features of the dataset:
# --- SECTION 1 ---
# Libraries and data loading
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
bc = load_breast_cancer()
bc.data=bc.data[:,:2]
Then, we fit the cluster on the data. Note that we don't have to split the data into train and test sets:
# --- SECTION 2 ---
# Instantiate and train
km = KMeans(n_clusters=3)
km.fit(bc.data)
Following that, we create a two-dimensional mesh and cluster every point, in order to plot the cluster areas and boundaries:
# --- SECTION 3 ---
# Create a point mesh to plot cluster areas
# Step size of the mesh.
h = .02
# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = bc.data[:, 0].min() - 1, bc.data[:, 0].max() + 1
y_min, y_max = bc.data[:, 1].min() - 1, bc.data[:, 1].max() + 1
# Create the actual mesh and cluster it
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = km.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest',
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
aspect='auto', origin='lower',)
Finally, we plot the actual data, color-mapped to its respective clusters:
--- SECTION 4 ---
# Plot the actual data
c = km.predict(bc.data)
r = c == 0
b = c == 1
g = c == 2
plt.scatter(bc.data[r, 0], bc.data[r, 1], label='cluster 1')
plt.scatter(bc.data[b, 0], bc.data[b, 1], label='cluster 2')
plt.scatter(bc.data[g, 0], bc.data[g, 1], label='cluster 3')
plt.title('K-means')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.xlabel(bc.feature_names[0])
plt.ylabel(bc.feature_names[1])
`()
plt.show()
The result is a two-dimensional image with color-coded boundaries of each cluster, as well as the instances: