Using scikit-learn

The scikit-learn has a number of clustering techniques available for use. Here, we briefly present how to use K-means. The algorithm is implemented in the KMeans class, which is contained in the sklearn.cluster package. This package contains all the clustering algorithms that are available in scikit-learn. In this chapter, we will use mainly K-means, as it is one of the most intuitive algorithms. Furthermore, the techniques used in this chapter can be applied to almost any clustering algorithm. For this experiment, we will try to cluster breast cancer data, in order to explore the possibility of distinguishing malignant cases from benign cases. In order to better visualize the results, we will first perform a t-Distributed Stochastic Neighbor Embedding (t-SNE) decomposition, and use the two-dimensional embeddings as features. In order to proceed, we first load the required data and libraries, as well as set the seed for the NumPy random number generator:

You can read more about t-SNE at https://lvdmaaten.github.io/tsne/.

import matplotlib.pyplot as plt
import numpy as np

from sklearn.cluster import KMeans
from sklearn.datasets import load_breast_cancer
from sklearn.manifold import TSNE

np.random.seed(123456)

bc = load_breast_cancer()

Following this, we instantiate t-SNE, and transform our data. We plot the data in order to visually inspect and examine the data structure:

data = tsne.fit_transform(bc.dataa)
reds = bc.target == 0
blues = bc.target == 1
plt.scatter(data[reds, 0], data[reds, 1], label='malignant')
plt.scatter(data[blues, 0], data[blues, 1], label='benign')
plt.xlabel('1st Component')
plt.ylabel('2nd Component')
plt.title('Breast Cancer dataa')
plt.legend()

The preceding code generates the following plot. We observe two distinct areas. The area populated by the blue points denotes embedding values that imply a high risk that the tumor is malignant:

Plot of the two embeddings (components) of the breast cancer data

As we have identified that there exists some structure in the data, we will try to use K-means clustering in order to model it. By intuition, we assume that two clusters would suffice, as we try to separate two distinct regions, and we know that there are two classes in the dataset. Nonetheless, we will also experiment with four and six clusters, as they might provide more insight on the data. We will measure the percentage of each class assigned to each cluster, in order to gauge their quality. We do this by populating the classified dictionary. Each key corresponds to a cluster. Each key also points to a second dictionary, where the number of malignant and benign cases are recorded for the specific cluster. Furthermore, we plot the cluster assignments, as we want to see how the data is distributed among the clusters:

plt.figure()
plt.title('2, 4, and 6 clusters.')
for clusters in [2, 4, 6]:
 km = KMeans(n_clusters=clusters)
 preds = km.fit_predict(data)
 plt.subplot(1, 3, clusters/2)
 plt.scatter(*zip(*data), c=preds)

classified = {x: {'m': 0, 'b': 0} for x in range(clusters)}

for i in range(len(data)):
 cluster = preds[i]
 label = bc.target[i]
 label = 'm' if label == 0 else 'b'
 classified[cluster][label] = classified[cluster][label]+1

print('-'*40)
for c in classified:
 print('Cluster %d. Malignant percentage: ' % c, end=' ')
 print(classified[c], end=' ')
 print('%.3f' % (classified[c]['m'] /
 (classified[c]['m'] + classified[c]['b'])))

The results are depicted on the following table and figure:

Cluster	Malignant	Benign	Malignant percentage
2 clusters
0	206	97	0.68
1	6	260	0.023
4 clsuters
0	2	124	0.016
1	134	1	0.993
2	72	96	0.429
3	4	136	0.029
6 clusters
0	2	94	0.021
1	81	10	0.89
2	4	88	0.043
3	36	87	0.0293
4	0	78	0
5	89	0	1

Distribution of malignant and benign cases among the clusters

We observe that the algorithm is able to separate the instances belonging to each class quite effectively, even though it has no information about the labels:

Cluster assignment of each instance; 2, 4, and 6 clusters

Furthermore, we see that as we increase the number of clusters, the instances assigned to dominantly malignant or benign clusters does not increase, but the regions are better separated. This enables greater granularity and a more accurate prediction of probability that a selected instance belongs to either class. If we repeat the experiment without transforming the data, we get the following results:

Cluster	Malignant	Benign	Malignant percentage
2 clusters
0	82	356	0.187
1	130	1	0.992
4 clusters
0	6	262	0.022
1	100	1	0.99
2	19	0	1
3	87	94	0.481
6 clusters
0	37	145	0.203
1	37	0	1
2	11	0	1
3	62	9	0.873
4	5	203	0.024
5	60	0	1

Clustering results on the data without t-sne transform

There are also two metrics that can be used in order to determine cluster quality. For data where the ground truth is known (essentially, labeled data), homogeneity measures the rate by which each cluster is dominated by a single class. For data where the ground truth is not known, the silhouette coefficient measures the intra-cluster cohesiveness and the inter-cluster separability. These metrics are implemented in scikit-learn under the metrics package, by the silhouette_score and homogeneity_score functions. The two metrics for each method are depicted in the following table. Homogeneity is higher for the transformed data, but the silhouette score is lower.

This is expected, as the transformed data has only two dimensions, thus making the possible distance between the instances themselves smaller:

Metric	Clusters	Raw data	Transformed data
Homogeneity	2	0.422	0.418
4	0.575	0.603
6	0.620	0.648
Silhouette	2	0.697	0.500
4	0.533	0.577
6	0.481	0.555

Homogeneity and silhouette scores for clusterings of the raw and transformed data

Table of Contents for Using scikit-learn

Create new playlist

Sign In

Sign Up

Table of Contents for
Using scikit-learn