K-means clustering

The K-means algorithm is also referred to as vector quantization. What the algorithm does is finds the cluster (centroid) positions that minimize the distances to all points in the cluster. This is done iteratively; the problem with the algorithm is that it can be a bit greedy, meaning that it will find the nearest minima quickly. This is generally solved with some kind of basin-hopping approach where the nearest minima found is randomly perturbed and the algorithm restarted. Due to this fact, the algorithm is dependent on good initial guesses as input.

Suicide rate versus GDP versus absolute latitude

As mentioned in Chapter 4 Regression, we will analyze the data of suicide rates versus GDP versus absolute latitude or Degrees From Equator (DFE) for clusters. Our hypothesis from the visual inspection was that there were at least two distinct clusters, one with a higher suicide rate, GDP, and absolute latitude, and one with lower. We saved an HDF file in  Chapter 4 Regression, that we now read in as a DataFrame. This time, we want to discard all the rows where one or more column entries are NaN or empty. Thus, we use the appropriate DataFrame method for this:

TABLE_FILE = 'data/data_ch4.h5' 
d2 = pd.read_hdf(TABLE_FILE) 
d2 = d2.dropna() 

Next, while the DataFrame is a very handy format, which we will utilize later on, the input to the cluster algorithms in SciPy does not handle Pandas datatypes natively. Thus, we transfer the data to a NumPy array:

rates = d2[['DFE','GDP_CD','Both']].as_matrix().astype('float') 

Next, to recap, we visualize the data with one histogram of the GDP and one scatterplot for all the data. We do this to aid us in the initial guesses of the cluster centroid positions:

plt.subplots(12, figsize=(8,3.5)) 
plt.subplot(121) 
plt.hist(rates.T[1], bins=20,color='SteelBlue') 
plt.xticks(rotation=45, ha='right') 
plt.yscale('log') 
plt.xlabel('GDP') 
plt.ylabel('Counts') 
plt.subplot(122) 
plt.scatter(rates.T[0], rates.T[2],  
            s=2e5*rates.T[1]/rates.T[1].max(), 
            color='SteelBlue', edgecolors='0.3'), 
plt.xlabel('Absolute Latitude (Degrees, 'DFE')') 
plt.ylabel('Suicide Rate (per 100')') 
plt.subplots_adjust(wspace=0.25); 

Suicide rate versus GDP versus absolute latitude

The scatter plot to the right shows the Suicide Rate on the y-axis and the Absolute Latitude on the x-axis. The size of each point is proportional to the country's GDP. The function to run the clustering k-means takes a special kind of normalized input. The data arrays (columns) have to be normalized by the standard deviation of the array. Although this is straightforward, there is a function included in the module called whiten. It will scale the data with the standard deviation:

w = vq.whiten(rates) 

To show what it does to the data, we plot the preceding plots again, but with the output from the whiten function:

plt.subplots(12, figsize=(8,3.5)) 
plt.subplot(121) 
plt.hist(w[:,1], bins=20, color='SteelBlue') 
plt.yscale('log') 
plt.subplot(122) 
plt.scatter(w.T[0], w.T[2], s=2e5*w.T[1]/w.T[1].max(),  
            color='SteelBlue', edgecolors='0.3') 
plt.xticks(rotation=45, ha='right'), 

Suicide rate versus GDP versus absolute latitude

As you can see, all the data is scaled from the previous figure. However, as mentioned, the scaling is just the standard deviation. So let's calculate the scaling and save it to the sc variable:

sc = rates.std(axis=0) 

Now we are ready to estimate the initial guesses for the cluster centroids. Reading off the first plot of the data, we guess the centroids to be at 20 DFE, 200,000 GDP, and 10 suicides, and the second at 45 DFE, 100,000 GDP, and 15 suicides. We put this in an array and scale it with our scale parameter to the same scale as the output from the whiten function. This is then sent to the kmeans2 function of SciPy:

init_guess = np.array([[20,20E3,10],[45,100E3,15]]) 
init_guess /= sc
z2_cb, z2_lbl = vq.kmeans2(w, init_guess, minit='matrix',
                           iter=500) 

There is another function, kmeans (without the 2), which is a less complex version and does not stop iterating when it reaches a local minima; it stops when the changes between two iterations goes below some level. Thus, the standard k-means algorithm is represented in SciPy by the kmeans2 function. The function outputs the centroids' scaled positions (here, z2_cb) and a lookup table (z2_lbl) telling us which row belongs to which centroid. To get the centroid positions in units we understand, we simply multiply with our scaling value:

z2_cb_sc = z2_cb * sc 

At this point, we can plot the results. The following section is rather long and contains many different parts, so we will go through them section by section. However, the code should be run in one cell of the Notebook:

# K-means clustering figure START 
plt.figure(figsize=(6,4)) 
plt.scatter(z2_cb_sc[0,0], z2_cb_sc[0,2],
            s=5e2*z2_cb_sc[0,1]/rates.T[1].max(),
            marker='+', color='k',
            edgecolors='k', lw=2, zorder=10, alpha=0.7);
plt.scatter(z2_cb_sc[1,0], z2_cb_sc[1,2],
            s=5e2*z2_cb_sc[1,1]/rates.T[1].max(),
            marker='+', color='k', edgecolors='k', lw=3,
            zorder=10, alpha=0.7); 

The first steps are quite simple; we set up the figure size and plot the points of the cluster centroids. We hypothesized about two clusters, thus we plot them with two different calls to plt.scatter. Here, z2_cb_sc[1,0] gets the second cluster x coordinate (DFE) from the array, then switching 0 for 1 gives us the y coordinate (rate). We set the size of the marker to scale with the value of the third data axis, the GDP. We also do this further down for the data, just as in previous plots, so that it is easier to compare and differentiate the clusters. The zorder keyword gives the order in-depth of the elements that are plotted; a high zorder will put them on top of everything else, and a negative zorder will send them to the back.

s0 = abs(z2_lbl==0).astype('bool') 
s1 = abs(z2_lbl==1).astype('bool') 
pattern1 = 5*'x' 
pattern2 = 4*'/' 
plt.scatter(w.T[0][s0]*sc[0],  
            w.T[2][s0]*sc[2],  
            s=5e2*rates.T[1][s0]/rates.T[1].max(), 
            lw=1, 
            hatch=pattern1, 
            edgecolors='0.3', 
            color=plt.cm.Blues_r( 
                rates.T[1][s0]/rates.T[1].max())); 
plt.scatter(rates.T[0][s1], 
            rates.T[2][s1],  
            s=5e2*rates.T[1][s1]/rates.T[1].max(), 
            lw=1, 
            hatch=pattern2, 
            edgecolors='0.4', 
            marker='s', 
            color=plt.cm.Reds_r( 
                rates.T[1][s1]/rates.T[1].max()+0.4)) 

In this section, we plot the points of the clusters. First, we get the selection arrays. They are simply Boolean arrays, which are arrays where the values that correspond to either cluster 0 or 1 are True. Thus s0 is True where cluster id is 0, and s1 is True where cluster id is 1. Next, we define the hatch pattern for the scatterplot markers, which we later give the plotting function as input. The multiplier for the hatch pattern gives the density of the pattern. The scatterplots for the points are created in a similar fashion to the centroids, except the markers are a bit more complex. They are both colorcoded, as in the previous example with cholera deaths, but in a gradient instead of the exact same colors for all points. The gradient is defined by the GDP, which also defines the size of the points. The x and y data sent to the plot is different between the clusters, but they access the same data in the end because we multiply with our scaling factor.

p1 = plt.scatter([],[], hatch='None',  
                 s=20E3*5e2/rates.T[1].max(),  
                 color='k', edgecolors='None',) 
p2 = plt.scatter([],[], hatch='None', 
                 s=40E3*5e2/rates.T[1].max(),   
                 color='k', edgecolors='None',) 
p3 = plt.scatter([],[], hatch='None', 
                 s=60E3*5e2/rates.T[1].max(),  
                 color='k', edgecolors='None',) 
p4 = plt.scatter([],[], hatch='None', 
                 s=80E3*5e2/rates.T[1].max(),  
                 color='k', edgecolors='None',) 
labels = ["20'", "40'", "60'", ">80'"] 
plt.legend([p1, p2, p3, p4], labels, ncol=1,  
           frameon=True, #fontsize=12, 
           handlelength=1, loc=1,  
           borderpad=0.75,labelspacing=0.75, 
           handletextpad=0.75, title='GDP', scatterpoints=1.5) 
plt.ylim((-4,40)) 
plt.xlim((-4,80)) 
plt.title('K-means clustering') 
plt.xlabel('Absolute Latitude (Degrees, 'DFE')') 
plt.ylabel('Suicide Rate (per 100 000)'), 

The last tweak to the plot is made by creating a custom legend. We want to show the different sizes of the points and what GDP they correspond to. As there is a continuous gradient from low to high, we cannot use the plotted points. Thus we create our own, but leave the x and y input coordinates as empty lists. This will not show anything in the plot but we can use them to register in the legend. The various tweaks to the legend function control different aspects of the legend layout. I encourage you to experiment with it to see what happens:

Suicide rate versus GDP versus absolute latitude

As for the final analysis, two different clusters are identified. Just as our previous hypothesis, there is a cluster with a clear linear trend with relatively higher GDP, which is also located at a higher absolute latitude. Although the identification is rather weak, it is clear that the two groups are separated. Countries with low GDP are clustered closer to the equator. What happens when you add more clusters? Try to add a cluster for the low DFE high-rate countries, visualize it, and think about what this could mean for the conclusion(s).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset