Chapter 7. Clustering with Python

In the previous two chapters, we discussed and understood two important algorithms used in predictive analytics, namely, linear regression and logistic regression. Both of them are very widely used. They are supervised algorithms. If you stress your memory a tad bit and have thoroughly read the previous chapters of the book, you would remember that a supervised algorithm is one where the historical value of an output variable is known from the data. A supervised algorithm uses this value to train and build the model to forecast the value of an output variable for a dataset in future. An unsupervised algorithm, on the other hand, doesn't have the luxury or constraints (different perspectives of looking at it) of the output variable. It uses the values of the predictor variables instead to build a model.

Clustering—the algorithm that we are going to discuss in this chapter—is an unsupervised algorithm. Clustering or segmentation, as the name suggests, categorizes entries in clusters or segments in which the entries are more similar to each other than the entries outside the cluster. The properties of such clusters are then identified and treated separately. Once the clusters are defined, one can identify the properties of the cluster and define plans or strategy separately for each cluster. This results in efficient strategizing and planning for each cluster.

The broad focus of this chapter will be clustering and segmentation and by the end of this chapter, you would be able to learn the following:

  • Math behind the clustering algorithms: This section will talk about the various kinds of measures of similarity or dissimilarity between observations. The similarity or dissimilarity is measured in something called distances. We will look at different types of distances and create distance metrics.
  • Different types of clustering algorithms: This section has information about two kinds of clustering algorithms, namely, hierarchical clustering and k-means clustering. The details of the two algorithms will be illustrated using tables and code simulations.
  • Implementing clustering using Python: This section will deal with implementing k-means clustering algorithm on a dataset from scratch, analyzing and making sense of the output, generating plots showing the clusters, and making contextual sense of the clusters.
  • Fine-tuning the clustering: In this section, we will cover topics such as finding the optimum number of clusters and calculating a few statistics to check the efficiency of the clustering we performed.

Introduction to clustering – what, why, and how?

Now let us discuss the various aspects of clustering in greater detail.

What is clustering?

Clustering basically means the following:

  • Creating a group with a high similarity among the members of clusters
  • Creating a group with a significant distinction or dissimilarity between the members of two different clusters

The clustering algorithms work on calculating the similarity or dissimilarity between the observations to group them in clusters.

How is clustering used?

Let us look at the plot of Monthly Income and Monthly Expense for a group of 400 people. As one can see, there are visible clusters of people whose earnings and expenses are different from people from other clusters, but are very similar to the people in the cluster they belong to:

How is clustering used?

Fig. 7.1: Illustration of clustering plotting Monthly Income vs Monthly Expense

In the preceding plot, the visible clusters of the people can be identified based on their income and expense levels, as follows:

  • 1 (low income, low expense): The cluster marked as 1 has low income and low expense levels
  • 2 (medium income, less expense): The cluster marked as 2 has a medium level of income, but spend less—only a little higher than the people in the cluster 1 with low income
  • 3 (medium income, medium or high expense): The cluster marked as 3 also has medium levels of income, almost the same range as cluster 2, but they spend more than cluster 2
  • 4 (high income, high expense): The cluster marked as 4 has a high level of income and a high level of expense

This analysis can be very helpful if, let's say, an organization is trying to target potential customers for their different range of products. Once the clusters are known, the organization can target different clusters for different ranges of their products. Maybe, they can target the cluster 4 to sell their premium products and cluster 1 and 2 to sell their low-end products. This results in higher conversion rates for the advertisement campaigns.

This was one of the illustrations of how clustering can be advantageous. This was a very simple case with just the two attributes of the potential customers, and we were able to plot it on a 2D graph and look at the clusters. However, this is not the case for most of the time. We need to define some generalized metric for the similarity or dissimilarity of the observations. Also, we will discuss this in detail later in this chapter.

Some of the properties of a good cluster can be listed as follows:

  • Clusters should be identifiable and significant in size so that they can be classified as one.
  • Points within a cluster should be compactly placed within themselves and there should be minimum overlap with points in the other clusters.
  • Clusters should make business sense. The observations in the same cluster should exhibit similar properties when it comes to the business context.

Why do we do clustering?

Clustering can have a variety of applications. The following are some of the cases where clustering is used:

  • Clustering and segmentation are the bread and butter of the marketing professionals. The advent of digital marketing has made clustering indispensable. The goal here is to find customers who think, behave, and make decisions on similar lines and reach out to them and persuade them in a fashion tailor-made for them. Think of Facebook and sponsored posts. How do they use the demography, age group, and your preferences data to show you the most relevant posts?
  • Remember those taxonomy charts in Biology books from high school? Well, that is one of the most widely used applications of clustering—a particular type called hierarchical clustering. The clustering, in this case, happens on the basis of the similarity between sequences of amino acids between the two genus/species.
  • Clustering is used in seismology to find the expected epicenter of earthquakes and identify earthquake-prone zones.
  • Clustering is also used to impute values to the missing elements in a dataset. Remember that we imputed the missing values with the mean of the rest of the observations. To start with, some forms of clustering require calculating or assuming the centroid of the clusters. These can be used to impute missing values.
  • Clustering is used in urban planning to group together houses according to their geography, value, and amenities. It can also be used to identify a spot for public amenities such as public transport stops, mobile signal towers, and so on.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset