How it works...

Density-based clustering uses the idea of density reachability and density connectivity, which makes it very useful in discovering a cluster in nonlinear shapes. Before discussing the process of density-based clustering, some important background concepts must be explained. Density-based clustering takes two parameters into account: eps and MinPts. eps stands for the maximum radius of the neighborhood; MinPts denotes the minimum number of points within the eps neighborhood. With these two parameters, we can define the core point as having points more than MinPts within eps. Also, we can define the board point as having points less than MinPts, but it is in the neighborhood of the core points. Then, we can define the core object as if the number of points in the eps-neighborhood of p is more than MinPts.

Furthermore, we have to define the reachability between two points. We can say that a point, p, is directly density reachable from another point, q, if q is within the eps-neighborhood of p and p is a core object. Then, we can define that a point, p, is generic and density reachable from the point, q, if there exists a chain of points, p₁,p₂...,p_n, where p₁ = q, p_n = p, and p_i+1 is directly density reachable from p_i with regard to eps and MinPts for 1 <= i <= n:

Point p and q is density reachable

With a preliminary concept of density-based clustering, we can then illustrate the process of DBSCAN, the most popular density-based clustering, as shown in these steps:

Randomly select a point, p.
Retrieve all the points that are density-reachable from p with regard to eps and MinPts.
If p is a core point, then a cluster is formed. Otherwise, if it is a board point and no points are density reachable from p, the process will mark the point as noise and continue visiting the next point.
Repeat the process until all points have been visited.

In this recipe, we demonstrate how to use the DBSCAN density-based method to cluster customer data. First, we have to install and load the mlbench and fpc libraries. The mlbench package provides many methods to generate simulated data with different shapes and sizes. In this example, we generate a Cassini problem graph.

Next, we perform dbscan on a Cassini dataset to cluster the data. We specify the reachability distance as 0.2, the minimum reachability number of points to 2, the progress reporting as null, and use distance as a measurement. The clustering method successfully clusters data into three clusters with sizes of 200, 200, and 100. By plotting the points and cluster labels on the plot, we see that three sections of the Cassini graph are separated in different colors.

The fpc package also provides a predict function, and you can use this to predict the cluster labels of the input matrix. Point c(0,0) is classified into cluster 3, point c(0, -1.5) is classified into cluster 1, and point c(1,1) is classified into cluster 2.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...