How it works...

Instead of taking a heuristic approach to building a cluster, model-based clustering uses a probability-based approach. Model-based clustering assumes that the data is generated by an underlying probability distribution and tries to recover the distribution from the data. One common model-based approach is using finite mixture models, which provide a flexible modeling framework for the analysis of the probability distribution. Finite mixture models are a linearly weighted sum of component probability distribution. Assume the data y=(y₁,y₂...y_n) contains n independent and multivariable observations; G is the number of components; the likelihood of finite mixture models can be formulated as:

Where and are the density and parameters of the kth component in the mixture, and ( and ) is the probability that an observation belongs to the kth component.

The process of model-based clustering has several steps: First, the process selects the number and types of component probability distribution. Then, it fits a finite mixture model and calculates the posterior probabilities of a component membership. Lastly, it assigns the membership of each observation to the component with the maximum probability.

In this recipe, we demonstrate how to use model-based clustering to cluster data. We first install and load the Mclust library into R. We then fit the customer data into the model-based method by using the Mclust function.

After the data is fit into the model, we plot the model based on clustering results. There are four different plots: BIC, classification, uncertainty, and density plots. The BIC plot shows the BIC value, and one can use this value to choose the number of clusters. The classification plot shows how data is clustered in regard to different dimension combinations. The uncertainty plot shows the uncertainty of classifications in regard to different dimension combinations. The density plot shows the density estimation in contour.

You can also use the summary function to obtain the most likely model and the hidhest possible number of clusters. For this example, the highest possible number of clusters is five, with a BIC value equal to -556.1142.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...