Using GMM is a popular technique of soft clustering. GMM tries to model all the data points as a finite mixture of Gaussian distributions; the probability that each point belongs to each cluster is computed along with the cluster related statistics and represents an amalgamate distribution: where all the points are derived from one of K Gaussian subdistributions having own probability. In short, the functionality of GMM can be described in a three-steps pseudocode:
- Objective function: Compute and maximize the log-likelihood using expectation–maximization (EM) as a framework
- EM algorithm:
- E step: Compute the posterior probability of membership -i.e. nearer data points
- M step: Optimize the parameters.
- Assignment: Perform soft assignment during step E.
Technically, when a statistical model is given, parameters of that model (that is, when applied to a data set) are estimated using the maximum-likelihood estimation (MLE). On the other hand, EM algorithm is an iterative process of finding maximum likelihood.
The Spark MLlib implementation uses the expectation-maximization algorithm to induce the maximum-likelihood model from a given a set of data points. The current implementation uses the following parameters:
- K is the number of desired clusters to cluster your data points
- ConvergenceTol is the maximum change in log-likelihood at which we consider convergence achieved.
- MaxIterations is the maximum number of iterations to perform without reaching the convergence point.
- InitialModel is an optional starting point from which to start the EM algorithm. If this parameter is omitted, a random starting point will be constructed from the data.