Unsupervised learning

If we get rid of the label in the Iris dataset, it would be nice if some algorithm could recover the original grouping, maybe without the exact label names—setosa, versicolor, and virginica. Unsupervised learning has multiple applications in compression and encoding, CRM, recommendation engines, and security to uncover internal structure without actually having the exact labels. The labels sometimes can be given base on the singularity in attribute value distributions. For example, Iris setosa can be described as a Flower with Small Leaves.

While a supervised learning problem can always be cast as unsupervised by disregarding the label, the reverse is also true. A clustering algorithm can be cast as a density-estimation problem by assigning label 1 to all vectors and generating random vectors with label 0 (The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman, Springer Series in Statistics). The difference between the two is formal and it's even fuzzier with non-structured and nested data. Often, running unsupervised algorithms in labeled datasets leads to a better understanding of the dependencies and thus a better selection and performance of the supervised algorithm.

One of the most popular algorithms for clustering and unsupervised learning in k-means (and its variants, k-median and k-center, will be described later):

$ bin/spark-shell
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _ / _ / _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_   version 1.6.1
      /_/
Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.

scala> import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
scala> import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.Vectors
scala> val iris = sc.textFile("iris.txt")
iris: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[4] at textFile at <console>:23

scala> val vectors = data.map(s => Vectors.dense(s.split('	').map(_.toDouble))).cache()
vectors: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MapPartitionsRDD[5] at map at <console>:25

scala> val numClusters = 3
numClusters: Int = 3
scala> val numIterations = 20
numIterations: Int = 20
scala> val clusters = KMeans.train(vectors, numClusters, numIterations)
clusters: org.apache.spark.mllib.clustering.KMeansModel = org.apache.spark.mllib.clustering.KMeansModel@5dc9cb99
scala> val centers = clusters.clusterCenters
centers: Array[org.apache.spark.mllib.linalg.Vector] = Array([5.005999999999999,3.4180000000000006,1.4640000000000002,0.2439999999999999], [6.8538461538461535,3.076923076923076,5.715384615384614,2.0538461538461537], [5.883606557377049,2.740983606557377,4.388524590163936,1.4344262295081966])
scala> val SSE = clusters.computeCost(vectors)
WSSSE: Double = 78.94506582597859
scala> vectors.collect.map(x => clusters.predict(x))
res18: Array[Int] = Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2)
scala> println("Sum of Squared Errors = " + SSE)
Sum of Squared Errors = 78.94506582597859
scala> clusters.save(sc, "model")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

One can see that the first center, the one with index 0, has petal length and width of 1.464 and 0.244, which is much shorter than the other two—5.715 and 2.054, 4.389 and 1.434). The prediction completely matches the first cluster, corresponding to Iris setosa, but has a few mispredictions for the other two.

The measure of cluster quality might depend on the (desired) labels if we want to achieve a desired classification result, but since the algorithm has no information about the labeling, a more common measure is the sum of distances from centroids to the points in each of the clusters. Here is a graph of WSSSE, depending on the number of clusters:

scala> 1.to(10).foreach(i => println("i: " + i + " SSE: " + KMeans.train(vectors, i, numIterations).computeCost(vectors)))
i: 1 WSSSE: 680.8244
i: 2 WSSSE: 152.3687064773393
i: 3 WSSSE: 78.94506582597859
i: 4 WSSSE: 57.47327326549501
i: 5 WSSSE: 46.53558205128235
i: 6 WSSSE: 38.9647878510374
i: 7 WSSSE: 34.311167589868646
i: 8 WSSSE: 32.607859500805034
i: 9 WSSSE: 28.231729411088438
i: 10 WSSSE: 29.435054384424078

As expected, the average distance is decreasing as more clusters are configured. A common method to determine the optimal number of clusters—in our example, we know that there are three types of flowers—is to add a penalty function. A common penalty is the log of the number of clusters as we expect a convex function. What would be the coefficient in front of log? If each vector is associated with its own cluster, the sum of all distances will be zero, so if we would like a metric that achieves approximately the same value at both ends of the set of possible values, 1 to 150, the coefficient should be 680.8244/log(150):

scala> for (i <- 1.to(10)) println(i + " -> " + ((KMeans.train(vectors, i, numIterations).computeCost(vectors)) + 680 * scala.math.log(i) / scala.math.log(150)))
1 -> 680.8244
2 -> 246.436635016484
3 -> 228.03498068120865
4 -> 245.48126639400738
5 -> 264.9805962616268
6 -> 285.48857890531764
7 -> 301.56808340425164
8 -> 315.321639004243
9 -> 326.47262191671723
10 -> 344.87130979355675

Here is how the sum of the squared distances with penalty looks as a graph:

Unsupervised learning

Figure 04-2. The measure of the clustering quality as a function of the number of clusters

Besides k-means clustering, MLlib also has implementations of the following:

  • Gaussian mixture
  • Power Iteration Clustering (PIC)
  • Latent Dirichlet Allocation (LDA)
  • Streaming k-means

The Gaussian mixture is another classical mechanism, particularly known for spectral analysis. Gaussian mixture decomposition is appropriate, where the attributes are continuous and we know that they are likely to come from a set of Gaussian distributions. For example, while the potential groups of points corresponding to clusters may have the average for all attributes, say Var1 and Var2, the points might be centered around two intersecting hyperplanes, as shown in the following diagram:

Unsupervised learning

Figure 04-3. A mixture of two Gaussians that cannot be properly described by k-means clustering

This renders the k-means algorithm ineffective as it will not be able to distinguish between the two (of course a simple non-linear transformation such as a distance to one of the hyperplanes will solve the problem, but this is where domain knowledge and expertise as a data scientist are handy).

PIC is using clustering vertices of a graph provided pairwise similarity measures given as edge properties. It computes a pseudo-eigenvector of the normalized affinity matrix of the graph via power iteration and uses it to cluster vertices. MLlib includes an implementation of PIC using GraphX as its backend. It takes an RDD of (srcId, dstId, similarity) tuples and outputs a model with the clustering assignments. The similarities must be non-negative. PIC assumes that the similarity measure is symmetric. A pair (srcId, dstId) regardless of the ordering should appear at most once in the input data. If a pair is missing from the input, their similarity is treated as zero.

LDA can be used for clustering documents based on keyword frequencies. Rather than estimating a clustering using a traditional distance, LDA uses a function based on a statistical model of how text documents are generated.

Finally, streaming k-means is a modification of the k-means algorithm, where the clusters can be adjusted with new batches of data. For each batch of data, we assign all points to their nearest cluster, compute new cluster centers based on the assignment, and then update each cluster parameters using the equations:

Unsupervised learning
Unsupervised learning

Here, c t and c' t are the centers of from the old model and the ones computed for the new batch and n t and n' t are the number of vectors from the old model and for the new batch. By changing the α parameter, we can control how much information from the old runs can influence the clustering—0 means the new cluster centers are totally based on the points in the new batch, while 1 means that we accommodate for all points that we have seen so far.

k-means clustering has many modifications. For example, k-medians computes the cluster centers as medians of the attribute values, not mean, which works much better for some distributions and with L1 target distance metric (absolute value of the difference) as opposed to L2 (the sum of squares). K-medians centers are not necessarily present as a specific point in the dataset. K-medoids is another algorithm from the same family, where the resulting cluster center has to be an actual instance in the input set and we actually do not need to have the global sort, only the pairwise distances between the points. Many variations of the techniques exist on how to choose the original seed cluster centers and converge on the optimal number of clusters (besides the simple log trick I have shown).

Another big class of clustering algorithms is hierarchical clustering. Hierarchical clustering is either done from the top—akin to the decision tree algorithms—or from the bottom; we first find the closest neighbors, pair them, and continue the pairing process up the hierarchy until all records are merged. The advantage of hierarchical clustering is that it can be made deterministic and relatively fast, even though the cost of one iteration in k-means is probably going to be better. However, as mentioned, the unsupervised problem can actually be converted to a density-estimation supervised problem, with all the supervised learning techniques available. So have fun understanding the data!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset