In unsupervised learning, data points have no labels related to them; therefore, we need to put labels on them algorithmically. In other words, the correct classes of the training dataset in unsupervised learning are unknown.
Consequently, classes have to be inferred from the unstructured datasets which implies that the goal of an unsupervised learning algorithm is to pre-process the data in some structured ways by describing its structure. The main objective of the unsupervised learning algorithms or techniques is to explore the unknown patterns of the input data that are mostly unlabeled. In this way, it is closely related to the problem of density estimation used in theoretical and applied statistics.
Unsupervised learning, however, also comprehends many other techniques to summarize and explain the key features of the data including exploratory data analysis for finding these hidden patterns, even grouping the data points or features and applying the unsupervised learning technique based on data mining methods for the data pre-processing.
To overcome this obstacle in unsupervised learning, clustering techniques are used typically to group the unlabeled samples based on certain similarity measures, mining hidden patterns towards feature learning.
For in-depth theoretical knowledge, how the unsupervised algorithms work, please refer to these three books: Bousquet, O.; von Luxburg, U.; Raetsch, G., eds. (2004). Advanced Lectures on Machine Learning. Springer-Verlag. ISBN 978-3540231226. Or Duda, Richard O.; Hart, Peter E.; Stork, David G. (2001). Unsupervised Learning and Clustering. Pattern Classification (2nd Ed.). Wiley. ISBN 0-471-05669-3 and Jordan, Michael I.; Bishop, Christopher M. (2004). Neural Networks. In Allen B. Tucker. Computer Science Handbook, Second Edition (Section VII: Intelligent Systems). Boca Raton, FL: Chapman & Hall/CRC Press LLC. ISBN 1-58488-360-X.
In clustering, an algorithm groups objects into categories by analyzing similarities between input examples where similar objects or features are clustered and marked using circles around them.
Clustering uses include: Search results grouping such as grouping of customers, anomaly detection for suspicious pattern finding, text categorization for finding useful patterns in the tests, social network analysis for finding coherent groups, data center computing clusters for finding a way of putting related computers together to improve performance, astronomic data analysis for galaxy formation, and real estate data analysis for identifying neighborhoods based on similar features. Moreover, clustering uses unsupervised algorithms, which do not have the outputs in advance.
Clustering using the K-means algorithm begins by initializing all the coordinates to centroids. Note that Spark also supports other clustering algorithms such as Gaussian mixture, Power Iteration Clustering (PIC), Latent Dirichlet Allocation (LDA), Bisecting k-means, and Streaming k-means. Whereas, the Gaussian mixture is mainly used for expectation minimization as an optimization algorithm, the LDA, on the other hand, is used for the document classification and clustering. PIC is used for the clustering vertices of a graph given pairwise similarities as edge properties. Bisecting K-means is faster than the regular K-means, but it will generally produce a different clustering. Therefore, to keep the discussion simpler we will use the K-means algorithm for our purposes.
Interested readers should refer the Spark ML and Spark MLlib based clustering techniques at https://spark.apache.org/docs/latest/ml-clustering.html and https://spark.apache.org/docs/latest/mllib-clustering.html web pages respectively to get more insights. With every pass of the algorithm, each point is assigned to its nearest centroid based on some distance metric, usually the Euclidean distance.
Note that there are other ways to calculate the distance, for example, the Chebyshev distance is used to measure the distance by considering only the most significant dimensions. The Hamming distance algorithm is used to identify the difference bit by bit of two strings. The Mahalanobis distance is used to normalize the covariance matrix to make the distance metric scale-invariant.
The Manhattan distance is used to measure the distance following only axis-aligned directions. The Minkowski distance algorithm is used to make the Euclidean distance, Manhattan distance, and Chebyshev distance generalize. The Haversine distance is used to measure the great-circle distances between two points on a sphere from their longitudes and latitudes. Considering these distance measuring algorithms, it is clear that the Euclidean distance algorithm would be the most appropriate to solve our problem.
The centroids are then updated to be the centers of all the points assigned to it in that pass. This repeats until there is a minimum change in the centers. The k-means algorithm is an iterative algorithm and works in two steps:
We will use the Saratoga NY Homes downloaded from the URL http://course1.winona.edu/bdeppa/Stat%20425/Datasets.html to demonstrate an example of clustering as an unsupervised learning technique using Spark in Java. The dataset contains several features as follows: Price, Lot Size, Waterfront, Age, Land Value, New Construct, Central Air, Fuel Type, Heat Type, Sewer Type, Living Area, Pct.College, Bedrooms, Fireplaces, Bathrooms, and the number of Rooms. However, among those columns, we have shown only some selected columns in Table 3. Note that the original dataset was downloaded and later on converted into a corresponding text file as a tab delimiter:
Price |
Lot Size |
Water Front |
Age |
Land Value |
Rooms |
132500 |
0.09 |
0 |
42 |
5000 |
5 |
181115 |
0.92 |
0 |
0 |
22300 |
6 |
109000 |
0.19 |
0 |
133 |
7300 |
8 |
155000 |
0.41 |
0 |
13 |
18700 |
5 |
86060 |
0.11 |
0 |
0 |
15000 |
3 |
120000 |
0.68 |
0 |
31 |
14000 |
8 |
153000 |
0.4 |
0 |
33 |
23300 |
8 |
170000 |
1.21 |
0 |
23 |
146000 |
9 |
90000 |
0.83 |
0 |
36 |
222000 |
8 |
122900 |
1.94 |
0 |
4 |
212000 |
6 |
325000 |
2.29 |
0 |
123 |
126000 |
12 |
Table 3: Sample data from the "Saratoga NY Homes" dataset
We further took only the first two features (that is, Price and Lot Size) using the Spark feature learning algorithm presented in the previous chapter for simplicity. Our target is to show an exploratory analysis based on these two features for possible neighborhoods of the house located in the same area. First, look at the basic scatter plot diagram based on the value in the dataset:
It's clearly seen that there are four cluster on the plot marked as circles in Figure 8. However, finding a number of clusters is a tricky task. Here, we have the advantage of visual inspection, which is not available for data on hyperplanes or multidimensional data. Now we need to find the same result using Spark. For simplicity, we will use the K-means clustering API of Spark. The use of raw data and finding feature vectors is shown in Figure 9:
Before performing the feature extraction, we need to load and parse the Saratoga NY Homes dataset. This step also includes: loading packages and related dependencies, reading the dataset as RDD, model training and prediction, collecting the local parsed data, and comparing clustering.
Step 1: Import statistics and related classes
Here is the code to import statistics and related classes:
import java.io.Serializable; import java.util.List; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.function.Function; import org.apache.spark.mllib.clustering.KMeans; import org.apache.spark.mllib.clustering.KMeansModel; import org.apache.spark.mllib.linalg.Vector; import org.apache.spark.mllib.linalg.Vectors; import org.apache.spark.rdd.RDD; import org.apache.spark.sql.SparkSession;
Step 2: Create the Spark session
Here is the code to create a Spark session:
static SparkSession spark = SparkSession .builder().appName("JavaLDAExample") .master("local[*]") .config("spark.sql.warehouse.dir", "E:/Exp/") .getOrCreate();
Step 3: Load the Saratoga NY Homes.txt
Read, parse, and create RDDs from the dataset:
RDD<String> data = spark.sparkContext().textFile("input/Saratoga_ NY_Homes.txt", 2);
Step 4:Transform the data into an RDD of dense vectors
If you follow the preceding step carefully, actually we have created the normal RDD. Therefore, that RDD has to be converted into the corresponding JavaRDD
before mapping into a dense vector using Vector:
JavaRDD<Vector> parsedData = data.toJavaRDD().map(new Function<String, Vector>() { @Override public Vector call(String s) throws Exception { String[] sarray = s.split(","); double[] values = new double[sarray.length]; for (int i = 0; i < sarray.length; i++) values[i] = Double.parseDouble(sarray[i]); return Vectors.dense(values); } });
Step 5: Train the model
Train the model by specifying four clusters and five iterations. Just refer the following code for doing that:
int numClusters = 4; int numIterations = 10; int runs = 2; KMeansModel clusters = KMeans.train(parsedData.rdd(), numClusters, numIterations, runs , KMeans.K_MEANS_PARALLEL()); Now estimate the cost to compute the clsuters as follows: double cost = clusters.computeCost(parsedData.rdd()); System.out.println("Cost: " + cost);
You should receive the results as follows:
Cost: 3.60148995801542E12
Step 6: Show the cluster centers
Vector[] centers = clusters.clusterCenters(); System.out.println("Cluster Centers: "); for (Vector center : centers) { System.out.println(center); }
The preceding code should produce the center of the clusters as follows:
[545360.4081632652,0.9008163265306122,0.1020408163265306,21.73469387755102,111630.61224489794,0.061224489795918366,0.7551020408163265,2.3061224489795915,2.1632653061224487,2.714285714285714,2860.755102040816,59.346938775510196,3.510204081632653,1.1020408163265305,2.714285714285714,10.061224489795917] [134073.06845637583,0.3820000000000002,0.0026845637583892616,33.72617449664429,19230.76510067114,0.012080536912751677,0.22818791946308722,2.621476510067114,2.7234899328859057,2.6630872483221477,1332.9234899328858,52.86040268456375,2.7395973154362414,0.38120805369127514,1.4946308724832214,5.806711409395973] [218726.0625,0.5419711538461538,0.0,25.495192307692307,32579.647435897434,0.041666666666666664,0.3830128205128205,2.3205128205128203,2.4615384615384617,2.692307692307692,1862.3076923076922,57.4599358974359,3.3894230769230766,0.7019230769230769,2.032852564102564,7.44551282051282] [332859.0580645161,0.6369354838709671,0.025806451612903226,19.803225806451614,63188.06451612903,0.13870967741935483,0.6096774193548387,2.2225806451612904,2.2483870967741937,2.774193548387097,2378.4290322580646,57.66774193548387,3.6225806451612903,0.8516129032258064,2.479032258064516,8.719354838709677]
Step 7: Evaluate the model error rate
double WSSSE = clusters.computeCost(parsedData.rdd()); System.out.println("Within Set Sum of Squared Errors = " + WSSSE);
This should produce the result as follows:
Within Set Sum of Squared Errors = 3.60148995801542E12
Step 8: Predict the cluster for the second element
List<Vector> houses = parsedData.collect(); int prediction = clusters.predict(houses.get(18)); System.out.println("Prediction: "+prediction);
Output prediction: 0
Step 9: Stop the Spark session
Stop the Spark session using the stop()
method as follows:
spark.stop();
Step 10: Cluster comparing
Now let's compare the cluster assignments by k-means versus the ones we have done individually. The k-means algorithm gives the cluster IDs starting from 0. Once you inspect the data, you find out the following mapping between the A to D cluster IDs we gave versus K-means in Table 4:
Cluster name |
Cluster number |
Cluster assignment |
A |
3 |
A=>3 |
B |
1 |
B=>1 |
C |
0 |
C=>0 |
D |
2 |
D=>2 |
Table 4: Cluster assignment for the neighbourhood K-means clustering example
Now, let's pick some of the data from different parts of the chart and predict which cluster it belongs to. Let's look at the house (say 1 as an example) data, which has a lot size of 876 square feet and is priced at $665K:
int prediction = clusters.predict(houses.get(18)); System.out.println("Prediction: "+prediction);
[Output] Prediction: 2
That means the house with the preceding properties falls in the cluster 2. You can test the prediction capability with more data of course. Let's do some neighborhood analysis to see what meaning these clusters carry. We can assume that most of the houses in cluster 3 are near downtown. The cluster 2 houses are on hilly terrain for example.
In this example, we dealt with a very small set of features; common sense and visual inspection would also lead us to the same conclusions. However, if you want to acquire more accuracy, of course, you should construct more meaningful features by not only considering only the lot size and the house price but other features like the number of rooms, house age, land value, heating type, and so on.
However, it would not be wise to include the Waterfront as a meaningful feature since no house has a water garden in front of the house in this example. We will provide a detailed analysis towards the better accuracy of meaningful a prediction in next chapter, where we will demonstrate these considerations.
The beauty of the k-means algorithm is that it does the clustering on the data with an unlimited number of features. It is a great tool to use when you have a raw data and would like to know the patterns in that data. However, deciding pn the number of clusters prior to doing the experiment might not be successful but sometimes may lead to an overfitting problem or an underfitting problem.
To overcome the aformentioned limitation of the K-means, we have some more robust algorithms like Markov Chain Monte Carlo (MCMC , see also https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo) presented in Tribble, Seth D., Markov chain Monte Carlo algorithms using completely uniformly distributed driving sequences, Diss. Stanford University, 2007. Moreover, a more technical discussion can be found at the URL http://www.autonlab.org/tutorials/kmeans11.pdf too.