Unsupervised learning

In unsupervised learning, data points have no labels related to them; therefore, we need to put labels on them algorithmically. In other words, the correct classes of the training dataset in unsupervised learning are unknown.

Consequently, classes have to be inferred from the unstructured datasets which implies that the goal of an unsupervised learning algorithm is to pre-process the data in some structured ways by describing its structure. The main objective of the unsupervised learning algorithms or techniques is to explore the unknown patterns of the input data that are mostly unlabeled. In this way, it is closely related to the problem of density estimation used in theoretical and applied statistics.

Unsupervised learning, however, also comprehends many other techniques to summarize and explain the key features of the data including exploratory data analysis for finding these hidden patterns, even grouping the data points or features and applying the unsupervised learning technique based on data mining methods for the data pre-processing.

To overcome this obstacle in unsupervised learning, clustering techniques are used typically to group the unlabeled samples based on certain similarity measures, mining hidden patterns towards feature learning.

Tip

For in-depth theoretical knowledge, how the unsupervised algorithms work, please refer to these three books: Bousquet, O.; von Luxburg, U.; Raetsch, G., eds. (2004). Advanced Lectures on Machine Learning. Springer-Verlag. ISBN 978-3540231226. Or Duda, Richard O.; Hart, Peter E.; Stork, David G. (2001). Unsupervised Learning and Clustering. Pattern Classification (2nd Ed.). Wiley. ISBN 0-471-05669-3 and Jordan, Michael I.; Bishop, Christopher M. (2004). Neural Networks. In Allen B. Tucker. Computer Science Handbook, Second Edition (Section VII: Intelligent Systems). Boca Raton, FL: Chapman & Hall/CRC Press LLC. ISBN 1-58488-360-X.

Unsupervised learning example

In clustering, an algorithm groups objects into categories by analyzing similarities between input examples where similar objects or features are clustered and marked using circles around them.

Clustering uses include: Search results grouping such as grouping of customers, anomaly detection for suspicious pattern finding, text categorization for finding useful patterns in the tests, social network analysis for finding coherent groups, data center computing clusters for finding a way of putting related computers together to improve performance, astronomic data analysis for galaxy formation, and real estate data analysis for identifying neighborhoods based on similar features. Moreover, clustering uses unsupervised algorithms, which do not have the outputs in advance.

Clustering using the K-means algorithm begins by initializing all the coordinates to centroids. Note that Spark also supports other clustering algorithms such as Gaussian mixture, Power Iteration Clustering (PIC), Latent Dirichlet Allocation (LDA), Bisecting k-means, and Streaming k-means. Whereas, the Gaussian mixture is mainly used for expectation minimization as an optimization algorithm, the LDA, on the other hand, is used for the document classification and clustering. PIC is used for the clustering vertices of a graph given pairwise similarities as edge properties. Bisecting K-means is faster than the regular K-means, but it will generally produce a different clustering. Therefore, to keep the discussion simpler we will use the K-means algorithm for our purposes.

Interested readers should refer the Spark ML and Spark MLlib based clustering techniques at https://spark.apache.org/docs/latest/ml-clustering.html and https://spark.apache.org/docs/latest/mllib-clustering.html web pages respectively to get more insights. With every pass of the algorithm, each point is assigned to its nearest centroid based on some distance metric, usually the Euclidean distance.

Note that there are other ways to calculate the distance, for example, the Chebyshev distance is used to measure the distance by considering only the most significant dimensions. The Hamming distance algorithm is used to identify the difference bit by bit of two strings. The Mahalanobis distance is used to normalize the covariance matrix to make the distance metric scale-invariant.

The Manhattan distance is used to measure the distance following only axis-aligned directions. The Minkowski distance algorithm is used to make the Euclidean distance, Manhattan distance, and Chebyshev distance generalize. The Haversine distance is used to measure the great-circle distances between two points on a sphere from their longitudes and latitudes. Considering these distance measuring algorithms, it is clear that the Euclidean distance algorithm would be the most appropriate to solve our problem.

The centroids are then updated to be the centers of all the points assigned to it in that pass. This repeats until there is a minimum change in the centers. The k-means algorithm is an iterative algorithm and works in two steps:

  • Cluster assignment step: This algorithm will go through each data point and, depending upon which centroid it is nearer to, it will be assigned that centroid and, in turn, the cluster it represents
  • Move centroid step: This algorithm will take each centroid and move it to the mean of the data points in the cluster

Unsupervised learning with Spark - an example

We will use the Saratoga NY Homes downloaded from the URL http://course1.winona.edu/bdeppa/Stat%20425/Datasets.html to demonstrate an example of clustering as an unsupervised learning technique using Spark in Java. The dataset contains several features as follows: Price, Lot Size, Waterfront, Age, Land Value, New Construct, Central Air, Fuel Type, Heat Type, Sewer Type, Living Area, Pct.College, Bedrooms, Fireplaces, Bathrooms, and the number of Rooms. However, among those columns, we have shown only some selected columns in Table 3. Note that the original dataset was downloaded and later on converted into a corresponding text file as a tab delimiter:

Price

Lot Size

Water Front

Age

Land Value

Rooms

132500

0.09

0

42

5000

5

181115

0.92

0

0

22300

6

109000

0.19

0

133

7300

8

155000

0.41

0

13

18700

5

86060

0.11

0

0

15000

3

120000

0.68

0

31

14000

8

153000

0.4

0

33

23300

8

170000

1.21

0

23

146000

9

90000

0.83

0

36

222000

8

122900

1.94

0

4

212000

6

325000

2.29

0

123

126000

12

Table 3: Sample data from the "Saratoga NY Homes" dataset

We further took only the first two features (that is, Price and Lot Size) using the Spark feature learning algorithm presented in the previous chapter for simplicity. Our target is to show an exploratory analysis based on these two features for possible neighborhoods of the house located in the same area. First, look at the basic scatter plot diagram based on the value in the dataset:

Unsupervised learning with Spark - an example

Figure 8: Cluster of the neighborhoods

It's clearly seen that there are four cluster on the plot marked as circles in Figure 8. However, finding a number of clusters is a tricky task. Here, we have the advantage of visual inspection, which is not available for data on hyperplanes or multidimensional data. Now we need to find the same result using Spark. For simplicity, we will use the K-means clustering API of Spark. The use of raw data and finding feature vectors is shown in Figure 9:

Unsupervised learning with Spark - an example

Figure 9: Unsupervised learning using Spark

K-means clustering of the neighborhood

Before performing the feature extraction, we need to load and parse the Saratoga NY Homes dataset. This step also includes: loading packages and related dependencies, reading the dataset as RDD, model training and prediction, collecting the local parsed data, and comparing clustering.

Step 1: Import statistics and related classes

Here is the code to import statistics and related classes:

import java.io.Serializable; 
import java.util.List; 
import org.apache.spark.api.java.JavaRDD; 
import org.apache.spark.api.java.function.Function; 
import org.apache.spark.mllib.clustering.KMeans; 
import org.apache.spark.mllib.clustering.KMeansModel; 
import org.apache.spark.mllib.linalg.Vector; 
import org.apache.spark.mllib.linalg.Vectors; 
import org.apache.spark.rdd.RDD; 
import org.apache.spark.sql.SparkSession;  

Step 2: Create the Spark session

Here is the code to create a Spark session:

  static SparkSession spark = SparkSession 
      .builder().appName("JavaLDAExample") 
      .master("local[*]") 
      .config("spark.sql.warehouse.dir", "E:/Exp/") 
      .getOrCreate(); 

Step 3: Load the Saratoga NY Homes.txt

Read, parse, and create RDDs from the dataset:

RDD<String> data = spark.sparkContext().textFile("input/Saratoga_ NY_Homes.txt", 2); 

Step 4:Transform the data into an RDD of dense vectors

If you follow the preceding step carefully, actually we have created the normal RDD. Therefore, that RDD has to be converted into the corresponding JavaRDD before mapping into a dense vector using Vector:

JavaRDD<Vector> parsedData = data.toJavaRDD().map(new Function<String, Vector>() { 
      @Override 
      public Vector call(String s) throws Exception { 
        String[] sarray = s.split(","); 
        double[] values = new double[sarray.length]; 
        for (int i = 0; i < sarray.length; i++) 
          values[i] = Double.parseDouble(sarray[i]); 
        return Vectors.dense(values); 
      } 
    });  

Step 5: Train the model

Train the model by specifying four clusters and five iterations. Just refer the following code for doing that:

int numClusters = 4; 
int numIterations = 10; 
int runs = 2; 
KMeansModel clusters = KMeans.train(parsedData.rdd(), numClusters, numIterations, runs , KMeans.K_MEANS_PARALLEL());  
Now estimate the cost to compute the clsuters as follows: 
double cost = clusters.computeCost(parsedData.rdd()); 
System.out.println("Cost: " + cost);  

You should receive the results as follows:

Cost: 3.60148995801542E12   

Step 6: Show the cluster centers

Vector[] centers = clusters.clusterCenters(); 
System.out.println("Cluster Centers: "); 
for (Vector center : centers)  
{ 
  System.out.println(center); 
} 

The preceding code should produce the center of the clusters as follows:

[545360.4081632652,0.9008163265306122,0.1020408163265306,21.73469387755102,111630.61224489794,0.061224489795918366,0.7551020408163265,2.3061224489795915,2.1632653061224487,2.714285714285714,2860.755102040816,59.346938775510196,3.510204081632653,1.1020408163265305,2.714285714285714,10.061224489795917] 
[134073.06845637583,0.3820000000000002,0.0026845637583892616,33.72617449664429,19230.76510067114,0.012080536912751677,0.22818791946308722,2.621476510067114,2.7234899328859057,2.6630872483221477,1332.9234899328858,52.86040268456375,2.7395973154362414,0.38120805369127514,1.4946308724832214,5.806711409395973] 
[218726.0625,0.5419711538461538,0.0,25.495192307692307,32579.647435897434,0.041666666666666664,0.3830128205128205,2.3205128205128203,2.4615384615384617,2.692307692307692,1862.3076923076922,57.4599358974359,3.3894230769230766,0.7019230769230769,2.032852564102564,7.44551282051282] 
[332859.0580645161,0.6369354838709671,0.025806451612903226,19.803225806451614,63188.06451612903,0.13870967741935483,0.6096774193548387,2.2225806451612904,2.2483870967741937,2.774193548387097,2378.4290322580646,57.66774193548387,3.6225806451612903,0.8516129032258064,2.479032258064516,8.719354838709677] 

Step 7: Evaluate the model error rate

double WSSSE = clusters.computeCost(parsedData.rdd()); 
System.out.println("Within Set Sum of Squared Errors = " + WSSSE); 

This should produce the result as follows:

Within Set Sum of Squared Errors = 3.60148995801542E12 

Step 8: Predict the cluster for the second element

List<Vector> houses = parsedData.collect(); 
int prediction  = clusters.predict(houses.get(18)); 
System.out.println("Prediction: "+prediction);  

Output prediction: 0

Step 9: Stop the Spark session

Stop the Spark session using the stop() method as follows:

spark.stop(); 

Step 10: Cluster comparing

Now let's compare the cluster assignments by k-means versus the ones we have done individually. The k-means algorithm gives the cluster IDs starting from 0. Once you inspect the data, you find out the following mapping between the A to D cluster IDs we gave versus K-means in Table 4:

Cluster name

Cluster number

Cluster assignment

A

3

A=>3

B

1

B=>1

C

0

C=>0

D

2

D=>2

Table 4: Cluster assignment for the neighbourhood K-means clustering example

Now, let's pick some of the data from different parts of the chart and predict which cluster it belongs to. Let's look at the house (say 1 as an example) data, which has a lot size of 876 square feet and is priced at $665K:

int prediction  = clusters.predict(houses.get(18)); 
    System.out.println("Prediction: "+prediction); 

[Output] Prediction: 2

That means the house with the preceding properties falls in the cluster 2. You can test the prediction capability with more data of course. Let's do some neighborhood analysis to see what meaning these clusters carry. We can assume that most of the houses in cluster 3 are near downtown. The cluster 2 houses are on hilly terrain for example.

In this example, we dealt with a very small set of features; common sense and visual inspection would also lead us to the same conclusions. However, if you want to acquire more accuracy, of course, you should construct more meaningful features by not only considering only the lot size and the house price but other features like the number of rooms, house age, land value, heating type, and so on.

However, it would not be wise to include the Waterfront as a meaningful feature since no house has a water garden in front of the house in this example. We will provide a detailed analysis towards the better accuracy of meaningful a prediction in next chapter, where we will demonstrate these considerations.

The beauty of the k-means algorithm is that it does the clustering on the data with an unlimited number of features. It is a great tool to use when you have a raw data and would like to know the patterns in that data. However, deciding pn the number of clusters prior to doing the experiment might not be successful but sometimes may lead to an overfitting problem or an underfitting problem.

Tip

To overcome the aformentioned limitation of the K-means, we have some more robust algorithms like Markov Chain Monte Carlo (MCMC , see also https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo) presented in Tribble, Seth D., Markov chain Monte Carlo algorithms using completely uniformly distributed driving sequences, Diss. Stanford University, 2007. Moreover, a more technical discussion can be found at the URL http://www.autonlab.org/tutorials/kmeans11.pdf too.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset