Summary

In this chapter, we learned the following:

  • Clustering is an unsupervised predictive algorithm to club similar data points together and segregate the dissimilar points from each other. This algorithm finds the usage in marketing, taxonomy, seismology, public policy, and data mining.
  • The distance between two observations is one of the criteria on which the observations can be clustered together.
  • The distance between all the points in a dataset is best represented by an nxn symmetric matrix called a distance matrix.
  • Hierarchical clustering is an agglomerative mode of clustering wherein we start with n clusters (equal to the number of points in the dataset) that are agglomerated into a lesser number of cluster based on the linkages developed over distance matrix.
  • K-means clustering algorithm is a widely used mode of clustering wherein the number of clusters need to be stated in advance before performing the clustering. K-means clustering method outputs a label for each row of data depicting the cluster it belongs to. It also outputs the cluster centers. K-means method is easier to analyze and make sense of.
  • Deciding the number of clusters (k) for the k-means clustering is an important task. The elbow method and silhouette coefficient method are some of the methods that can help us to decide the optimum number of k.

In the current chapter, we dealt with an unsupervised algorithm that is very widely used. Next, we will learn about a classification supervised algorithm. It is called a decision tree. It is a great set of algorithms to classify and predict data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset