In unsupervised learning, data points have no labels related with them. Therefore, we need to put labels on it algorithmically, as shown in the following figure. In other words, the correct classes of the training dataset in unsupervised learning are unknown. Consequently, classes have to be inferred from the unstructured datasets, which imply that the goal of an unsupervised learning algorithm is to preprocess the data in some structured ways by describing its structure.
To overcome this obstacle in unsupervised learning, clustering techniques are commonly used to group the unlabeled samples based on certain similarity measures. Therefore, this task also involves mining hidden patterns toward feature learning. Clustering is the process of intelligently categorizing the items in your dataset. The overall idea is that two items in the same cluster are “closer” to each other than items that belong to separate clusters. That is the general definition, leaving the interpretation of “closeness” open.
Examples include clustering, frequent pattern mining, and dimensionality reduction for solving unsupervised learning problems (it can be applied to supervised learning problems too). We will provide several examples of unsupervised learning, such as k-means, bisecting k-means, Gaussian mixture model, Latent dirichlet allocation (LDA), and so on, in this book. We will also show how to use a dimensionality reduction algorithm such as Principal Component Analysis (PCA) or Singular Value Decomposition (SVD) in supervised learning through regression analysis.
- It reduces the time and storage space required in machine learning tasks
- It helps remove multicollinearity and improves the performance of the machine learning model
- Data visualization becomes easier when reduced to very low dimensions such as 2D or 3D