Dimensionality reduction

Another form of unsupervised learning is dimensionality reduction. The number of features present in a dataset equals the dataset's dimensions. Often, many features can be correlated, noisy, or simply not provide much information. Nonetheless, the cost of storing and processing data is correlated with a dataset's dimensionality. Thus, by reducing the dataset's dimensions, we can help the algorithms to better model the data.

Another use of dimensionality reduction is for the visualization of high-dimensional datasets. For example, using the t-distributed Stochastic Neighbor Embedding (t-SNE) algorithm, we can reduce the breast cancer dataset to two dimensions or components. Although it is not easy to visualize 30 dimensions, it is quite easy to visualize two.

Furthermore, we can visually test whether the information contained within the dataset can be utilized to separate the dataset's classes or not. The next figure depicts the two components on the y and x axis, while the color represents the instance's class. Although we cannot plot all of the dimensions, by plotting the two components, we can conclude that a degree of separability between the classes exists:

Using t-SNE to reduce the dimensionality of the breast cancer dataset
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset