t-SNE for dimensionality reduction

t-SNE stands for t-distributed Stochastic Neighbor Embedding. It's a nonlinear dimensionality reduction technique developed by Laurens van der Maaten and Geoffrey Hinton. t-SNE has been widely used for data visualization in various domains, including computer vision, NLP, bioinformatics, and computational genomics.

As its name implies, t-SNE embeds high-dimensional data into a low-dimensional (usually two-dimensional or three-dimensional) space where similarity among data samples (neighbor information) is preserved. It first models a probability distribution over neighbors around data points by assigning a high probability to similar data points and an extremely small probability to dissimilar ones. Note that similar and neighbor are measured by Euclidean distance or other metrics. Then, it constructs a projection onto a low-dimensional space where the divergence between the input distribution and output distribution is minimized. The original high-dimensional space is modeled as Gaussian distribution, while the output low-dimensional space is modeled as t-distribution.

We'll herein implement t-SNE using the TSNE class from scikit-learn:

>>> from sklearn.manifold import TSNE

Now let's use t-SNE to verify our count vector representation.

We pick three distinct topics, talk.religion.misc, comp.graphics, and sci.space, and visualize documents vectors from these three topics.

First, just load documents of these three labels, as follows:

>>> categories_3 = ['talk.religion.misc', 'comp.graphics', 'sci.space']
>>> groups_3 = fetch_20newsgroups(categories=categories_3)

It goes through the same process and generates a count matrix, data_cleaned_count_3, with 500 features from the input, groups_3. You can refer to steps in previous sections as you just need to repeat the same code.

Next, we apply t-SNE to reduce the 500-dimensional matrix to two-dimensional matrix:

>>> tsne_model = TSNE(n_components=2, perplexity=40,
random_state=42, learning_rate=500)

>>> data_tsne = tsne_model.fit_transform(data_cleaned_count_3.toarray())

The parameters we specify in the TSNE object are as follows:

  • n_components: The output dimension
  • perplexity: The number of nearest data points considered neighbors in the algorithm with a typical value of between 5 and 50
  • random_state: The random seed for program reproducibility
  • learning_rate: The factor affecting the process of finding the optimal mapping space with a typical value of between 10 and 1,000

Note, the TSNE object only takes in dense matrix, hence we convert the sparse matrix, data_cleaned_count_3, into a dense one using toarray().

We just successfully reduce the input dimension from 500 to 2. Finally, we can easily visualize it in a two-dimensional scatter plot where x axis is the first dimension, y axis is the second dimension, and the color, c, is based on the topic label of each original document:

>>> import matplotlib.pyplot as plt
>>> plt.scatter(data_tsne[:, 0], data_tsne[:, 1], c=groups_3.target)
>>> plt.show()

Refer to the following screenshot for the end result:

Data points from the three topics are in different colors such as green, purple, and yellow. We can observe three clear clusters. Data points from the same topic are close to each other while those from different topics are far away. Obviously, count vectors are great representations for original text data as they preserve distinction among three different topics.

You can also play around with the parameters and see whether you can obtain a nicer plot where the three clusters are better separated.

Count vectorization does well in keeping document disparity. How about maintaining similarity? We can also check that using documents from overlapping topics, such as five topics, comp.graphics, comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware, comp.sys.mac.hardware, and comp.windows.x:

>>> categories_5 = ['comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x']
>>> groups_5 = fetch_20newsgroups(categories=categories_5)

Similar processes (including text clean-up, count vectorization, and t-SNE) are repeated and the resulting plot is displayed as follows:

Data points from those five computer-related topics are all over the place, which means they are contextually similar. To conclude, count vectors are great representations for original text data as they are also good at preserving similarity among related topics.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset