In the previous chapter, we went through a text visualization using t-SNE. T-SNE, or any dimensionality reduction algorithm, is a type of unsupervised learning. Moving forward, in this chapter, we will be continuing our unsupervised learning journey, specifically focusing on clustering and topic modeling. We will start with how unsupervised learning learns without guidance and how it is good at discovering hidden information underneath data.
Next, we will talk about clustering as an important branch of unsupervised learning, which identifies different groups of observations from data. For instance, clustering is useful for market segmentation, where consumers of similar behaviors are grouped into one segment for marketing purposes. We will perform clustering on the 20 newsgroups text dataset and see what clusters will be produced.
Another unsupervised learning route we will take is topic modeling, which is the process of extracting themes hidden in the dataset. You will be amused by how many interesting themes we are able to mine from the 20 newsgroups dataset.
We will cover the following topics:
In the previous chapter, we applied t-SNE to visualize the newsgroup text data, reduced to 2 dimensions. T-SNE, or dimensionality reduction in general, is a type of unsupervised learning. Instead of having a teacher educating on what particular output to produce, such as a class or membership (classification), and a continuous value (regression), unsupervised learning identifies inherent structures or commonalities in the input data. Since there is no guidance from the "teacher" in unsupervised learning, there is no clear answer on what is a right or wrong result. Unsupervised learning has the freedom to discover hidden information underneath input data.
An easy way to understand unsupervised learning is to think of going through many practice questions for an exam. In supervised learning, you are given answers to those practice questions. You basically figure out the relationship between the questions and answers and learn how to map the questions to the answers. Hopefully, you will do well in the actual exam in the end by giving the correct answers. However, in unsupervised learning, you are not provided with the answers to those practice questions. What you might do in this instance could include the following:
You will notice that the outcomes of all these tasks are pretty open-ended. They are correct as long as they are able to describe the commonality and the structure underneath the data.
Practice questions are the features in machine learning, which are also often called attributes, observations, or predictive variables. Answers to questions are the labels in machine learning, which are also called targets or target variables. Practice questions with answers provided are labeled data, while practice questions without answers are unlabeled data. Unsupervised learning works with unlabeled data and acts on that information without guidance.
Unsupervised learning can include the following types:
Unsupervised learning is extensively employed in the area of NLP mainly because of the difficulty of obtaining labeled text data. Unlike numerical data (such as house prices, stock data, and online click streams), labeling text can sometimes be subjective, manual, and tedious. Unsupervised learning algorithms that do not require labels become effective when it comes to mining text data.
In Chapter 9, Mining the 20 Newsgroups Dataset with Text Analysis Techniques, you experienced using t-SNE to reduce the dimensionality of text data. Now, let's explore text mining with clustering algorithms and topic modeling techniques. We will start with clustering the newsgroups data.
The newsgroups data comes with labels, which are the categories of the newsgroups, and a number of categories that are closely related or even overlapping, for instance, the five computer groups: comp.graphics
, comp.os.ms-windows.misc
, comp.sys.ibm.pc.hardware
, comp.sys.mac.hardware
, and comp.windows.x
, and the two religion-related ones: alt.atheism
and talk.religion.misc
.
Let's now pretend we don't know those labels or they don't exist. Will samples from related topics be clustered together? We will now resort to the k-means clustering algorithm.
The goal of the k-means algorithm is to partition the data into k groups based on feature similarities. K is a predefined property of a k-means clustering model. Each of the k clusters is specified by a centroid (center of a cluster) and each data sample belongs to the cluster with the nearest centroid. During training, the algorithm iteratively updates the k centroids based on the data provided. Specifically, it involves the following steps:
Figure 10.1: Distance metrics
The outputs of a trained k-means clustering model include the following:
It is very easy to understand the k-means clustering algorithm and its implementation is also straightforward, as you will discover next.
We will use the iris
dataset from scikit-learn as an example. Let's first load the data and visualize it. We herein only use two features out of the original four for simplicity:
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> X = iris.data[:, 2:4]
>>> y = iris.target
Since the dataset contains three iris classes, we plot it in three different colors, as follows:
>>> import numpy as np
>>> from matplotlib import pyplot as plt
>>> plt.scatter(X[:,0], X[:,1], c=y)
>>> plt.show()
This will give us the following output for the original data plot:
Figure 10.2: Plot of the original iris dataset
Assuming we know nothing about the label y, we try to cluster the data into three groups, as there seem to be three clusters in the preceding plot (or you might say two, which we will come back to later). Let's perform step 1, specifying k, and step 2, initializing centroids, by randomly selecting three samples as initial centroids
:
>>> k = 3
>>> random_index = np.random.choice(range(len(X)), k)
>>> centroids = X[random_index]
We visualize the data (without labels any more) along with the initial random centroids:
>>> def visualize_centroids(X, centroids):
... plt.scatter(X[:, 0], X[:, 1])
... plt.scatter(centroids[:, 0], centroids[:, 1], marker='*',
s=200, c='#050505')
... plt.show()
>>> visualize_centroids(X, centroids)
Refer to the following screenshot for the data, along with the initial random centroids:
Figure 10.3: Data points with random centroids
Now we perform step 3, which entails assigning clusters based on the nearest centroids. First, we need to define a function calculating distance that is measured by the Euclidean distance, as demonstrated herein:
>>> def dist(a, b):
... return np.linalg.norm(a - b, axis=1)
Then, we develop a function that assigns a sample to the cluster of the nearest centroid:
>>> def assign_cluster(x, centroids):
... distances = dist(x, centroids)
... cluster = np.argmin(distances)
... return cluster
With the clusters assigned, we perform step 4, which involves updating the centroids to the mean of all samples in the individual clusters
:
>>> def update_centroids(X, centroids, clusters):
... for i in range(k):
... cluster_i = np.where(clusters == i)
... centroids[i] = np.mean(X[cluster_i], axis=0)
Finally, we have step 5, which involves repeating step 3 and step 4 until the model converges and whichever of the following occurs:
We set the tolerance of the first condition and the maximum number of iterations as follows:
>>> tol = 0.0001
>>> max_iter = 100
Initialize the clusters' starting values, along with the starting clusters for all samples as follows:
>>> iter = 0
>>> centroids_diff = 100000
>>> clusters = np.zeros(len(X))
With all the components ready, we can train the model iteration by iteration where it first checks convergence, before performing steps 3 and 4, and then visualizes the latest centroids:
>>> from copy import deepcopy
>>> while iter < max_iter and centroids_diff > tol:
... for i in range(len(X)):
... clusters[i] = assign_cluster(X[i], centroids)
... centroids_prev = deepcopy(centroids)
... update_centroids(X, centroids, clusters)
... iter += 1
... centroids_diff = np.linalg.norm(centroids -
centroids_prev)
... print('Iteration:', str(iter))
... print('Centroids:
', centroids)
... print('Centroids move: {:5.4f}'.format(centroids_diff))
... visualize_centroids(X, centroids)
Let's take a look at the following outputs generated from the preceding commands:
Iteration: 1
Centroids:
[[5.01827957 1.72258065]
[3.41428571 1.05714286]
[1.464 0.244 ]]
Centroids move: 0.8274
Figure 10.4: k-means clustering result after the first round
Iteration: 2
Centroids:
[[5.20897436 1.81923077]
[3.83181818 1.16818182]
[1.464 0.244 ]]
Centroids move: 0.4820
Figure 10.5: k-means clustering result after the second round
Iteration: 3
Centroids:
[[5.3796875 1.9125 ]
[4.06388889 1.25555556]
[1.464 0.244 ]]
Centroids move: 0.3152
The plot of centroids after iteration 3 is as follows:
Figure 10.6: k-means clustering result after the third round
Iteration: 4
Centroids:
[[5.51481481 1.99444444]
[4.19130435 1.30217391]
[1.464 0.244 ]]
Centroids move: 0.2083
The plot of centroids after iteration 4 is as follows:
Figure 10.7: k-means clustering result after the fourth round
Iteration: 5
Centroids:
[[5.53846154 2.01346154]
[4.22083333 1.31041667]
[1.464 0.244 ]]
Centroids move: 0.0431
The plot of centroids after iteration 5 is as follows:
Figure 10.8: k-means clustering result after the fifth round
Iteration: 6
Centroids:
[[5.58367347 2.02653061]
[4.25490196 1.33921569]
[1.464 0.244 ]]
Centroids move: 0.0648
The plot of centroids after iteration 6 is as follows:
Figure 10.9: k-means clustering result after the sixth round
Iteration: 7
Centroids:
[[5.59583333 2.0375 ]
[4.26923077 1.34230769]
[1.464 0.244 ]]
Centroids move: 0.0220
The plot of centroids after iteration 7 is as follows:
Figure 10.10: k-means clustering result after the seventh round
Iteration: 8
Centroids:
[[5.59583333 2.0375 ]
[4.26923077 1.34230769]
[1.464 0.244 ]]
Centroids move: 0.0000
The plot of centroids after iteration 8 is as follows:
Figure 10.11: k-means clustering result after the eighth round
The model converges after eight iterations. The resulting centroids look promising, and we can also plot the clusters:
>>> plt.scatter(X[:, 0], X[:, 1], c=clusters)
>>> plt.scatter(centroids[:, 0], centroids[:, 1], marker='*',
s=200, c='#050505')
>>> plt.show()
Refer to the following screenshot for the end result:
Figure 10.12: Data samples along with learned cluster centroids
As you can see, samples around the same centroid form a cluster. After eight iterations (you might see slightly more or less iterations in your case), the model converges and the centroids will no longer be updated.
Having developed our own k-means clustering model, we will now discuss how to use scikit-learn for a quicker solution by performing the following steps:
KMeans
class and initialize a model with three clusters, as follows:
>>> from sklearn.cluster import KMeans
>>> kmeans_sk = KMeans(n_clusters=3, random_state=42)
The KMeans
class takes in the following important parameters:
Constructor parameter |
Default value |
Example values |
Description |
|
|
|
K clusters |
|
|
|
Maximum number of iterations |
|
|
|
Tolerance to declare convergence |
|
|
|
Random seed for program reproducibility |
Table 10.1: Parameters of the KMeans class
>>> kmeans_sk.fit(X)
>>> clusters_sk = kmeans_sk.labels_
>>> centroids_sk = kmeans_sk.cluster_centers_
>>> plt.scatter(X[:, 0], X[:, 1], c=clusters_sk)
>>> plt.scatter(centroids_sk[:, 0], centroids_sk[:, 1],
marker='*', s=200, c='#050505')
>>> plt.show()
This will result in the following output:
Figure 10.13: Data samples along with learned cluster centroids using scikit-learn
We get similar results to the previous one using the model we implemented from scratch.
Let's return to our earlier discussion on what the right value for k
is. In the preceding example, it is more intuitive to set it to 3
since we know there are three classes in total. However, in most cases, we don't know how many groups are sufficient or efficient, and meanwhile, the algorithm needs a specific value of k
to start with. So, how can we choose the value for k
? There is a famous approach called the Elbow method.
In the Elbow method, different values of k are chosen and corresponding models are trained; for each trained model, the sum of squared errors, or SSE (also called the sum of within-cluster distances) of centroids is calculated and is plotted against k. Note that for one cluster, the squared error (or the within-cluster distance) is computed as the sum of the squared distances from individual samples in the cluster to the centroid. The optimal k is chosen where the marginal drop of SSE starts to decrease dramatically. This means that further clustering does not provide any substantial gain.
Let's apply the Elbow method to the example we covered in the previous section (learning by examples is what this book is all about). We perform k-means clustering under different values of k on the iris
data:
>>> iris = datasets.load_iris()
>>> X = iris.data
>>> y = iris.target
>>> k_list = list(range(1, 7))
>>> sse_list = [0] * len(k_list)
We use the whole feature space and k
ranges from 1
to 6
. Then, we train individual models and record the resulting SSE, respectively:
>>> for k_ind, k in enumerate(k_list):
... kmeans = KMeans(n_clusters=k, random_state=42)
... kmeans.fit(X)
... clusters = kmeans.labels_
... centroids = kmeans.cluster_centers_
... sse = 0
... for i in range(k):
... cluster_i = np.where(clusters == i)
... sse += np.linalg.norm(X[cluster_i] - centroids[i])
... print('k={}, SSE={}'.format(k, sse))
... sse_list[k_ind] = sse
k=1, SSE=26.103076447039722
k=2, SSE=16.469773740281195
k=3, SSE=15.089477089696558
k=4, SSE=15.0307321707491
k=5, SSE=14.858930749063735
k=6, SSE=14.883090350867239
Finally, we plot the SSE versus the various k
ranges, as follows:
>>> plt.plot(k_list, sse_list)
>>> plt.show()
This will result in the following output:
Figure 10.14: k-means elbow: SSE versus k
Apparently, the Elbow point is k=3
, since the drop in SSE slows down dramatically right after 3
. Hence, k=3
is an optimal solution in this case, which is consistent with the fact that there are three classes of flowers.
You should now be very familiar with k-means clustering. Next, let's see what we are able to mine from the newsgroups dataset using this algorithm. We will use all data from four categories, 'alt.atheism'
, 'talk.religion.misc'
, 'comp.graphics'
, and 'sci.space'
, as an example.
We first load the data from those newsgroups and preprocess it as we did in Chapter 9, Mining the 20 Newsgroups Dataset with Text Analysis Techniques:
>>> from sklearn.datasets import fetch_20newsgroups
>>> categories = [
... 'alt.atheism',
... 'talk.religion.misc',
... 'comp.graphics',
... 'sci.space',
... ]
>>> groups = fetch_20newsgroups(subset='all',
categories=categories)
>>> labels = groups.target
>>> label_names = groups.target_names
>>> def is_letter_only(word):
... for char in word:
... if not char.isalpha():
... return False
... return True
>>> from nltk.corpus import names
>>> all_names = set(names.words())
>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer = WordNetLemmatizer()
>>> data_cleaned = []
>>> for doc in groups.data:
... doc = doc.lower()
... doc_cleaned = ' '.join(lemmatizer.lemmatize(word) for
word in doc.split() if word.isalpha()
and word not in all_names)
... data_cleaned.append(doc_cleaned)
We then convert the cleaned text data into count vectors using CountVectorizer
of scikit-learn:
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> count_vector = CountVectorizer(stop_words="english",
max_features=None, max_df=0.5, min_df=2)
>>> data = count_vector.fit_transform(data_cleaned)
Note that the vectorizer we use here does not limit the number of features (word tokens), but the minimum and maximum document frequency, which are 2 and 50% of the dataset, respectively. Document frequency of a word is measured by the fraction of documents (samples) in the dataset that contain this word.
With the input data ready, we will now try to cluster them into four groups as follows:
>>> from sklearn.cluster import KMeans
>>> k = 4
>>> kmeans = KMeans(n_clusters=k, random_state=42)
>>> kmeans.fit(data)
Let's do a quick check on the sizes of the resulting clusters:
>>> clusters = kmeans.labels_
>>> from collections import Counter
>>> print(Counter(clusters))
Counter({3: 3360, 0: 17, 1: 7, 2: 3})
The clusters don't look absolutely correct, with most samples (3360
samples) congested in one big cluster (cluster 3). What could have gone wrong? It turns out that our count-based features are not sufficiently representative. A better numerical representation for text data is the term frequency-inverse document frequency (tf-idf). Instead of simply using the token count, or the so-called term frequency (tf), it assigns each term frequency a weighting factor that is inversely proportional to the document frequency. In practice, the idf factor of a term t in documents D is calculated as follows:
Here, nD is the total number of documents, nt is the number of documents containing the term t, and the 1 is added to avoid division by zero.
With the idf
factor incorporated, the tf-idf
representation diminishes the weight of common terms (such as get and make) and emphasizes terms that rarely occur, but that convey an important meaning.
To use the tf-idf representation, we just need to replace CountVectorizer
with TfidfVectorizer
from scikit-learn as follows:
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> tfidf_vector = TfidfVectorizer(stop_words='english',
max_features=None, max_df=0.5, min_df=2)
Now, redo feature extraction using the tf-idf
vectorizer and the k-means clustering algorithm on the resulting feature space:
>>> data = tfidf_vector.fit_transform(data_cleaned)
>>> kmeans.fit(data)
>>> clusters = kmeans.labels_
print(Counter(clusters))
Counter({1: 1560, 2: 686, 3: 646, 0: 495})
The clustering result becomes more reasonable.
We also take a closer look at the clusters by examining what they contain and the top 10 terms (the terms with the 10 highest tf-idf) representing each cluster:
>>> cluster_label = {i: labels[np.where(clusters == i)] for i in
range(k)}
>>> terms = tfidf_vector.get_feature_names()
>>> centroids = kmeans.cluster_centers_
>>> for cluster, index_list in cluster_label.items():
... counter = Counter(cluster_label[cluster])
... print('cluster_{}: {} samples'.format(cluster, len(index_list)))
... for label_index, count in sorted(counter.items(),
key=lambda x: x[1], reverse=True):
... print('{}: {} samples'.format(label_names[label_index], count))
... print('Top 10 terms:')
... for ind in centroids[cluster].argsort()[-10:]:
... print(' %s' % terms[ind], end="")
... print()
cluster_0: 495 samples
sci.space: 494 samples
comp.graphics: 1 samples
Top 10 terms:
toronto moon zoology nasa hst mission wa launch shuttle space
cluster_1: 1560 samples
sci.space: 459 samples
alt.atheism: 430 samples
talk.religion.misc: 352 samples
comp.graphics: 319 samples
Top 10 terms:
people new think know like ha just university article wa
cluster_2: 686 samples
comp.graphics: 651 samples
sci.space: 32 samples
alt.atheism: 2 samples
talk.religion.misc: 1 samples
Top 10 terms:
know thanks need format looking university program file graphic image
cluster_3: 646 samples
alt.atheism: 367 samples
talk.religion.misc: 275 samples
sci.space: 2 samples
comp.graphics: 2 samples
Top 10 terms:
moral article morality think jesus people say christian wa god
From what we observe in the preceding results:
cluster_0
is obviously about space and includes almost all sci.space
samples and related terms such as moon
, nasa
, launch
, shuttle
, and space
cluster_1
is more of a generic topiccluster_2
is more about computer graphics and related terms, such as format
, program
, file
, graphic
, and image
cluster_3
is an interesting one, which successfully brings together two overlapping topics, atheism and religion, with key terms including moral
, morality
, jesus
, christian
, and god
Feel free to try different values of k
, or use the Elbow method to find the optimal one (this is actually an exercise for this chapter).
It is quite interesting to find key terms for each text group via clustering. Topic modeling is another approach for doing so, but in a much more direct way. It does not simply search for the key terms in individual clusters generated beforehand. What it does is directly extract collections of key terms from documents. You will see how this works in the next section.
A topic model is a type of statistical model for discovering the probability distributions of words linked to the topic. The topic in topic modeling does not exactly match the dictionary definition, but corresponds to a nebulous statistical concept, which is an abstraction that occurs in a collection of documents.
When we read a document, we expect certain words appearing in the title or the body of the text to capture the semantic context of the document. An article about Python programming will have words such as class and function, while a story about snakes will have words such as eggs and afraid. Documents usually have multiple topics; for instance, this recipe is about three things: topic modeling, non-negative matrix factorization, and latent Dirichlet allocation, which we will discuss shortly. We can therefore define an additive model for topics by assigning different weights to topics.
Topic modeling is widely used for mining hidden semantic structures in given text data. There are two popular topic modeling algorithms—non-negative matrix factorization and latent Dirichlet allocation. We will go through both of these in the next two sections.
Non-negative matrix factorization (NMF) relies heavily on linear algebra. It factorizes an input matrix, V, into a product of two smaller matrices, W and H, in such a way that these three matrices have no negative values. In the context of NLP, these three matrices have the following meanings:
How to derive the computation of W and H is beyond the scope of this book. However, you can refer to the following diagram to get a better sense of how NMF works:
Figure 10.15: Example of matrix W and matrix H derived from an input matrix V
If you are interested in reading more about NMF, feel free to check out the original paper Generalized Nonnegative Matrix Approximations with Bregman Divergences, by Inderjit S. Dhillon and Suvrit Sra, in NIPS 2005.
Let's now apply NMF to our newsgroups data. Scikit-learn has a nice module for decomposition that includes NMF:
>>> from sklearn.decomposition import NMF
>>> t = 20
>>> nmf = NMF(n_components=t, random_state=42)
We specify 20 topics (n_components
) as an example. Important parameters of the model are included in the following table:
Constructor parameter |
Default value |
Example values |
Description |
|
|
|
Number of components — in the context of topic modeling, this corresponds to the number of topics. If |
|
|
|
Maximum number of iterations |
|
|
|
Tolerance to declare convergence |
Table 10.2: Parameters of the NMF class
We used the term matrix as input to the NMF model, but you could also use the tf-idf one instead. Here, we will reuse count_vector
, as defined previously:
>>> data = count_vector.fit_transform(data_cleaned)
Now, fit the NMF model nmf
on the term matrix data
:
>>> nmf.fit(data)
We can obtain the resulting topic-feature rank W after the model is trained:
>>> nmf.components_
[[0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
0.00000000e+00 1.81952400e-04]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
7.35497518e-04 3.65665719e-03]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
0.00000000e+00 0.00000000e+00]
...
[0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 2.69725134e-02
0.00000000e+00 0.00000000e+00]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
0.00000000e+00 4.26844886e-05]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
0.00000000e+00 0.00000000e+00]]
For each topic, we display the top 10 terms based on their ranks:
>>> terms = count_vector.get_feature_names()
>>> for topic_idx, topic in enumerate(nmf.components_):
... print("Topic {}:" .format(topic_idx))
... print(" ".join([terms[i] for i in topic.argsort()[-10:]]))
Topic 0:
available quality program free color version gif file image jpeg
Topic 1:
ha article make know doe say like just people think
Topic 2:
include available analysis user software ha processing data tool image
Topic 3:
atmosphere kilometer surface ha earth wa planet moon spacecraft solar
Topic 4:
communication technology venture service market ha commercial space satellite launch
Topic 5:
verse wa jesus father mormon shall unto mcconkie lord god
Topic 6:
format message server object image mail file ray send graphic
Topic 7:
christian people doe atheism believe religion belief religious god atheist
Topic 8:
file graphic grass program ha package ftp available image data
Topic 9:
speed material unified star larson book universe theory physicist physical
Topic 10:
planetary station program group astronaut center mission shuttle nasa space
Topic 11:
infrared high astronomical center acronym observatory satellite national telescope space
Topic 12:
used occurs true form ha ad premise conclusion argument fallacy
Topic 13:
gospel people day psalm prophecy christian ha matthew wa jesus
Topic 14:
doe word hanging say greek matthew mr act wa juda
Topic 15:
siggraph graphic file information format isbn data image ftp available
Topic 16:
venera mar lunar surface space venus soviet mission wa probe
Topic 17:
april book like year time people new did article wa
Topic 18:
site retrieve ftp software data information client database gopher search
Topic 19:
use look xv color make program correction bit gamma image
There are a number of interesting topics, for instance, computer graphics-related topics, such as 0
, 2
, 6
, and 8
, space-related ones, such as 3
, 4
, and 9
, and religion-related ones, such as 5
, 7
, and 13
. There are also two topics, 1
and 12
, that are hard to interpret, which is totally fine since topic modeling is a kind of free-form learning.
Let's explore another popular topic modeling algorithm, latent Dirichlet allocation (LDA). LDA is a generative probabilistic graphical model that explains each input document by means of a mixture of topics with certain probabilities. Again, topic in topic modeling means a collection of words with a certain connection. In other words, LDA basically deals with two probability values, and . This can be difficult to understand at the beginning. So, let's start from the bottom, the end result of an LDA model.
Let's take a look at the following set of documents:
Document 1: This restaurant is famous for fish and chips.
Document 2: I had fish and rice for lunch.
Document 3: My sister bought me a cute kitten.
Document 4: Some research shows eating too much rice is bad.
Document 5: I always forget to feed fish to my cat.
Now, let's say we want two topics. The topics derived from these documents may appear as follows:
Topic 1: 30% fish, 20% chip, 30% rice, 10% lunch, 10% restaurant (which we can interpret Topic 1 to be food related)
Topic 2: 40% cute, 40% cat, 10% fish, 10% feed (which we can interpret Topic 1 to be about pet)
Therefore, we find how each document is represented by these two topics:
Documents 1: 85% Topic 1, 15% Topic 2
Documents 2: 88% Topic 1, 12% Topic 2
Documents 3: 100% Topic 2
Documents 4: 100% Topic 1
Documents 5: 33% Topic 1, 67% Topic 2
After seeing a dummy example, we come back to its learning procedure:
LDA is trained in a generative manner, where it tries to abstract from the documents a set of hidden topics that are likely to generate a certain collection of words.
With all this in mind, let's see LDA in action. The LDA model is also included in scikit-learn:
>>> from sklearn.decomposition import LatentDirichletAllocation
>>> t = 20
>>> lda = LatentDirichletAllocation(n_components=t,
learning_method='batch',random_state=42)
Again, we specify 20 topics (n_components
). The key parameters of the model are included in the following table:
Constructor parameter |
Default value |
Example values |
Description |
|
|
|
Numberof components – in the context of topic modeling, this corresponds to the number of topics. |
|
|
|
In batch mode, all training data is used for each update. In online mode, a mini-batch of training data is used for each update. In general, if the data size is large, the online mode is faster. |
|
|
|
Maximum number of iterations. |
|
|
|
Seed used by the random number generator. |
Table 10.3: Parameters of the LatentDirichletAllocation class
For the input data to LDA, remember that LDA only takes in term counts as it is a probabilistic graphical model. This is unlike NMF, which can work with both the term count matrix and the tf-idf matrix as long as they are non-negative data. Again, we use the term matrix defined previously as input to the LDA model:
>>> data = count_vector.fit_transform(data_cleaned)
Now, fit the LDA model on the term matrix, data
:
>>> lda.fit(data)
We can obtain the resulting topic-term rank after the model is trained:
>>> lda.components_
[[0.05 2.05 2.05 ... 0.05 0.05 0.05 ]
[0.05 0.05 0.05 ... 0.05 0.05 0.05 ]
[0.05 0.05 0.05 ... 4.0336285 0.05 0.05 ]
...
[0.05 0.05 0.05 ... 0.05 0.05 0.05 ]
[0.05 0.05 0.05 ... 0.05 0.05 0.05 ]
[0.05 0.05 0.05 ... 0.05 0.05 3.05 ]]
Similarly, for each topic, we display the top 10 terms based on their ranks as follows:
>>> terms = count_vector.get_feature_names()
>>> for topic_idx, topic in enumerate(lda.components_):
... print("Topic {}:" .format(topic_idx))
... print(" ".join([terms[i] for i in
topic.argsort()[-10:]]))
Topic 0:
atheist doe ha believe say jesus people christian wa god
Topic 1:
moment just adobe want know ha wa hacker article radius
Topic 2:
center point ha wa available research computer data graphic hst
Topic 3:
objective argument just thing doe people wa think say article
Topic 4:
time like brian ha good life want know just wa
Topic 5:
computer graphic think know need university just article wa like
Topic 6:
free program color doe use version gif jpeg file image
Topic 7:
gamma ray did know university ha just like article wa
Topic 8:
tool ha processing using data software color program bit image
Topic 9:
apr men know ha think woman just university article wa
Topic 10:
jpl propulsion mission april mar jet command data spacecraft wa
Topic 11:
russian like ha university redesign point option article space station
Topic 12:
ha van book star material physicist universe physical theory wa
Topic 13:
bank doe book law wa article rushdie muslim islam islamic
Topic 14:
think gopher routine point polygon book university article know wa
Topic 15:
ha rocket new lunar mission satellite shuttle nasa launch space
Topic 16:
want right article ha make like just think people wa
Topic 17:
just light space henry wa like zoology sky article toronto
Topic 18:
comet venus solar moon orbit planet earth probe ha wa
Topic 19:
site format image mail program available ftp send file graphic
There are a number of interesting topics that we just mined, for instance, computer graphics-related topics, such as 2
, 5
, 6
, 8
, and 19
, space-related ones, such as 10
, 11
, 12
, and 15
, and religion-related ones, such as 0
and 13
. There are also topics involving noise, for example, 9
and 16
, which may require some imagination to interpret. Again, this is not surprising at all, since LDA, or topic modeling in general, as mentioned earlier, is a kind of free-form learning.
The project in this chapter was about finding hidden similarity underneath newsgroups data, be it semantic groups, themes, or word clouds. We started with what unsupervised learning does and the typical types of unsupervised learning algorithms. We then introduced unsupervised learning clustering and studied a popular clustering algorithm, k-means, in detail.
We also talked about tf-idf as a more efficient feature extraction tool for text data. After that, we performed k-means clustering on the newsgroups data and obtained four meaningful clusters. After examining the key terms in each resulting cluster, we went straight to extracting representative terms among original documents using topic modeling techniques. Two powerful topic modeling approaches, NMF and LDA, were discussed and implemented. Finally, we had some fun interpreting the topics we obtained from both methods.
Hitherto, we have covered all the main categories of unsupervised learning, including dimensionality reduction, clustering, and topic modeling, which is also dimensionality reduction in a way.
In the next chapter, we will review what you have learned so far in this book and provide best practices of real-world machine learning. The chapter aims to foolproof your learning and get you ready for the entire machine learning workflow and productionization. This will be a wrap-up of general machine learning techniques before we move on to more complex topics in the final three chapters.