10

Discovering Underlying Topics in the Newsgroups Dataset with Clustering and Topic Modeling

In the previous chapter, we went through a text visualization using t-SNE. T-SNE, or any dimensionality reduction algorithm, is a type of unsupervised learning. Moving forward, in this chapter, we will be continuing our unsupervised learning journey, specifically focusing on clustering and topic modeling. We will start with how unsupervised learning learns without guidance and how it is good at discovering hidden information underneath data.

Next, we will talk about clustering as an important branch of unsupervised learning, which identifies different groups of observations from data. For instance, clustering is useful for market segmentation, where consumers of similar behaviors are grouped into one segment for marketing purposes. We will perform clustering on the 20 newsgroups text dataset and see what clusters will be produced.

Another unsupervised learning route we will take is topic modeling, which is the process of extracting themes hidden in the dataset. You will be amused by how many interesting themes we are able to mine from the 20 newsgroups dataset.

We will cover the following topics:

  • What is unsupervised learning?
  • Types of unsupervised learning
  • What is k-means clustering and how does it work?
  • Implementing k-means clustering from scratch
  • Implementing k-means with scikit-learn
  • Optimizing k-means clustering models
  • Term frequency-inverse document frequency
  • Clustering newsgroups data using k-means
  • What is topic modeling?
  • Non-negative matrix factorization for topic modeling
  • Latent Dirichlet allocation for topic modeling
  • Topic modeling on the newsgroups data

Learning without guidance – unsupervised learning

In the previous chapter, we applied t-SNE to visualize the newsgroup text data, reduced to 2 dimensions. T-SNE, or dimensionality reduction in general, is a type of unsupervised learning. Instead of having a teacher educating on what particular output to produce, such as a class or membership (classification), and a continuous value (regression), unsupervised learning identifies inherent structures or commonalities in the input data. Since there is no guidance from the "teacher" in unsupervised learning, there is no clear answer on what is a right or wrong result. Unsupervised learning has the freedom to discover hidden information underneath input data.

An easy way to understand unsupervised learning is to think of going through many practice questions for an exam. In supervised learning, you are given answers to those practice questions. You basically figure out the relationship between the questions and answers and learn how to map the questions to the answers. Hopefully, you will do well in the actual exam in the end by giving the correct answers. However, in unsupervised learning, you are not provided with the answers to those practice questions. What you might do in this instance could include the following:

  • Grouping similar practice questions so that you can later study related questions together at one time
  • Finding questions that are highly repetitive so that you will not waste time on them
  • Spotting rare questions so that you can be better prepared for them
  • Extracting the key chunk of each question by removing boilerplate so you can cut to the point

You will notice that the outcomes of all these tasks are pretty open-ended. They are correct as long as they are able to describe the commonality and the structure underneath the data.

Practice questions are the features in machine learning, which are also often called attributes, observations, or predictive variables. Answers to questions are the labels in machine learning, which are also called targets or target variables. Practice questions with answers provided are labeled data, while practice questions without answers are unlabeled data. Unsupervised learning works with unlabeled data and acts on that information without guidance.

Unsupervised learning can include the following types:

  • Clustering: This means grouping data based on commonality, which is often used for exploratory data analysis. Grouping similar practice questions, as mentioned earlier, is an example of clustering. Clustering techniques are widely used in customer segmentation or for grouping similar online behaviors for a marketing campaign.
  • Association: This explores the co-occurrence of particular values of two or more features. Outlier detection (also called anomaly detection) is a typical case, where rare observations are identified. Spotting rare questions in the preceding example can be achieved using outlier detection techniques.
  • Projection: This maps the original feature space to a reduced dimensional space retaining or extracting a set of principal variables. Extracting the key chunk of practice questions is an example projection, or specifically a dimensionality reduction.

Unsupervised learning is extensively employed in the area of NLP mainly because of the difficulty of obtaining labeled text data. Unlike numerical data (such as house prices, stock data, and online click streams), labeling text can sometimes be subjective, manual, and tedious. Unsupervised learning algorithms that do not require labels become effective when it comes to mining text data.

In Chapter 9, Mining the 20 Newsgroups Dataset with Text Analysis Techniques, you experienced using t-SNE to reduce the dimensionality of text data. Now, let's explore text mining with clustering algorithms and topic modeling techniques. We will start with clustering the newsgroups data.

Clustering newsgroups data using k-means

The newsgroups data comes with labels, which are the categories of the newsgroups, and a number of categories that are closely related or even overlapping, for instance, the five computer groups: comp.graphics, comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware, comp.sys.mac.hardware, and comp.windows.x, and the two religion-related ones: alt.atheism and talk.religion.misc.

Let's now pretend we don't know those labels or they don't exist. Will samples from related topics be clustered together? We will now resort to the k-means clustering algorithm.

How does k-means clustering work?

The goal of the k-means algorithm is to partition the data into k groups based on feature similarities. K is a predefined property of a k-means clustering model. Each of the k clusters is specified by a centroid (center of a cluster) and each data sample belongs to the cluster with the nearest centroid. During training, the algorithm iteratively updates the k centroids based on the data provided. Specifically, it involves the following steps:

  1. Specifying k: The algorithm needs to know how many clusters to generate as an end result.
  2. Initializing centroids: The algorithm starts with randomly selecting k samples from the dataset as centroids.
  3. Assigning clusters: Now that we have k centroids, samples that share the same closest centroid constitute one cluster. K clusters are created as a result. Note that closeness is usually measured by the Euclidean distance. Other metrics can also be used, such as the Manhattan distance and Chebyshev distance, which are listed in the following table:

    Figure 10.1: Distance metrics

  4. Updating centroids: For each cluster, we need to recalculate its center point, which is the mean of all the samples in the cluster. K centroids are updated to be the means of corresponding clusters. This is why the algorithm is called k-means.
  5. Repeating steps 3 and 4: We keep repeating assigning clusters and updating centroids until the model converges when no or a small enough update of centroids can be done, or enough iterations have been completed.

The outputs of a trained k-means clustering model include the following:

  • The cluster ID of each training sample, ranging from 1 to k
  • K centroids, which can be used to cluster new samples—a new sample will belong to the cluster of the closest centroid

It is very easy to understand the k-means clustering algorithm and its implementation is also straightforward, as you will discover next.

Implementing k-means from scratch

We will use the iris dataset from scikit-learn as an example. Let's first load the data and visualize it. We herein only use two features out of the original four for simplicity:

>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> X = iris.data[:, 2:4]
>>> y = iris.target

Since the dataset contains three iris classes, we plot it in three different colors, as follows:

>>> import numpy as np
>>> from matplotlib import pyplot as plt
>>> plt.scatter(X[:,0], X[:,1], c=y)
>>> plt.show()

This will give us the following output for the original data plot:

Figure 10.2: Plot of the original iris dataset

Assuming we know nothing about the label y, we try to cluster the data into three groups, as there seem to be three clusters in the preceding plot (or you might say two, which we will come back to later). Let's perform step 1, specifying k, and step 2, initializing centroids, by randomly selecting three samples as initial centroids:

>>> k = 3
>>> random_index = np.random.choice(range(len(X)), k)
>>> centroids = X[random_index]

We visualize the data (without labels any more) along with the initial random centroids:

>>> def visualize_centroids(X, centroids):
...     plt.scatter(X[:, 0], X[:, 1])
...     plt.scatter(centroids[:, 0], centroids[:, 1], marker='*',
                                             s=200, c='#050505')
...     plt.show()
>>> visualize_centroids(X, centroids)

Refer to the following screenshot for the data, along with the initial random centroids:

Figure 10.3: Data points with random centroids

Now we perform step 3, which entails assigning clusters based on the nearest centroids. First, we need to define a function calculating distance that is measured by the Euclidean distance, as demonstrated herein:

>>> def dist(a, b):
...     return np.linalg.norm(a - b, axis=1)

Then, we develop a function that assigns a sample to the cluster of the nearest centroid:

>>> def assign_cluster(x, centroids):
...     distances = dist(x, centroids)
...     cluster = np.argmin(distances)
...     return cluster

With the clusters assigned, we perform step 4, which involves updating the centroids to the mean of all samples in the individual clusters:

>>> def update_centroids(X, centroids, clusters):
...     for i in range(k):
...         cluster_i = np.where(clusters == i)
...         centroids[i] = np.mean(X[cluster_i], axis=0)

Finally, we have step 5, which involves repeating step 3 and step 4 until the model converges and whichever of the following occurs:

  • Centroids move less than the pre-specified threshold
  • Sufficient iterations have been taken

We set the tolerance of the first condition and the maximum number of iterations as follows:

>>> tol = 0.0001
>>> max_iter = 100

Initialize the clusters' starting values, along with the starting clusters for all samples as follows:

>>> iter = 0
>>> centroids_diff = 100000
>>> clusters = np.zeros(len(X))

With all the components ready, we can train the model iteration by iteration where it first checks convergence, before performing steps 3 and 4, and then visualizes the latest centroids:

>>> from copy import deepcopy
>>> while iter < max_iter and centroids_diff > tol:
...     for i in range(len(X)):
...         clusters[i] = assign_cluster(X[i], centroids)
...     centroids_prev = deepcopy(centroids)
...     update_centroids(X, centroids, clusters)
...     iter += 1
...     centroids_diff = np.linalg.norm(centroids -
                                       centroids_prev)
...     print('Iteration:', str(iter))
...     print('Centroids:
', centroids)
...     print('Centroids move: {:5.4f}'.format(centroids_diff))
...     visualize_centroids(X, centroids)

Let's take a look at the following outputs generated from the preceding commands:

  • Iteration 1: Take a look at the following output of iteration 1:
    Iteration: 1
    Centroids:
    [[5.01827957 1.72258065]
    [3.41428571 1.05714286]
    [1.464      0.244 ]]
    Centroids move: 0.8274
    

    The plot of centroids after iteration 1 is as follows:

Figure 10.4: k-means clustering result after the first round

  • Iteration 2: Take a look at the following output of iteration 2:
    Iteration: 2
    Centroids:
    [[5.20897436 1.81923077]
    [3.83181818 1.16818182]
    [1.464      0.244 ]]
    Centroids move: 0.4820
    

    The plot of centroids after iteration 2 is as follows:

Figure 10.5: k-means clustering result after the second round

  • Iteration 3: Take a look at the following output of iteration 3:
    Iteration: 3
    Centroids:
    [[5.3796875  1.9125 ]
    [4.06388889 1.25555556]
    [1.464      0.244 ]]
    Centroids move: 0.3152
    

    The plot of centroids after iteration 3 is as follows:

Figure 10.6: k-means clustering result after the third round

  • Iteration 4: Take a look at the following output of iteration 4:
    Iteration: 4
    Centroids:
    [[5.51481481 1.99444444]
    [4.19130435 1.30217391]
    [1.464      0.244 ]]
    Centroids move: 0.2083
    

    The plot of centroids after iteration 4 is as follows:

Figure 10.7: k-means clustering result after the fourth round

  • Iteration 5: Take a look at the following output of iteration 5:
    Iteration: 5
    Centroids:
    [[5.53846154 2.01346154]
    [4.22083333 1.31041667]
    [1.464      0.244 ]]
    Centroids move: 0.0431
    

    The plot of centroids after iteration 5 is as follows:

Figure 10.8: k-means clustering result after the fifth round

  • Iteration 6: Take a look at the following output of iteration 6:
    Iteration: 6
    Centroids:
    [[5.58367347 2.02653061]
    [4.25490196 1.33921569]
    [1.464 0.244 ]]
    Centroids move: 0.0648
    

    The plot of centroids after iteration 6 is as follows:

Figure 10.9: k-means clustering result after the sixth round

  • Iteration 7: Take a look at the following output of iteration 7:
    Iteration: 7
    Centroids:
    [[5.59583333 2.0375 ]
    [4.26923077 1.34230769]
    [1.464 0.244 ]]
    Centroids move: 0.0220
    

    The plot of centroids after iteration 7 is as follows:

Figure 10.10: k-means clustering result after the seventh round

  • Iteration 8: Take a look at the following output of iteration 8:
    Iteration: 8
    Centroids:
    [[5.59583333 2.0375 ]
    [4.26923077 1.34230769]
    [1.464 0.244 ]]
    Centroids move: 0.0000
    

    The plot of centroids after iteration 8 is as follows:

Figure 10.11: k-means clustering result after the eighth round

The model converges after eight iterations. The resulting centroids look promising, and we can also plot the clusters:

>>> plt.scatter(X[:, 0], X[:, 1], c=clusters)
>>> plt.scatter(centroids[:, 0], centroids[:, 1], marker='*',
                                           s=200, c='#050505')
>>> plt.show()

Refer to the following screenshot for the end result:

Figure 10.12: Data samples along with learned cluster centroids

As you can see, samples around the same centroid form a cluster. After eight iterations (you might see slightly more or less iterations in your case), the model converges and the centroids will no longer be updated.

Implementing k-means with scikit-learn

Having developed our own k-means clustering model, we will now discuss how to use scikit-learn for a quicker solution by performing the following steps:

  1. First, import the KMeans class and initialize a model with three clusters, as follows:
    >>> from sklearn.cluster import KMeans
    >>> kmeans_sk = KMeans(n_clusters=3, random_state=42)
    

    The KMeans class takes in the following important parameters:

    Constructor parameter

    Default value

    Example values

    Description

    n_clusters

    8

    3, 5, 10

    K clusters

    max_iter

    300

    10, 100, 500

    Maximum number of iterations

    tol

    1e-4

    1e-5, 1e-8

    Tolerance to declare convergence

    random_state

    None

    0, 42

    Random seed for program reproducibility

    Table 10.1: Parameters of the KMeans class

  2. We then fit the model on the data:
    >>> kmeans_sk.fit(X)
    
  3. After that, we can obtain the clustering results, including the clusters for data samples and centroids of individual clusters:
    >>> clusters_sk = kmeans_sk.labels_
    >>> centroids_sk = kmeans_sk.cluster_centers_
    
  4. Similarly, we plot the clusters along with the centroids:
    >>> plt.scatter(X[:, 0], X[:, 1], c=clusters_sk)
    >>> plt.scatter(centroids_sk[:, 0], centroids_sk[:, 1],
                                   marker='*', s=200, c='#050505')
    >>> plt.show()
    

This will result in the following output:

Figure 10.13: Data samples along with learned cluster centroids using scikit-learn

We get similar results to the previous one using the model we implemented from scratch.

Choosing the value of k

Let's return to our earlier discussion on what the right value for k is. In the preceding example, it is more intuitive to set it to 3 since we know there are three classes in total. However, in most cases, we don't know how many groups are sufficient or efficient, and meanwhile, the algorithm needs a specific value of k to start with. So, how can we choose the value for k? There is a famous approach called the Elbow method.

In the Elbow method, different values of k are chosen and corresponding models are trained; for each trained model, the sum of squared errors, or SSE (also called the sum of within-cluster distances) of centroids is calculated and is plotted against k. Note that for one cluster, the squared error (or the within-cluster distance) is computed as the sum of the squared distances from individual samples in the cluster to the centroid. The optimal k is chosen where the marginal drop of SSE starts to decrease dramatically. This means that further clustering does not provide any substantial gain.

Let's apply the Elbow method to the example we covered in the previous section (learning by examples is what this book is all about). We perform k-means clustering under different values of k on the iris data:

>>> iris = datasets.load_iris()
>>> X = iris.data
>>> y = iris.target
>>> k_list = list(range(1, 7))
>>> sse_list = [0] * len(k_list)

We use the whole feature space and k ranges from 1 to 6. Then, we train individual models and record the resulting SSE, respectively:

>>> for k_ind, k in enumerate(k_list):
...     kmeans = KMeans(n_clusters=k, random_state=42)
...     kmeans.fit(X)
...     clusters = kmeans.labels_
...     centroids = kmeans.cluster_centers_
...     sse = 0
...     for i in range(k):
...         cluster_i = np.where(clusters == i)
...         sse += np.linalg.norm(X[cluster_i] - centroids[i])
...     print('k={}, SSE={}'.format(k, sse))
...     sse_list[k_ind] = sse
k=1, SSE=26.103076447039722
k=2, SSE=16.469773740281195
k=3, SSE=15.089477089696558
k=4, SSE=15.0307321707491
k=5, SSE=14.858930749063735
k=6, SSE=14.883090350867239

Finally, we plot the SSE versus the various k ranges, as follows:

>>> plt.plot(k_list, sse_list)
>>> plt.show()

This will result in the following output:

Figure 10.14: k-means elbow: SSE versus k

Apparently, the Elbow point is k=3, since the drop in SSE slows down dramatically right after 3. Hence, k=3 is an optimal solution in this case, which is consistent with the fact that there are three classes of flowers.

Clustering newsgroups data using k-means

You should now be very familiar with k-means clustering. Next, let's see what we are able to mine from the newsgroups dataset using this algorithm. We will use all data from four categories, 'alt.atheism', 'talk.religion.misc', 'comp.graphics', and 'sci.space', as an example.

We first load the data from those newsgroups and preprocess it as we did in Chapter 9, Mining the 20 Newsgroups Dataset with Text Analysis Techniques:

>>> from sklearn.datasets import fetch_20newsgroups
>>> categories = [
...     'alt.atheism',
...     'talk.religion.misc',
...     'comp.graphics',
...     'sci.space',
... ]
>>> groups = fetch_20newsgroups(subset='all',
                                     categories=categories)
>>> labels = groups.target
>>> label_names = groups.target_names
>>> def is_letter_only(word):
...     for char in word:
...         if not char.isalpha():
...             return False
...     return True
>>> from nltk.corpus import names
>>> all_names = set(names.words())
>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer = WordNetLemmatizer()
>>> data_cleaned = []
>>> for doc in groups.data:
...     doc = doc.lower()
...     doc_cleaned = ' '.join(lemmatizer.lemmatize(word) for
                    word in doc.split() if word.isalpha()  
                    and word not in all_names)
...     data_cleaned.append(doc_cleaned)

We then convert the cleaned text data into count vectors using CountVectorizer of scikit-learn:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> count_vector = CountVectorizer(stop_words="english",
                        max_features=None, max_df=0.5, min_df=2)
>>> data = count_vector.fit_transform(data_cleaned)

Note that the vectorizer we use here does not limit the number of features (word tokens), but the minimum and maximum document frequency, which are 2 and 50% of the dataset, respectively. Document frequency of a word is measured by the fraction of documents (samples) in the dataset that contain this word.

With the input data ready, we will now try to cluster them into four groups as follows:

>>> from sklearn.cluster import KMeans
>>> k = 4
>>> kmeans = KMeans(n_clusters=k, random_state=42)
>>> kmeans.fit(data)

Let's do a quick check on the sizes of the resulting clusters:

>>> clusters = kmeans.labels_
>>> from collections import Counter
>>> print(Counter(clusters))
Counter({3: 3360, 0: 17, 1: 7, 2: 3})

The clusters don't look absolutely correct, with most samples (3360 samples) congested in one big cluster (cluster 3). What could have gone wrong? It turns out that our count-based features are not sufficiently representative. A better numerical representation for text data is the term frequency-inverse document frequency (tf-idf). Instead of simply using the token count, or the so-called term frequency (tf), it assigns each term frequency a weighting factor that is inversely proportional to the document frequency. In practice, the idf factor of a term t in documents D is calculated as follows:

Here, nD is the total number of documents, nt is the number of documents containing the term t, and the 1 is added to avoid division by zero.

With the idf factor incorporated, the tf-idf representation diminishes the weight of common terms (such as get and make) and emphasizes terms that rarely occur, but that convey an important meaning.

To use the tf-idf representation, we just need to replace CountVectorizer with TfidfVectorizer from scikit-learn as follows:

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> tfidf_vector = TfidfVectorizer(stop_words='english',
                        max_features=None, max_df=0.5, min_df=2)

Now, redo feature extraction using the tf-idf vectorizer and the k-means clustering algorithm on the resulting feature space:

>>> data = tfidf_vector.fit_transform(data_cleaned)
>>> kmeans.fit(data)
>>> clusters = kmeans.labels_
print(Counter(clusters))
Counter({1: 1560, 2: 686, 3: 646, 0: 495})

The clustering result becomes more reasonable.

We also take a closer look at the clusters by examining what they contain and the top 10 terms (the terms with the 10 highest tf-idf) representing each cluster:

>>> cluster_label = {i: labels[np.where(clusters == i)] for i in
                                                        range(k)}
>>> terms = tfidf_vector.get_feature_names()
>>> centroids = kmeans.cluster_centers_
>>> for cluster, index_list in cluster_label.items():
...     counter = Counter(cluster_label[cluster])
...     print('cluster_{}: {} samples'.format(cluster, len(index_list)))
...     for label_index, count in sorted(counter.items(),
                               key=lambda x: x[1], reverse=True):
...         print('{}: {} samples'.format(label_names[label_index], count))
...     print('Top 10 terms:')
...     for ind in centroids[cluster].argsort()[-10:]:
...         print(' %s' % terms[ind], end="")
...     print()
cluster_0: 495 samples
sci.space: 494 samples
comp.graphics: 1 samples
Top 10 terms:
toronto moon zoology nasa hst mission wa launch shuttle space
cluster_1: 1560 samples
sci.space: 459 samples
alt.atheism: 430 samples
talk.religion.misc: 352 samples
comp.graphics: 319 samples
Top 10 terms:
people new think know like ha just university article wa
cluster_2: 686 samples
comp.graphics: 651 samples
sci.space: 32 samples
alt.atheism: 2 samples
talk.religion.misc: 1 samples
Top 10 terms:
know thanks need format looking university program file graphic image
cluster_3: 646 samples
alt.atheism: 367 samples
talk.religion.misc: 275 samples
sci.space: 2 samples
comp.graphics: 2 samples
Top 10 terms:
moral article morality think jesus people say christian wa god

From what we observe in the preceding results:

  • cluster_0 is obviously about space and includes almost all sci.space samples and related terms such as moon, nasa, launch, shuttle, and space
  • cluster_1 is more of a generic topic
  • cluster_2 is more about computer graphics and related terms, such as format, program, file, graphic, and image
  • cluster_3 is an interesting one, which successfully brings together two overlapping topics, atheism and religion, with key terms including moral, morality, jesus, christian, and god

Feel free to try different values of k, or use the Elbow method to find the optimal one (this is actually an exercise for this chapter).

It is quite interesting to find key terms for each text group via clustering. Topic modeling is another approach for doing so, but in a much more direct way. It does not simply search for the key terms in individual clusters generated beforehand. What it does is directly extract collections of key terms from documents. You will see how this works in the next section.

Discovering underlying topics in newsgroups

A topic model is a type of statistical model for discovering the probability distributions of words linked to the topic. The topic in topic modeling does not exactly match the dictionary definition, but corresponds to a nebulous statistical concept, which is an abstraction that occurs in a collection of documents.

When we read a document, we expect certain words appearing in the title or the body of the text to capture the semantic context of the document. An article about Python programming will have words such as class and function, while a story about snakes will have words such as eggs and afraid. Documents usually have multiple topics; for instance, this recipe is about three things: topic modeling, non-negative matrix factorization, and latent Dirichlet allocation, which we will discuss shortly. We can therefore define an additive model for topics by assigning different weights to topics.

Topic modeling is widely used for mining hidden semantic structures in given text data. There are two popular topic modeling algorithms—non-negative matrix factorization and latent Dirichlet allocation. We will go through both of these in the next two sections.

Topic modeling using NMF

Non-negative matrix factorization (NMF) relies heavily on linear algebra. It factorizes an input matrix, V, into a product of two smaller matrices, W and H, in such a way that these three matrices have no negative values. In the context of NLP, these three matrices have the following meanings:

  • The input matrix V is the term count or tf-idf matrix of size n * m, where n is the number of documents or samples, and m is the number of terms.
  • The first decomposition output matrix W is the feature matrix of size t * m, where t is the number of topics specified. Each row of W represents a topic with each element in the row representing the rank of a term in the topic.
  • The second decomposition output matrix H is the coefficient matrix of size n * t. Each row of H represents a document, with each element in the row representing the weight of a topic within the document.

How to derive the computation of W and H is beyond the scope of this book. However, you can refer to the following diagram to get a better sense of how NMF works:

Figure 10.15: Example of matrix W and matrix H derived from an input matrix V

If you are interested in reading more about NMF, feel free to check out the original paper Generalized Nonnegative Matrix Approximations with Bregman Divergences, by Inderjit S. Dhillon and Suvrit Sra, in NIPS 2005.

Let's now apply NMF to our newsgroups data. Scikit-learn has a nice module for decomposition that includes NMF:

>>> from sklearn.decomposition import NMF
>>> t = 20
>>> nmf = NMF(n_components=t, random_state=42)

We specify 20 topics (n_components) as an example. Important parameters of the model are included in the following table:

Constructor parameter

Default value

Example values

Description

n_components

None

5, 10, 20

Number of components — in the context of topic modeling, this corresponds to the number of topics. If None, it becomes the number of input features.

max_iter

200

100, 200

Maximum number of iterations

tol

1e-4

1e-5, 1e-8

Tolerance to declare convergence

Table 10.2: Parameters of the NMF class

We used the term matrix as input to the NMF model, but you could also use the tf-idf one instead. Here, we will reuse count_vector, as defined previously:

>>> data = count_vector.fit_transform(data_cleaned)

Now, fit the NMF model nmf on the term matrix data:

>>> nmf.fit(data)

We can obtain the resulting topic-feature rank W after the model is trained:

>>> nmf.components_
[[0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
 0.00000000e+00 1.81952400e-04]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
 7.35497518e-04 3.65665719e-03]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
 0.00000000e+00 0.00000000e+00]
...
[0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 2.69725134e-02
 0.00000000e+00 0.00000000e+00]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
 0.00000000e+00 4.26844886e-05]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
 0.00000000e+00 0.00000000e+00]]

For each topic, we display the top 10 terms based on their ranks:

>>> terms = count_vector.get_feature_names()
>>> for topic_idx, topic in enumerate(nmf.components_):
...         print("Topic {}:" .format(topic_idx))
...         print(" ".join([terms[i] for i in topic.argsort()[-10:]]))
Topic 0:
available quality program free color version gif file image jpeg
Topic 1:
ha article make know doe say like just people think
Topic 2:
include available analysis user software ha processing data tool image
Topic 3:
atmosphere kilometer surface ha earth wa planet moon spacecraft solar
Topic 4:
communication technology venture service market ha commercial space satellite launch
Topic 5:
verse wa jesus father mormon shall unto mcconkie lord god
Topic 6:
format message server object image mail file ray send graphic
Topic 7:
christian people doe atheism believe religion belief religious god atheist
Topic 8:
file graphic grass program ha package ftp available image data
Topic 9:
speed material unified star larson book universe theory physicist physical
Topic 10:
planetary station program group astronaut center mission shuttle nasa space
Topic 11:
infrared high astronomical center acronym observatory satellite national telescope space
Topic 12:
used occurs true form ha ad premise conclusion argument fallacy
Topic 13:
gospel people day psalm prophecy christian ha matthew wa jesus
Topic 14:
doe word hanging say greek matthew mr act wa juda
Topic 15:
siggraph graphic file information format isbn data image ftp available
Topic 16:
venera mar lunar surface space venus soviet mission wa probe
Topic 17:
april book like year time people new did article wa
Topic 18:
site retrieve ftp software data information client database gopher search
Topic 19:
use look xv color make program correction bit gamma image

There are a number of interesting topics, for instance, computer graphics-related topics, such as 026, and 8, space-related ones, such as 34, and 9, and religion-related ones, such as 57, and 13. There are also two topics, 1 and 12, that are hard to interpret, which is totally fine since topic modeling is a kind of free-form learning.

Topic modeling using LDA

Let's explore another popular topic modeling algorithm, latent Dirichlet allocation (LDA). LDA is a generative probabilistic graphical model that explains each input document by means of a mixture of topics with certain probabilities. Again, topic in topic modeling means a collection of words with a certain connection. In other words, LDA basically deals with two probability values, and . This can be difficult to understand at the beginning. So, let's start from the bottom, the end result of an LDA model.

Let's take a look at the following set of documents:

Document 1: This restaurant is famous for fish and chips.
Document 2: I had fish and rice for lunch.
Document 3: My sister bought me a cute kitten.
Document 4: Some research shows eating too much rice is bad.
Document 5: I always forget to feed fish to my cat.

Now, let's say we want two topics. The topics derived from these documents may appear as follows:

Topic 1: 30% fish, 20% chip, 30% rice, 10% lunch, 10% restaurant (which we can interpret Topic 1 to be food related)
Topic 2: 40% cute, 40% cat, 10% fish, 10% feed (which we can interpret Topic 1 to be about pet)

Therefore, we find how each document is represented by these two topics:

Documents 1: 85% Topic 1, 15% Topic 2
Documents 2: 88% Topic 1, 12% Topic 2
Documents 3: 100% Topic 2
Documents 4: 100% Topic 1
Documents 5: 33% Topic 1, 67% Topic 2

After seeing a dummy example, we come back to its learning procedure:

  1. Specify the number of topics, T. Now we have topic 1, 2, …, and T.
  2. For each document, randomly assign one of the topics to each term in the document.
  3. For each document, calculate , which is the proportion of terms in the document that are assigned to the topic t.
  4. For each topic, calculate , which is the proportion of term w among all terms that are assigned to the topic.
  5. For each term w, reassign its topic based on the latest probabilities  and .
  6. Repeat steps 3 to 5 under the latest topic distributions for each iteration. The training stops if the model converges or reaches the maximum number of iterations.

LDA is trained in a generative manner, where it tries to abstract from the documents a set of hidden topics that are likely to generate a certain collection of words.

With all this in mind, let's see LDA in action. The LDA model is also included in scikit-learn:

>>> from sklearn.decomposition import LatentDirichletAllocation
>>> t = 20
>>> lda = LatentDirichletAllocation(n_components=t,
                      learning_method='batch',random_state=42)

Again, we specify 20 topics (n_components). The key parameters of the model are included in the following table:

Constructor parameter

Default value

Example values

Description

n_components

10

5, 10, 20

Numberof components – in the context of topic modeling, this corresponds to the number of topics.

learning_method

"batch"

"online", "batch"

In batch mode, all training data is used for each update. In online mode, a mini-batch of training data is used for each update. In general, if the data size is large, the online mode is faster.

max_iter

10

10, 20

Maximum number of iterations.

randome_state

None

0, 42

Seed used by the random number generator.

Table 10.3: Parameters of the LatentDirichletAllocation class

For the input data to LDA, remember that LDA only takes in term counts as it is a probabilistic graphical model. This is unlike NMF, which can work with both the term count matrix and the tf-idf matrix as long as they are non-negative data. Again, we use the term matrix defined previously as input to the LDA  model:

>>> data = count_vector.fit_transform(data_cleaned)

Now, fit the LDA model on the term matrix, data:

>>> lda.fit(data)

We can obtain the resulting topic-term rank after the model is trained:

>>> lda.components_
[[0.05     2.05    2.05    ...   0.05      0.05    0.05 ]
 [0.05     0.05    0.05    ...   0.05      0.05    0.05 ]
 [0.05     0.05    0.05    ...   4.0336285 0.05    0.05 ]
 ...
 [0.05     0.05    0.05    ...   0.05      0.05    0.05 ]
 [0.05     0.05    0.05    ...   0.05      0.05    0.05 ]
 [0.05     0.05    0.05    ...   0.05      0.05    3.05 ]]

Similarly, for each topic, we display the top 10 terms based on their ranks as follows:

>>> terms = count_vector.get_feature_names()
>>> for topic_idx, topic in enumerate(lda.components_):
...         print("Topic {}:" .format(topic_idx))
...         print(" ".join([terms[i] for i in
                                   topic.argsort()[-10:]]))
Topic 0:
atheist doe ha believe say jesus people christian wa god
Topic 1:
moment just adobe want know ha wa hacker article radius
Topic 2:
center point ha wa available research computer data graphic hst
Topic 3:
objective argument just thing doe people wa think say article
Topic 4:
time like brian ha good life want know just wa
Topic 5:
computer graphic think know need university just article wa like
Topic 6:
free program color doe use version gif jpeg file image
Topic 7:
gamma ray did know university ha just like article wa
Topic 8:
tool ha processing using data software color program bit image
Topic 9:
apr men know ha think woman just university article wa
Topic 10:
jpl propulsion mission april mar jet command data spacecraft wa
Topic 11:
russian like ha university redesign point option article space station
Topic 12:
ha van book star material physicist universe physical theory wa
Topic 13:
bank doe book law wa article rushdie muslim islam islamic
Topic 14:
think gopher routine point polygon book university article know wa
Topic 15:
ha rocket new lunar mission satellite shuttle nasa launch space
Topic 16:
want right article ha make like just think people wa
Topic 17:
just light space henry wa like zoology sky article toronto
Topic 18:
comet venus solar moon orbit planet earth probe ha wa
Topic 19:
site format image mail program available ftp send file graphic

There are a number of interesting topics that we just mined, for instance, computer graphics-related topics, such as 2, 5, 6, 8, and 19, space-related ones, such as 101112, and 15, and religion-related ones, such as 0 and 13. There are also topics involving noise, for example, 9 and 16, which may require some imagination to interpret. Again, this is not surprising at all, since LDA, or topic modeling in general, as mentioned earlier, is a kind of free-form learning.

Summary

The project in this chapter was about finding hidden similarity underneath newsgroups data, be it semantic groups, themes, or word clouds. We started with what unsupervised learning does and the typical types of unsupervised learning algorithms. We then introduced unsupervised learning clustering and studied a popular clustering algorithm, k-means, in detail.

We also talked about tf-idf as a more efficient feature extraction tool for text data. After that, we performed k-means clustering on the newsgroups data and obtained four meaningful clusters. After examining the key terms in each resulting cluster, we went straight to extracting representative terms among original documents using topic modeling techniques. Two powerful topic modeling approaches, NMF and LDA, were discussed and implemented. Finally, we had some fun interpreting the topics we obtained from both methods.

Hitherto, we have covered all the main categories of unsupervised learning, including dimensionality reduction, clustering, and topic modeling, which is also dimensionality reduction in a way.

In the next chapter, we will review what you have learned so far in this book and provide best practices of real-world machine learning. The chapter aims to foolproof your learning and get you ready for the entire machine learning workflow and productionization. This will be a wrap-up of general machine learning techniques before we move on to more complex topics in the final three chapters.

Exercises

  1. Perform k-means clustering on newsgroups data using different values of k, or use the Elbow method to find the optimal one. See if you get better grouping results.
  2. Try different numbers of topics, in either NMF or LDA, and see which one produces more meaningful topics in the end. This should be a fun exercise.
  3. Can you experiment with NMF or LDA on the entire 20 groups of newsgroups data? Are the resulting topics full of noise or gems?
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset