How it works...

First, we create a DataFrame with our articles. 

Next, we go through pretty much the same steps as we went through in the Extracting features from text recipe:

  1. We split the sentences using .RegexTokenizer(...)
  2. We remove the stop words using .StopWordsRemover(...)
  3. We count each word's occurrence using .CountVectorizer(...)

To find the clusters in our data, we will use the Latent Dirichlet Allocation (LDA) model. In our case, we know that we expect to have three clusters, but if you do not know how many clusters you might have, you can use one of the techniques we introduced in the Tuning hyperparameters recipe earlier in this chapter.

Finally, we put everything in the Pipeline for our convenience. 

Once the model is estimated, let's see how it performs. Here's a piece of code that will help us do that; note the NumPy's .argmax(...) method that helps us find the index of the highest value:

for topic in ( 
topic_pipeline
.fit(articles)
.transform(articles)
.select('Topic','Object','topicDistribution')
.take(10)
):
print(
topic.Topic
, topic.Object
, np.argmax(topic.topicDistribution)
, topic.topicDistribution
)

Here's what we get back:

As you can see, with proper processing, we can properly extract topics from the articles; the articles about galaxies are grouped in cluster 2, geographies are in cluster 1, and animals are in 0 cluster.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset