9

Mining the 20 Newsgroups Dataset with Text Analysis Techniques

In previous chapters, we went through a bunch of fundamental machine learning concepts and supervised learning algorithms. Starting from this chapter, as the second step of our learning journey, we will be covering in detail several important unsupervised learning algorithms and techniques. To make our journey more interesting, we will start with a natural language processing (NLP) problem— exploring newsgroups data. You will gain hands-on experience in working with text data, especially how to convert words and phrases into machine-readable values and how to clean up words with little meaning. We will also visualize text data by mapping it into a two-dimensional space in an unsupervised learning manner.

We will go into detail on each of the following topics:

  • NLP fundamentals and applications
  • Touring Python NLP libraries
  • Tokenization, stemming, and lemmatization
  • Getting and exploring the newsgroups data
  • Data visualization using seaborn and matplotlib
  • The Bag of words (BoW) model
  • Text preprocessing
  • Dimensionality reduction
  • T-SNE and T-SNE for text visualization

How computers understand language – NLP

In Chapter 1Getting Started with Machine Learning and Python, I mentioned that machine learning-driven programs or computers are good at discovering event patterns by processing and working with data. When the data is well structured or well defined, such as in a Microsoft Excel spreadsheet table or a relational database table, it is intuitively obvious why machine learning is better at dealing with it than humans. Computers read such data the same way as humans, for example, revenue: 5,000,000 as the revenue being 5 million and age: 30 as the age being 30; then computers crunch assorted data and generate insights in a faster way than humans. However, when the data is unstructured, such as words with which humans communicate, news articles, or someone's speech in French, it seems that computers cannot understand words as well as humans do (yet).

What is NLP?

There is a lot of information in the world about words or raw text, or, broadly speaking, natural language. This refers to any language that humans use to communicate with each other. Natural language can take various forms, including, but not limited to, the following:

  • Text, such as a web page, SMS, email, and menus
  • Audio, such as speech and commands to Siri
  • Signs and gestures
  • Many other forms, such as songs, sheet music, and Morse code

The list is endless, and we are all surrounded by natural language all of the time (that's right, right now as you are reading this book). Given the importance of this type of unstructured data, natural language data, we must have methods to get computers to understand and reason with natural language and to extract data from it. Programs equipped with NLP throughout techniques can already do a lot in certain areas, which already seems magical!

NLP is a significant subfield of machine learning that deals with the interactions between machines (computers) and human (natural) languages. The data for NLP tasks can be in different forms, for example, text from social media posts, web pages, even medical prescriptions, or audio from voice mails, commands to control systems, or even a favorite song or movie. Nowadays, NLP is broadly involved in our daily lives: we cannot live without machine translation; weather forecast scripts are automatically generated; we find voice search convenient; we get the answer to a question (such as what is the population of Canada) quickly thanks to intelligent question-answering systems; speech-to-text technology helps people with special needs.

The history of NLP

If machines are able to understand language like humans do, we consider them intelligent. In 1950, the famous mathematician Alan Turing proposed in an article, Computing Machinery and Intelligence, a test as a criterion of machine intelligence. It's now called the Turing test (https://plato.stanford.edu/entries/turing-test/), and its goal is to examine whether a computer is able to adequately understand languages so as to fool humans into thinking that this machine is another human. It is probably no surprise to you that no computer has passed the Turing test yet, but the 1950s is considered to be when the history of NLP started.

Understanding language might be difficult, but would it be easier to automatically translate texts from one language to another? In my first ever programming course, the lab booklet had the algorithm for coarse-grained machine translation. This type of translation involves looking up in dictionaries and generating text in a new language. A more practically feasible approach would be to gather texts that are already translated by humans and train a computer program on these texts. In 1954, in the Georgetown–IBM experiment (https://en.wikipedia.org/wiki/Georgetown%E2%80%93IBM_experiment), scientists claimed that machine translation would be solved in three to five years. Unfortunately, a machine translation system that can beat human expert translators does not exist yet. But machine translation has been greatly evolving since the introduction of deep learning and has seen incredible achievements in certain areas, for example, social media (Facebook open sourced a neural machine translation system, https://ai.facebook.com/tools/translate/), real-time conversation (Skype, SwiftKey Keyboard, and Google Pixel Buds), and image-based translation, such as Google Translate.

Conversational agents, or chatbots, are another hot topic in NLP. The fact that computers are able to have a conversation with us has reshaped the way businesses are run. In 2016, Microsoft's AI chatbotTay (https://blogs.microsoft.com/blog/2016/03/25/learning-tays-introduction/), was unleashed to mimic a teenage girl and converse with users on Twitter in real time. She learned how to speak from all the things users posted and commented on Twitter. However, she was overwhelmed by tweets from trolls, and automatically learned their bad behaviors and started to output inappropriate things on her feeds. She ended up being terminated within 24 hours.

NLP applications

There are also several tasks that attempt to organize knowledge and concepts in such a way that they become easier for computer programs to manipulate. The way we organize and represent concepts is called ontology. An ontology defines concepts and relationships between concepts. For instance, we can have a so-called triple, such as ("python", "language", "is-a") representing the relationship between two concepts, such as Python is a language.

An important use case for NLP at a much lower level, compared to the previous cases, is part-of-speech (PoStagging. A PoS is a grammatical word category such as a noun or verb. PoS tagging tries to determine the appropriate tag for each word in a sentence or a larger document. The following table gives examples of English PoS:

Part of speech

Examples

Noun

David, machine

Pronoun

They, her

Adjective

Awesome, amazing

Verb

Read, write

Adverb

Very, quite

Preposition

Out, at

Conjunction

And, but

Interjection

Phew, oops

Article

A, the

Table 9.1: PoS examples

There are a variety of real-world NLP applications involving supervised learning, such as PoS tagging mentioned earlier. A typical example is identifying news sentiment, which could be positive or negative in the binary case, or positive, neutral, or negative in multiclass classification. News sentiment analysis provides a significant signal to trading in the stock market.

Another example we can easily think of is news topic classification, where classes may or may not be mutually exclusive. In the newsgroup example that we just discussed, classes are mutually exclusive (despite slight overlapping), such as technology, sports, and religion. It is, however, good to realize that a news article can be occasionally assigned multiple categories (multi-label classification). For example, an article about the Olympic Games may be labeled sports and politics if there is unexpected political involvement.

Finally, an interesting application that is perhaps unexpected is named entity recognition (NER). Named entities are phrases of definitive categories, such as names of persons, companies, geographic locations, dates and times, quantities and monetary values. NER is an important subtask of information extraction to seek and identify such entities. For example, we can conduct NER on the following sentence: SpaceX[Organization], a California[Location]-based company founded by a famous tech entrepreneur Elon Musk[Person], announced that it would manufacture the next-generation, 9[Quantity]-meter-diameter launch vehicle and spaceship for the first orbital flight in 2020[Date].

In the next chapter, we will discuss how unsupervised learning, including clustering and topic modeling, is applied to text data. We will begin by covering NLP basics in the upcoming sections in this chapter.

Touring popular NLP libraries and picking up NLP basics

Now that we have covered a short list of real-world applications of NLP, we will be touring the essential stack of Python NLP libraries. These packages handle a wide range of NLP tasks as mentioned previously, including sentiment analysis, text classification, and NER.

Installing famous NLP libraries

The most famous NLP libraries in Python include the Natural Language Toolkit (NLTK), spaCyGensim, and TextBlob. The scikit-learn library also has impressive NLP-related features. Let's take a look at them in more detail:

  • nltk: This library (http://www.nltk.org/) was originally developed for educational purposes and is now being widely used in industry as well. It is said that you can't talk about NLP without mentioning NLTK. It is one of the most famous and leading platforms for building Python-based NLP applications. You can install it simply by running the following command line in the terminal:
    sudo pip install -U nltk
    

    If you're using conda, execute the following command line:

    conda install nltk
    
  • spaCy: This library (https://spacy.io/) is a more powerful toolkit in the industry than NLTK. This is mainly for two reasons: one, spaCy is written in Cython, which is much more memory-optimized (now you can see where the Cy in spaCy comes from) and excels in NLP tasks; second, spaCy uses state-of-the-art algorithms for core NLP problems, such as convolutional neural network (CNN) models for tagging and NER. However, it could seem advanced for beginners. In case you're interested, here are the installation instructions.

    Run the following command line in the terminal:

    pip install -U spacy
    

    For conda, execute the following command line:

    conda install -c conda-forge spacy
    
  • Gensim: This library (https://radimrehurek.com/gensim/), developed by Radim Rehurek, has been gaining popularity over recent years. It was initially designed in 2008 to generate a list of similar articles given an article, hence the name of this library (generate similar—> Gensim). It was later drastically improved by Radim Rehurek in terms of its efficiency and scalability. Again, you can easily install it via pip by running the following command line:
    pip install --upgrade gensim
    

    In the case of conda, you can perform the following command line in the terminal:

    conda install -c conda-forge gensim 
    

You should make sure that the dependencies, NumPy and SciPy, are already installed before gensim.

  • TextBlob: This library (https://textblob.readthedocs.io/en/dev/) is a relatively new one built on top of NLTK. It simplifies NLP and text analysis with easy-to-use built-in functions and methods, as well as wrappers around common tasks. We can install TextBlob by running the following command line in the terminal:
    pip install -U textblob
    

    TextBlob has some useful features that are not available in NLTK (currently), such as spellchecking and correction, language detection, and translation.

Corpora

As of 2018, NLTK comes with over 100 collections of large and well-structured text datasets, which are called corpora in NLP. Corpora can be used as dictionaries for checking word occurrences and as training pools for model learning and validating. Some useful and interesting corpora include Web Text corpus, Twitter samples, Shakespeare corpus, Sentiment Polarity, Names corpus (it contains lists of popular names, which we will be exploring very shortly), WordNet, and the Reuters benchmark corpus. The full list can be found at http://www.nltk.org/nltk_data. Before using any of these corpus resources, we need to first download them by running the following code in the Python interpreter:

>>> import nltk
>>> nltk.download()

A new window will pop up and ask you which collections (the Collections tab in the following screenshot) or corpus (the Corpora tab in the following screenshot) to download, and where to keep the data:

Figure 9.1: Collections tab in the NLTK installation

Installing the whole popular package is the quick solution, since it contains all important corpora needed for your current study and future research. Installing a particular corpus, as shown in the following screenshot, is also fine:

Figure 9.2: Corpora tab in the NLTK installation

Once the package or corpus you want to explore is installed, you can now take a look at the Names corpus (make sure the names corpus is installed).

First, import the names corpus:

>>> from nltk.corpus import names

We can check out the first 10 names in the list:

>>> print(names.words()[:10])
['Abagael', 'Abagail', 'Abbe', 'Abbey', 'Abbi', 'Abbie',
'Abby', 'Abigael', 'Abigail', 'Abigale']

There are, in total, 7944 names, as shown in the following output derived by executing the following command:

>>> print(len(names.words()))
7944

Other corpora are also fun to explore.

Besides the easy-to-use and abundant corpora pool, more importantly, NLTK is also good at many NLP and text analysis tasks, including tokenization, PoS tagging, NER, word stemming, and lemmatization.

Tokenization

Given a text sequence, tokenization is the task of breaking it into fragments, which can be words, characters, or sentences. Certain characters are usually removed, such as punctuation marks, digits, and emoticons. The remaining fragments are the so-called tokens used for further processing. Moreover, tokens composed of one word are also called unigrams in computational linguistics; bigrams are composed of two consecutive words; trigrams of three consecutive words; and n-grams of n consecutive words. Here is an example of tokenization:

Figure 9.3: Tokenization example

We can implement word-based tokenization using the word_tokenize function in NLTK. We will use the input text '''I am reading a book., and in the next line, It is Python Machine Learning By Example,, then 3rd edition.''', as an example, as shown in the following commands:

>>> from nltk.tokenize import word_tokenize
>>> sent = '''I am reading a book.
...           It is Python Machine Learning By Example,
...           3rd edition.'''
>>> print(word_tokenize(sent))
['I', 'am', 'reading', 'a', 'book', '.', 'It', 'is', 'Python', 'Machine', 'Learning', 'By', 'Example', ',', '3rd', 'edition', '.']

Word tokens are obtained.

The word_tokenize function keeps punctuation marks and digits, and only discards whitespaces and newlines.

You might think word tokenization is simply splitting a sentence by space and punctuation. Here's an interesting example showing that tokenization is more complex than you think:

>>> sent2 = 'I have been to U.K. and U.S.A.'
>>> print(word_tokenize(sent2))
['I', 'have', 'been', 'to', 'U.K.', 'and', 'U.S.A', '.']

The tokenizer accurately recognizes the words 'U.K.' and 'U.S.A' as tokens instead of 'U' and '.' followed by 'K', for example.

spaCy also has an outstanding tokenization feature. It uses an accurately trained model that is constantly updated. To install it, we can run the following command:

python -m spacy download en_core_web_sm

Then, we'll load the en_core_web_sm model and parse the sentence using this model:

>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> tokens2 = nlp(sent2)
>>> print([token.text for token in tokens2])
['I', 'have', 'been', 'to', 'U.K.', 'and', 'U.S.A.']

We can also segment text based on sentences. For example, on the same input text, using the sent_tokenize function from NLTK, we have the following commands:

>>> from nltk.tokenize import sent_tokenize
>>> print(sent_tokenize(sent))
['I am reading a book.', 
'It's Python Machine Learning By Example,
          3nd edition.']

Two sentence-based tokens are returned, as there are two sentences in the input text.

PoS tagging

We can apply an off-the-shelf tagger from NLTK or combine multiple taggers to customize the tagging process. It is easy to directly use the built-in tagging function, pos_tag, as in pos_tag(input_tokens), for instance. But behind the scenes, it is actually a prediction from a pre-built supervised learning model. The model is trained based on a large corpus composed of words that are correctly tagged.

Reusing an earlier example, we can perform PoS tagging as follows:

>>> import nltk
>>> tokens = word_tokenize(sent)
>>> print(nltk.pos_tag(tokens))
[('I', 'PRP'), ('am', 'VBP'), ('reading', 'VBG'), ('a', 'DT'), ('book', 'NN'), ('.', '.'), ('It', 'PRP'), ('is', 'VBZ'), ('Python', 'NNP'), ('Machine', 'NNP'), ('Learning', 'NNP'), ('By', 'IN'), ('Example', 'NNP'), (',', ','), ('2nd', 'CD'), ('edition', 'NN'), ('.', '.')]

The PoS tag following each token is returned. We can check the meaning of a tag using the help function. Looking up PRP and VBP, for example, gives us the following output:

>>> nltk.help.upenn_tagset('PRP')
PRP: pronoun, personal
   hers herself him himself hisself it itself me myself one oneself ours ourselves ownself self she thee theirs them themselves they thou thy us
>>> nltk.help.upenn_tagset('VBP')
VBP: verb, present tense, not 3rd person singular
   predominate wrap resort sue twist spill cure lengthen brush terminate appear tend stray glisten obtain comprise detest tease attract emphasize mold postpone sever return wag ...

In spaCy, getting a PoS tag is also easy. The token object parsed from an input sentence has an attribute called pos_, which is the tag we are looking for. Let's print the pos_ for each token, as follows:

>>> print([(token.text, token.pos_) for token in tokens2])
[('I', 'PRON'), ('have', 'VERB'), ('been', 'VERB'), ('to', 'ADP'), ('U.K.', 'PROPN'), ('and', 'CCONJ'), ('U.S.A.', 'PROPN')]

We have just played around with PoS tagging with NLP packages. What about NER? Let's see in the next section.

NER

Given a text sequence, the NER task is to locate and identify words or phrases that are of definitive categories, such as names of persons, companies, locations, and dates. Let's take a peep at an example of using spaCy for NER.

First, tokenize an input sentence, The book written by Hayden Liu in 2020 was sold at $30 in America, as usual, as shown in the following command:

>>> tokens3 = nlp('The book written by Hayden Liu in 2020 was sold at $30 in America')

The resultant token object contains an attribute called ents, which are the named entities. We can extract the tagging for each recognized named entity as follows:

>>> print([(token_ent.text, token_ent.label_) for token_ent in tokens3.ents])
[('Hayden Liu', 'PERSON'), ('2018', 'DATE'), ('30', 'MONEY'), ('America', 'GPE')]

We can see from the results that Hayden Liu is PERSON2018 is DATE30 is MONEY, and America is GPE (country). Please refer to https://spacy.io/api/annotation#section-named-entities for a full list of named entity tags.

Stemming and lemmatization

Word stemming is a process of reverting an inflected or derived word to its root form. For instance, machine is the stem of machines, and learning and learned are generated from learn as their stem.

The word lemmatization is a cautious version of stemming. It considers the PoS of a word when conducting stemming. Also, it traces back to the lemma of the word. We will discuss these two text preprocessing techniques, stemming and lemmatization, in further detail shortly. For now, let's take a quick look at how they're implemented respectively in NLTK by performing the following steps:

  1. Import porter as one of the three built-in stemming algorithms (LancasterStemmer and SnowballStemmer are the other two) and initialize the stemmer as follows:
    >>> from nltk.stem.porter import PorterStemmer
    >>> porter_stemmer = PorterStemmer()
    
  2. Stem machines and learning, as shown in the following codes:
    >>> porter_stemmer.stem('machines')
    'machin'
    >>> porter_stemmer.stem('learning')
    'learn'
    

    Stemming sometimes involves the chopping of letters if necessary, as you can see in machin in the preceding command output.

  3. Now, import a lemmatization algorithm based on the built-in WordNet corpus and initialize a lemmatizer:
    >>> from nltk.stem import WordNetLemmatizer
    >>> lemmatizer = WordNetLemmatizer()
    

    Similar to stemming, we lemmatize machines, and learning:

    >>> lemmatizer.lemmatize('machines')
    'machine'
    >>> lemmatizer.lemmatize('learning')
    'learning'
    

Why is learning unchanged? It turns out that this algorithm only lemmatizes on nouns by default.

Semantics and topic modeling

Gensim is famous for its powerful semantic and topic modeling algorithms. Topic modeling is a typical text mining task of discovering the hidden semantic structures in a document. A semantic structure in plain English is the distribution of word occurrences. It is obviously an unsupervised learning task. What we need to do is to feed in plain text and let the model figure out the abstract topics. We will study topic modeling in detail in Chapter 10Discovering Underlying Topics in the Newsgroups Dataset with Clustering and Topic Modeling.

In addition to robust semantic modeling methods, gensim also provides the following functionalities:

  • Word embedding: Also known as word vectorization, this is an innovative way to represent words while preserving words' co-occurrence features. We will study word embedding in detail in Chapter 11Machine Learning Best Practices.
  • Similarity querying: This functionality retrieves objects that are similar to the given query object. It's a feature built on top of word embedding.
  • Distributed computing: This functionality makes it possible to efficiently learn from millions of documents.

Last but not least, as mentioned in the first chapter, Getting Started with Machine Learning and Python, scikit-learn is the main package we have used throughout this entire book. Luckily, it provides all the text processing features we need, such as tokenization, along with comprehensive machine learning functionalities. Plus, it comes with a built-in loader for the 20 newsgroups dataset.

Now that the tools are available and properly installed, what about the data?

Getting the newsgroups data

The project in this chapter is about the 20 newsgroups dataset. It's composed of text taken from newsgroup articles, as its name implies. It was originally collected by Ken Lang and now has been widely used for experiments in text applications of machine learning techniques, specifically NLP techniques.

The data contains approximately 20,000 documents across 20 online newsgroups. A newsgroup is a place on the internet where people can ask and answer questions about a certain topic. The data is already cleaned to a certain degree and already split into training and testing sets. The cutoff point is at a certain date.

The original data comes from http://qwone.com/~jason/20Newsgroups/, with 20 different topics listed, as follows:

  • comp.graphics
  • comp.os.ms-windows.misc
  • comp.sys.ibm.pc.hardware
  • comp.sys.mac.hardware
  • comp.windows.x
  • rec.autos
  • rec.motorcycles
  • rec.sport.baseball
  • rec.sport.hockey
  • sci.crypt
  • sci.electronics
  • sci.med
  • sci.space
  • misc.forsale
  • talk.politics.misc
  • talk.politics.guns
  • talk.politics.mideast
  • talk.religion.misc
  • alt.atheism
  • soc.religion.christian

All of the documents in the dataset are in English. And we can easily deduce the topics from the newsgroups' names.

The dataset is labeled and each document is composed of text data and a group label. This also makes it a perfect fit for supervised learning, such as text classification. At the end of the chapter, feel free to practice classification on this dataset using what you've learned so far in this book.

Some of the newsgroups are closely related or even overlapping, for instance, those five computer groups (comp.graphicscomp.os.ms-windows.misccomp.sys.ibm.pc.hardwarecomp.sys.mac.hardware, and comp.windows.x), while some are not closely related to each other, such as Christian (soc.religion.christian) and baseball (rec.sport.baseball).

Hence, it's a perfect use case for unsupervised learning such as clustering, with which we can see whether similar topics are grouped together and unrelated ones are far apart. Moreover, we can even discover abstract topics beyond the original 20 labels using topic modeling techniques.

For now, let's focus on exploring and analyzing the text data. We will get started with acquiring the data.

It is possible to download the dataset manually from the original website or many other online repositories. However, there are also many versions of the dataset—some are cleaned in a certain way and some are in raw form. To avoid confusion, it is best to use a consistent acquisition method. The scikit-learn library provides a utility function that loads the dataset. Once the dataset is downloaded, it's automatically cached. We don't need to download the same dataset twice.

In most cases, caching the dataset, especially for a relatively small one, is considered a good practice. Other Python libraries also provide data download utilities, but not all of them implement automatic caching. This is another reason why we love scikit-learn.

As always, we first import the loader function for the 20 newsgroups data, as follows:

>>> from sklearn.datasets import fetch_20newsgroups

Then, we download the dataset with all the default parameters, as follows:

>>> groups = fetch_20newsgroups()
Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)

We can also specify one or more certain topic groups and particular sections (training, testing, or both) and just load such a subset of data in the program. The full list of parameters and options for the loader function is summarized in the following table:

Parameter

Default value

Example values

Description

subset

'train'

'train','test','all'

The dataset to load: the training set, the testing set, or both.

data_home

~/scikit_learn_data

~/myfolder

Directory where the files are stored and cached

categories

None

['sci.space",alt.atheism']

List of newsgroups to load. If None, all newsgroups will be loaded.

shuffle

True

True, False

Boolean indicating whether to shuffle the data

random_state

42

7, 43

Random seed integer used to shuffle the data

remove

0

('headers','footers','quotes')

Tuple indicating the part(s) among header, footer, and quote of each newsgroup post to omit. Nothing is removed by default.

download_if_missing

True

True, False

Boolean indicating whether to download the data if it is not found locally

Table 9.2: List of parameters of the fetch_20newsgroups() function

Remember that random_state is useful for the purpose of reproducibility. You are able to get the same dataset every time you run the script. Otherwise, working on datasets shuffled under different orders might bring in unnecessary variations.

In this section, we loaded the newsgroups data. Let's explore it next.

Exploring the newsgroups data

After we download the 20 newsgroups dataset by whatever means we prefer, the data object of groups is now cached in memory. The data object is in the form of a key-value dictionary. Its keys are as follows:

>>> groups.keys()
dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

The target_names key gives the newsgroups names:

>>> groups['target_names']
   ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

The target key corresponds to a newsgroup, but is encoded as an integer:

>>> groups.target
array([7, 4, 4, ..., 3, 1, 8])

Then, what are the distinct values for these integers? We can use the unique function from NumPy to figure it out:

>>> import numpy as np
>>> np.unique(groups.target)
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

They range from 0 to 19, representing the 1st, 2nd, 3rd, …, 20th newsgroup topics in groups['target_names'].

In the context of multiple topics or categories, it is important to know what the distribution of topics is. A balanced class distribution is the easiest to deal with, because there are no under-represented or over-represented categories. However, frequently we have a skewed distribution with one or more categories dominating.

We will use the seaborn package (https://seaborn.pydata.org/) to compute the histogram of categories and plot it utilizing the matplotlib package (https://matplotlib.org/). We can install both packages via pip as follows:

python -m pip install -U matplotlib
pip install seaborn

In the case of conda, you can execute the following command line:

conda install -c conda-forge matplotlib
conda install seaborn

Remember to install matplotlib before seaborn as matplotlib is one of the dependencies of the seaborn package.

Now, let's display the distribution of the classes, as follows:

>>> import seaborn as sns
>>> import matplotlib.pyplot as plt
>>> sns.distplot(groups.target)
<matplotlib.axes._subplots.AxesSubplot object at 0x108ada6a0>
>>> plt.show()

Refer to the following screenshot for the end result:

Figure 9.4: Distribution of newsgroup classes

As you can see, the distribution is approximately uniform so that's one less thing to worry about.

It's good to visualize data to get a general idea of how the data is structured, what possible issues may arise, and whether there are any irregularities that we have to take care of.

Other keys are quite self-explanatory: data contains all newsgroups documents and filenames store the path where each document is located in your filesystem.

Now, let's now have a look at the first document and its topic number and name by executing the following command:

>>> groups.data[0]
"From: [email protected] (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
 ---- brought to you by your neighborhood Lerxst ----




"
>>> groups.target[0]
7
>>> groups.target_names[groups.target[0]]
'rec.autos'

If random_state isn't fixed (42 by default), you may get different results running the preceding scripts.

As you can see, the first document is from the rec.autos newsgroup, which was assigned the number 7. Reading this post, we can easily figure out that it's about cars. The word car actually occurs a number of times in the document. Words such as bumper also seem very car-oriented. However, words such as doors may not necessarily be car related, as they may also be associated with home improvement or another topic.

As a side note, it makes sense to not distinguish between doors and door, or the same word with different capitalization, such as Doors. There are some rare cases where capitalization does matter, for instance, if we're trying to find out whether a document is about the band called The Doors or the more common concept, the doors (in wood).

Thinking about features for text data

From the preceding analysis, we can safely conclude that, if we want to figure out whether a document was from the rec.autos newsgroup, the presence or absence of words such as cardoors, and bumper can be very useful features. The presence or not of a word is a Boolean variable, and we can also look at the count of certain words. For instance, car occurs multiple times in the document. Maybe the more times such a word is found in a text, the more likely it is that the document has something to do with cars.

Counting the occurrence of each word token

It seems that we are only interested in the occurrence of certain words, their count, or a related measure, and not in the order of the words. We can therefore view a text as a collection of words. This is called the Bag of Words (BoW) model. This is a very basic model, but it works pretty well in practice. We can optionally define a more complex model that takes into account the order of words and PoS tags. However, such a model is going to be more computationally expensive and more difficult to program. In reality, the basic BoW model in most cases suffices. We can give it a shot and see whether the BoW model makes sense.

We begin by converting documents into a matrix where each row represents each newsgroup document and each column represents a word token, or specifically, a unigram to begin with. And the value of each element in the matrix is the number of times the word (column) occurs in the document (row). We are utilizing the CountVectorizer class from scikit-learn to do the work:

>>> from sklearn.feature_extraction.text import CountVectorizer

The important parameters and options for the count conversion function are summarized in the following table:

Constructor parameter

Default value

Example values

Description

ngram_range

(1,1)

(1,2), (2,2)

Lower and upper bound of the n-grams to be extracted in the input text, for example (1,1) means unigram, (1,2) means unigram and bigram

stop_words

Nonea

'english', or list ['a','the', 'of'] or None

Which stop word list to use, can be "english" referring to the built in list, or a customized input list. If None, no words will be removed.

lowercase

True

True, False

Whether or not converting all characters to lowercase

max_features

None

None, 200, 500

The number of top (most frequent) tokens to consider, or all tokens if None

binary

False

True, False

If true, all non-zero counts becomes 1s

Table 9.3: List of parameters of the CountVectorizer() function

We first initialize the count vectorizer with 500 top features (500 most frequent tokens):

>>>  count_vector = CountVectorizer(max_features=500)

Use it to fit on the raw text data as follows:

>>> data_count = count_vector.fit_transform(groups.data)

Now the count vectorizer captures the top 500 features and generates a token count matrix out of the original text input:

>>> data_count
<11314x500 sparse matrix of type '<class 'numpy.int64'>'
      with 798221 stored elements in Compressed Sparse Row format>
>>> data_count[0]
<1x500 sparse matrix of type '<class 'numpy.int64'>'
      with 53 stored elements in Compressed Sparse Row format>

The resulting count matrix is a sparse matrix where each row only stores non-zero elements (hence, only 798,221 elements instead of 11314 * 500 = 5,657,000). For example, the first document is converted into a sparse vector composed of 53 non-zero elements. If you are interested in seeing the whole matrix, feel free to run the following:

>>> data_count.toarray()

If you just want the first row, run the following:

>>> data_count.toarray()[0]

Let's take a look at the following output derived from the preceding command:

Figure 9.5: Output of count vectorization

So, what are those 500 top features? They can be found in the following output:

>>> print(count_vector.get_feature_names())
['00', '000', '0d', '0t', '10', '100', '11', '12', '13', '14', '145', '15', '16', '17', '18', '19', '1993', '1d9', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '34u', '35', '40', '45', '50', '55', '80', '92', '93', '__', '___', 'a86', 'able', 'ac', 'access', 'actually', 'address', 'ago', 'agree', 'al', 'american
……
……
……
 'user', 'using', 'usually', 'uucp', 've', 'version', 'video', 'view', 'virginia', 'vs', 'want', 'wanted', 'war', 'washington', 'way', 'went', 'white', 'win', 'window', 'windows', 'won', 'word', 'words', 'work', 'working', 'works', 'world', 'wouldn', 'write', 'writes', 'wrong', 'wrote', 'year', 'years', 'yes', 'york']

Our first trial doesn't look perfect. Obviously, the most popular tokens are numbers, or letters with numbers such as a86, which do not convey important information. Moreover, there are many words that have no actual meaning, such as youthethem, and then. Also, some words contain identical information, for example, tell and tolduse and used, and time and times. Let's tackle these issues.

Text preprocessing

We begin by retaining letter-only words so that numbers such as 00 and 000 and combinations of letters and numbers such as b8f will be removed. The filter function is defined as follows:

>>> data_cleaned = []
>>> for doc in groups.data:
...     doc_cleaned = ' '.join(word for word in doc.split() 
                                             if word.isalpha())
...     data_cleaned.append(doc_cleaned)

This will generate a cleaned version of the newsgroups data.

Dropping stop words

We didn't talk about stop_words as an important parameter in CountVectorizerStop words are those common words that provide little value in helping to differentiate documents. In general, stop words add noise to the BoW model and can be removed.

There's no universal list of stop words. Hence, depending on the tools or packages you are using, you will remove different sets of stop words. Take scikit-learn as an example—you can check the list as follows:

>>> from sklearn.feature_extraction import stop_words
>>> print(stop_words.ENGLISH_STOP_WORDS)
frozenset({'most', 'three', 'between', 'anyway', 'made', 'mine', 'none', 'could', 'last', 'whenever', 'cant', 'more', 'where', 'becomes', 'its', 'this', 'front', 'interest', 'least', 're', 'it', 'every', 'four', 'else', 'over', 'any', 'very', 'well', 'never', 'keep', 'no', 'anything', 'itself', 'alone', 'anyhow', 'until', 'therefore', 'only', 'the', 'even', 'so', 'latterly', 'above', 'hereafter', 'hereby', 'may', 'myself', 'all', 'those', 'down',
……
……
'him', 'somehow', 'or', 'per', 'nowhere', 'fifteen', 'via', 'must', 'someone', 'from', 'full', 'that', 'beyond', 'still', 'to', 'get', 'himself', 'however', 'as', 'forty', 'whatever', 'his', 'nothing', 'though', 'almost', 'become', 'call', 'empty', 'herein', 'than', 'while', 'bill', 'thru', 'mostly', 'yourself', 'up', 'former', 'each', 'anyone', 'hundred', 'several', 'others', 'along', 'bottom', 'one', 'five', 'therein', 'was', 'ever', 'beside', 'everyone'})

To drop stop words from the newsgroups data, we simply just need to specify the stop_words parameter:

>>> count_vector_sw = CountVectorizer(stop_words="english", max_features=500)

Besides stop words, you may notice that names are included in the top features, such as andrew. We can filter names with the Name corpus from NLTK we just worked with.

Reducing inflectional and derivational forms of words

As mentioned earlier, we have two basic strategies to deal with words from the same root—stemming and lemmatization. Stemming is a quicker approach that involves, if necessary, chopping off letters; for example, words becomes word after stemming. The result of stemming doesn't have to be a valid word. For instance, trying and try become tri. Lemmatizing, on the other hand, is slower but more accurate. It performs a dictionary lookup and guarantees to return a valid word. Recall that we implemented both stemming and lemmatization using NLTK in a previous section.

Putting all of these (preprocessing, dropping stop words, lemmatizing, and count vectorizing) together, we obtain the following:

>>> from nltk.corpus import names
>>> all_names = set(names.words())
>>> count_vector_sw = CountVectorizer(stop_words="english", max_features=500)
>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer = WordNetLemmatizer()
>>> data_cleaned = []
>>> for doc in groups.data:
...     doc = doc.lower()
...     doc_cleaned = ' '.join(lemmatizer.lemmatize(word)
                               for word in doc.split()
                               if word.isalpha() and
                               word not in all_names)
...     data_cleaned.append(doc_cleaned)
>>> data_cleaned_count = count_vector_sw.fit_transform(data_cleaned)

Now the features are much more meaningful:

>>> print(count_vector_sw.get_feature_names())
['able', 'accept', 'access', 'according', 'act', 'action', 'actually', 'add', 'address', 'ago', 'agree', 'algorithm', 'allow', 'american', 'anonymous', 'answer', 'anybody', 'apple', 'application', 'apr', 'april', 'arab', 'area', 'argument', 'armenian', 'article', 'ask', 'asked', 'assume', 'atheist', 'attack', 'attempt', 'available', 'away', 'bad', 'based', 'belief', 'believe', 'best', 'better', 'bible', 'big', 'bike', 'bit', 'black', 'board', 'body', 'book', 'box', 'build', 'bus', 'buy', 'ca', 'california', 'called', 'came', 'canada', 'car', 'card', 'care', 'carry', 'case', 'cause', 'center', 'certain', 'certainly', 'chance', 'change', 'check', 'child', 'chip', 'christian', 'church', 'city', 'claim', 'clear', 'clinton', 'clipper', 'code', 'college', 'color', 'come', 'coming', 'command', 'comment', 'common', 'communication', 'company', 'computer', 'consider', 'considered', 'contact', 'control', 'copy',
……
……
'short', 'shot', 'similar', 'simple', 'simply', 'single', 'site', 'situation', 'size', 'small', 'software', 'sort', 'sound', 'source', 'space', 'special', 'specific', 'speed', 'standard', 'start', 'started', 'state', 'statement', 'steve', 'stop', 'strong', 'study', 'stuff', 'subject', 'sun', 'support', 'sure', 'taken', 'taking', 
'talk', 'talking', 'tape', 'tax', 'team', 'technical', 'technology', 'tell', 'term', 'test', 'texas', 'text', 'thanks', 'thing', 'think', 'thinking', 'thought', 'time', 'tin', 'today', 'told', 'took', 'total', 'tried', 'true', 'truth', 'try', 'trying', 'turkish', 'turn', 'type', 'understand', 'united', 'university', 'unix', 'unless', 'usa', 'use', 'used', 'user', 'using', 'usually', 'value', 'various', 'version', 'video', 'view', 'wa', 'want', 'wanted', 'war', 'water', 'way', 'weapon', 'week', 'went', 'western', 'white', 'widget', 'win', 'window', 'woman', 'word', 'work', 'working', 'world', 'worth', 'write', 'written', 'wrong', 'year', 'york', 'young']

We have just converted text from each raw newsgroup document into a sparse vector of size 500. For a vector from a document, each element represents the number of times a word token occurs in this document. Also, these 500-word tokens are selected based on their overall occurrences after text preprocessing, the removal of stop words, and lemmatization. Now you may ask questions such as, is such an occurrence vector representative enough, or does such an occurrence vector convey enough information that can be used to differentiate the document from documents on other topics? You will see the answer in the next section.

Visualizing the newsgroups data with t-SNE

We can answer these questions easily by visualizing those representation vectors. If we can see the document vectors from the same topic form a cluster, we did a good job mapping the documents into vectors. But how? They are of 500 dimensions, while we can visualize data of at most three dimensions. We can resort to t-SNE for dimensionality reduction.

What is dimensionality reduction?

Dimensionality reduction is an important machine learning technique that reduces the number of features and, at the same time, retains as much information as possible. It is usually performed by obtaining a set of new principal features.

As mentioned before, it is difficult to visualize data of high dimension. Given a three-dimensional plot, we sometimes don't find it straightforward to observe any findings, not to mention 10, 100, or 1,000 dimensions. Moreover, some of the features in high dimensional data may be correlated and, as a result, bring in redundancy. This is why we need dimensionality reduction.

Dimensionality reduction is not simply taking out a pair of two features from the original feature space. It is transforming the original feature space to a new space of fewer dimensions. The data transformation can be linear, such as the famous one, principal component analysis (PCA), which maps the data in a higher dimensional space to a lower dimensional space where the variance of the data is maximized, as mentioned in Chapter 3, Recognizing Faces with Support Vector Machine, or nonlinear, such as neural networks and t-SNE, which is coming up shortly. Non-negative matrix factorization (NMF) is another powerful algorithm, which we will study in detail in Chapter 10, Discovering Underlying Topics in the Newsgroups Dataset with Clustering and Topic Modeling.

At the end of the day, most dimensionality reduction algorithms are in the family of unsupervised learning as the target or label information (if available) is not used in data transformation.

t-SNE for dimensionality reduction

t-SNE stands for t-distributed Stochastic Neighbor Embedding. It is a nonlinear dimensionality reduction technique developed by Laurens van der Maaten and Geoffrey Hinton. t-SNE has been widely used for data visualization in various domains, including computer vision, NLP, bioinformatics, and computational genomics.

As its name implies, t-SNE embeds high-dimensional data into a low-dimensional (usually two-dimensional or three-dimensional) space where similarity among data samples (neighbor information) is preserved. It first models a probability distribution over neighbors around data points by assigning a high probability to similar data points and an extremely small probability to dissimilar ones. Note that similarity and neighbor distances are measured by Euclidean distance or other metrics. Then, t-SNE constructs a projection onto a low-dimensional space where the divergence between the input distribution and output distribution is minimized. The original high-dimensional space is modeled as a Gaussian distribution, while the output low-dimensional space is modeled as t-distribution.

We'll herein implement t-SNE using the TSNE class from scikit-learn:

>>> from sklearn.manifold import TSNE

Now, let's use t-SNE to verify our count vector representation.

We pick three distinct topics, talk.religion.misccomp.graphics, and sci.space, and visualize document vectors from these three topics.

First, just load documents of these three labels, as follows:

>>> categories_3 = ['talk.religion.misc', 'comp.graphics', 'sci.space']
>>> groups_3 = fetch_20newsgroups(categories=categories_3)

We go through the same process and generate a count matrix, data_cleaned_count_3, with 500 features from the input, groups_3. You can refer to steps in previous sections as you just need to repeat the same code.

Next, we apply t-SNE to reduce the 500-dimensional matrix to a two-dimensional matrix:

>>> tsne_model = TSNE(n_components=2, perplexity=40,
                     random_state=42, learning_rate=500)
>>> data_tsne = tsne_model.fit_transform(data_cleaned_count_3.toarray())

The parameters we specify in the TSNE object are as follows:

  • n_components: The output dimension
  • perplexity: The number of nearest data points considered neighbors in the algorithm with a typical value of between 5 and 50
  • random_state: The random seed for program reproducibility
  • learning_rate: The factor affecting the process of finding the optimal mapping space with a typical value of between 10 and 1,000

Note that the TSNE object only takes in a dense matrix, hence we convert the sparse matrix, data_cleaned_count_3, into a dense one using toarray().

We just successfully reduced the input dimension from 500 to 2. Finally, we can easily visualize it in a two-dimensional scatter plot where the x axis is the first dimension, the y axis is the second dimension, and the color, c, is based on the topic label of each original document:

>>> import matplotlib.pyplot as plt
>>> plt.scatter(data_tsne[:, 0], data_tsne[:, 1], c=groups_3.target)
>>> plt.show()

Refer to the following screenshot for the end result:

Figure 9.6: Applying t-SNE to data from three different topics

Data points from the three topics are in different colors, such as green, purple, and yellow. We can observe three clear clusters. Data points from the same topic are close to each other, while those from different topics are far away. Obviously, count vectors are great representations for original text data as they preserve the distinction among three different topics.

You can also play around with the parameters and see whether you can obtain a nicer plot where the three clusters are better separated.

Count vectorization does well in keeping document disparity. How about maintaining similarity? We can also check that using documents from overlapping topics, such as these five topics: comp.graphicscomp.os.ms-windows.misccomp.sys.ibm.pc.hardwarecomp.sys.mac.hardware, and comp.windows.x:

>>> categories_5 = ['comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x']
>>> groups_5 = fetch_20newsgroups(categories=categories_5)

Similar processes (including text clean-up, count vectorization, and t-SNE) are repeated and the resulting plot is displayed as follows:

Figure 9.7: Applying t-SNE to data from five similar topics

Data points from those five computer-related topics are all over the place, which means they are contextually similar. To conclude, count vectors are great representations for original text data as they are also good at preserving similarity among related topics.

Summary

In this chapter, you learned the fundamental concepts of NLP as an important subfield in machine learning, including tokenization, stemming and lemmatization, and PoS tagging. We also explored three powerful NLP packages and worked on some common tasks using NLTK and spaCy. Then, we continued with the main project exploring newsgroups data. We began by extracting features with tokenization techniques and went through text preprocessing, stop word removal, and stemming and lemmatization. We then performed dimensionality reduction and visualization with t-SNE and proved that count vectorization is a good representation for text data.

We had some fun mining the newsgroups data using dimensionality reduction as an unsupervised approach. Moving forward, in the next chapter, we'll be continuing our unsupervised learning journey, specifically looking at topic modeling and clustering.

Exercises

  1. Do you think all of the top 500-word tokens contain valuable information? If not, can you impose another list of stop words?
  2. Can you use stemming instead of lemmatization to process the newsgroups data?
  3. Can you increase max_features in CountVectorizer from 500 to 5000 and see how the t-SNE visualization will be affected?
  4. Try visualizing documents from six topics (similar or dissimilar) and tweak parameters so that the formed clusters look reasonable.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset