Chapter 3. Math with words (TF-IDF vectors)

This chapter covers

  • Counting words and term frequencies to analyze meaning
  • Predicting word occurrence probabilities with Zipf’s Law
  • Vector representation of words and how to start using them
  • Finding relevant documents from a corpus using inverse document frequencies
  • Estimating the similarity of pairs of documents with cosine similarity and Okapi BM25

Having collected and counted words (tokens), and bucketed them into stems or lemmas, it’s time to do something interesting with them. Detecting words is useful for simple tasks, like getting statistics about word usage or doing keyword search. But you’d like to know which words are more important to a particular document and across the corpus as a whole. Then you can use that “importance” value to find relevant documents in a corpus based on keyword importance within each document.

That will make a spam detector a little less likely to get tripped up by a single curse word or a few slightly-spammy words within an email. And you’d like to measure how positive and prosocial a tweet is when you have a broad range of words with various degrees of “positivity” scores or labels. If you have an idea about the frequency with which those words appear in a document in relation to the rest of the documents, you can use that to further refine the “positivity” of the document. In this chapter, you’ll learn about a more nuanced, less binary measure of words and their usage within a document. This approach has been the mainstay for generating features from natural language for commercial search engines and spam filters for decades.

The next step in your adventure is to turn the words of chapter 2 into continuous numbers rather than just integers representing word counts or binary “bit vectors” that detect the presence or absence of particular words. With representations of words in a continuous space, you can operate on their representation with more exciting math. Your goal is to find numerical representation of words that somehow capture the importance or information content of the words they represent. You’ll have to wait until chapter 4 to see how to turn this information content into numbers that represent the meaning of words.

In this chapter, we look at three increasingly powerful ways to represent words and their importance in a document:

  • Bags of words—Vectors of word counts or frequencies
  • Bags of n-grams—Counts of word pairs (bigrams), triplets (trigrams), and so on
  • TF-IDF vectors—Word scores that better represent their importance
Important

TF-IDF stands for term frequency times inverse document frequency. Term frequencies are the counts of each word in a document, which you learned about in previous chapters. Inverse document frequency means that you’ll divide each of those word counts by the number of documents in which the word occurs.

Each of these techniques can be applied separately or as part of an NLP pipeline. These are all statistical models in that they are frequency based. Later in the book, you’ll see various ways to peer even deeper into word relationships and their patterns and non-linearities.

But these “shallow” NLP machines are powerful and useful for many practical applications such as spam filtering and sentiment analysis.

3.1. Bag of words

In the previous chapter, you created your first vector space model of a text. You used one-hot encoding of each word and then combined all those vectors with a binary OR (or clipped sum) to create a vector representation of a text. And this binary bag-of-words vector makes a great index for document retrieval when loaded into a data structure such as a Pandas DataFrame.

You then looked at an even more useful vector representation that counts the number of occurrences, or frequency, of each word in the given text. As a first approximation, you assume that the more times a word occurs, the more meaning it must contribute to that document. A document that refers to “wings” and “rudder” frequently may be more relevant to a problem involving jet airplanes or air travel, than say a document that refers frequently to “cats” and “gravity.” Or if you have classified some words as expressing positive emotions—words like “good,” “best,” “joy,” and “fantastic”—the more a document that contains those words is likely to have positive “sentiment.” You can imagine though how an algorithm that relied on these simple rules might be mistaken or led astray.

Let’s look at an example where counting occurrences of words is useful:

>>> from nltk.tokenize import TreebankWordTokenizer
>>> sentence = """The faster Harry got to the store, the faster Harry,
...     the faster, would get home."""
>>> tokenizer = TreebankWordTokenizer()
>>> tokens = tokenizer.tokenize(sentence.lower())
>>> tokens
['the',
 'faster',
 'harry',
 'got',
 'to',
 'the',
 'store',
 ',',
 'the',
 'faster',
 'harry',
 ',',
 'the',
 'faster',
 ',',
 'would',
 'get',
 'home',
 '.']

With your simple list, you want to get unique words from the document and their counts. A Python dictionary serves this purpose nicely, and because you want to count the words as well, you can use Counter, as you did in previous chapters:

>>> from collections import Counter
>>> bag_of_words = Counter(tokens)
>>> bag_of_words
Counter({'the': 4,
         'faster': 3,
         'harry': 2,
         'got': 1,
         'to': 1,
         'store': 1,
         ',': 3,
         'would': 1,
         'get': 1,
         'home': 1,
         '.': 1})

As with any good Python dictionary, the order of your keys got shuffled. The new order is optimized for storage, update, and retrieval, not consistent display. The information content contained in the order of words within the original statement has been discarded.

Note

A collections.Counter object is an unordered collection, also called a bag or multiset. Depending on your platform and Python version, you may find that a Counter is displayed in a seemingly reasonable order, like lexical order or the order that tokens appeared in your statement. But just as for a standard Python dict, you cannot rely on the order of your tokens (keys) in a Counter.

For short documents like this one, the unordered bag of words still contains a lot of information about the original intent of the sentence. And the information in a bag of words is sufficient to do some powerful things such as detect spam, compute sentiment (positivity, happiness, and so on), and even detect subtle intent, like sarcasm. It may be a bag, but it’s full of meaning and information. So let’s get these words ranked—sorted in some order that’s easier to think about. The Counter object has a handy method, most_common, for just this purpose:

>>> bag_of_words.most_common(4)                       1
[('the', 4), (',', 3), ('faster', 3), ('harry', 2)]

  • 1 By default, most_common() lists all tokens from most frequent to least, but you’ve limited the list to the top four here.

Specifically, the number of times a word occurs in a given document is called the term frequency, commonly abbreviated TF. In some examples you may see the count of word occurrences normalized (divided) by the number of terms in the document.[1]

1

However, normalized frequency is really a probability, so it should probably not be called frequency.

So your top four terms or tokens are “the,” “,”, “harry,” and “faster.” But the word “the” and the punctuation “,” aren’t very informative about the intent of this document. And these uninformative tokens are likely to appear a lot during your hurried adventure. So for this example, you’ll ignore them, along with a list of standard English stop words and punctuation. This won’t always be the case, but for now it helps simplify the example. That leaves you with “harry” and “faster” among the top tokens in your TF vector (bag of words).

Let’s calculate the term frequency of “harry” from the Counter object (bag_of_words) you defined above:

>>> times_harry_appears = bag_of_words['harry']
>>> num_unique_words = len(bag_of_words)            1
>>> tf = times_harry_appears / num_unique_words
>>> round(tf, 4)
0.1818

  • 1 The number of unique tokens from your original source

Let’s pause for a second and look a little deeper at normalized term frequency, a phrase (and calculation) we use often throughout this book. It’s the word count tempered by how long the document is. But why “temper” it all? Let’s say you find the word “dog” 3 times in document A and 100 times in document B. Clearly “dog” is way more important to document B. But wait. Let’s say you find out document A is a 30-word email to a veterinarian and document B is War & Peace (approx 580,000 words!). Your first analysis was straight-up backwards. The following equations take the document length into account:

TF(“dog,” documentA) = 3/30 = .1

TF(“dog,” documentB) = 100/580000 = .00017

Now you have something you can see that describes “something” about the two documents and their relationship to the word “dog” and each other. So instead of raw word counts to describe your documents in a corpus, you can use normalized term frequencies. Similarly you could calculate each word and get the relative importance to the document of that term. Your protagonist, Harry, and his need for speed are clearly central to the story of this document. You’ve made some great progress in turning text into numbers, beyond just the presence or absence of a given word. Now this is a clearly contrived example, but you can quickly see how meaningful results could come from this approach. Let’s look at a bigger piece of text. Take these first few paragraphs from the Wikipedia article on kites:

A kite is traditionally a tethered heavier-than-air craft with wing surfaces that react against the air to create lift and drag. A kite consists of wings, tethers, and anchors. Kites often have a bridle to guide the face of the kite at the correct angle so the wind can lift it. A kite’s wing also may be so designed so a bridle is not needed; when kiting a sailplane for launch, the tether meets the wing at a single point. A kite may have fixed or moving anchors. Untraditionally in technical kiting, a kite consists of tether-set-coupled wing sets; even in technical kiting, though, a wing in the system is still often called the kite.

The lift that sustains the kite in flight is generated when air flows around the kite’s surface, producing low pressure above and high pressure below the wings. The interaction with the wind also generates horizontal drag along the direction of the wind. The resultant force vector from the lift and drag force components is opposed by the tension of one or more of the lines or tethers to which the kite is attached. The anchor point of the kite line may be static or moving (such as the towing of a kite by a running person, boat, free-falling anchors as in paragliders and fugitive parakites or vehicle).

The same principles of fluid flow apply in liquids and kites are also used under water.

A hybrid tethered craft comprising both a lighter-than-air balloon as well as a kite lifting surface is called a kytoon.

Kites have a long and varied history and many different types are flown individually and at festivals worldwide. Kites may be flown for recreation, art or other practical uses. Sport kites can be flown in aerial ballet, sometimes as part of a competition. Power kites are multi-line steerable kites designed to generate large forces which can be used to power activities such as kite surfing, kite landboarding, kite fishing, kite buggying and a new trend snow kiting. Even Man-lifting kites have been made.

Wikipedia

Then you’ll assign the text to a variable:

>>> from collections import Counter
>>> from nltk.tokenize import TreebankWordTokenizer
>>> tokenizer = TreebankWordTokenizer()
>>> from nlpia.data.loaders import kite_text           1
>>> tokens = tokenizer.tokenize(kite_text.lower())
>>> token_counts = Counter(tokens)
>>> token_counts
Counter({'the': 26, 'a': 20, 'kite': 16, ',': 15, ...})

  • 1 kite_text = “A kite is traditionally ...” as above
Note

The TreebankWordTokenizer returns 'kite.' (with a period) as a token. The Treebank Tokenizer assumes that your document has already been segmented into separate sentences, so it’ll only ignore punctuation at the very end of the string. Sentence segmentation is tricky and you won’t learn about it until chapter 11. Nonetheless, the spaCy parser is faster and more accurate because it does sentence segmentation and tokenization (along with a lot of other things)[2] in one pass. So use spaCy in your production app rather than the NLTK components we used for these simple examples.

2

See the web page titled “spaCy 101: Everything you need to know” (https://spacy.io/usage/spacy-101#annotations-token).

Okay, back to the example. So that is a lot of stop words. It’s not likely that this Wikipedia article is about the articles “the” and “a,” nor the conjunction “and” and the other stop words. So let’s ditch them for now:

>>> import nltk
>>> nltk.download('stopwords', quiet=True)
True
>>> stopwords = nltk.corpus.stopwords.words('english')
>>> tokens = [x for x in tokens if x not in stopwords]
>>> kite_counts = Counter(tokens)
>>> kite_counts
Counter({'kite': 16,
         'traditionally': 1,
         'tethered': 2,
         'heavier-than-air': 1,
         'craft': 2,
         'wing': 5,
         'surfaces': 1,
         'react': 1,
         'air': 2,
         ...,
         'made': 1})}

By looking purely at the number of times words occur in this document, you’re learning something about it. The terms kite(s), wing, and lift are all important. And, if you didn’t know what this document was about, you just happened across this document in your vast database of Google-like knowledge, you might “programmatically” be able to infer it has something to do with “flight” or “lift” or, in fact, “kites.”

Across multiple documents in a corpus, things get a little more interesting. A set of documents may all be about, say, kite flying. You would imagine all the documents may refer to string and wind quite often, and the term frequencies TF("string") and TF("wind") would therefore rank highly in all the documents. Now let’s look at a way to more gracefully represent these numbers for mathematical intents.

3.2. Vectorizing

You’ve transformed your text into numbers on a basic level. But you’ve still just stored them in a dictionary, so you’ve taken one step out of the text-based world and into the realm of mathematics. Next you’ll go ahead and jump in all the way. Instead of describing a document in terms of a frequency dictionary, you’ll make a vector of those word counts. In Python, this will be a list, but in general it’s an ordered collection or array. You can do this quickly with

>>> document_vector = []
>>> doc_length = len(tokens)
>>> for key, value in kite_counts.most_common():
...     document_vector.append(value / doc_length)
>>> document_vector
[0.07207207207207207,
 0.06756756756756757,
 0.036036036036036036,
 ...,
 0.0045045045045045045]

This list, or vector, is something you can do math on directly.

Tip

You can speed up processing of these data structures many ways.[3] For now you’re just playing with the nuts and bolts, but soon you’ll want to speed things up.

3

See the web page titled “NumPy” (http://www.numpy.org/).

Math isn’t very interesting with just one element. Having one vector for one document isn’t enough. You can grab a couple more documents and make vectors for each of them as well. But the values within each vector need to be relative to something consistent across all the vectors. If you’re going to do math on them, they need to represent a position in a common space, relative to something consistent. Your vectors need to have the same origin and share the same scale, or “units,” on each of their dimensions. The first step in this process is to normalize the counts by calculating normalized term frequency instead of raw count in the document (as you did in the last section); the second step is to make all the vectors of standard length or dimension.

Also, you want the value for each element of the vector to represent the same word in each document’s vector. But you may notice that your email to the vet isn’t going to contain many of the words that are in War & Peace (or maybe it will, who knows?). But it’s fine (and as it happens, necessary) if your vectors contain values of 0 in various positions. You’ll find every unique word in each document and then find every unique word in the union of those two sets. This collections of words in your vocabulary is often called a lexicon, which is the same concept referenced in earlier chapters, just in terms of your special corpus. Let’s look at what that would look like with something shorter than War & Peace. Let’s check in on Harry. You had one “document” already—let’s round out the corpus with a couple more:

>>> docs = ["The faster Harry got to the store, the faster and faster Harry 
 would get home."]
>>> docs.append("Harry is hairy and faster than Jill.")
>>> docs.append("Jill is not as hairy as Harry.")
Tip

If you’re playing along with us rather than typing these out, you can import them from the nlpia package: from nlpia.data.loaders import harry_docs as docs.

First, let’s look at your lexicon for this corpus containing three documents:

>>> doc_tokens = []
>>> for doc in docs:
...     doc_tokens += [sorted(tokenizer.tokenize(doc.lower()))]
>>> len(doc_tokens[0])
17
>>> all_doc_tokens = sum(doc_tokens, [])
>>> len(all_doc_tokens)
33
>>> lexicon = sorted(set(all_doc_tokens))
>>> len(lexicon)
18
>>> lexicon
[',',
 '.',
 'and',
 'as',
 'faster',
 'get',
 'got',
 'hairy',
 'harry',
 'home',
 'is',
 'jill',
 'not',
 'store',
 'than',
 'the',
 'to',
 'would']

Each of your three document vectors will need to have 18 values, even if the document for that vector doesn’t contain all 18 words in your lexicon. Each token is assigned a “slot” in your vectors corresponding to its position in your lexicon. Some of those token counts in the vector will be zeros, which is what you want:

>>> from collections import OrderedDict
>>> zero_vector = OrderedDict((token, 0) for token in lexicon)
>>> zero_vector
OrderedDict([(',', 0),
             ('.', 0),
             ('and', 0),
             ('as', 0),
             ('faster', 0),
             ('get', 0),
             ('got', 0),
             ('hairy', 0),
             ('harry', 0),
             ('home', 0),
             ('is', 0),
             ('jill', 0),
             ('not', 0),
             ('store', 0),
             ('than', 0),
             ('the', 0),
             ('to', 0),
             ('would', 0)])

Now you’ll make copies of that base vector, update the values of the vector for each document, and store them in an array:

>>> import copy
>>> doc_vectors = []
>>> for doc in docs:
...     vec = copy.copy(zero_vector)                1
...     tokens = tokenizer.tokenize(doc.lower())
...     token_counts = Counter(tokens)
...     for key, value in token_counts.items():
...         vec[key] = value / len(lexicon)
...     doc_vectors.append(vec)

  • 1 copy.copy() creates an independent copy, a separate instance of your zero vector, rather than reusing a reference (pointer) to the original object’s memory location. Otherwise you’d just be overwriting the same zero_vector with new values in each loop, and you wouldn’t have a fresh zero on each pass of the loop.

You have three vectors, one for each document. So what? What can you do with them? Your document word-count vectors can do all the cool stuff any vector can do, so let’s learn a bit more about vectors and vector spaces first.[4]

4

If you’d like more details about linear algebra and vectors, take a look at appendix C.

3.2.1. Vector spaces

Vectors are the primary building blocks of linear algebra, or vector algebra. They’re an ordered list of numbers, or coordinates, in a vector space. They describe a location or position in that space. Or they can be used to identify a particular direction and magnitude or distance in that space. A space is the collection of all possible vectors that could appear in that space. So a vector with two values would lie in a 2D vector space, a vector with three values in 3D vector space, and so on.

A piece of graph paper, or a grid of pixels in an image, are both nice 2D vector spaces. You can see how the order of these coordinates matter. If you reverse the x and y coordinates for locations on your graph paper, without reversing all your vector calculations, all your answers for linear algebra problems would be flipped. Graph paper and images are examples of rectilinear, or Euclidean spaces, because the x and y coordinates are perpendicular to each other. The vectors we talk about in this chapter are all rectilinear, Euclidean spaces.

What about latitude and longitude on a map or globe? That map or globe is definitely a 2D vector space because it’s an ordered list of two numbers: latitude and longitude. But each of the latitude-longitude pairs describes a point on an approximately spherical, bumpy surface—the Earth’s surface. And latitude and longitude coordinates aren’t exactly perpendicular, so a latitude-longitude vector space isn’t rectilinear. That means you have to be careful when you calculate things like distance or closeness (similarity) between two points represented by a pair of 2D latitude-longitude vectors, or vectors in any non-Euclidean space. Think about how you would calculate the distance between the latitude and longitude coordinates of Portland, OR and New York, NY.[5]

5

You’d need to use a package like GeoPy (https://geopy.readthedocs.io) to get the math right.

Figure 3.1 is one way to draw the 2D vectors (5, 5), (3, 2), and (-1, 1). The head of a vector (represented by the pointy tip of an arrow) is used to identify a location in a vector space. So the vector heads in this diagram will be at those three pairs of coordinates. The tail of a position vector (represented by the “rear” of the arrow) is always at the origin, or (0, 0).

Figure 3.1. 2D vectors

What about 3D vector spaces? Positions and velocities in the 3D physical world you live in can be represented by x, y, and z coordinates in a 3D vector. Or the curvilinear space formed by all the latitude-longitude-altitude triplets describing locations near the surface of the Earth.

But you aren’t limited to normal 3D space. You can have 5 dimensions, 10 dimensions, 5,000, whatever. The linear algebra all works out the same. You might need more computing power as the dimensionality grows. And you’ll run into some “curse-of-dimensionality” issues, but you can wait to deal with that until the last chapter, chapter 13.[6]

6

The curse of dimensionality is that vectors will get exponentially farther and farther away from one another, in Euclidean distance, as the dimensionality increases. A lot of simple operations become impractical above 10 or 20 dimensions, like sorting a large list of vectors based on their distance from a “query” or “reference” vector (approximate nearest neighbor search). To dig deeper, check out Wikipedia’s “Curse of Dimensionality” article (https://en.wikipedia.org/wiki/Curse_of_dimensionality), explore hyperspace with one of this book’s authors at Exploring Hyperspace (https://docs.google.com/presentation/d/1SEU8VL0KWPDKKZnBSaMxUBDDwI8yqIxu9RQtq2bpnNg), play with the Python annoy package (https://github.com/spotify/annoy), or search Google Scholar for “high dimensional approximate nearest neighbors” (https://scholar.google.com/scholar?q=high+dimensional+approximate+nearest+neighbor).

For a natural language document vector space, the dimensionality of your vector space is the count of the number of distinct words that appear in the entire corpus. For TF (and TF-IDF to come), sometimes we call this dimensionality capital letter “K.” This number of distinct words is also the vocabulary size of your corpus, so in an academic paper it’ll usually be called “|V|.” You can then describe each document within this K-dimensional vector space by a K-dimensional vector. K = 18 in your three-document corpus about Harry and Jill. Because humans can’t easily visualize spaces of more than three dimensions, let’s set aside most of those dimensions and look at two for a moment, so you can have a visual representation of the vectors on this flat page you’re reading. So in figure 3.2, K is reduced to two for a two-dimensional view of the 18-dimensional Harry and Jill vector space.

Figure 3.2. 2D term frequency vectors

K-dimensional vectors work the same way, just in ways you can’t easily visualize. Now that you have a representation of each document and know they share a common space, you have a path to compare them. You could measure the Euclidean distance between the vectors by subtracting them and computing the length of the distance between them, which is called the 2-norm distance. It’s the distance a “crow” would have to fly (in a straight line) to get from a location identified by the tip (head) of one vector and the location of the tip of the other vector. Check out appendix C on linear algebra to see why this is a bad idea for word count (term frequency) vectors.

Two vectors are “similar” if they share similar direction. They might have similar magnitude (length), which would mean that the word count (term frequency) vectors are for documents of about the same length. But do you care about document length in your similarity estimate for vector representations of words in documents? Probably not. You’d like your estimate of document similarity to find use of the same words about the same number of times in similar proportions. This accurate estimate would give you confidence that the documents they represent are probably talking about similar things.

Cosine similarity is merely the cosine of the angle between two vectors (theta), shown in figure 3.3, which can be calculated from the Euclidian dot product using

  • A ⋅ B = |A| |B| * cos Θ

Cosine similarity is efficient to calculate because the dot product doesn’t require evaluation of any trigonometric functions. In addition, cosine similarity has a convenient range for most machine learning problems: -1 to +1.

Figure 3.3. 2D thetas

In Python this would be

a.dot(b) == np.linalg.norm(a) * np.linalg.norm(b) / np.cos(theta)

Solving this relationship for cos(theta), you can derive the cosine similarity using

Or you can do it in pure Python without numpy, as in the following listing.

Listing 3.1. Compute cosine similarity in python
>>> import math
>>> def cosine_sim(vec1, vec2):
...     """ Let's convert our dictionaries to lists for easier matching."""
...     vec1 = [val for val in vec1.values()]
...     vec2 = [val for val in vec2.values()]
...
...     dot_prod = 0
...     for i, v in enumerate(vec1):
...         dot_prod += v * vec2[i]
...
...     mag_1 = math.sqrt(sum([x**2 for x in vec1]))
...     mag_2 = math.sqrt(sum([x**2 for x in vec2]))
...
...     return dot_prod / (mag_1 * mag_2)

So you need to take the dot product of two of your vectors in question—multiply the elements of each vector pairwise—and then sum up those products. You then divide by the norm (magnitude or length) of each vector. The vector norm is the same as its Euclidean distance from the head to the tail of the vector—the square root of the sum of the squares of its elements. This normalized dot product, like the output of the cosine function, will be a value between -1 and 1. It’s the cosine of the angle between these two vectors. This value is the same as the portion of the longer vector that’s covered by the shorter vector’s perpendicular projection onto the longer one. It gives you a value for how much the vectors point in the same direction.

A cosine similarity of 1 represents identical normalized vectors that point in exactly the same direction along all dimensions. The vectors may have different lengths or magnitudes, but they point in the same direction. Remember you divided the dot product by the norm of each vector, and this can happen before or after the dot product. So the vectors are normalized so they both have a length of 1 as you do the dot product. So the closer a cosine similarity value is to 1, the closer the two vectors are in angle. For NLP document vectors that have a cosine similarity close to 1, you know that the documents are using similar words in similar proportion. So the documents whose document vectors are close to each other are likely talking about the same thing.

A cosine similarity of 0 represents two vectors that share no components. They are orthogonal, perpendicular in all dimensions. For NLP TF vectors, this situation occurs only if the two documents share no words in common. Because these documents use completely different words, they must be talking about completely different things. This doesn’t necessarily mean they have different meanings or topics, just that they use completely different words.

A cosine similarity of -1 represents two vectors that are anti-similar, completely opposite. They point in opposite directions. This can never happen for simple word count (term frequency) vectors or even normalized TF vectors (which we talk about later). Counts of words can never be negative. So word count (term frequency) vectors will always be in the same “quadrant” of the vector space. None of the term frequency vectors can sneak around into one of the quadrants behind the tail of the other vectors. None of your term frequency vectors can have components (word frequencies) that are the negative of another term frequency vector, because term frequencies just can’t be negative.

You won’t see any negative cosine similarity values for pairs of vectors for natural language documents in this chapter. But in the next chapter, we develop a concept of words and topics that are “opposite” to each other. And this will show up as documents, words, and topics that have cosine similarities of less than zero, or even -1.

Opposites attract

There’s an interesting consequence of the way you calculated cosine similarity. If two vectors or documents have a cosine similarity of -1 (are opposites) to a third vector, they must be perfectly similar to each other. They must be exactly the same vectors. But the documents those vectors represent may not be exactly the same. Not only might the word order be shuffled, but one may be much longer than the other, if it uses the same words in the same proportion.

Later, you’ll come up with vectors that more accurately model a document. But for now, you’ve gotten a good introduction to the tools you need.

3.3. Zipf’s Law

Now on to our main topic—sociology. Okay, not, but you’ll make a quick detour into the world of counting people and words, and you’ll learn a seemingly universal rule that governs the counting of most things. It turns out, that in language, like most things involving living organisms, patterns abound.

In the early twentieth century, the French stenographer Jean-Baptiste Estoup noticed a pattern in the frequencies of words that he painstakingly counted by hand across many documents (thank goodness for computers and Python). In the 1930s, the American linguist George Kingsley Zipf sought to formalize Estoup’s observation, and this relationship eventually came to bear Zipf’s name:

Zipf’s law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table.

Wikipedia

Specifically, inverse proportionality refers to a situation where an item in a ranked list will appear with a frequency tied explicitly to its rank in the list. The first item in the ranked list will appear twice as often as the second, and three times as often as the third, for example. One of the quick things you can do with any corpus or document is plot the frequencies of word usages relative to their rank (in frequency). If you see any outliers that don’t fall along a straight line in a log-log plot, it may be worth investigating.

As an example of how far Zipf’s Law stretches beyond the world of words, figure 3.4 charts the relationship between the population of US cities and the rank of that population. It turns out that Zipf’s Law applies to counts of lots of things. Nature is full of systems that experience exponential growth and “network effects” like population dynamics, economic output, and resource distribution.[7] It’s interesting that something as simple as Zipf’s Law could hold true across a wide range of natural and manmade phenomena. Nobel Laureate Paul Krugman, speaking about economic models and Zipf’s Law, put it this way:

7

See the web page titled “There is More than a Power Law in Zipf” (https://www.nature.com/articles/srep00812).

The usual complaint about economic theory is that our models are oversimplified—that they offer excessively neat views of complex, messy reality. [With Zipf’s law] the reverse is true: You have complex, messy models, yet reality is startlingly neat and simple.

Figure 3.4. City population distribution

Here’s an updated version of Krugman’s city population plot.[8]

8

Population data downloaded from Wikipedia using Pandas. See the nlpia.book.examples code on GitHub (https://github.com/totalgood/nlpia/blob/master/src/nlpia/book/examples/ch03_zipf.py).

As with cities and social networks, so with words. Let’s first download the Brown Corpus from NLTK:

The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial, and so on.[9]

9

For a complete list, see http://icame.uib.no/brown/bcm-los.html.

NLTK Documentation

>>> nltk.download('brown')            1
>>> from nltk.corpus import brown
>>> brown.words()[:10]                2
 ['The',
 'Fulton',
 'County',
 'Grand',
 'Jury',
 'said',
 'Friday',
 'an',
 'investigation',
 'of']
>>> brown.tagged_words()[:5]          3
 [('The', 'AT'),
 ('Fulton', 'NP-TL'),
 ('County', 'NN-TL'),
 ('Grand', 'JJ-TL'),
 ('Jury', 'NN-TL')]
>>> len(brown.words())
1161192

  • 1 The Brown corpus is about 3MB.
  • 2 words() is a built-in method of the NTLK corpus object that returns the tokenized corpus as a sequence of strs.
  • 3 You’ll learn about part-of-speech tagging in chapter 2.

So with over 1 million tokens, you have something meaty to look at:

>>> from collections import Counter
>>> puncs = set((',', '.', '--', '-', '!', '?',
...     ':', ';', '``', "''", '(', ')', '[', ']'))
>>> word_list = (x.lower() for x in brown.words() if x not in puncs)
>>> token_counts = Counter(word_list)
>>> token_counts.most_common(20)
[('the', 69971),
 ('of', 36412),
 ('and', 28853),
 ('to', 26158),
 ('a', 23195),
 ('in', 21337),
 ('that', 10594),
 ('is', 10109),
 ('was', 9815),
 ('he', 9548),
 ('for', 9489),
 ('it', 8760),
 ('with', 7289),
 ('as', 7253),
 ('his', 6996),
 ('on', 6741),
 ('be', 6377),
 ('at', 5372),
 ('by', 5306),
 ('i', 5164)]

A quick glance shows that the word frequencies in the Brown corpus follow the logarithmic relationship Zipf predicted. “The” (rank 1 in term frequency) occurs roughly twice as often as “of” (rank 2 in term frequency), and roughly three times as often as “and” (rank 3 in term frequency). If you don’t believe us, use the example code (https://github.com/totalgood/nlpia/blob/master/src/nlpia/book/examples/ch03_zipf.py) in the nlpia package to see this yourself.

In short, if you rank the words of a corpus by the number of occurrences and list them in descending order, you’ll find that, for a sufficiently large sample, the first word in that ranked list is twice as likely to occur in the corpus as the second word in the list. And it is four times as likely to appear as the fourth word in the list. So given a large corpus, you can use this breakdown to say statistically how likely a given word is to appear in any given document of that corpus.

3.4. Topic modeling

Now back to your document vectors. Word counts are useful, but pure word count, even when normalized by the length of the document, doesn’t tell you much about the importance of that word in that document relative to the rest of the documents in the corpus. If you could suss out that information, you could start to describe documents within the corpus. Say you have a corpus of every kite book ever written. “Kite” would almost surely occur many times in every book (document) you counted, but that doesn’t provide any new information; it doesn’t help distinguish between those documents. Whereas something like “construction” or “aerodynamics” might not be so prevalent across the entire corpus, but for the ones where it frequently occurred, you would know more about each document’s nature. For this you need another tool.

Inverse document frequency, or IDF, is your window through Zipf in topic analysis. Let’s take your term frequency counter from earlier and expand on it. You can count tokens and bin them up two ways: per document and across the entire corpus. You’re going to be counting just by document.

Let’s return to the Kite example from Wikipedia and grab another section (the History section); say it’s the second document in your Kite corpus:

Kites were invented in China, where materials ideal for kite building were readily available: silk fabric for sail material; fine, high-tensile-strength silk for flying line; and resilient bamboo for a strong, lightweight framework.

The kite has been claimed as the invention of the 5th-century BC Chinese philosophers Mozi (also Mo Di) and Lu Ban (also Gongshu Ban). By 549 AD paper kites were certainly being flown, as it was recorded that in that year a paper kite was used as a message for a rescue mission. Ancient and medieval Chinese sources describe kites being used for measuring distances, testing the wind, lifting men, signaling, and communication for military operations. The earliest known Chinese kites were flat (not bowed) and often rectangular. Later, tailless kites incorporated a stabilizing bowline. Kites were decorated with mythological motifs and legendary figures; some were fitted with strings and whistles to make musical sounds while flying. From China, kites were introduced to Cambodia, Thailand, India, Japan, Korea and the western world.

After its introduction into India, the kite further evolved into the fighter kite, known as the patang in India, where thousands are flown every year on festivals such as Makar Sankranti.

Kites were known throughout Polynesia, as far as New Zealand, with the assumption being that the knowledge diffused from China along with the people. Anthropomorphic kites made from cloth and wood were used in religious ceremonies to send prayers to the gods. Polynesian kite traditions are used by anthropologists get an idea of early “primitive” Asian traditions that are believed to have at one time existed in Asia.

Wikipedia

First let’s get the total word count for each document in your corpus, intro_doc and history_doc:

>>> from nlpia.data.loaders import kite_text, kite_history
>>> kite_intro = kite_text.lower()                           1
>>> intro_tokens = tokenizer.tokenize(kite_intro)
>>> kite_history = kite_history.lower()
>>> history_tokens = tokenizer.tokenize(kite_history)
>>> intro_total = len(intro_tokens)
>>> intro_total
363
>>> history_total = len(history_tokens)
>>> history_total
297

  • 1 “A kite is traditionally ... ?” “a kite is traditionally ...”

Now with a couple tokenized kite documents in hand, let’s look at the term frequency of “kite” in each document. You’ll store the TFs you find in two dictionaries, one for each document:

>>> intro_tf = {}
>>> history_tf = {}
>>> intro_counts = Counter(intro_tokens)
>>> intro_tf['kite'] = intro_counts['kite'] / intro_total
>>> history_counts = Counter(history_tokens)
>>> history_tf['kite'] = history_counts['kite'] / history_total
>>> 'Term Frequency of "kite" in intro is: {:.4f}'.format(intro_tf['kite'])
'Term Frequency of "kite" in intro is: 0.0441'
>>> 'Term Frequency of "kite" in history is: {:.4f}'
...     .format(history_tf['kite'])
'Term Frequency of "kite" in history is: 0.0202'

Okay, you have a number twice as large as the other. Is the intro section twice as much about kites? No, not really. So let’s dig a little deeper. First, let’s see how those numbers relate to some other word, say “and”:

>>> intro_tf['and'] = intro_counts['and'] / intro_total
>>> history_tf['and'] = history_counts['and'] / history_total
>>> print('Term Frequency of "and" in intro is: {:.4f}'
...     .format(intro_tf['and']))
Term Frequency of "and" in intro is: 0.0275
>>> print('Term Frequency of "and" in history is: {:.4f}'
...     .format(history_tf['and']))
Term Frequency of "and" in history is: 0.0303

Great! You know both of these documents are about “and” just as much as they are about “kite”! Oh, wait. That’s not helpful, huh? Just as in your first example, where the system seemed to think “the” was the most important word in the document about your fast friend Harry, in this example “and” is considered highly relevant. Even at first glance, you can tell this isn’t revelatory.

A good way to think of a term’s inverse document frequency is this: How strange is it that this token is in this document? If a term appears in one document a lot of times, but occurs rarely in the rest of the corpus, one could assume it’s important to that document specifically. Your first step toward topic analysis!

A term’s IDF is merely the ratio of the total number of documents to the number of documents the term appears in. In the case of “and” and “kite” in your current example, the answer is the same for both:

  • 2 total documents / 2 documents contain “and” = 2/2 = 1
  • 2 total documents / 2 documents contain “kite” = 2/2 = 1
  • Not very interesting. So let’s look at another word “China.”
  • 2 total documents / 1 document contains “China” = 2/1 = 2

Okay, that’s something different. Let’s use this “rarity” measure to weight the term frequencies:

>>> num_docs_containing_and = 0
>>> for doc in [intro_tokens, history_tokens]:
...     if 'and' in doc:
...         num_docs_containing_and += 1       1

  • 1 similarly for “kite” and “China”

And let’s grab the TF of “China” in the two documents:

>>> intro_tf['china'] = intro_counts['china'] / intro_total
>>> history_tf['china'] = history_counts['china'] / history_total

And finally, the IDF for all three. You’ll store the IDFs in dictionaries per document like you did with TF:

>>> num_docs = 2
>>> intro_idf = {}
>>> history_idf = {}
>>> intro_idf['and'] = num_docs / num_docs_containing_and
>>> history_idf['and'] = num_docs / num_docs_containing_and
>>> intro_idf['kite'] = num_docs / num_docs_containing_kite
>>> history_idf['kite'] = num_docs / num_docs_containing_kite
>>> intro_idf['china'] = num_docs / num_docs_containing_china
>>> history_idf['china'] = num_docs / num_docs_containing_china

And then for the intro document you find:

>>> intro_tfidf = {}
>>> intro_tfidf['and'] = intro_tf['and'] * intro_idf['and']
>>> intro_tfidf['kite'] = intro_tf['kite'] * intro_idf['kite']
>>> intro_tfidf['china'] = intro_tf['china'] * intro_idf['china']

And then for the history document:

>>> history_tfidf = {}
>>> history_tfidf['and'] = history_tf['and'] * history_idf['and']
>>> history_tfidf['kite'] = history_tf['kite'] * history_idf['kite']
>>> history_tfidf['china'] = history_tf['china'] * history_idf['china']

3.4.1. Return of Zipf

You’re almost there. Let’s say, though, you have a corpus of 1 million documents (maybe you’re baby-Google), someone searches for the word “cat,” and in your 1 million documents you have exactly 1 document that contains the word “cat.” The raw IDF of this is

  • 1,000,000 / 1 = 1,000,000

Let’s imagine you have 10 documents with the word “dog” in them. Your IDF for “dog” is

  • 1,000,000 / 10 = 100,000

That’s a big difference. Your friend Zipf would say that’s too big, because it’s likely to happen a lot. Zipf’s Law showed that when you compare the frequencies of two words, like “cat” and “dog,” even if they occur a similar number of times, the more frequent word will have an exponentially higher frequency than the less frequent one. So Zipf’s Law suggests that you scale all your word frequencies (and document frequencies) with the log() function, the inverse of exp(). This ensures that words such as “cat” and “dog,” which have similar counts, aren’t exponentially different in frequency. And this distribution of word frequencies will ensure that your TF-IDF scores are more uniformly distributed. So you should redefine IDF to be the log of the original probability of that word occurring in one of your documents. You’ll want to take the log of the term frequency as well.[10]

10

Gerard Salton and Chris Buckley first demonstrated the usefulness of log scaling for information retrieval in their paper Term Weighting Approaches in Automatic Text Retrieval (https://ecommons.cornell.edu/bitstream/handle/1813/6721/87-881.pdf).

The base of log function isn’t important, because you only want to make the frequency distribution uniform, not to scale it within a particular numerical range.[11] If you use a base 10 log function, you’ll get:

11

Later we show you how to normalize the TF-IDF vectors after all the TF-IDF values have been calculated using this log scaling.

  • search: cat
  • idf = log(1,000,000/1) = 6
  • search: dog
  • idf = log(1,000,000/10) = 5

So now you’re weighting the TF results of each more appropriately to their occurrences in language, in general.

And then finally, for a given term, t, in a given document, d, in a corpus, D, you get:

So the more times a word appears in the document, the TF (and hence the TF-IDF) will go up. At the same time, as the number of documents that contain that word goes up, the IDF (and hence the TF-IDF) for that word will go down. So now, you have a number—something your computer can chew on. But what is it exactly? It relates a specific word or token to a specific document in a specific corpus, and then it assigns a numeric value to the importance of that word in the given document, given its usage across the entire corpus.

In some classes, all the calculations will be done in log space so that multiplications become additions and division becomes subtraction:

>>> log_tf = log(term_occurences_in_doc) -
...     log(num_terms_in_doc)                       1
>>> log_log_idf = log(log(total_num_docs) -
...     log(num_docs_containing_term))              2
>>> log_tf_idf = log_tf + log_idf                   3

  • 1 Log probability of a particular term in a particular document
  • 2 Log of the log probability of a particular term occurring at least once in a document—the first log is to linearize the IDF (compensate for Zipf’s Law)
  • 3 Log TF-IDF is the log of the product of TF and IDF or the sum of the logs of TF and IDF.

This single number, the TF-IDF, is the humble foundation of a simple search engine. As you’ve stepped from the realm of text firmly into the realm of numbers, it’s time for some math. You won’t likely ever have to implement the preceding formulas for computing TF-IDF. Linear algebra isn’t necessary for full understanding of the tools used in natural language processing, but a general familiarity with how the formulas work can make their use more intuitive.

3.4.2. Relevance ranking

As you saw earlier, you can easily compare two vectors and get their similarity, but you have since learned that merely counting words isn’t as descriptive as using their TF-IDF. Therefore, in each document vector let’s replace each word’s word_count with the word’s TF-IDF. Now your vectors will more thoroughly reflect the meaning, or topic, of the document, as shown in this Harry example:

>>> document_tfidf_vectors = []
>>> for doc in docs:
...     vec = copy.copy(zero_vector)                1
...     tokens = tokenizer.tokenize(doc.lower())
...     token_counts = Counter(tokens)
...
...     for key, value in token_counts.items():
...         docs_containing_key = 0
...         for _doc in docs:
...             if key in _doc:
...                 docs_containing_key += 1
...         tf = value / len(lexicon)
...         if docs_containing_key:
...             idf = len(docs) / docs_containing_key
...         else:
...             idf = 0
...         vec[key] = tf * idf
...     document_tfidf_vectors.append(vec)

  • 1 You need to copy the zero_vector to create a new, separate object. Otherwise you’d end up overwriting the same object/vector each time through the loop.

With this setup, you have K-dimensional vector representation of each document in the corpus. And now on to the hunt! Or search, in your case. Two vectors, in a given vector space, can be said to be similar if they have a similar angle. If you imagine each vector starting at the origin and reaching out its prescribed distance and direction, the ones that reach out at the same angle are similar, even if they don’t reach out to the same distance.

Two vectors are considered similar if their cosine similarity is high, so you can find two similar vectors near each other if they minimize:

Now you have all you need to do a basic TF-IDF-based search. You can treat the search query itself as a document, and therefore get the TF-IDF-based vector representation of it. The last step is then to find the documents whose vectors have the highest cosine similarities to the query and return those as the search results.

If you take your three documents about Harry, and make the query “How long does it take to get to the store?” as shown here

>>> query = "How long does it take to get to the store?"
>>> query_vec = copy.copy(zero_vector)
>>> query_vec = copy.copy(zero_vector)           1

>>> tokens = tokenizer.tokenize(query.lower())
>>> token_counts = Counter(tokens)
 
>>> for key, value in token_counts.items():
...     docs_containing_key = 0
...     for _doc in documents:
...       if key in _doc.lower():
...         docs_containing_key += 1
...     if docs_containing_key == 0:             2
...         continue
...     tf = value / len(tokens)
...     idf = len(documents) / docs_containing_key
...    query_vec[key] = tf * idf
>>> cosine_sim(query_vec, document_tfidf_vectors[0])
0.5235048549676834
>>> cosine_sim(query_vec, document_tfidf_vectors[1])
0.0
>>> cosine_sim(query_vec, document_tfidf_vectors[2])
0.0

  • 1 copy.copy() ensures you’re dealing with separate objects, not multiple references to the same object.
  • 2 You didn’t find that token in the lexicon, so go to the next key.

you can safely say document 0 has the most relevance for your query! And with this you can find relevant documents in any corpus, be it articles in Wikipedia, books from Gutenberg, or tweets from the wild west that is Twitter. Google look out!

Actually, Google’s search engine is safe from competition from us. You have to do an “index scan” of your TF-IDF vectors with each query. That’s an O(N) algorithm. Most search engines can respond in constant time (O(1)) because they use an inverted index.[12] You aren’t going to implement an index that can find these matches in constant time here, but if you’re interested you might like exploring the state-of-the-art Python implementation in the Whoosh[13] package and its source code.[14] Instead of showing you how to build this conventional keyword-based search engine, in chapter 4 we show you the latest semantic indexing approaches that capture the meaning of text.

12

See the web page titled “Inverted index” (https://en.wikipedia.org/wiki/Inverted_index).

13

See the web page titled “Whoosh” (https://pypi.python.org/pypi/Whoosh).

14

See the web page titled “GitHub - Mplsbeb/whoosh: A fast pure-Python search engine” (https://github.com/Mplsbeb/whoosh).

Tip

In the preceding code, you dropped the keys that weren’t found in the lexicon to avoid a divide-by-zero error. But a better approach is to +1 the denominator of every IDF calculation, which ensures no denominators are zero. In fact this approach—called additive smoothing (Laplace smoothing)[15]—will usually improve the search results for TF-IDF keyword-based searches.

15

See the web page titled “Additive smoothing” (https://en.wikipedia.org/wiki/Additive_smoothing).

Keyword search is only one tool in your NLP pipeline. You want to build a chatbot. But most chatbots rely heavily on a search engine. And some chatbots rely exclusively on a search engine as their only algorithm for generating responses. You need to take one additional step to turn your simple search index (TF-IDF) into a chatbot. You need to store your training data in pairs of questions (or statements) and appropriate responses. Then you can use TF-IDF to search for a question (or statement) most like the user input text. Instead of returning the most similar statement in your database, you return the response associated with that statement. Like any tough computer science problem, ours can be solved with one more layer of indirection. And with that, you’re chatting!

3.4.3. Tools

Now that was a lot of code for things that have long since been automated. You can find a quick path to the same result using the scikit-learn package.[16] If you haven’t already set up your environment using appendix A so that it includes this package, here’s one way to install it:

16

See the web page titled “scikit-learn: machine learning in Python” (http://scikit-learn.org/).

pip install scipy
pip install sklearn

Here’s how you can use sklearn to build a TF-IDF matrix. The sklearn TF-IDF class is a model with .fit() and .transform() methods that comply with the sklearn API for all machine learning models:

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = docs
>>> vectorizer = TfidfVectorizer(min_df=1)
>>> model = vectorizer.fit_transform(corpus)                            1
>>> print(model.todense().round(2))                                     2
[[0.16 0.   0.48 0.21 0.21 0.   0.25 0.21 0.   0.   0.   0.21 0.   0.64
  0.21 0.21]
 [0.37 0.   0.37 0.   0.   0.37 0.29 0.   0.37 0.37 0.   0.   0.49 0.
  0.   0.  ]
 [0.   0.75 0.   0.   0.   0.29 0.22 0.   0.29 0.29 0.38 0.   0.   0.
  0.   0.  ]]

  • 1 The TFIDFVectorizer model produces a sparse numpy matrix, because a TF-IDF matrix usually contains mostly zeros, since most documents use a small portion of the total words in the vocabulary.
  • 2 The .todense() method converts a sparse matrix back into a regular numpy matrix (filling in the gaps with zeros) for your viewing pleasure.

With scikit-learn, in four lines you created a matrix of your three documents and the inverse document frequency for each term in the lexicon. You have a matrix (practically a list of lists in Python) that represents the three documents (the three rows of the matrix). The TF-IDF of each term, token, or word in your lexicon make up the columns of the matrix (or again, the indices of each row). They only have 16, as they tokenize differently and drop the punctuation; you had a comma and a period. On large texts this or some other pre-optimized TF-IDF model will save you scads of work.

3.4.4. Alternatives

TF-IDF matrices (term-document matrices) have been the mainstay of information retrieval (search) for decades. As a result, researchers and corporations have spent a lot of time trying to optimize that IDF part to try to improve the relevance of search results. Table 3.1 lists some of the ways you can normalize and smooth your term frequency weights.[17]

17

Word Embeddings Past, Present and Future by Piero Molino at AI with the Best 2017.

Search engines (information retrieval systems) match keywords (terms) between queries and documents in a corpus. If you’re building a search engine and want to provide documents that are likely to match what your users are looking for, you should spend some time investigating the alternatives described by Piero Molino in table 3.1.

Table 3.1. Alternative TF-IDF normalization approaches (Molino 2017)

Scheme

Definition

None wij = fij
TF-IDF
TF-ICF
Okapi BM25
ATC
LTU
MI
PosMI wij = max(0, MI)
T-Test
x2 See section 4.3.5 of From Distributional to Semantic Similarity (https://www.era.lib.ed.ac.uk/bitstream/handle/1842/563/IP030023.pdf#subsection.4.3.5) by James Richard Curran
Lin98a
Lin98b
Gref94

One such alternative to using straight TF-IDF cosine distance to rank query results is Okapi BM25, or its most recent variant, BM25F.

3.4.5. Okapi BM25

The smart people at London’s City University came up with a better way to rank search results. Rather than merely computing the TF-IDF cosine similarity, they normalize and smooth the similarity. They also ignore duplicate terms in the query document, effectively clipping the term frequencies for the query vector at 1. And the dot product for the cosine similarity isn’t normalized by the TF-IDF vector norms (number of terms in the document and the query), but rather by a nonlinear function of the document length itself:

q_idf * dot(q_tf, d_tf[i]) * 1.5 / 
 (dot(q_tf, d_tf[i]) + .25 + .75 * d_num_words[i] / d_num_words.mean()))

You can optimize your pipeline by choosing the weighting scheme that gives your users the most relevant results. But if your corpus isn’t too large, you might consider forging ahead with us into even more useful and accurate representations of the meaning of words and documents. In subsequent chapters, we show you how to implement a semantic search engine that finds documents that “mean” something similar to the words in your query rather than just documents that use those exact words from your query. Semantic search is much better than anything TF-IDF weighting and stemming and lemmatization can ever hope to achieve. The only reason Google and Bing and other web search engines don’t use the semantic search approach is that their corpus is too large. Semantic word and topic vectors don’t scale to billions of documents, but millions of documents are no problem.

So you only need the most basic TF-IDF vectors to feed into your pipeline to get state-of-the-art performance for semantic search, document classification, dialog systems, and most of the other applications we mentioned in chapter 1. TF-IDFs are the first stage in your pipeline, the most basic set of features you’ll extract from text. In the next chapter, we compute topic vectors from your TF-IDF vectors. Topic vectors are an even better representation of the meaning of the content of a bag of words than any of these carefully normalized and smoothed TF-IDF vectors. And things only get better from there as we move on to Word2vec word vectors in chapter 6 and neural net embeddings of the meaning of words and documents in later chapters.

3.4.6. What’s next

Now that you can convert natural language text to numbers, you can begin to manipulate them and compute with them. Numbers firmly in hand, in the next chapter you’ll refine those numbers to try to represent the meaning or topic of natural language text instead of only its words.

Summary

  • Any web-scale search engine with millisecond response times has the power of a TF-IDF term document matrix hidden under the hood.
  • Term frequencies must be weighted by their inverse document frequency to ensure the most important, most meaningful words are given the heft they deserve.
  • Zipf’s law can help you predict the frequencies of all sorts of things, including words, characters, and people.
  • The rows of a TF-IDF term document matrix can be used as a vector representation of the meanings of those individual words to create a vector space model of word semantics.
  • Euclidean distance and similarity between pairs of high dimensional vectors doesn’t adequately represent their similarity for most NLP applications.
  • Cosine distance, the amount of “overlap” between vectors, can be calculated efficiently by just multiplying the elements of normalized vectors together and summing up those products.
  • Cosine distance is the go-to similarity score for most natural language vector representations.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset