Application of word embeddings - information retrieval

There are countless applications for word embeddings; one of these is in the field of information retrieval. When humans input keywords and key phrases into search engines, search engines are able to recall and surface specific articles/stories that match those keywords exactly. For example, if we search for articles about dogs, we will get articles that mention the word dog. But what if we search for the word canine? We should still expect to see articles about dogs based on the fact that canines are dogs. Let's implement a simple information retrieval system to showcase the power of word embeddings.

Let's create a function that tries to grab embeddings of individual words from our gensim package and returns None if this lookup fails:

# helper function to try to grab embeddings for a word and returns None if that word is not found
def get_embedding(string):
try:
return model.wv[string]
except:
return None

Now, let's create three article titles, one about a dog, one about a cat, and one about absolutely nothing at all for a distractor:

# very original article titles
sentences = [
"this is about a dog",
"this is about a cat",
"this is about nothing"
]

The goal is to input a reference word that is similar to dog or cat and be able to grab the more relevant title. To do this, we will first create a 3 x 300 matrix of vectorizations for each sentence. We will do this by taking the mean of every word in the sentence and using the resulting mean vector as an estimation of the vectorization of the entire sentence. Once we have a vectorization of every sentence, we can compare that against the embedding of the reference word by taking a dot product between them. The closest vector is the one with the largest dot product:

# Zero matrix of shape (3,300)
vectorized_sentences = np.zeros((len(sentences),300))
# for every sentence
for i, sentence in enumerate(sentences):
# tokenize sentence into words
words = sentence.split(' ')
# embed whichever words that we can
embedded_words = [get_embedding(w) for w in words]
embedded_words = filter(lambda x:x is not None, embedded_words)
# Take a mean of the vectors to get an estimate vectorization of the sentence
vectorized_sentence = reduce(lambda x,y:x+y, embedded_words)/len(embedded_words)
# change the ith row (in place) to be the ith vectorization
vectorized_sentences[i:] = vectorized_sentence

vectorized_sentences.shape
(3, 300)

One thing to notice here is that we are creating a vectorization of documents (collection of words) and not considering the order of the words. How is this better than utilizing a CountVectorizer or a TfidfVectorizer to grab a count-based vectorization of text? The gensim method is attempting to project our text onto a latent structure learned by the context of individual words, while the scikit-learn vectorizers are only able to use the vocab at our disposal to create our vectorizations. In these three sentences, there are only seven unique words:

this, is, about, a, dog, cat, nothing

So, the maximum shape our CountVectorizer or TfidfVectorizer can project is (3, 7). Let's try to grab the most relevant sentence to the word dog:

# we want articles most similar to the reference word "dog"
reference_word = 'dog'

# take a dot product between the embedding of dof and our vectorized matrix
best_sentence_idx = np.dot(vectorized_sentences, get_embedding(reference_word)).argsort()[-1]

# output the most relevant sentence
sentences[best_sentence_idx]

'this is about a dog'

That one was easy. Given the word dog, we should be able to retrieve the sentence about a dog. This should also hold true if we input the word cat:

reference_word = 'cat'
best_sentence_idx = np.dot(vectorized_sentences, get_embedding(reference_word)).argsort()[-1]

sentences[best_sentence_idx]

'this is about a cat'

Now, let's try something harder. Let's input the words canine and tiger and see if we get the dog and cat sentences respectively:

reference_word = 'canine'
best_sentence_idx = np.dot(vectorized_sentences, get_embedding(reference_word)).argsort()[-1]

print sentences[best_sentence_idx]

'this is about a dog'

reference_word = 'tiger'
best_sentence_idx = np.dot(vectorized_sentences, get_embedding(reference_word)).argsort()[-1]

print sentences[best_sentence_idx]

'this is about a cat'

Let's try a slightly more interesting example. The following are chapter titles from Sinan's first book, Principles of Data Science:

# Chapter titles from Sinan's first book, "The Principles of Data Science

sentences = """How to Sound Like a Data Scientist
Types of Data
The Five Steps of Data Science
Basic Mathematics
A Gentle Introduction to Probability
Advanced Probability
Basic Statistics
Advanced Statistics
Communicating Data
Machine Learning Essentials
Beyond the Essentials
Case Studies """.split(' ')

This will give us a list of 12 different chapter titles to retrieve from. The goal then will be to use a reference word to sort and serve up the top three most relevant chapter titles to read, given the topic. For example, if we asked our algorithm to give us chapters relating to math, we might expect to be recommended the chapters about basic mathematics, statistics, and probability.

Let's try to see which chapters are the best to read, given human input. Before we do so, let's calculate a matrix of vectorized documents like we did with our previous three sentences:

# Zero matrix of shape (3,300)
vectorized_sentences = np.zeros((len(sentences),300))
# for every sentence
for i, sentence in enumerate(sentences):
# tokenize sentence into words
words = sentence.split(' ')
# embed whichever words that we can
embedded_words = [get_embedding(w) for w in words]
embedded_words = filter(lambda x:x is not None, embedded_words)
# Take a mean of the vectors to get an estimate vectorization of the sentence
vectorized_sentence = reduce(lambda x,y:x+y, embedded_words)/len(embedded_words)
# change the ith row (in place) to be the ith vectorization
vectorized_sentences[i:] = vectorized_sentence

vectorized_sentences.shape
(12, 300)

Now, let's find the chapters that are most related to math:

# find chapters about math
reference_word = 'math'
best_sentence_idx = np.dot(vectorized_sentences, get_embedding(reference_word)).argsort()[-3:][::-1]

[sentences[b] for b in best_sentence_idx]

['Basic Mathematics', 'Basic Statistics', 'Advanced Probability ']

Now, let's say we are giving a talk about data and want to know which chapters are going to be the most helpful in that area:

# which chapters are about giving talks about data
reference_word = 'talk'
best_sentence_idx = np.dot(vectorized_sentences, get_embedding(reference_word)).argsort()[-3:][::-1]

[sentences[b] for b in best_sentence_idx]

['Communicating Data ', 'How to Sound Like a Data Scientist', 'Case Studies ']

And finally, which chapters are about AI:

# which chapters are about AI
reference_word = 'AI'
best_sentence_idx = np.dot(vectorized_sentences, get_embedding(reference_word)).argsort()[-3:][::-1]

[sentences[b] for b in best_sentence_idx]

['Advanced Probability ', 'Advanced Statistics', 'Machine Learning Essentials']

We can see how we can use word embeddings to retrieve information in the form of text given context learned from the universe of text.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset