Document similarity with word embeddings

The practical use case of word vectors is to compare the semantic similarity between documents. If you are a retail bank, insurance company, or any other company that sells to end users, you will have to deal with support requests. You'll often find that many customers have similar requests, so by finding out how similar texts are semantically, previous answers to similar requests can be reused, and your organization's overall service can be improved.

spaCy has a built-in function to measure the similarity between two sentences. It also comes with pretrained vectors from the Word2Vec model, which is similar to GloVe. This method works by averaging the embedding vectors of all the words in a text and then measuring the cosine of the angle between the average vectors. Two vectors pointing in roughly the same direction will have a high similarity score, whereas vectors pointing in different directions will have a low similarity score. This is visualized in the following graph:

Document similarity with word embeddings

Similarity vectors

We can see the similarity between two phrases by running the following command:

sup1 = nlp('I would like to open a new checking account')
sup2 = nlp('How do I open a checking account?')

As you can see, these requests are pretty similar, achieving a rate of 70%:

sup1.similarity(sup2)
0.7079433112862716

As you can see, their similarity score is quite high. This simple averaging method works pretty decently. It is not, however, able to capture things such as negations or a single deviating vector, which might not influence the average too much.

For example, "I would like to close a checking account" has a semantically different meaning than, "I would like to open a checking account." However, the model sees them as being pretty similar. Yet, this approach is still useful and a good illustration of the advantages of representing semantics as vectors.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset