Rule-based matching

Before deep learning and statistical modeling took over, NLP was all about rules. That's not to say that rule-based systems are dead! They are often easy to set up and perform very well when it comes to doing simple tasks.

Imagine you wanted to find all mentions of Google in a text. Would you really train a neural network-based named entity recognizer? If you did, you would have to run all of the text through the neural network and then look for Google in the entity texts. Alternatively, would you rather just search for text that exactly matches Google with a classic search algorithm? Well, we're in luck, as spaCy comes with an easy-to-use, rule-based matcher that allows us to do just that.

Before we start this section, we first must make sure that we reload the English language model and import the matcher. This is a very simple task that can be done by running the following code:

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en')

The matcher searches for patterns, which we encode as a list of dictionaries. It operates token by token, that is, word for word, except for punctuation and numbers, where a single symbol can be a token.

As a starting example, let's search for the phrase "hello, world." To do this, we would define a pattern as follows:

pattern = [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]

This pattern is fulfilled if the lower case first token is hello. The LOWER attribute checks if both words would match if they were both converted to lowercase. That means if the actual token text is "Hello" or "HELLO," then it would also fulfill the requirement. The second token has to be punctuation to pick up the comma, so the phrases "hello. world" or "hello! world" would both work, but not "hello world."

The lower case of the third token has to be "world," so "WoRlD" would also be fine.

The possible attributes for a token can be the following:

  • ORTH: The token text has to match exactly
  • LOWER: The lower case of the token has to match
  • LENGTH: The length of the token text has to match
  • IS_ALPHA, IS_ASCII, IS_DIGIT: The token text has to consist of alphanumeric characters, ASCII symbols, or digits
  • IS_LOWER, IS_UPPER, IS_TITLE: The token text has to be lower case, upper case, or title case
  • IS_PUNCT, IS_SPACE, IS_STOP: The token text has to be punctuation, white space, or a stop word
  • LIKE_NUM, LIKE_URL, LIKE_EMAIL: The token has to resemble a number, URL, or email
  • POS, TAG, DEP, LEMMA, SHAPE: The token's position, tag, dependency, lemma, or shape has to match
  • ENT_TYPE: The token's entity type from NER has to match

spaCy's lemmatization is extremely useful. A lemma is the base version of a word. For example, "was" is a version of "be," so "be" is the lemma for "was" but also for "is." spaCy can lemmatize words in context, meaning it uses the surrounding words to determine what the actual base version of a word is.

To create a matcher, we have to pass on the vocabulary the matcher works on. In this case, we can just pass the vocabulary of our English language model by running the following:

matcher = Matcher(nlp.vocab)

In order to add the required attributes to our matcher, we must call the following:

matcher.add('HelloWorld', None, pattern)

The add function expects three arguments. The first is a name of the pattern, in this case, HelloWorld, so that we can keep track of the patterns we added. The second is a function that can process matches once found. Here we pass None, meaning no function will be applied, though we will use this tool later. Finally, we need to pass the list of token attributes we want to search for.

To use our matcher, we can simply call matcher(doc). This will give us back all the matches that the matcher found. We can call this by running the following:

doc = nlp(u'Hello, world! Hello world!')
matches = matcher(doc)

If we print out the matches, we can see the structure:

matches
[(15578876784678163569, 0, 3)]

The first thing in a match is the hash of the string found. This is just to identify what was found internally; we won't use it here. The next two numbers indicate the range in which the matcher found something, here tokens 0 to 3.

We can get the text back by indexing the original document:

doc[0:3]
Hello, wor
ld

In the next section we will look at how we can add custom functions to matchers.

Adding custom functions to matchers

Let's move on to a more complex case. We know that the iPhone is a product. However, the neural network-based matcher often classifies it as an organization. This happens because the word "iPhone" gets used a lot in a similar context as organizations, as in "The iPhone offers..." or "The iPhone sold...."

Let's build a rule-based matcher that always classifies the word "iPhone" as a product entity.

First, we have to get the hash of the word PRODUCT. Words in spaCy can be uniquely identified by their hash. Entity types also get identified by their hash. To set an entity of the product type, we have to be able to provide the hash for the entity name.

We can get the name from the language model's vocabulary by running the following:

PRODUCT = nlp.vocab.strings['PRODUCT']

Next, we need to define an on_match rule. This function will be called every time the matcher finds a match. on_match rules have four arguments:

  • matcher: The matcher that made the match.
  • doc: The document the match was made in.
  • i: The index of a match. The first match in a document would have index zero, the second would have index one, and so on.
  • matches: A list of all matches made.

There are two things happening in our on_match rule:

def add_product_ent(matcher, doc, i, matches):
    match_id, start, end = matches[i]            #1
    doc.ents += ((PRODUCT, start, end),)         #2

Let's break down what they are:

  1. We index all matches to find our match at index i. One match is a tuple of a match_id, the start of the match, and the end of the match.
  2. We add a new entity to the document's named entities. An entity is a tuple of the hash of the type of entity (the hash of the word PRODUCT here), the start of the entity, and the end of the entity. To append an entity, we have to nest it in another tuple. Tuples that only contain one value need to include a comma at the end. It is important not to overwrite doc.ents, as we otherwise would remove all the entities that we have already found.

Now that we have an on_match rule, we can define our matcher.

We should note that matchers allow us to add multiple patterns, so we can add a matcher for just the word "iPhone" and another pattern for the word "iPhone" together with a version number, as in "iPhone 5":

pattern1 = [{'LOWER': 'iPhone'}]                           #1
pattern2 = [{'ORTH': 'iPhone'}, {'IS_DIGIT': True}]        #2

matcher = Matcher(nlp.vocab)                               #3
matcher.add('iPhone', add_product_ent,pattern1, pattern2)  #4

So, what makes these commands work?

  1. We define the first pattern.
  2. We define the second pattern.
  3. We create a new empty matcher.
  4. We add the patterns to the matcher. Both will fall under the rule called iPhone, and both will call our on_match rule called add_product_ent.

We will now pass one of the news articles through the matcher:

doc = nlp(df.content.iloc[14])         #1
matches = matcher(doc)                 #2

This code is relatively simple, with only two steps:

  1. We run the text through the pipeline to create an annotated document.
  2. We run the document through the matcher. This modifies the document created in the step before. We do not care as much about the matches but more about how the on_match method adds the matches as entities to our documents.

Now that the matcher is set up, we need to add it to the pipeline so that spaCy can use it automatically. This will be the focus in the next section.

Adding the matcher to the pipeline

Calling the matcher separately is somewhat cumbersome. To add it to the pipeline, we have to wrap it into a function, which we can achieve by running the following:

def matcher_component(doc):
    matches = matcher(doc)
    return doc

The spaCy pipeline calls the components of the pipeline as functions and always expects the annotated document to be returned. Returning anything else could break the pipeline.

We can then add the matcher to the main pipeline, as can be seen in the following code:

nlp.add_pipe(matcher_component,last=True)

The matcher is now the last piece of the pipeline. From this point onward iPhones will now get tagged based on the matcher's rules.

And boom! All mentions of the word "iPhone" (case independent) are now tagged as named entities of the product type. You can validate this by displaying the entities with displacy as we have done in the following code:

displacy.render(doc,style='ent',jupyter=True)

The results of that code can be seen in the following screenshot:

Adding the matcher to the pipeline

spaCy now finds the iPhone as a product

Combining rule-based and learning-based systems

One especially interesting aspect of spaCy's pipeline system is that it is relatively easy to combine different aspects of it. We can, for example, combine neural network-based named entity recognition with a rule-based matcher in order to find something such as executive compensation information.

Executive compensation is often reported in the press but hard to find in aggregate. One possible rule-based matching pattern for executive compensation could look like this:

pattern = [{'ENT_TYPE':'PERSON'},{'LEMMA':'receive'},{'ENT_TYPE':'MONEY'}]

A matcher looking for this pattern would pick up any combination of a person's name, for example, John Appleseed, or Daniel; any version of the word receive, for example, received, receives, and so on; followed by an expression of money, for example, $4 million.

This matcher could be run over a large text corpus with the on_match rule handily saving the found snippets into a database. The machine learning approach for naming entities and the rule-based approach go hand in hand seamlessly.

Since there is much more training data available with annotations for names and money, rather than statements about executive education, it is much easier to combine the NER with a rule-based method rather than training a new NER.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset