Named entity recognition

A common task in NLP is named entity recognition (NER). NER is all about finding things that the text explicitly refers to. Before discussing more about what is going on, let's jump right in and do some hands-on NER on the first article in our dataset.

The first thing we need to do is load spaCy, in addition to the model for English language processing:

import spacy
nlp = spacy.load('en')

Next, we must select the text of the article from our data:

text = df.loc[0,'content']

Finally, we'll run this piece of text through the English language model pipeline. This will create a Doc instance, something we explained earlier on in this chapter. The file will hold a lot of information, including the named entities:

doc = nlp(text)

One of the best features of spaCy is that it comes with a handy visualizer called displacy, which we can use to show the named entities in text. To get the visualizer to generate the display, based on the text from our article, we must run this code:

from spacy import displacy
displacy.render(doc,              #1style='ent',      #2jupyter=True)     #3

With that command now executed, we've done three important things, which are:

  1. We've passed the document
  2. We have specified that we would like to render entities
  3. We let displacy know that we are running this in a Jupyter notebook so that rendering works correctly
Named entity recognition

The output of the previous NER using spaCy tags

And voilà! As you can see, there are a few mishaps, such as blank spaces being classified as organizations, and "Obama" being classified as a place.

So, why has this happened? It's because the tagging has been done by a neural network and neural networks are strongly dependent on the data that they were trained on. So, because of these imperfections, we might find that we need to fine-tune the tagging model for our own purposes, and in a minute, we will see how that works.

You can also see in our output that NER offers a wide range of tags, some of which come with strange abbreviations. For now, don't worry as we will examine a full list of tags later on in this chapter.

Right now, let's answer a different question: what organizations does the news in our dataset write about? To make this exercise run faster, we will create a new pipeline in which we will disable everything but NER.

To find out the answer to this question, we must first run the following code:

nlp = spacy.load('en',disable=['parser','tagger','textcat'])

In the next step, we'll loop over the first 1,000 articles from our dataset, which can be done with the following code:

from tqdm import tqdm_notebook

frames = []
for i in tqdm_notebook(range(1000)):
    doc = df.loc[i,'content']                              #1
    text_id = df.loc[i,'id']                               #2
    doc = nlp(doc)                                         #3
    ents = [(e.text, e.start_char, e.end_char, e.label_)   #4
            for e in doc.ents 
            if len(e.text.strip(' -—')) > 0]
    frame = pd.DataFrame(ents)                             #5
    frame['id'] = text_id                                  #6
    frames.append(frame)                                   #7
    
npf = pd.concat(frames)                                    #8

npf.columns = ['Text','Start','Stop','Type','id']          #9

The code we've just created has nine key points. Let's take a minute to break it down, so we are confident in understanding what we've just written. Note that in the preceding code, the hashtag, #, refers to the number it relates to in this following list:

  1. We get the content of the article at row i.
  2. We get the id of the article.
  3. We run the article through the pipeline.
  4. For all the entities found, we save the text, index of the first and last character, as well as the label. This only happens if the tag consists of more than white spaces and dashes. This removes some of the mishaps we encountered earlier when the classification tagged empty segments or delimiters.
  5. We create a pandas DataFrame out of the array of tuples created.
  6. We add the id of the article to all records of our named entities.
  7. We add the DataFrame containing all the tagged entities of one document to a list. This way, we can build a collection of tagged entities over a larger number of articles.
  8. We concatenate all DataFrames in the list, meaning that we create one big table with all tags.
  9. For easier use, we give the columns meaningful names

Now that we've done that, the next step is to plot the distribution of the types of entities that we found. This code will produce a chart which can be created with the following code:

npf.Type.value_counts().plot(kind='bar')

The output of the code being this graph:

Named entity recognition

spaCy tag distribution

After seeing the preceding graph, it is a fair question to ask which categories spaCy can identify and where they come from. The English language NER that comes with spaCy is a neural network trained on the OntoNotes 5.0 corpus, meaning it can recognize the following categories:

  • PERSON: People, including fictional characters
  • ORG: Companies, agencies, institutions
  • GPE: Places including countries, cities, and states
  • DATE: Absolute (for example, January 2017) or relative dates (for example, two weeks)
  • CARDINAL: Numerals that are not covered by other types
  • NORP: Nationalities or religious or political groups
  • ORDINAL: "first," "second," and so on
  • TIME: Times shorter than a day (for example, two hours)
  • WORK_OF_ART: Titles of books, songs, and so on
  • LOC: Locations that are not GPEs, for example, mountain ranges or streams
  • MONEY: Monetary values
  • FAC: Facilities such as airports, highways or bridges
  • PERCENT: Percentages
  • EVENT: Named hurricanes, battles, sporting events, and so on
  • QUANTITY: Measurements such as weights or distance
  • LAW: Named documents that are laws
  • PRODUCT: Objects, vehicles, food, and so on
  • LANGUAGE: Any named language

Using this list, we will now look at the 15 most frequently named organizations, categorized as ORG. As part of this, we will produce a similar graph showing us that information.

To get the graph, we must run the following:

orgs = npf[npf.Type == 'ORG']
orgs.Text.value_counts()[:15].plot(kind='bar')

The resulting code will give us the following graph:

Named entity recognition

spaCy organization distance

As you can see, political institutions such as the senate are most frequently named in our news dataset. Likewise, some companies, such as Volkswagen, that were in the center of media attention can also be found in the chart. Take a minute to also notice how the White House and White House are listed as two separate organizations, despite us knowing they are the same entity.

Depending on your needs, you might want to do some post-processing, such as removing "the" from organization names. Python comes with a built-in string replacement method that you can use with pandas. This would allow you to achieve post-processing. However, this is not something we will cover in depth here.

Should you want to look at it in more detail, you can get the documentation and example from the following link: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html

Also, note how Trump is shown here as an organization. However, if you look at the tagged text, you will also see that "Trump" is tagged several times as an NORP, a political organization. This has happened because the NER infers the type of tag from the context. Since Trump is the U.S. president, his name often gets used in the same context as (political) organizations.

This pretrained NER gives you a powerful tool that can solve many common NLP tasks. So, in reality, from here you could conduct all kinds of other investigations. For example, we could fork the notebook to see whether The New York Times is mentioned as different entities more often than the Washington Post or Breitbart.

Fine-tuning the NER

A common issue you may find is that the pretrained NER does not perform well enough on the specific types of text that you want it to work with. To solve this problem, you will need to fine-tune the NER model by training it with custom data. Achieving this will be the focus of this section.

The training data you're using should be in a form like this:

TRAIN_DATA = [
    ('Who is Shaka Khan?', {
        'entities': [(7, 17, 'PERSON')]
    }),
    ('I like London and Berlin.', {
        'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]
    })
]

As you can see, you provide a list of tuples of the string, together with the start and end points, as well as the types of entities you want to tag. Data such as this is usually collected through manual tagging, often on platforms such as Amazon's Mechanical Turk (MTurk).

The company behind spaCy, Explosion AI, also make a (paid) data tagging system called Prodigy, which enables efficient data collection. Once you have collected enough data, you can either fine-tune a pretrained model or initialize a completely new model.

To load and fine-tune a model, we need to use the load() function:

nlp = spacy.load('en')

Alternatively, to create a new and empty model from scratch that is ready for the English language, use the blank function:

nlp = spacy.blank('en')

Either way, we need to get access to the NER component. If you have created a blank model, you'll need to create an NER pipeline component and add it to the model.

If you have loaded an existing model, you can just access its existing NER by running the following code:

if 'ner' not in nlp.pipe_names:
    ner = nlp.create_pipe('ner')
    nlp.add_pipe(ner, last=True)
else:
    ner = nlp.get_pipe('ner')

The next step is to ensure that our NER can recognize the labels we have. Imagine our data contained a new type of named entity such as ANIMAL. With the add_label function, we can add a label type to an NER.

The code to achieve this can be seen below, but don't worry if it doesn't make sense right now, we'll break it down on the next page:

for _, annotations in TRAIN_DATA:
    for ent in annotations.get('entities'):
        ner.add_label(ent[2])
import random

                                                   #1
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']

with nlp.disable_pipes(*other_pipes):
    optimizer = nlp._optimizer                     #2
    if not nlp._optimizer:
        optimizer = nlp.begin_training()
    for itn in range(5):                           #3
        random.shuffle(TRAIN_DATA)                 #4
        losses = {} #5
        for text, annotations in TRAIN_DATA:       #6
            nlp.update(                            #7
                [text],  
                [annotations],  
                drop=0.5,                          #8
                sgd=optimizer,                     #9
                losses=losses)                     #10
        print(losses)

What we've just written is made up of 10 key elements:

  1. We disable all pipeline components that are not the NER by first getting a list of all the components that are not the NER and then disabling them for training.
  2. Pretrained models come with an optimizer. If you have a blank model, you will need to create a new optimizer. Note that this also resets the model weights.
  3. We now train for a number of epochs, in this case, 5.
  4. At the beginning of each epoch, we shuffle the training data using Python's built-in random module.
  5. We create an empty dictionary to keep track of the losses.
  6. We then loop over the text and annotations in the training data.
  7. nlp.update performs one forward and backward pass, and updates the neural network weights. We need to supply text and annotations, so that the function can figure out how to train a network from it.
  8. We can manually specify the dropout rate we want to use while training.
  9. We pass a stochastic gradient descent optimizer that performs the model updates. Note that you cannot just pass a Keras or TensorFlow optimizer here, as spaCy has its own optimizers.
  10. We can also pass a dictionary to write losses that we can later print to monitor progress.

Once you've run the code, the output should look something like this:

{'ner': 5.0091189558407585}
{'ner': 3.9693684224622108}
{'ner': 3.984836024903589}
{'ner': 3.457960373417813}
{'ner': 2.570318400714134}

What you are seeing is the loss value of a part of the spaCy pipeline, in this case, the named entity recognition (NER) engine. Similar to the cross-entropy loss we discussed in previous chapters, the actual value is hard to interpret and does not tell you very much. What matters here is that the loss is decreasing over time and that it reaches a value much lower than the initial loss.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset