A common task in NLP is named entity recognition (NER). NER is all about finding things that the text explicitly refers to. Before discussing more about what is going on, let's jump right in and do some hands-on NER on the first article in our dataset.
The first thing we need to do is load spaCy, in addition to the model for English language processing:
import spacy nlp = spacy.load('en')
Next, we must select the text of the article from our data:
text = df.loc[0,'content']
Finally, we'll run this piece of text through the English language model pipeline. This will create a Doc
instance, something we explained earlier on in this chapter. The file will hold a lot of information, including the named entities:
doc = nlp(text)
One of the best features of spaCy is that it comes with a handy visualizer called displacy
, which we can use to show the named entities in text. To get the visualizer to generate the display, based on the text from our article, we must run this code:
from spacy import displacy displacy.render(doc, #1style='ent', #2jupyter=True) #3
With that command now executed, we've done three important things, which are:
displacy
know that we are running this in a Jupyter notebook so that rendering works correctlyAnd voilà! As you can see, there are a few mishaps, such as blank spaces being classified as organizations, and "Obama" being classified as a place.
So, why has this happened? It's because the tagging has been done by a neural network and neural networks are strongly dependent on the data that they were trained on. So, because of these imperfections, we might find that we need to fine-tune the tagging model for our own purposes, and in a minute, we will see how that works.
You can also see in our output that NER offers a wide range of tags, some of which come with strange abbreviations. For now, don't worry as we will examine a full list of tags later on in this chapter.
Right now, let's answer a different question: what organizations does the news in our dataset write about? To make this exercise run faster, we will create a new pipeline in which we will disable everything but NER.
To find out the answer to this question, we must first run the following code:
nlp = spacy.load('en',disable=['parser','tagger','textcat'])
In the next step, we'll loop over the first 1,000 articles from our dataset, which can be done with the following code:
from tqdm import tqdm_notebook frames = [] for i in tqdm_notebook(range(1000)): doc = df.loc[i,'content'] #1 text_id = df.loc[i,'id'] #2 doc = nlp(doc) #3 ents = [(e.text, e.start_char, e.end_char, e.label_) #4 for e in doc.ents if len(e.text.strip(' -—')) > 0] frame = pd.DataFrame(ents) #5 frame['id'] = text_id #6 frames.append(frame) #7 npf = pd.concat(frames) #8 npf.columns = ['Text','Start','Stop','Type','id'] #9
The code we've just created has nine key points. Let's take a minute to break it down, so we are confident in understanding what we've just written. Note that in the preceding code, the hashtag, #
, refers to the number it relates to in this following list:
i
.Now that we've done that, the next step is to plot the distribution of the types of entities that we found. This code will produce a chart which can be created with the following code:
npf.Type.value_counts().plot(kind='bar')
The output of the code being this graph:
After seeing the preceding graph, it is a fair question to ask which categories spaCy can identify and where they come from. The English language NER that comes with spaCy is a neural network trained on the OntoNotes 5.0 corpus, meaning it can recognize the following categories:
GPE
s, for example, mountain ranges or streamsUsing this list, we will now look at the 15 most frequently named organizations, categorized as ORG. As part of this, we will produce a similar graph showing us that information.
To get the graph, we must run the following:
orgs = npf[npf.Type == 'ORG'] orgs.Text.value_counts()[:15].plot(kind='bar')
The resulting code will give us the following graph:
As you can see, political institutions such as the senate are most frequently named in our news dataset. Likewise, some companies, such as Volkswagen, that were in the center of media attention can also be found in the chart. Take a minute to also notice how the White House and White House are listed as two separate organizations, despite us knowing they are the same entity.
Depending on your needs, you might want to do some post-processing, such as removing "the" from organization names. Python comes with a built-in string replacement method that you can use with pandas. This would allow you to achieve post-processing. However, this is not something we will cover in depth here.
Should you want to look at it in more detail, you can get the documentation and example from the following link: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html
Also, note how Trump is shown here as an organization. However, if you look at the tagged text, you will also see that "Trump" is tagged several times as an NORP, a political organization. This has happened because the NER infers the type of tag from the context. Since Trump is the U.S. president, his name often gets used in the same context as (political) organizations.
This pretrained NER gives you a powerful tool that can solve many common NLP tasks. So, in reality, from here you could conduct all kinds of other investigations. For example, we could fork the notebook to see whether The New York Times is mentioned as different entities more often than the Washington Post or Breitbart.
A common issue you may find is that the pretrained NER does not perform well enough on the specific types of text that you want it to work with. To solve this problem, you will need to fine-tune the NER model by training it with custom data. Achieving this will be the focus of this section.
The training data you're using should be in a form like this:
TRAIN_DATA = [ ('Who is Shaka Khan?', { 'entities': [(7, 17, 'PERSON')] }), ('I like London and Berlin.', { 'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')] }) ]
As you can see, you provide a list of tuples of the string, together with the start and end points, as well as the types of entities you want to tag. Data such as this is usually collected through manual tagging, often on platforms such as Amazon's Mechanical Turk (MTurk).
The company behind spaCy, Explosion AI, also make a (paid) data tagging system called Prodigy, which enables efficient data collection. Once you have collected enough data, you can either fine-tune a pretrained model or initialize a completely new model.
To load and fine-tune a model, we need to use the load()
function:
nlp = spacy.load('en')
Alternatively, to create a new and empty model from scratch that is ready for the English language, use the blank
function:
nlp = spacy.blank('en')
Either way, we need to get access to the NER component. If you have created a blank model, you'll need to create an NER pipeline component and add it to the model.
If you have loaded an existing model, you can just access its existing NER by running the following code:
if 'ner' not in nlp.pipe_names: ner = nlp.create_pipe('ner') nlp.add_pipe(ner, last=True) else: ner = nlp.get_pipe('ner')
The next step is to ensure that our NER can recognize the labels we have. Imagine our data contained a new type of named entity such as ANIMAL
. With the add_label
function, we can add a label type to an NER.
The code to achieve this can be seen below, but don't worry if it doesn't make sense right now, we'll break it down on the next page:
for _, annotations in TRAIN_DATA: for ent in annotations.get('entities'): ner.add_label(ent[2]) import random #1 other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner'] with nlp.disable_pipes(*other_pipes): optimizer = nlp._optimizer #2 if not nlp._optimizer: optimizer = nlp.begin_training() for itn in range(5): #3 random.shuffle(TRAIN_DATA) #4 losses = {} #5 for text, annotations in TRAIN_DATA: #6 nlp.update( #7 [text], [annotations], drop=0.5, #8 sgd=optimizer, #9 losses=losses) #10 print(losses)
What we've just written is made up of 10 key elements:
random
module.nlp.update
performs one forward and backward pass, and updates the neural network weights. We need to supply text and annotations, so that the function can figure out how to train a network from it.Once you've run the code, the output should look something like this:
{'ner': 5.0091189558407585} {'ner': 3.9693684224622108} {'ner': 3.984836024903589} {'ner': 3.457960373417813} {'ner': 2.570318400714134}
What you are seeing is the loss value of a part of the spaCy pipeline, in this case, the named entity recognition (NER) engine. Similar to the cross-entropy loss we discussed in previous chapters, the actual value is hard to interpret and does not tell you very much. What matters here is that the loss is decreasing over time and that it reaches a value much lower than the initial loss.