Chapter 11. Information extraction (named entity extraction and question answering)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 11. Information extraction (named entity extraction and question answering)

This chapter covers

Sentence segmentation
Named entity recognition (NER)
Numerical information extraction
Part-of-speech (POS) tagging and dependency tree parsing
Logical relation extraction and knowledge bases

One last skill you need before you can build a full-featured chatbot is extracting information or knowledge from natural language text.

11.1. Named entities and relations

You’d like your machine to extract pieces of information and facts from text so it can know a little bit about what a user is saying. For example, imagine a user says “Remind me to read aiindex.org on Monday.” You’d like that statement to trigger a calendar entry or alarm for the next Monday after the current date.

To trigger those actions, you’d need to know that “me” represents a particular kind of named entity: a person. And the chatbot should know that it should “expand” or normalize that word by replacing it with that person’s username. You’d also need your chatbot to recognize that “aiindex.org” is an abbreviated URL, a named entity of the name of a specific instance of something. And you need to know that a normalized spelling of this particular kind of named entity might be “http://aiindex.org,” “https://aiindex.org,” or maybe even “https://www.aiindex.org.” Likewise you need your chatbot to recognize that Monday is one of the days of the week (another kind of named entity called an “event”) and be able to find it on the calendar.

For the chatbot to respond properly to that simple request, you also need it to extract the relation between the named entity “me” and the command “remind.” You’d even need to recognize the implied subject of the sentence, “you,” referring to the chatbot, another person named entity. And you need to “teach” the chatbot that reminders happen in the future, so it should find the soonest upcoming Monday to create the reminder.

A typical sentence may contain several named entities of various types, such as geographic entities, organizations, people, political entities, times (including dates), artifacts, events, and natural phenomena. And a sentence can contain several relations, too—facts about the relationships between the named entities in the sentence.

11.1.1. A knowledge base

Besides just extracting information from the text of a user statement, you can also use information extraction to help your chatbot train itself! If you have your chatbot run information extraction on a large corpus, such as Wikipedia, that corpus will produce facts about the world that can inform future chatbot behaviors and replies. Some chatbots record all the information they extract (from offline reading-assignment “homework”) in a knowledge base. Such a knowledge base can later be queried to help your chatbot make informed decisions or inferences about the world.

Chatbots can also store knowledge about the current user “session” or conversation. Knowledge that is relevant only to the current conversation is called “context.” This contextual knowledge can be stored in the same global knowledge base that supports the chatbot, or it can be stored in a separate knowledge base. Commercial chatbot APIs, such as IBM’s Watson or Amazon’s Lex, typically store context separate from the global knowledge base of facts that they use to support conversations with all the other users.

Context can include facts about the user, the chatroom or channel, or the weather and news for that moment in time. Context can even include the changing state of the chatbot itself, based on the conversation. An example of “self-knowledge” a smart chatbot should keep track of is the history of all the things it has already told someone or the questions it has already asked of the user, so it doesn’t repeat itself.

So that’s the goal for this chapter, teaching your bot to understand what it reads. And you’ll put that understanding into a flexible data structure designed to store knowledge. Then your bot can use that knowledge to make decisions and say smart stuff about the world.

In addition to the simple task of recognizing numbers and dates in text, you’d like your bot to be able to extract more general information about the world. And you’d like it to do this on its own, rather than having you “program” everything you know about the world into it. For example, you’d like it to be able to learn from natural language documents such as this sentence from Wikipedia:

In 1983, Stanislav Petrov, a lieutenant colonel of the Soviet Air Defense Forces, saved the world from nuclear war.

If you were to take notes in a history class after reading or hearing something like that, you’d probably paraphrase things and create connections in your brain between concepts or words. You might reduce it to a piece of knowledge, that thing that you “got out of it.” You’d like your bot to do the same thing. You’d like it to “take note” of whatever it learns, such as the fact or knowledge that Stanislov Petrov was a lieutenant colonel. This could be stored in a data structure something like this:

('Stanislav Petrov', 'is-a', 'lieutenant colonel')

This is an example of two named entity nodes ('Stanislav Petrov' and 'lieutenant colonel') and a relation or connection ('is a') between them in a knowledge graph or knowledge base. When a relationship like this is stored in a form that complies with the RDF standard (relation description format) for knowledge graphs, it’s referred to as an RDF triplet. Historically these RDF triplets were stored in XML files, but they can be stored in any file format or database that can hold a graph of triplets in the form of (subject, relation, object).

A collection of these triplets is a knowledge graph. This is also sometimes called an ontology by linguists, because it’s storing structured information about words. But when the graph is intended to represent facts about the world rather than merely words, it’s referred to as a knowledge graph or knowledge base. Figure 11.1 is a graphic representation of the knowledge graph you’d like to extract from a sentence like that.

The “is-a” relationship at the top of figure 11.1 represents a fact that couldn’t be directly extracted from the statement about Stanislav. But this fact that “lieutenant colonel” is a military rank could be inferred from the fact that the title of a person who’s a member of a military organization is a military rank. This logical operation of deriving facts from a knowledge graph is called knowledge graph inference. It can also be called querying a knowledge base, analogous to querying a relational database.

Figure 11.1. Stanislav knowledge graph

For this particular inference or query about Stanislov’s military rank, your knowledge graph would have to already contain facts about militaries and military ranks. It might even help if the knowledge base had facts about the titles of people and how people relate to occupations (jobs). Perhaps you can see now how a base of knowledge helps a machine understand more about a statement than it could without that knowledge. Without this base of knowledge, many of the facts in a simple statement like this will be “over the head” of your chatbot. You might even say that questions about occupational rank would be “above the pay grade” of a bot that only knew how to classify documents according to randomly allocated topics.^[1]

¹
See chapter 4 if you’ve forgotten about how random topic allocation can be.

It may not be obvious how big a deal this is, but it is a BIG deal. If you’ve ever interacted with a chatbot that doesn’t understand “which way is up,” literally, you’d understand. One of the most daunting challenges in AI research is the challenge of compiling and efficiently querying a knowledge graph of common sense knowledge. We take common sense knowledge for granted in our everyday conversations.

Humans start acquiring much of their common sense knowledge even before they acquire language skill. We don’t spend our childhood writing about how a day begins with light and sleep usually follows sunset. And we don’t edit Wikipedia articles about how an empty belly should only be filled with food rather than dirt or rocks. This makes it hard for machines to find a corpus of common sense knowledge to read and learn from. No common-sense knowledge Wikipedia articles exist for your bot to do information extraction on. And some of that knowledge is instinct, hard-coded into our DNA.^[2]

²
There are hard-coded, common-sense knowledge bases out there for you to build on. Google Scholar is your friend in this knowledge graph search.

All kinds of factual relationships exist between things and people, such as “kind-of,” “is-used-for,” “has-a,” “is-famous-for,” “was-born,” and “has-profession.” NELL, the Carnegie Mellon Never Ending Language Learning bot, is focused almost entirely on the task of extracting information about the “kind-of” relationship.

Most knowledge bases normalize the strings that define these relationships, so that “kind of” and “type of” would be assigned a normalized string or ID to represent that particular relation. And some knowledge bases also resolve the nouns representing the objects in a knowledge base. So the bigram “Stanislav Petrov” might be assigned a particular ID. Synonyms for “Stanislav Petrov,” like “S. Petrov” and “Lt Col Petrov,” would also be assigned to that same ID, if the NLP pipeline suspected they referred to the same person.

A knowledge base can be used to build a practical type of chatbot called a question answering system (QA system). Customer service chatbots, including university TA bots, rely almost exclusively on knowledge bases to generate their replies.^[3] Question answering systems are great for helping humans find factual information, which frees up human brains to do the things they’re better at, such as attempting to generalize from those facts. Humans are bad at remembering facts accurately but good at finding connections and patterns between those facts, something machines have yet to master. We talk more about question answering chatbots in the next chapter.

³
2016, AI Teaching Assistant at GaTech: http://www.news.gatech.edu/2016/05/09/artificial-intelligence-course-creates-ai-teaching-assistant.

11.1.2. Information extraction

So you’ve learned that “information extraction” is converting unstructured text into structured information stored in a knowledge base or knowledge graph. Information extraction is part of an area of research called natural language understanding (NLU), though that term is often used synonymously with natural language processing.

Information extraction and NLU is a different kind of learning than you may think of when researching data science. It isn’t only unsupervised learning; even the very “model” itself, the logic about how the world works, can be composed without human intervention. Instead of giving your machine fish (facts), you’re teaching it how to fish (extract information). Nonetheless, machine learning techniques are often used to train the information extractor.

11.2. Regular patterns

You need a pattern-matching algorithm that can identify sequences of characters or words that match the pattern so you can “extract” them from a longer string of text. The naive way to build such a pattern-matching algorithm is in Python, with a sequence of if/then statements that look for that symbol (a word or character) at each position of a string. Say you wanted to find some common greeting words, such as “Hi,” “Hello,” and “Yo,” at the beginning of a statement. You might do it as shown in the following listing.

Listing 11.1. Pattern hardcoded in Python

>>> def find_greeting(s):
...     """ Return greeting str (Hi, etc) if greeting pattern matches """
...     if s[0] == 'H':
...         if s[:3] in ['Hi', 'Hi ', 'Hi,', 'Hi!']:
...             return s[:2]
...         elif s[:6] in ['Hello', 'Hello ', 'Hello,', 'Hello!']:
...             return s[:5]
...     elif s[0] == 'Y':
...         if s[1] == 'o' and s[:3] in ['Yo', 'Yo,', 'Yo ', 'Yo!']:
...             return s[:2]
...     return None

And the following listing shows how it would work.

Listing 11.2. Brittle pattern-matching example

>>> find_greeting('Hi Mr. Turing!')
'Hi'
>>> find_greeting('Hello, Rosa.')
'Hello'
>>> find_greeting("Yo, what's up?")
'Yo'
>>> find_greeting("Hello")
'Hello'
>>> print(find_greeting("hello"))
None
>>> print(find_greeting("HelloWorld"))
None

You can probably see how tedious programming a pattern matching algorithm this way would be. And it’s not even that good. It’s quite brittle, relying on precise spellings and capitalization and character positions in a string. And it’s tricky to specify all the “delimiters,” such as punctuation, white space, or the beginnings and ends of strings (NULL characters) that are on either sides of words you’re looking for.

You could probably come up with a way to allow you to specify different words or strings you want to look for without hard-coding them into Python expressions like this. And you could even specify the delimiters in a separate function. That would let you do some tokenization and iteration to find the occurrence of the words you’re looking for anywhere in a string. But that’s a lot of work.

Fortunately that work has already been done! A pattern-matching engine is integrated into most modern computer languages, including Python. It’s called regular expressions. Regular expressions and string interpolation formatting expressions (for example, "{:05d}".format(42)), are mini programming languages unto themselves. This language for pattern matching is called the regular expression language. And Python has a regular expression interpreter (compiler and runner) in the standard library package re. So let’s use them to define your patterns instead of deeply nested Python if statements.

11.2.1. Regular expressions

Regular expressions are strings written in a special computer language that you can use to specify algorithms. Regular expressions are a lot more powerful, flexible, and concise than the equivalent Python you’d need to write to match patterns like this. So regular expressions are the pattern definition language of choice for many NLP problems involving pattern matching. This NLP application is an extension of its original use for compiling and interpreting formal languages (computer languages).

Regular expressions define a finite state machine or FSM—a tree of “if-then” decisions about a sequence of symbols, such as the find_greeting() function in listing 11.1. The symbols in the sequence are passed into the decision tree of the FSM one symbol at a time. A finite state machine that operates on a sequence of symbols such as ASCII character strings, or a sequence of English words, is called a grammar. They can also be called formal grammars to distinguish them from natural language grammar rules you learned in grammar school.

In computer science and mathematics, the word “grammar” refers to the set of rules that determine whether or not a sequence of symbols is a valid member of a language, often called a computer language or formal language. And a computer language, or formal language, is the set of all possible statements that would match the formal grammar that defines that language. That’s kind of a circular definition, but that’s the way mathematics works sometimes. You probably want to review appendix B if you aren’t familiar with basic regular expression syntax and symbols such as r'.*' and r'a-z'.

11.2.2. Information extraction as ML feature extraction

So you’re back where you started in chapter 1, where we first mentioned regular expressions. But didn’t you switch from “grammar-based” NLP approaches at the end of chapter 1 in favor of machine learning and data-driven approaches? Why return to hard-coded (manually composed) regular expressions and patterns? Because your statistical or data-driven approach to NLP has limits.

You want your machine learning pipeline to be able to do some basic things, such as answer logical questions, or perform actions such as scheduling meetings based on NLP instructions. And machine learning falls flat here. You rarely have a labeled training set that covers the answers to all the questions people might ask in natural language. Plus, as you’ll see here, you can define a compact set of condition checks (a regular expression) to extract key bits of information from a natural language string. And it can work for a broad range of problems.

Pattern matching (and regular expressions) continue to be the state-of-the art approach for information extraction. Even with machine learning approaches to natural language processing, you need to do feature engineering. You need to create bags of words or embeddings of words to try to reduce the nearly infinite possibilities of meaning in natural language text into a vector that a machine can process easily. Information extraction is just another form of machine learning feature extraction from unstructured natural language data, such as creating a bag of words, or doing PCA on that bag of words. And these patterns and features are still employed in even the most advanced natural language machine learning pipelines, such as Google’s Assistant, Siri, Amazon Alexa, and other state-of-the-art bots.

Information extraction is used to find statements and information that you might want your chatbot to have “on the tip of its tongue.” Information extraction can be accomplished beforehand to populate a knowledge base of facts. Alternatively, the required statements and information can be found on-demand, when the chatbot is asked a question or a search engine is queried. When a knowledge base is built ahead of time, the data structure can be optimized to facilitate faster queries within larger domains of knowledge. A prebuilt knowledge base enables the chatbot to respond quickly to questions about a wider range of information. If information is retrieved in real-time, as the chatbot is being queried, this is often called “search.” Google and other search engines combine these two techniques, querying a knowledge graph (knowledge base) and falling back to text search if the necessary facts aren’t found. Many of the natural language grammar rules you learned in school can be encoded in a formal grammar designed to operate on words or symbols representing parts of speech. And the English language can be thought of as the words and grammar rules that make up the language. Or you can think of it as the set of all the possible things you could say that would be recognized as valid statements by an English language speaker.

And that brings us to another feature of formal grammars and finite state machines that will come in handy for NLP. Any formal grammar can be used by a machine in two ways:

To recognize matches to that grammar
To generate a new sequence of symbols

Not only can you use patterns (regular expressions) for extracting information from natural language, but you can also use them in a chatbot that wants to “say” things that match that pattern! We show you how to do this with a package called rstr^[4] for some of your information extraction patterns here.

⁴
See the web page titled “leapfrogdevelopment / rstr — Bitbucket” (https://bitbucket.org/leapfrogdevelopment/rstr/).

This formal grammar and finite state machine approach to pattern matching has some other awesome features. A true finite state machine can be guaranteed to always run in finite time (to halt). It will always tell you whether you’ve found a match in your string or not. It will never get caught in a perpetual loop... as long as you don’t use some of the advanced features of regular expression engines that allow you to “cheat” and incorporate loops into your FSM.

So you’ll stick to regular expressions that don’t require these “look-back” or “look-ahead” cheats. You’ll make sure your regular expression matcher processes each character and moves ahead to the next character only if it matches—sort of like a strict train conductor walking through the seats checking tickets. If you don’t have one, the conductor stops and declares that there’s a problem, a mismatch, and he refuses to go on, or look ahead or behind you until he resolves the problem. There are no “go backs” or “do overs” for train passengers, or for strict regular expressions.

11.3. Information worth extracting

Some keystone bits of quantitative information are worth the effort of “hand-crafted” regular expressions:

GPS locations
Dates
Prices
Numbers

Other important pieces of natural language information require more complex patterns than are easily captured with regular expressions:

Question trigger words
Question target words
Named entities

11.3.1. Extracting GPS locations

GPS locations are typical of the kinds of numerical data you’ll want to extract from text using regular expressions. GPS locations come in pairs of numerical values for latitude and longitude. They sometimes also include a third number for altitude, or height above sea level, but you’ll ignore that for now. Let’s just extract decimal latitude/longitude pairs, expressed in degrees. This will work for many Google Maps URLs. Though URLs aren’t technically natural language, they are often part of unstructured text data, and you’d like to extract this bit of information, so your chatbot can know about places as well as things.

Let’s use your decimal number pattern from previous examples, but let’s be more restrictive and make sure the value is within the valid range for latitude (+/- 90 deg) and longitude (+/- 180 deg). You can’t go any farther north than the North Pole (+90 deg) or farther south than the South Pole (-90 deg). And if you sail from Greenwich England 180 deg east (+180 deg longitude), you’ll reach the date line, where you’re also 180 deg west (-180 deg) from Greenwich. See the following listing.

Listing 11.3. Regular expression for GPS coordinates

>>> import re
>>> lat = r'([-]?[0-9]?[0-9][.][0-9]{2,10})'
>>> lon = r'([-]?1?[0-9]?[0-9][.][0-9]{2,10})'
>>> sep = r'[,/ ]{1,3}'
>>> re_gps = re.compile(lat + sep + lon)
 
>>> re_gps.findall('http://...maps/@34.0551066,-118.2496763...')
[(34.0551066, -118.2496763)]
 
>>> re_gps.findall("https://www.openstreetmap.org/#map=10/5.9666/116.0566")
[('5.9666', '116.0566')]
 
>>> re_gps.findall("Zig Zag Cafe is at 45.344, -121.9431 on my GPS.")
[('45.3440', '-121.9431')]

Numerical data is pretty easy to extract, especially if the numbers are part of a machine-readable string. URLs and other machine-readable strings put numbers such as latitude and longitude in a predictable order, format, and units to make things easy for us. This pattern will still accept some out-of-this-world latitude and longitude values, but it gets the job done for most of the URLs you’ll copy from mapping web apps such as Open-StreetMap.

But what about dates? Will regular expressions work for dates? What if you want your date extractor to work in Europe and the US, where the order of day/month is often reversed?

11.3.2. Extracting dates

Dates are a lot harder to extract than GPS coordinates. Dates are a more natural language, with different dialects for expressing similar things. In the US, Christmas 2017 is “12/25/17.” In Europe, Christmas 2017 is “25/12/17.” You could check the locale of your user and assume that they write dates the same way as others in their region. But this assumption can be wrong.

So most date and time extractors try to work with both kinds of day/month orderings and check to make sure it’s a valid date. This is how the human brain works when we read a date like that. Even if you were a US-English speaker and you were in Brussels around Christmas, you’d probably recognize “25/12/17” as a holiday, because there are only 12 months in the year.

This “duck-typing” approach that works in computer programming can work for natural language, too. If it looks like a duck and acts like a duck, it’s probably a duck. If it looks like a date and acts like a date, it’s probably a date. You’ll use this “try it and ask forgiveness later” approach for other natural language processing tasks as well. You’ll try a bunch of options and accept the one the works. You’ll try your extractor or your generator, and then you’ll run a validator on it to see if it makes sense.

For chatbots this is a particularly powerful approach, allowing you to combine the best of multiple natural language generators. In chapter 10, you generated some chatbot replies using LSTMs. To improve the user experience, you could generate a lot of replies and choose the one with the best spelling, grammar, and sentiment. We’ll talk more about this in chapter 12. See the following listing.

Listing 11.4. Regular expression for US dates

>>> us = r'((([01]?d)[-/]([0123]?d))([-/]([0123]d)dd)?)'
>>> mdy = re.findall(us, 'Santa came 12/25/2017. An elf appeared 12/12.')
>>> mdy
[('12/25/2017', '12/25', '12', '25', '/2017', '20'),
 ('12/12', '12/12', '12', '12', '', '')]

A list comprehension can be used to provide a little structure to that extracted data, by converting the month, day, and year into integers and labeling that numerical information with a meaningful name, as shown in the following listing.

Listing 11.5. Structuring extracted dates

>>> dates = [{'mdy': x[0], 'my': x[1], 'm': int(x[2]), 'd': int(x[3]),
...     'y': int(x[4].lstrip('/') or 0), 'c': int(x[5] or 0)} for x in mdy]
>>> dates
[{'mdy': '12/25/2017', 'my': '12/25', 'm': 12, 'd': 25, 'y': 2017, 'c': 20},
 {'mdy': '12/12', 'my': '12/12', 'm': 12, 'd': 12, 'y': 0, 'c': 0}]

Even for these simple dates, it’s not possible to design a regex that can resolve all the ambiguities in the second date, “12/12.” There are ambiguities in the language of dates that only humans can guess at resolving using knowledge about things like Christmas and the intent of the writer of a text. For examle “12/12” could mean

December 12th, 2017—month/day in the estimated year based on anaphora resolution^[5]

⁵
Issues in Anaphora Resolution by Imran Q. Sayed for Stanford’s CS224N course: https://nlp.stanford.edu/courses/cs224n/2003/fp/iqsayed/project_report.pdf.
December 12th, 2018—month/day in the current year at time of publishing
December 2012—month/year in the year 2012

Because month/day come before the year in US dates and in our regex, “12/12” is presumed to be December 12th of an unknown year. You can fill in any missing numerical fields with the most recently read year using the context from the structured data in memory, as shown in the following listing.

Listing 11.6. Basic context maintenance

>>> for i, d in enumerate(dates):
...     for k, v in d.items():
...         if not v:
...             d[k] = dates[max(i - 1, 0)][k]             1
>>> dates
[{'mdy': '12/25/2017', 'my': '12/25', 'm': 12, 'd': 25, 'y': 2017, 'c': 20},
 {'mdy': '12/12', 'my': '12/12', 'm': 12, 'd': 12, 'y': 2017, 'c': 20}]
>>> from datetime import date
>>> datetimes = [date(d['y'], d['m'], d['d']) for d in dates]
>>> datetimes
[datetime.date(2017, 12, 25), datetime.date(2017, 12, 12)]

1 This works because both the dict and the list are mutable data types.

This is a basic but reasonably robust way to extract date information from natural language text. The main remaining tasks to turn this into a production date extractor would be to add some exception catching and context maintenance that’s appropriate for your application. If you added that to the nlpia package (http://github.com/totalgood/nlpia) with a pull request, I’m sure your fellow readers would appreciate it. And if you added some extractors for times, well, then you’d be quite the hero.

There are opportunities for some hand-crafted logic to deal with edge cases and natural language names for months and even days. But no amount of sophistication could resolve the ambiguity in the date “12/11.” That could be

December 11th in whatever year you read or heard it
November 12th if you heard it in London or Launceston, Tasmania (a commonwealth territory)
December 2011 if you read it in a US newspaper
November 2012 if you read it in an EU newspaper

Some natural language ambiguities can’t be resolved, even by a human brain. But let’s make sure your date extractor can handle European day/month order by reversing month and day in your regex. See the following listing.

Listing 11.7. Regular expression for European dates

>>> eu = r'((([0123]?d)[-/]([01]?d))([-/]([0123]d)?dd)?)'
>>> dmy = re.findall(eu, 'Alan Mathison Turing OBE FRS (23/6/1912-7/6/1954) 
...     was an English computer scientist.')
>>> dmy
[('23/6/1912', '23/6', '23', '6', '/1912', '19'),
 ('7/6/1954', '7/6', '7', '6', '/1954', '19')]
>>> dmy = re.findall(eu, 'Alan Mathison Turing OBE FRS (23/6/12-7/6/54) 
...     was an English computer scientist.')
>>> dmy
[('23/6/12', '23/6', '23', '6', '/12', ''),
 ('7/6/54', '7/6', '7', '6', '/54', '')]

That regular expression correctly extracts Turing’s birth and wake dates from a Wikipedia excerpt. But I cheated, I converted the month “June” into the number 6 before testing the regular expression on that Wikipedia sentence. So this isn’t a realistic example. And you’d still have some ambiguity to resolve for the year if the century isn’t specified. Does the year 54 mean 1954 or does it mean 2054? You’d like your chatbot to be able to extract dates from unaltered Wikipedia articles so it can read up on famous people and learn import dates. For your regex to work on more natural language dates, such as those found in Wikipedia articles, you need to add words such as “June” (and all its abbreviations) to your date-extracting regular expression.

You don’t need any special symbols to indicate words (characters that go together in sequence). You can type them in the regex exactly as you’d like them to be spelled in the input, including capitalization. All you have to do is put an OR symbol (|) between them in the regular expression. And you need to make sure it can handle US month/day order as well as European order. You’ll add these two alternative date “spellings” to your regular expression with a “big” OR (|) between them as a fork in your tree of decisions in the regular expression.

Let’s use some named groups to help you recognize years such as “’84” as 1984 and “08” as 2008. And let’s try to be a little more precise about the 4-digit years you want to match, only matching years in the future up to 2399 and in the past back to year 0.^[6] See the following listing.

⁶
See the web page titled “Year zero” (https://en.wikipedia.org/wiki/Year_zero).

Listing 11.8. Recognizing years

>>> yr_19xx = (
...     r'(?P<yr_19xx>' +
...     '|'.join('{}'.format(i) for i in range(30, 100)) +
...     r')'
...     )                                                             1
>>> yr_20xx = (
...     r'(?P<yr_20xx>' +
...     '|'.join('{:02d}'.format(i) for i in range(10)) + '|' +
...     '|'.join('{}'.format(i) for i in range(10, 30)) +
...     r')'
...     )                                                             2
>>> yr_cent = r'(?P<yr_cent>' + '|'.join(
...     '{}'.format(i) for i in range(1, 40)) + r')'                  3
>>> yr_ccxx = r'(?P<yr_ccxx>' + '|'.join(
...     '{:02d}'.format(i) for i in range(0, 100)) + r')'           4
>>> yr_xxxx = r'(?P<yr_xxxx>(' + yr_cent + ')(' + yr_ccxx + r'))'
>>> yr = (
...     r'(?P<yr>' +
...     yr_19xx + '|' + yr_20xx + '|' + yr_xxxx +
...     r')'
...     )
>>> groups = list(re.finditer(
...     yr, "0, 2000, 01, '08, 99, 1984, 2030/1970 85 47 `66"))
>>> full_years = [g['yr'] for g in groups]
>>> full_years
['2000', '01', '08', '99', '1984', '2030', '1970', '85', '47', '66']

1 2-digit years 30-99 = 1930-1999
2 1- or 2-digit years 01-30 = 2001-2030
3 First digits of a 3- or 4-digit year such as the “1” in “123 A.D.” or “20” in “2018”
4 Last 2 digits of a 3- or 4-digit year such as the “23” in “123 A.D.” or “18” in “2018”

Wow! That’s a lot of work, just to handle some simple year rules in regex rather than in Python. Don’t worry, packages are available for recognizing common date formats. They are much more precise (fewer false matches) and more general (fewer misses). So you don’t need to be able to compose complex regular expressions such as this yourself. This example just gives you a pattern in case you need to extract a particular kind of number using a regular expression in the future. Monetary values and IP addresses are examples where a more complex regular expression, with named groups, might come in handy.

Let’s finish up your regular expression for extracting dates by adding patterns for the month names such as “June” or “Jun” in Turing’s birthday on Wikipedia dates, as shown in the following listing.

Listing 11.9. Recognizing month words with regular expressions

>>> mon_words = 'January February March April May June July ' 
...     'August September October November December'
>>> mon = (r'(' + '|'.join('{}|{}|{}|{}|{:02d}'.format(
...     m, m[:4], m[:3], i + 1, i + 1) for i, m in 
 enumerate(mon_words.split())) +
...     r')')
>>> re.findall(mon, 'January has 31 days, February the 2nd month 
 of 12, has 28, except in a Leap Year.')
['January', 'February', '12']

Can you see how you might combine these regular expressions into a larger one that can handle both EU and US date formats? One complication is that you can’t reuse the same name for a group (parenthesized part of the regular expression). So you can’t put an OR between the US and EU ordering of the named regular expressions for month and year. And you need to include patterns for some optional separators between the day, month, and year.

Here’s one way to do all that.

Listing 11.10. Combining information extraction regular expressions

>>> day = r'|'.join('{:02d}|{}'.format(i, i) for i in range(1, 32))
>>> eu = (r'(' + day + r')[-,/ ]{0,2}(' +
...     mon + r')[-,/ ]{0,2}(' + yr.replace('<yr', '<eu_yr') + r')')
>>> us = (r'(' + mon + r')[-,/ ]{0,2}(' +
...     day + r')[-,/ ]{0,2}(' + yr.replace('<yr', '<us_yr') + r')')
>>> date_pattern = r'(' + eu + '|' + us + r')'
>>> list(re.finditer(date_pattern, '31 Oct, 1970 25/12/2017'))
[<_sre.SRE_Match object; span=(0, 12), match='31 Oct, 1970'>,
 <_sre.SRE_Match object; span=(13, 23), match='25/12/2017'>]

Finally, you need to validate these dates by seeing if they can be turned into valid Python datetime objects, as shown in the following listing.

Listing 11.11. Validating dates

>>> import datetime
>>> dates = []
>>> for g in groups:
...     month_num = (g['us_mon'] or g['eu_mon']).strip()
...     try:
...         month_num = int(month_num)
...     except ValueError:
...         month_num = [w[:len(month_num)]
...             for w in mon_words].index(month_num) + 1
...     date = datetime.date(
...         int(g['us_yr'] or g['eu_yr']),
...         month_num,
...         int(g['us_day'] or g['eu_day']))
...     dates.append(date)
>>> dates
[datetime.date(1970, 10, 31), datetime.date(2017, 12, 25)]

Your date extractor appears to work OK, at least for a few simple, unambiguous dates. Think about how packages such as Python-dateutil and datefinder are able to resolve ambiguities and deal with more “natural” language dates such as “today” and “next Monday.” And if you think you can do it better than these packages, send them a pull request!

If you just want a state of the art date extractor, statistical (machine learning) approaches will get you there faster. The Stanford Core NLP SUTime library (https://nlp.stanford.edu/software/sutime.html) and dateutil.parser.parse by Google are state-of-the-art.

11.4. Extracting relationships (relations)

So far you’ve looked only at extracting tricky noun instances such as dates and GPS latitude and longitude values. And you’ve worked mainly with numerical patterns. It’s time to tackle the harder problem of extracting knowledge from natural language. You’d like your bot to learn facts about the world from reading an encyclopedia of knowledge such as Wikipedia. You’d like it to be able to relate those dates and GPS coordinates to the entities it reads about.

What knowledge could your brain extract from this sentence from Wikipedia?

On March 15, 1554, Desoto wrote in his journal that the Pascagoula people ranged as far north as the confluence of the Leaf and Chickasawhay rivers at 30.4, 88.5.

Extracting the dates and the GPS coordinates might enable you to associate that date and location with Desoto, the Pascagoula people, and two rivers whose names you can’t pronounce. You’d like your bot (and your mind) to be able to connect those facts to larger facts—for example, that Desoto was a Spanish conquistador and that the Pascagoula people were a peaceful Native American tribe. And you’d like the dates and locations to be associated with the right “things”: Desoto, and the intersection of two rivers, respectively.

This is what most people think of when they hear the term natural language understanding. To understand a statement you need to be able to extract key bits of information and correlate it with related knowledge. For machines, you store that knowledge in a graph, also called a knowledge base. The edges of your knowledge graph are the relationships between things. And the nodes of your knowledge graph are the nouns or objects found in your corpus.

The pattern you’re going to use to extract these relationships (or relations) is a pattern such as SUBJECT - VERB - OBJECT. To recognize these patterns, you’ll need your NLP pipeline to know the parts of speech for each word in a sentence.

11.4.1. Part-of-speech (POS) tagging

POS tagging can be accomplished with language models that contain dictionaries of words with all their possible parts of speech. They can then be trained on properly tagged sentences to recognize the parts of speech in new sentences with other words from that dictionary. NLTK and spaCy both implement POS tagging functions. You’ll use spaCy here because it’s faster and more accurate. See the following listing.

Listing 11.12. POS tagging with spaCy

>>> import spacy
>>> en_model = spacy.load('en_core_web_md')
>>> sentence = ("In 1541 Desoto wrote in his journal that the Pascagoula peop
     le " +
...     "ranged as far north as the confluence of the Leaf and Chickasawhay r
     ivers at 30.4, -88.5.")
>>> parsed_sent = en_model(sentence)
>>> parsed_sent.ents
(1541, Desoto, Pascagoula, Leaf, Chickasawhay, 30.4)                  1
 
>>> ' '.join(['{}_{}'.format(tok, tok.tag_) for tok in parsed_sent])
'In_IN 1541_CD Desoto_NNP wrote_VBD in_IN his_PRP$ journal_NN that_IN the_DT
     Pascagoula_NNP people_NNS
 ranged_VBD as_RB far_RB north_RB as_IN the_DT confluence_NN of_IN the_DT Lea
     f_NNP and_CC Chickasawhay_NNP
 rivers_VBZ at_IN 30.4_CD ,_, -88.5_NFP ._.'                          2

1 spaCy misses the longitude in the lat, lon numerical pair.
2 spaCy uses the "OntoNotes 5" POS tags: https://spacy.io/api/annotation#pos-tagging.

So to build your knowledge graph, you need to figure out which objects (noun phrases) should be paired up. You’d like to pair up the date “March 15, 1554” with the named entity Desoto. You could then resolve those two strings (noun phrases) to point to objects you have in your knowledge base. March 15, 1554 can be converted to a datetime.date object with a normalized representation.

spaCy-parsed sentences also contain the dependency tree in a nested dictionary. And spacy.displacy can generate a scalable vector graphics SVG string (or a complete HTML page), which can be viewed as an image in a browser. This visualization can help you find ways to use the tree to create tag patterns for relation extraction. See the following listing.

Listing 11.13. Visualize a dependency tree

>>> from spacy.displacy import render
>>> sentence = "In 1541 Desoto wrote in his journal about the Pascagoula."
>>> parsed_sent = en_model(sentence)
>>> with open('pascagoula.html', 'w') as f:
...     f.write(render(docs=parsed_sent, page=True, 
 options=dict(compact=True)))

The dependency tree for this short sentence shows that the noun phrase “the Pascagoula” is the object of the relationship “met” for the subject “Desoto” (see figure 11.2). And both nouns are tagged as proper nouns.

Figure 11.2. The Pascagoula people

To create POS and word property patterns for a spacy.matcher.Matcher, listing all the token tags in a table is helpful. The following listing shows some helper functions that make that easier.

Listing 11.14. Helper functions for spaCy tagged strings

>>> import pandas as pd
>>> from collections import OrderedDict
>>> def token_dict(token):
...     return OrderedDict(ORTH=token.orth_, LEMMA=token.lemma_,
...         POS=token.pos_, TAG=token.tag_, DEP=token.dep_)
 
>>> def doc_dataframe(doc):
...     return pd.DataFrame([token_dict(tok) for tok in doc])
 
>>> doc_dataframe(en_model("In 1541 Desoto met the Pascagoula."))
         ORTH       LEMMA    POS  TAG    DEP
0          In          in    ADP   IN   prep
1        1541        1541    NUM   CD   pobj
2      Desoto      desoto  PROPN  NNP  nsubj
3         met        meet   VERB  VBD   ROOT
4         the         the    DET   DT    det
5  Pascagoula  pascagoula  PROPN  NNP   dobj
6           .           .  PUNCT    .  punct

Now you can see the sequence of POS or TAG features that will make a good pattern. If you’re looking for “has-met” relationships between people and organizations, you’d probably like to allow patterns such as “PROPN met PROPN,” “PROPN met the PROPN,” “PROPN met with the PROPN,” and “PROPN often meets with PROPN.” You could specify each of those patterns individually, or try to capture them all with some * or ? operators on “any word” patterns between your proper nouns:

'PROPN ANYWORD? met ANYWORD? ANYWORD? PROPN'

Patterns in spaCy are much more powerful and flexible than the preceding pseudocode, so you have to be more verbose to explain exactly the word features you’d like to match. In a spaCy pattern specification, you use a dictionary to capture all the tags that you want to match for each token or word, as shown in the following listing.

Listing 11.15. Example spaCy POS pattern

>>> pattern = [{'TAG': 'NNP', 'OP': '+'}, {'IS_ALPHA': True, 'OP': '*'},
...            {'LEMMA': 'meet'},
...            {'IS_ALPHA': True, 'OP': '*'}, {'TAG': 'NNP', 'OP': '+'}]

You can then extract the tagged tokens you need from your parsed sentence, as shown in the following listing.

Listing 11.16. Creating a POS pattern matcher with spaCy

>>> from spacy.matcher import Matcher
>>> doc = en_model("In 1541 Desoto met the Pascagoula.")
>>> matcher = Matcher(en_model.vocab)
>>> matcher.add('met', None, pattern)
>>> m = matcher(doc)
>>> m
[(12280034159272152371, 2, 6)]
 
>>> doc[m[0][1]:m[0][2]]
Desoto met the Pascagoula

So you extracted a match from the original sentence from which you created the pattern, but what about similar sentences from Wikipedia? See the following listing.

Listing 11.17. Using a POS pattern matcher

>>> doc = en_model("October 24: Lewis and Clark met their first Mandan Chief,
      Big White.")
>>> m = matcher(doc)[0]
>>> m
(12280034159272152371, 3, 11)
 
>>> doc[m[1]:m[2]]
Lewis and Clark met their first Mandan Chief
 
>>> doc = en_model("On 11 October 1986, Gorbachev and Reagan met at a house")
>>> matcher(doc)
[]                 1

1 The pattern doesn’t match any substrings of the sentence from Wikipedia.

You need to add a second pattern to allow for the verb to occur after the subject and object nouns, as shown in the following listing.

Listing 11.18. Combining multiple patterns for a more robust pattern matcher

>>> doc = en_model("On 11 October 1986, Gorbachev and Reagan met at a house")
>>> pattern = [{'TAG': 'NNP', 'OP': '+'}, {'LEMMA': 'and'}, {'TAG': 'NNP', 'O
     P': '+'},
...            {'IS_ALPHA': True, 'OP': '*'}, {'LEMMA': 'meet'}]
>>> matcher.add('met', None, pattern)                            1
>>> m = matcher(doc)
>>> m
[(14332210279624491740, 5, 9),
 (14332210279624491740, 5, 11),
 (14332210279624491740, 7, 11),
 (14332210279624491740, 5, 12)]                                  2
 
>>> doc[m[-1][1]:m[-1][2]]                                       3
Gorbachev and Reagan met at a house

1 Adds an additional pattern without removing the previous pattern. Here 'met' is an arbitrary key. Name your pattern whatever you like.
2 The '+' operators increase the number of overlapping alternative matches.
3 The longest match is the last one in the list of matches.

So now you have your entities and a relationship. You can even build a pattern that is less restrictive about the verb in the middle (“met”) and more restrictive about the names of the people and groups on either side. Doing so might allow you to identify additional verbs that imply that one person or group has met another, such as the verb “knows,” or even passive phrases, such as “had a conversation” or “became acquainted with.” Then you could use these new verbs to add relationships for new proper nouns on either side.

But you can see how you’re drifting away from the original meaning of your seed relationship patterns. This is called semantic drift. Fortunately, spaCy tags words in a parsed document with not only their POS and dependency tree information, but it also provides the Word2vec word vector. You can use this vector to prevent the connector verb and the proper nouns on either side from drifting too far away from the original meaning of your seed pattern.^[7]

⁷
This is the subject of active research: https://nlp.stanford.edu/pubs/structuredVS.pdf.

11.4.2. Entity name normalization

The normalized representation of an entity is usually a string, even for numerical information such as dates. The normalized ISO format for this date would be “1541-01-01.” A normalized representation for entities enables your knowledge base to connect all the different things that happened in the world on that same date to that same node (entity) in your graph.

You’d do the same for other named entities. You’d correct the spelling of words and attempt to resolve ambiguities for names of objects, animals, people, places, and so on. Normalizing named entities and resolving ambiguities is often called coreference resolution or anaphora resolution, especially for pronouns or other “names” relying on context. This is similar to lemmatization, which we discussed in chapter 2. Normalization of named entities ensures that spelling and naming variations don’t pollute your vocabulary of entity names with confounding, redundant names.

For example “Desoto” might be expressed in a particular document in at least five different ways:

“de Soto”
“Hernando de Soto”
“Hernando de Soto (c. 1496/1497–1542), Spanish conquistador”
https://en.wikipedia.org/wiki/Hernando_de_Soto (a URI)
A numerical ID for a database of famous and historical people

Similarly your normalization algorithm can choose any of these forms. A knowledge graph should normalize each kind of entity the same way, to prevent multiple distinct entities of the same type from sharing the same name. You don’t want multiple person names referring to the same physical person. Even more importantly, the normalization should be applied consistently—both when you write new facts to the knowledge base or when you read or query the knowledge base.

If you decide to change the normalization approach after the database has been populated, the data for existing entities in the knowledge should be “migrated,” or altered, to adhere to the new normalization scheme. Schemaless databases (key-value stores), like the ones used to store knowledge graphs or knowledge bases, aren’t free from the migration responsibilities of relational databases. After all, schemaless databases are interface wrappers for relational databases under the hood.

Your normalized entities also need “is-a” relationships to connect them to entity categories that define types or categories of entities. These “is-a” relationships can be thought of as tags, because each entity can have multiple “is-a” relationships. Like names of people or POS tags, dates and other discrete numerical objects need to be normalized if you want to incorporate them into your knowledge base.

What about relations between entities—do they need to be stored in some normalized way?

11.4.3. Relation normalization and extraction

Now you need a way to normalize the relationships, to identify the kind of relationship between entities. Doing so will allow you to find all birthday relationships between dates and people, or dates of occurrences of historical events, such as the encounter between “Hernando de Soto” and the “Pascagoula people.” And you need to write an algorithm to choose the right label for your relationship.

And these relationships can have a hierarchical name, such as “occurred-on/approximately” and “occurred-on/exactly,” to allow you to find specific relationships or categories of relationships. You can also label these relationships with a numerical property for the “confidence,” probability, weight, or normalized frequency (analogous to TF-IDF for terms/words) of that relationship. You can adjust these confidence values each time a fact extracted from a new text corroborates or contradicts an existing fact in the database.

Now you need a way to match patterns that can find these relationships.

11.4.4. Word patterns

Word patterns are just like regular expressions, but for words instead of characters. Instead of character classes, you have word classes. For example, instead of matching a lowercase character you might have a word pattern decision to match all the singular nouns (“NN” POS tag).^[8] This is usually accomplished with machine learning. Some seed sentences are tagged with some correct relationships (facts) extracted from those sentences. A POS pattern can be used to find similar sentences where the subject and object words, or even the relationships, might change.

⁸
spaCy uses the “OntoNotes 5” POS tags: https://spacy.io/api/annotation#pos-tagging.

You can use the spaCy package two different ways to match these patterns in O(1) (constant time) no matter how many patterns you want to match:

PhraseMatcher for any word/tag sequence patterns^[9]

⁹
See the web page titled “Code Examples - spaCy Usage Documentation” (https://spacy.io/usage/examples#phrase-matcher).
Matcher for POS tag sequence patterns^[10]

¹⁰
See the web page titled “Matcher - spaCy API Documentation” (https://spacy.io/api/matcher).

To ensure that the new relations found in new sentences are truly analogous to the original seed (example) relationships, you often need to constrain the subject, relation, and object word meanings to be similar to those in the seed sentences. The best way to do this is with some vector representation of the meaning of words. Does this ring a bell? Word vectors, discussed in chapter 4, are one of the most widely used word meaning representations for this purpose. They help minimize semantic drift.

Using semantic vector representations for words and phrases has made automatic information extraction accurate enough to build large knowledge bases automatically. But human supervision and curation is required to resolve much of the ambiguity in natural language text. CMU’s NELL (Never-Ending Language Learner)^[11] enables users to vote on changes to the knowledge base using Twitter and a web application.

¹¹
See the web page titled “NELL: The Computer that Learns - Carnegie Mellon University” (https://www.cmu.edu/homepage/computing/2010/fall/nell-computer-that-learns.shtml).

11.4.5. Segmentation

We’ve skipped one form of information extraction. It’s also a tool used in information extraction. Most of the documents you’ve used in this chapter have been bite-sized chunks containing just a few facts and named entities. But in the real world you may need to create these chunks yourself.

Document “chunking” is useful for creating semi-structured data about documents that can make it easier to search, filter, and sort documents for information retrieval. And for information extraction, if you’re extracting relations to build a knowledge base such as NELL or Freebase, you need to break it into parts that are likely to contain a fact or two. When you divide natural language text into meaningful pieces, it’s called segmentation. The resulting segments can be phrases, sentences, quotes, paragraphs, or even entire sections of a long document.

Sentences are the most common chunk for most information extraction problems. Sentences are usually punctuated with one of a few symbols (., ?, !, or a new line). And grammatically correct English language sentences must contain a subject (noun) and a verb, which means they’ll usually have at least one relation or fact worth extracting. And sentences are often self-contained packets of meaning that don’t rely too much on preceding text to convey most of their information.

Fortunately most languages, including English, have the concept of a sentence, a single statement with a subject and verb that says something about the world. Sentences are just the right bite-sized chunk of text for your NLP knowledge extraction pipeline. For the chatbot pipeline, your goal is to segment documents into sentences, or statements.

In addition to facilitating information extraction, you can flag some of those statements and sentences as being part of a dialog or being suitable for replies in a dialog. Using a sentence segmenter allows you to train your chatbot on longer texts, such as books. Choosing those books appropriately gives your chatbot a more literary, intelligent style than if you trained it purely on Twitter streams or IRC chats. And these books give your chatbot access to a much broader set of training documents to build its common sense knowledge about the world.

Sentence segmentation

Sentence segmentation is usually the first step in an information extraction pipeline. It helps isolate facts from each other so that you can associate the right price with the right thing in a string such as “The Babel fish costs $42. 42 cents for the stamp.” And that string is a good example of why sentence segmentation is tough—the dot in the middle could be interpreted as a decimal or a “full stop” period.

The simplest pieces of “information” you can extract from a document are sequences of words that contain a logically cohesive statement. The most important segments in a natural language document, after words, are sentences. Sentences contain a logically cohesive statement about the world. These statements contain the information you want to extract from text. Sentences often tell you the relationship between things and how the world works when they make statements of fact, so you can use sentences for knowledge extraction. And sentences often explain when, where, and how things happened in the past, tend to happen in general, or will happen in the future. So we should also be able to extract facts about dates, times, locations, people, and even sequences of events or tasks using sentences as our guide. And, most importantly, all natural languages have sentences or logically cohesive sections of text of some sort. And all languages have a widely shared process for generating them (a set of grammar rules or habits).

But segmenting text, identifying sentence boundaries, is a bit trickier than you might think. In English, for example, no single punctuation mark or sequence of characters always marks the end of a sentence.

11.4.6. Why won’t split('.!?') work?

Even a human reader might have trouble finding an appropriate sentence boundary within each of the following quotes. And if they did find multiple sentences from each, they would be wrong for four out of five of these difficult examples:

I live in the U.S. but I commute to work in Mexico on S.V. Australis for a woman from St. Bernard St. on the Gulf of Mexico.

I went to G.T.You?

She yelled “It’s right here!” but I kept looking for a sentence boundary anyway.

I stared dumbfounded on as things like “How did I get here?,” “Where am I?,” “Am I alive?” flittered across the screen.

The author wrote “'I don’t think it’s conscious.' Turing said.”

Even a human reader might have trouble finding an appropriate sentence boundary within each of these quotes. More sentence segmentation “edge cases” such as these are available at tm-town.com^[12] and within the nlpia.data module.

¹²
See the web page titled “Natural Language Processing: TM-Town” (https://www.tm-town.com/natural-language-processing#golden_rules).

Technical text is particularly difficult to segment into sentences, because engineers, scientists, and mathematicians tend to use periods and exclamation points to signify a lot of things besides the end of a sentence. When we tried to find the sentence boundaries in this book, we had to manually correct several of the extracted sentences.

If only we wrote English like telegrams, with a “STOP” or unique punctuation mark at the end of each sentence. Because we don’t, you’ll need some more sophisticated NLP than just split('.!?'). Hopefully you’re already imagining a solution in your head. If so, it’s probably based on one of the two approaches to NLP you’ve used throughout this book:

Manually programmed algorithms (regular expressions and pattern-matching)
Statistical models (data-based models or machine learning)

We use the sentence segmentation problem to revisit these two approaches by showing you how to use regular expressions as well as perceptrons to find sentence boundaries. And you’ll use the text of this book as a training and test set to show you some of the challenges. Fortunately you haven’t inserted any newlines within sentences, to manually wrap text like in newspaper column layouts. Otherwise, the problem would be even more difficult. In fact, much of the source text for this book, in ASCIIdoc format, has been written with “old-school” sentence separators (two spaces after the end of every sentence), or with each sentence on a separate line. This was so we could use this book as a training and test set for segmenters.

11.4.7. Sentence segmentation with regular expressions

Regular expressions are just a shorthand way of expressing the tree of “if...then” rules (regular grammar rules) for finding character patterns in strings of characters. As we mentioned in chapters 1 and 2, regular expressions (regular grammars) are a particularly succinct way to specify the rules of a finite state machine. Our regex or FSM has only one purpose: identify sentence boundaries.

If you do a web search for sentence segmenters,^[13] you’re likely to be pointed to various regular expressions intended to capture the most common sentence boundaries. Here are some of them, combined and enhanced to give you a fast, general-purpose sentence segmenter. The following regex would work with a few “normal” sentences:

¹³
See the web page titled “Python sentence segment at DuckDuckGo” (https://duckduckgo.com/?q=Python+sentence+segment&t=canonical&ia=qa).

>>> re.split(r'[!.?]+[ $]', "Hello World.... Are you there?!?! I'm going
     to Mars!")
['Hello World', 'Are you there', "I'm going to Mars!"]

Unfortunately, this re.split approach gobbles up the sentence-terminating token, and only retains it if it’s the last character in a document or string. But it does do a good job of ignoring the trickery of periods within doubly nested quotes:

>>> re.split(r'[!.?] ', "The author wrote "'I don't think it's conscious.'
     Turing said."")
['The author wrote "'I don't think it's conscious.' Turing said."']

It also ignores periods in quotes that terminate an actual sentence. This can be a good thing or a bad thing, depending on your information extraction steps that follow your sentence segmenter:

>>> re.split(r'[!.?] ', "The author wrote "'I don't think it's conscious.'
 Turing said." But I stopped reading.")
['The author wrote "'I don't think it's conscious.' Turing said." But I
 stopped reading."']

What about abbreviated text, such as SMS messages and tweets? Sometimes hurried humans squish sentences together, leaving no space surrounding periods. Alone, the following regex could only deal with periods in SMS messages that have letters on either side, and it would safely skip over numerical values:

>>> re.split(r'(?<!d).|.(?!d)', "I went to GT.You?")
['I went to GT', 'You?']

Even combining these two regexes isn’t enough to get more than a few right in the difficult test cases from nlpia.data:

>>> from nlpia.data.loaders import get_data
>>> regex = re.compile(r'((?<!d).|.(?!d))|([!.?]+)[ $]+')
>>> examples = get_data('sentences-tm-town')
>>> wrong = []
>>> for i, (challenge, text, sents) in enumerate(examples):
...     if tuple(regex.split(text)) != tuple(sents):
...         print('wrong {}: {}{}'.format(i, text[:50], '...' if len(text) >
     50 else ''))
...         wrong += [i]
>>> len(wrong), len(examples)
(61, 61)

You’d have to add a lot more “look-ahead” and “look-back” to improve the accuracy of a regex sentence segmenter. A better approach for sentence segmentation is to use a machine learning algorithm (often a single-layer neural net or logistic regression) trained on a labeled set of sentences. Several packages contain such a model you can use to improve your sentence segmenter:

DetectorMorse^[14]

¹⁴
See the web page titled “GitHub - cslu-nlp/DetectorMorse: Fast supervised sentence boundary detection using the averaged perceptron” (https://github.com/cslu-nlp/detectormorse).
spaCy^[15]

¹⁵
See the web page titled “Facts & Figures - spaCy Usage Documentation” (https://spacy.io/usage/facts-figures).
SyntaxNet^[16]

¹⁶
See the web page titled “models/syntaxnet-tutorial.md at master” (https://github.com/tensorflow/models/blob/master/research/syntaxnet/g3doc/syntaxnet-tutorial.md).
NLTK (Punkt)^[17]

¹⁷
See the web page titled “nltk.tokenize package — NLTK 3.3 documentation” (http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.punkt).
Stanford CoreNLP ^[18]

¹⁸
See the web page titled “torotoki / corenlp-python — Bitbucket” (https://bitbucket.org/torotoki/corenlp-python).

You use the spaCy sentence segmenter (built into the parser) for most of your mission-critical applications. spaCy has few dependencies and compares well with the others on accuracy and speed. DetectorMorse, by Kyle Gorman, is another good choice if you want state-of-the-art performance in a pure Python implementation that you can refine with your own training set.

11.5. In the real world

Information extraction and question answering systems are used for

TA assistants for university courses
Customer service
Tech support
Sales
Software documentation and FAQs

Information extraction can be used to extract things such as

Dates
Times
Prices
Quantities
Addresses
Names
- People
- Places
- Apps
- Companies
- Bots
Relationships
- “is-a” (kinds of things)
- “has” (attributes of things)
- “related-to”

Whether information is being parsed from a large corpus or from user input on the fly, being able to extract specific details and store them for later use is critical to the performance of a chatbot. First by identifying and isolating this information and then by tagging relationships between those pieces of information we’ve learned to “normalize” information programmatically. With that knowledge safely shelved in a search-able structure, your chatbot will be equipped with the tools to hold its own in a conversation in a given domain.

Summary

A knowledge graph can be built to store relationships between entities.
Regular expressions are a mini-programming language that can isolate and extract information.
Part-of-speech tagging allows you to extract relationships between entities mentioned in a sentence.
Segmenting sentences requires more than just splitting on periods and exclamation marks.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 11. Information extraction (named entity extraction and question answering)

Create new playlist

Sign In

Sign Up

Chapter 11. Information extraction (named entity extraction and question answering)

11.1. Named entities and relations

11.1.1. A knowledge base

Figure 11.1. Stanislav knowledge graph

11.1.2. Information extraction

11.2. Regular patterns

Listing 11.1. Pattern hardcoded in Python

Listing 11.2. Brittle pattern-matching example

11.2.1. Regular expressions

11.2.2. Information extraction as ML feature extraction

11.3. Information worth extracting

11.3.1. Extracting GPS locations

Listing 11.3. Regular expression for GPS coordinates

11.3.2. Extracting dates

Listing 11.4. Regular expression for US dates

Listing 11.5. Structuring extracted dates

Listing 11.6. Basic context maintenance

Listing 11.7. Regular expression for European dates

Listing 11.8. Recognizing years

Listing 11.9. Recognizing month words with regular expressions

Listing 11.10. Combining information extraction regular expressions

Listing 11.11. Validating dates

11.4. Extracting relationships (relations)

11.4.1. Part-of-speech (POS) tagging

Listing 11.12. POS tagging with spaCy

Listing 11.13. Visualize a dependency tree

Figure 11.2. The Pascagoula people

Listing 11.14. Helper functions for spaCy tagged strings

Listing 11.15. Example spaCy POS pattern

Listing 11.16. Creating a POS pattern matcher with spaCy

Listing 11.17. Using a POS pattern matcher

Listing 11.18. Combining multiple patterns for a more robust pattern matcher

11.4.2. Entity name normalization

11.4.3. Relation normalization and extraction

11.4.4. Word patterns

11.4.5. Segmentation

Sentence segmentation

11.4.6. Why won’t split('.!?') work?

11.4.7. Sentence segmentation with regular expressions

11.5. In the real world

Summary

Table of Contents for
Chapter 11. Information extraction (named entity extraction and question answering)