Regular expressions

Regular expressions, or regexes, are a powerful form of rule-based matching. Invented back in the 1950s, they were, for a very long time, the most useful way to find things in text and proponents argue that they still are.

No chapter on NLP would be complete without mentioning regexes. With that being said, this section is by no means a complete regex tutorial. It's intended to introduce the general idea and show how regexes can be used in Python, pandas, and spaCy.

A very simple regex pattern could be "a." This would only find instances of the lower-case letter a followed by a dot. However, regexes also allow you to add ranges of patterns; for example, "[a-z]." would find any lower-case letter followed by a dot, and "xy." would find only the letters "x" or "y" followed by a dot.

Regex patterns are case sensitive, so "A-Z" would only capture upper-case letters. This is useful if we are searching for expressions in which the spelling is frequently different; for example, the pattern "seriali[sz]e" would catch the British as well as the American English version of the word.

The same goes for numbers. "0-9" captures all numbers from 0 to 9. To find repetitions, you can use "*," which captures zero or more occurrences, or "+," which captures one or more occurrences. For example, "[0-9]+" would capture any series of numbers, which might be useful when looking for years. While "[A-Z][a-z] + [0-9] +," for example, would find all words starting with a capital letter followed by one or more digit, such as "March 2018" but also "Jaws 2."

Curly brackets can be used to define the number of repetitions. For instance, "[0-9]{4}" would find number sequences with exactly four digits. As you can see, a regex does not make any attempt to understand what is in the text, but rather offers a clever method of finding text that matches patterns.

A practical use case in the financial industry is finding the VAT number of companies in invoices. These follow a pretty strict pattern in most countries that can easily be encoded. VAT numbers in the Netherlands, for example, follow this regex pattern: "NL[0-9]{9}B[0-9]{2}".

Using Python's regex module

Python has a built-in tool for regexes called re. While it does not need to be installed because it is part of Python itself, we can import it with the following code:

import re

Imagine we are working on an automatic invoice processor, and we want to find the VAT number of the company that sent us the invoice. For simplicity's sake, we're going to only deal with Dutch VAT numbers (the Dutch for "VAT" is "BTW"). As mentioned before, we know the pattern for a Dutch VAT number is as follows:

pattern = 'NL[0-9]{9}B[0-9]{2}'

A string for finding a BTW number might look like this:

my_string = 'ING Bank N.V. BTW:NL003028112B01'

So, to find all the occurrences of a BTW number in the string, we can call re.findall, which will return a list of all strings matching the pattern found. To call this, we simply run:

re.findall(pattern,my_string)
['NL003028112B01']

re also allows the passing of flags to make the development of regex patterns a bit easier. For example, to ignore the case of letters when matching a regular expression, we can add a re.IGNORECASE flag, like we've done here:

re.findall(pattern,my_string, flags=re.IGNORECASE)

Often, we are interested in a bit more information about our matches. To this end, there is a match object. re.search yields a match object for the first match found:

match = re.search(pattern,my_string)

We can get more information out of this object, such as the location of our match, simply by running:

match.span()
(18, 32)

The span, the start and the end of our match, is the characters 18 to 32.

Regex in pandas

The data for NLP problems often comes in pandas DataFrames. Luckily for us, pandas natively supports regex. If, for example, we want to find out whether any of the articles in our news dataset contain a Dutch BTW number, then we can pass the following code:

df[df.content.str.contains(pattern)]

This would yield all the articles that include a Dutch BTW number, but unsurprisingly no articles in our dataset do.

When to use regexes and when not to

A regex is a powerful tool, and this very short introduction does not do it justice. In fact, there are several books longer than this one written purely on the topic of regexes. However, for the purpose of this book, we're only going to briefly introduce you to the topic.

A regex, as a tool, works well on simple and clear-to-define patterns. VAT/BTW numbers are a perfect example, as are email addresses and phone numbers, both of which are very popular use cases for regexes. However, a regex fails when the pattern is hard to define or if it can only be inferred from context. It is not possible to create a rule-based named entity recognizer that can spot that a word refers to the name of a person, because names follow no clear distinguishing pattern.

So, the next time you are looking to find something that is easy to spot for a human but hard to describe in rules, use a machine learning-based solution. Likewise, the next time you are looking for something clearly encoded, such as a VAT number, use regexes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset