B.3. Regular expressions

Regular expressions are little computer programs with their own programming language. Each regular expression string like r'[a-z]+' can be compiled into a small program designed to be run on other strings to find matches. We provide a quick reference and some examples here, but you’ll probably want to dig deeper in some online tutorials, if you’re serious about NLP. As usual, the best way to learn is to play around at the command line. The nlpia package has a lot of natural language text documents and some useful regular expression examples for you to play with.

A regular expression defines a sequence of conditional expressions (if in Python) that each work on a single character. The sequence of conditionals forms a tree that eventually concludes in an answer to the question “is the input string a match or not.” Because each regular expression can only match a finite number of strings and has a finite number of conditional branches, it defines a finite state machine (FSM).[2]

2

This is only true for strict regular expression syntaxes that don’t look-ahead and look-behind.

The re package is the default regex compiler/interpreter in Python, but the new official package is regex and can be easily installed with the pip install regex. It’s more powerful, with better support for Unicode characters and fuzzy matching (pretty awesome for NLP). You don’t need those extra features for the examples here, so you can use either one. You only need to learn a few regular expression symbols to solve the problems in this book:

  • |—The OR symbol.
  • ()—Grouping with parentheses, just like in Python expressions.
  • []—Character classes.
  • s, , d, w—Shortcuts to common character classes.
  • *, ?, +—Some common shortcuts to character class occurrence count limits.
  • {7,10}—When—*, ?, and + aren’t enough, you can specify exact count ranges with curly braces.

B.3.1. |—OR

The | symbol is used to separate strings that can alternatively match the input string to produce an overall match for the regular expression. So the regular expression 'Hobson|Cole|Hannes' would match any of the given names (first names) of this book’s authors. Patterns are processed left to right, and “short circuit” when a match is made, like most other programming languages. So the order of the patterns between the OR symbols (|) doesn’t affect the match, in this case, since all the patterns (author names) have unique character sequences in the first two characters. The following listing shows a shuffling of the author’s names so you can see for yourself.

Listing B.1. Regex OR symbol
>>> import re
>>> re.findall(r'Hannes|Hobson|Cole', 'Hobson Lane, Cole Howard, 
 and Hannes Max Hapke')
['Hobson', 'Cole', 'Hannes']         1

  • 1 .findall() searches for all the non-overlapping regex matches within the input string, so it returns them in a list.

To exercise your Python playfulness, see if you can cause the regular expression to short circuit on the first pattern, when a human looking at all three patterns might choose a better match:

>>> re.findall(r'H|Hobson|Cole', 'Hobson Lane, Cole Howard, 
 and Hannes Max Hapke')
['H', 'Cole', 'H', 'H', 'H']

B.3.2. ()—Groups

You can use parentheses to group several symbol patterns into a single expression. Each grouped expression is evaluated as a whole. So r'(kitt|dogg)ie' matches either “kitty” or “doggy.” Without the parentheses, r'kitt|doggy' would match “kitt” or “doggy” (notice no “kitty”).

Groups have another purpose. They can be used to capture (extract) part of the input text. Each group is assigned a location in the list of groups() that you can retrieve according to their index, left to right. The .group() method returns the default overall group for the entire expression. You can use the previous groups to capture a “stem” (the part without the y) of the kitty/doggy regex, as shown in the following listing.

Listing B.2. regex grouping parentheses
>>> import re
>>> match = re.match(r'(kitt|dogg)y', "doggy")
>>> match.group()
'doggy'
>>> match.group(0)
'dogg'
>>> match.groups()
('dogg',)
>>> match = re.match(r'((kitt|dogg)(y))', "doggy")       1
>>> match.groups()
('doggy', 'dogg', 'y')
>>> match.group(2)
'y'

  • 1 If you want to capture each part in its own group If you want/need to give names to your groups for information extraction into a structured datatype (dict), you need to use the P symbol at the start of your group, like (P?<animal_stemm>dogg|kitt)y.[3]

    3

    Named regular expression group: What does "P" stand for? (https://stackoverflow.com/questions/10059673).

B.3.3. []—Character classes

Character classes are equivalent to an OR symbol (|) between a set of characters. So [abcd] is equivalent to (a|b|c|d), and [abc123] is equivalent to (a|b|c|d|1|2|3).

And if some of the characters in a character class are consecutive characters in the alphabet of characters (ASCII or Unicode), they can be abbreviated using a hyphen between them. So [a-d] is equivalent to [abcd] or (a|b|c|d), and [a-c1-3] is an abbreviation for [abc123] and (a|b|c|d|1|2|3).

Character class shortcuts

  • s[ ]—Whitespace characters
  • —A non-letter, non-digit next to a letter or digit
  • d[0-9]—A digit
  • w[a-zA-Z0-9_]—A word or variable name character
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset