Chapter 4. Look Around

Until this point, we have learned different mechanisms of matching characters while discarding them. A character that is already matched cannot be compared again, and the only way to match any upcoming character is by discarding it.

The exceptions to this are a number of metacharacters we have studied, the so-called zero-width assertions. These characters indicate positions rather than actual content. For instance, the caret symbol (^) is a representation of the beginning of a line or the dollar sign ($) for the end of a line. They just ensure that the position in the input is correct without actually consuming or matching any character.

A more powerful kind of zero-width assertion is look around, a mechanism with which it is possible to match a certain previous (look behind) or ulterior (look ahead) value to the current position. They effectively do assertion without consuming characters; they just return a positive or negative result of the match.

The look around mechanism is probably the most unknown and at the same time the most powerful technique in regular expressions. This mechanism allows us to create powerful regular expressions that cannot be written otherwise, either because of the complexity it would represent or just because of technical limitations of regular expressions without look around.

In this chapter, we are going to learn how to leverage the look around mechanism using Python regular expressions. We will understand how to apply them, how these work behind the scenes, and the few limitations the Python regular expression module will impose on us.

Both look ahead and look behind could be subdivided into another two types each: positive and negative:

  • Positive look ahead: This mechanism is represented as an expression preceded by a question mark and an equals sign, ?=, inside a parenthesis block. For example, (?=regex) will match if the passed regex do match against the forthcoming input.
  • Negative look ahead: This mechanism is specified as an expression preceded by a question mark and an exclamation mark, ?!, inside a parenthesis block. For example, (?!regex) will match if the passed regex do not match against the forthcoming input.
  • Positive look behind: This mechanism is represented as an expression preceded by a question mark, a less-than sign, and an equals sign, ?<=, inside a parenthesis block. For example, (?<=regex) will match if the passed regex do match against the previous input.
  • Negative look behind: This mechanism is represented as an expression preceded by a question mark, a less-than sign, and an exclamation mark, ?<!, inside a parenthesis block. For example, (?<!regex) will match if the passed regex do not match against the previous input.

Let's start looking forward to the next section.

Look ahead

The first type of look around mechanism that we are going to study is the look ahead mechanism. It tries to match ahead the subexpression passed as an argument. The zero-width nature of the two look around operations render them complex and difficult to understand.

As we know from the previous section, it is represented as an expression preceded by a question mark and an equals sign, ?=, inside a parenthesis block: (?=regex).

Let's start tackling this by comparing the result of the two similar regular expressions. We can recall that in Chapter 1, Introducing Regular Expressions, we matched the expression /fox/ to the phrase The quick brown fox jumps over the lazy dog. Let's also apply the expression /(?=fox)/ to the same input:

>>>pattern = re.compile(r'fox')
>>>result = pattern.search("The quick brown fox jumps over the lazy dog")
>>>print result.start(), result.end()
16 19

We just searched the literal fox in the input string, and just as expected we have found it between the index 16 and 19. Let's see the following example of the look ahead mechanism:

>>>pattern = re.compile(r'(?=fox)')
>>>result = pattern.search("The quick brown fox jumps over the lazy dog")
>>>print result.start(), result.end()
16 16

This time we have applied the expression /(?=fox)/ instead. The result has been just a position at the index 16 (both the start and end point to the same index). This is because look around does not consume characters, and therefore, it can be used to filter where the expression should match. However, it will not define the contents of the result. We can visually compare these two expressions in the following figure:

Look ahead

Comparison of normal and look ahead matches

Let's use this feature again to try and match any word that is followed by a comma character (,) using the following regular expression /w+(?=,)/ and the text They were three: Felix, Victor, and Carlos:

>>>pattern = re.compile(r'w+(?=,)')
>>>pattern.findall("They were three: Felix, Victor, and Carlos.")
['Felix', 'Victor']

We created a regular expression that accepts any repetition of alphanumeric characters followed by a comma character that is not going to be used as a part of the result. Therefore, only Felix and Victor were part of the result as Carlos didn't have a comma after the name.

How different was this compared to the use of the regular expressions we have up to this chapter? Let's compare the results by applying /w+,/ to the same text:

>>>pattern = re.compile(r'w+,')
>>>pattern.findall("They were three: Felix, Victor, and Carlos.")
['Felix,', 'Victor,']

With the preceding regular expressions, we asked the regular expression engine to accept any repetition of alphanumeric characters followed by a comma character. Therefore, the alphanumeric characters and the comma character will be returned, as we can see in the listing.

It's noteworthy that the look ahead mechanism is another subexpression that can be leveraged with all the power of regular expressions (it's not the same case for the look behind mechanism as we will discover later). Therefore, we can use all the constructions we learned so far as the alternation:

>>>pattern = re.compile(r'w+(?=,|.)')
>>>pattern.findall("They were three: Felix, Victor, and Carlos.")
['Felix', 'Victor', 'Carlos']

In the preceding example, we used alternation (even though we could have used other simpler techniques as a character set) to accept any repetition of alphanumeric characters followed by a comma or dot character that is not going to be used as a part of the result.

Negative look ahead

The negative look ahead mechanism presents the same nature of the look ahead but with a notable distinction: the result will be valid only if the subexpression doesn't match.

It is represented as an expression preceded by a question mark and an exclamation mark, ?!, inside a parenthesis block: (?!regex).

This is useful when we want to express what should not happen. For instance, to find any name John that is not John Smith, we could do the following:

>>>pattern = re.compile(r'John(?!sSmith)')                                    >>> result = pattern.finditer("I would rather go out with John McLane than with John Smith or John Bon Jovi")
>>>for i in result:
...print i.start(), i.end()
...
27 31
63 67

In the preceding example, we looked for John by consuming these five characters and then looked ahead for a whitespace character followed by the word Smith. In case of a match, the match will contain only the starting and end position of John. In this case, the positions are 27-31 for John McLane and 63-67 for John Bon Jovi.

Now, we are able to leverage the more basic forms of look around: the positive and negative look ahead. Let's learn how to get the most of it in substitutions and groups.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset