Overlapping groups

Throughout Chapter 2, Regular Expressions with Python, we've seen several operations where there was a warning about overlapping groups: for example, the findall operation. This is something that seems to confuse a lot of people. So, let's try to bring some clarity with a simple example:

>>>re.findall(r'(a|b)+', 'abaca')
['a', 'a']

What's happening here? Why does the following expression give us 'a' and 'a' instead of 'aba' and 'a'?

Let's look at it step by step to understand the solution:

Overlapping groups

Overlapping groups matching process

As we can see in the preceding figure, the characters aba are matched, but the captured group is only formed by a. This is because even though our regex is grouping every character, it stays with the last a. Keep this in mind because it's the key to understanding how it works. Stop for a moment and think about it, we're requesting the regex engine to capture all the groups made up of a or b, but just for one of the characters and that's the key. So, how can you capture the groups made of several 'a' or 'b' in any order? The following expression does the trick:

>>>re.findall(r'((?:a|b)+)', 'abbaca')
   ['abba', 'a']

We're asking the regex engine to capture every group made up of the subexpression (a|b) and not to group just one character.

One last thing on this— if we would want to obtain every group made of a or b with findall, we could write this simple expression:

>>>re.findall(r'(a|b)', 'abaca')
   ['a', 'b', 'a', 'a']

In this case, we're asking the regex engine to capture a group made of a or b. As we're using findall, we get every pattern matched, so we get four groups.

Tip

Rule of Thumb

It's better to keep regular expressions as simple as you can. So, you should begin with the simplest expression and then build more complex expressions step by step and not the other way around.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset