Throughout Chapter 2, Regular Expressions with Python, we've seen several operations where there was a warning about overlapping groups: for example, the findall
operation. This is something that seems to confuse a lot of people. So, let's try to bring some clarity with a simple example:
>>>re.findall(r'(a|b)+', 'abaca') ['a', 'a']
What's happening here? Why does the following expression give us 'a'
and 'a'
instead of 'aba'
and 'a'
?
Let's look at it step by step to understand the solution:
As we can see in the preceding figure, the characters aba
are matched, but the captured group is only formed by a
. This is because even though our regex is grouping every character, it stays with the last a
. Keep this in mind because it's the key to understanding how it works. Stop for a moment and think about it, we're requesting the regex engine to capture all the groups made up of a
or b
, but just for one of the characters and that's the key. So, how can you capture the groups made of several 'a'
or 'b'
in any order? The following expression does the trick:
>>>re.findall(r'((?:a|b)+)', 'abbaca') ['abba', 'a']
We're asking the regex engine to capture every group made up of the subexpression (a|b
) and not to group just one character.
One last thing on this— if we would want to obtain every group made of a
or b
with findall
, we could write this simple expression:
>>>re.findall(r'(a|b)', 'abaca') ['a', 'b', 'a', 'a']
In this case, we're asking the regex engine to capture a group made of a
or b
. As we're using findall
, we get every pattern matched, so we get four groups.