As we've mentioned before, capturing content is not the only use of groups. There are cases when we want to use groups, but we're not interested in extracting the information; alternation would be a good example. That's why we have a way to create groups without capturing. Throughout this book, we've been using groups to create subexpressions, as can be seen in the following example:
>>>re.search("Españ(a|ol)", "Español") <_sre.SRE_Match at 0x10e90b828> >>>re.search("Españ(a|ol)", "Español").groups() ('ol',)
You can see that we've captured a group even though we're not interested in the content of the group. So, let's try it without capturing, but first we have to know the syntax, which is almost the same as in normal groups, (?:pattern)
. As you can see, we've only added ?:
. Let's see the following example:
>>>re.search("Españ(?:a|ol)", "Español") <_sre.SRE_Match at 0x10e912648> >>>re.search("Españ(?:a|ol)", "Español").groups() ()
After using the new syntax, we have the same functionality as before, but now we're saving resources and the regex is easier to maintain. Note that the group cannot be referenced.
They're a special case of non-capturing groups; they're usually used to improve performance. It disables backtracking, so with them you can avoid cases where trying every possibility or path in the pattern doesn't make sense. This concept is difficult to understand, so stay with me up to the end of the section.
The re
module doesn't support atomic groups. So, in order to see an example, we're going to use the regex module: https://pypi.python.org/pypi/regex.
Imagine we have to look for an ID made up of one or more alphanumeric characters followed by a dash and by a digit:
>>>data = "aaaaabbbbbaaaaccccccdddddaaa" >>>regex.match("(w+)-d",data)
Let's see step by step what's happening here:
a
.a
.It tries this with every character. If you think about what we're doing, it doesn't make any sense to keep trying once you have failed the first time. And that's exactly what an atomic group is useful for. For example:
>>>regex.match("(?>w+)-d",data)
Here we've added ?>
, which indicates an atomic group, so once the regex engine fails to match,
it doesn't keep trying with every character in the data.