Building blocks for Python regex

In Python, there are two different objects dealing with Regex:

  • RegexObject: It is also known as Pattern Object. It represents a compiled regular expression
  • MatchObject: It represents the matched pattern

RegexObject

In order to start matching patterns, we'll have to compile the regex. Python gives us an interface to do that as we've seen previously. The result will be a pattern object or RegexObject. This object has several methods for typical operations on regular expressions. As we will see later, the re module provides a shorthand for every operation so that we can avoid compiling it first.

>>> pattern = re.compile(r'fo+')

The compilation of a regular expression produces a reusable pattern object that provides all the operations that can be done, such as matching a pattern and finding all substrings that match a particular regex. So, for example, if we want to know if a string starts with <HTML>, we can use the following code:

>>> pattern = re.compile(r'<HTML>')
>>> pattern.match("<HTML>")
   <_sre.SRE_Match at 0x108076578>

There are two ways of matching patterns and executing the operations related to the regular expressions. We can compile a pattern, which gives us a RegexObject, or we can use the module operations. Let's compare the two different mechanisms in the following examples.

If we want to re-use the regular expression, we can use the following code:

>>> pattern = re.compile(r'<HTML>')
>>> pattern.match("<HTML>")

On the other hand, we can directly perform the operation on the module using the following line of code:

>>> re.match(r'<HTML>', "<HTML>")

The re module provides a wrapper for every operation in the RegexObject. You can see them as shortcuts.

Internally, these wrappers create the RegexObject and then call the corresponding method. You might be wondering whether every time you call one of these wrappers it compiles the regular expression first. The answer is no. The re module caches the compiled pattern so that in future calls it doesn't have to compile it again.

Beware of the memory needs of your program. When you're using module operations, you don't control the cache, and so you can end up with a lot of memory usage. You can always use re.purge to clear the cache but this is a tradeoff with performance. Using compiled patterns allows you to have a fine-grained control of the memory consumption because you can decide when to purge them.

There are some differences between both ways though. With the RegexObject, it is possible to limit the region in which the pattern will be searched, for example limit the search of a pattern between the characters at index 2 and 20. In addition to that, you can set flags in every call by using the operations in the module. However, be careful; every time you change the flag, a new pattern will be compiled and cached.

Let's dive into the most important operations that can be done with a pattern object.

Searching

Let's see the operations we have to look for patterns in strings. Note that python has two operations, match and search; where many other languages have one, match.

match(string[, pos[, endpos]])

This method tries to match the compiled pattern only at the beginning of the string. If there is a match, then it returns a MatchObject. So, for example, let's try to match whether a string starts with <HTML> or not:

>>> pattern = re.compile(r'<HTML>')
>>> pattern.match("<HTML><head>")
<_sre.SRE_Match at 0x108076578>

In the preceding example, first we've compiled the pattern and then we've found a match in the <HTML><head> string.

Let's see what happens when the string doesn't start with <HTML>, as shown in the following lines of code:

>>> pattern.match("<HTML>")
    None

As you can see, there is no match. Remember what we said before, match tries to match at the beginning of the string. The string starts with a whitespace unlike the pattern. Note the difference with search in the following example:

>>> pattern.search("⇢<HTML>")
<_sre.SRE_Match at 0x108076578>

As expected, we have a match.

The optional pos parameter specifies where to start searching, as shown in the following code:

>>> pattern = re.compile(r'<HTML>')
>>> pattern.match("⇢ ⇢ <HTML>")
    None
>>> pattern.match("⇢ ⇢ <HTML>", 2)
   <_sre.SRE_Match at 0x1043bc850>

In the highlighted code, we can see how the pattern has a match even though there are two whitespaces in the string. This is possible because we've set pos to 2, so the match operation starts searching in that position.

Note that pos bigger than 0 doesn't mean that string starts at that index, for example:

>>> pattern = re.compile(r'^<HTML>')
>>> pattern.match("<HTML>")
   <_sre.SRE_Match at 0x1043bc8b8>
>>> pattern.match("⇢ ⇢ <HTML>",  2)
    None

In the preceding code, we've created a pattern to match strings in which the first character after "start" is followed by <HTML>. After that, we've tried to match the string <HTML> starting at the second character, <. There is no match because the pattern is trying to match the ^ metacharacter at the 2 position first.

Tip

Anchor characters tip

The characters ^ and $ indicate the start and end of the string respectively. You can neither see them in the strings nor write them, but they are always there and are valid characters for the regex engine.

Note the different result if we slice the string 2 positions, as in the following code:

>>> pattern.match("⇢ ⇢ <HTML>"[2:])
   <_sre.SRE_Match at 0x1043bca58>

The slice gives us a new string; therefore, there is a ^ metacharacter in it. On the contrary, pos just moves the index to the starting point for the search in the string.

The second argument, endpos, sets how far the pattern will try to match in the string. In the following case, it's equivalent to slicing:

>>> pattern = re.compile(r'<HTML>')
>>> pattern.match("<HTML>"[:2]) 
    None
>>> pattern.match("<HTML>", 0, 2) 
    None

So, in the following case, we don't have the problem mentioned with pos. There is a match even when the $ metacharacter is used:

>>> pattern = re.compile(r'<HTML>$')
>>> pattern.match("<HTML>⇢", 0,6)
<_sre.SRE_Match object at 0x1007033d8>
>>> pattern.match("<HTML>⇢"[:6])
<_sre.SRE_Match object at 0x100703370>

As you can see, there is no difference between slicing and endpos.

search(string[, pos[, endpos]])

This operation would be like the match of many languages, Perl for example. It tries to match the pattern at any location of the string and not just at the beginning. If there is a match, it returns a MatchObject.

>>> pattern = re.compile(r"world")
>>> pattern.search("hello⇢world")
   <_sre.SRE_Match at 0x1080901d0>
>>> pattern.search("hola⇢mundo ")
    None

The pos and endpos parameters have the same meaning as that in the match operation.

Note that with the MULTILINE flag, the ^ symbol matches at the beginning of the string and at the beginning of each line (we'll see more on this flag later). So, it changes the behavior of search.

In the following example, the first search matches <HTML> because it's at the beginning of the string, but the second search doesn't match because the string starts with a whitespace. And finally, in the third search, we have a match as we find <HTML> right after new line, thanks to re.MULTILINE.

>>> pattern = re.compile(r'^<HTML>', re.MULTILINE)
>>> pattern.search("<HTML>")
   <_sre.SRE_Match at 0x1043d3100>
>>> pattern.search("⇢<HTML>")
   None
>>> pattern.search("⇢ ⇢
<HTML>")
   <_sre.SRE_Match at 0x1043bce68>

So, as long as the pos parameter is less than, or equal to, the new lines, there will be a match.

>>> pattern.search("⇢ ⇢
<HTML>",  3)
  <_sre.SRE_Match at 0x1043bced0>
>>> pattern.search('</div></body>
<HTML>', 4)
  <_sre.SRE_Match at 0x1036d77e8>
>>> pattern.search("  
<HTML>", 4)
   None

findall(string[, pos[, endpos]])

The previous operations worked with one match at a time. On the contrary, in this case it returns a list with all the non-overlapping occurrences of a pattern and not the MatchObject like search and match do.

In the following example, we're looking for every word in a string. So, we obtain a list in which every item is the pattern found, in this case a word.

>>> pattern = re.compile(r"w+")
>>> pattern.findall("hello⇢world")
    ['hello', 'world']

Keep in mind that empty matches are a part of the result:

>>> pattern = re.compile(r'a*')
>>> pattern.findall("aba")
    ['a', '', 'a', '']

I bet you're wondering what's happening here? The trick comes from the * quantifier, which allows 0 or more repetitions of the preceding regex; the same had happened with the ? quantifier.

>>> pattern = re.compile(r'a?')
>>> pattern.findall("aba")
    ['a', '', 'a', '']

Basically, both of them match the expression even though the preceding regex is not found:

findall(string[, pos[, endpos]])

findall matching process

First, the regex matches the character a, then it follows with b. There is a match due to the * quantifier, the empty string. After that, it matches another a and finally it tries to match $. As we've mentioned before, even though you can't see $, it's a valid character for the regex engine. As it happened with the b, it matches due to the * quantifier.

We've seen quantifiers in depth in Chapter 1, Introducing Regular Expressions.

In case there are groups in the pattern, they are returned as tuples. The string is scanned from left to right, so the groups are returned in the same order they are found.

The following example tries to match a pattern made of two words and creates a group for every word. That's why we have a list of tuples in which every tuple has two groups.

>>> pattern = re.compile(r"(w+) (w+)")
>>> pattern.findall("Hello⇢world⇢hola⇢mundo")
    [('Hello', 'world'), ('hola', 'mundo')]

The findall operation along with groups is another thing that seems to confuse a lot of people. In Chapter 3, Groups, we've dedicated a complete section to explain this complex subject.

finditer(string[, pos[, endpos]])

Its working is essentially the same as findall, but it returns an iterator in which each element is a MatchObject, so we can use the operations provided by this object. So, it's quite useful when you need information for every match, for example the position in which the substring was matched. Several times, I've found myself using it to understand what's happening in findall.

Let's go back to one of our initial examples. Match every two words and capture them:

>>> pattern = re.compile(r"(w+) (w+)")
>>> it = pattern.finditer("Hello⇢world⇢hola⇢mundo")
>>> match = it.next()
>>> match.groups()
    ('Hello', 'world')
>>> match.span()
    (0, 11)

In the preceding example, we can see how we get an iterator with all the matches. For every element in the iterator, we get a MatchObject, so we can see the captured groups in the pattern, two in this case. We will also get the position of the match.

>>> match = it.next()
>>> match.groups()
    ('hola', 'mundo')
>>> match.span()
    (12, 22)

Now, we consume another element from the iterator and perform the same operations as before. So, we get the next match, its groups, and the position of the match. We've done the same as we did with the first match:

>>> match = it.next()
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
StopIteration

Finally, we try to consume another match, but in this case a StopIteration exception is thrown. This is normal behavior to indicate that there are no more elements.

Modifying a string

In this section, we're going to see the operations to modify strings, such as an operation to divide the string and another to replace some parts of it.

split(string, maxsplit=0)

In almost every language, you can find the split operation in strings. The big difference is that the split in the re module is more powerful due to which you can use a regex. So, in this case, the string is split based on the matches of the pattern. As always, the best way to understand it is with an example, so let's split a string into lines:

>>> re.split(r"
", "Beautiful⇢is better⇢than⇢ugly.
Explicit⇢is⇢better⇢than⇢implicit.")

['Beautiful⇢is⇢better⇢than⇢ugly.', 'Explicit⇢is⇢better⇢than⇢implicit.']

In the preceding example, the match is ; so, the string is split using it as the separator. Let's see a more complex example of how to get the words in a string:

>>> pattern = re.compile(r"W")
>>> pattern.split("hello⇢world")
['Hello', 'world']

In the preceding example, we've defined a pattern to match any non-alphanumeric character. So, in this case the match happens in the whitespace. That's why the string is split into words. Let's see another example to understand it better:

>>> pattern = re.compile(r"W")
>>> pattern.findall("hello⇢world")
['⇢']

Note that the match is the whitespace.

The maxsplit parameter specifies how many splits can be done at maximum and returns the remaining part in the result:

>>> pattern = re.compile(r"W")
>>> pattern.split("Beautiful is better than ugly", 2)
['Beautiful', 'is', 'better than ugly']

As you can see, only two words are split and the other words are a part of the result.

Have you realized that the pattern matched is not included? Take a look at every example in this section. What can we do if we want to capture the pattern too?

The answer is to use groups:

>>> pattern = re.compile(r"(-)")
>>> pattern.split("hello-word")
['hello', '-', 'word']

This happens because the split operation always returns the captured groups.

Note that when a group matches the start of the string, the result will contain the empty string as a first result:

>>> pattern = re.compile(r"(W)")
>>> pattern.split("⇢hello⇢word")
['', '⇢', 'hello', '⇢', 'word']

sub(repl, string, count=0)

This operation returns the resulting string after replacing the matched pattern in the original string with the replacement. If the pattern is not found, the original string is returned. For example, we're going to replace the digits in the string with - (dash):

>>> pattern = re.compile(r"[0-9]+")
>>> pattern.sub("-", "order0⇢order1⇢order13")
  'order-⇢order-⇢order-'

Basically, the regex matches 1 and more digits and replaces the pattern matched, 0, 1, and 13 here, with - (dash).

Note that it replaces the leftmost non-overlapping occurrences of the pattern. Let's see another example:

 >>> re.sub('00', '-', 'order00000')
   'order--0'

In the preceding example, we're replacing zeroes two by two. So, the first two are matched and then replaced, then the following two zeroes are matched and replaced too, and finally the last zero is left intact.

The repl argument can also be a function, in which case it receives a MatchObject as an argument and the string returned is the replacement. For example, imagine you have a legacy system in which there are two kinds of orders. Some start with a dash and the others start with a letter:

  • -1234
  • A193, B123, C124

You must change it to the following:

  • A1234
  • B193, B123, B124

In short, the ones starting with a dash should start with an A and the rest should start with a B.

>>>def normalize_orders(matchobj):
       if matchobj.group(1) == '-': return "A"
       else: return "B"

>>> re.sub('([-|A-Z])', normalize_orders, '-1234⇢A193⇢ B123')
'A1234⇢B193⇢B123'

As mentioned previously, for each matched pattern the normalize_orders function is called. So, if the first matched group is a , then we return an A; in any other case, we return B.

Note that in the code we get the first group with the index 1; take a look at the group operation to understand why.

Backreferences, a powerful feature is also provided by sub. We'll see them in depth in the next chapter. Basically, what it does is that it replaces the backreferences with the corresponding groups. For example, let's say you want to transform markdown to HTML, for the sake of keeping the example short, just bold the text:

>>> text = "imagine⇢a⇢new⇢*world*,⇢a⇢magic⇢*world*"
>>> pattern = re.compile(r'*(.*?)*')
>>> pattern.sub(r"<b>g<1><\b>", text)
'imagine⇢a⇢new⇢<b>world<\b>,⇢a⇢magic⇢<b>world<\b>'

As always, the previous example first compiles the pattern, which matches every word between the two *, and in addition to that it captures the word. Note that thanks to the ? metacharacter the pattern is non-greedy.

Note that g<number> is there to avoid ambiguity with literal numbers, for example, imagine you need to add "1" right after a group:

>>> pattern = re.compile(r'*(.*?)*')
>>> pattern.sub(r"<b>g<1>1<\b>", text)
   'imagine⇢a⇢new⇢<b>world1<\b>,⇢a⇢magic⇢<b>world1<\b>'

As you can see, the behavior is as expected. Let's see what happens on using the notation without < and >:

>>> text = "imagine⇢a⇢new⇢*world*,⇢a⇢magic⇢*world*"
>>> pattern = re.compile(r'*(.*?)*')
>>> pattern.sub(r"<b>g1
1<\b>", text)
 error: bad group name

In the preceding example, the group is highlighted to remove ambiguity and help us see it, and that's precisely the problem the regex engine is facing. Here, the regex engine tries to use the group number 11 which doesn't exist. For this reason, there is the g<group> notation.

Another thing to keep in mind with sub is that every backslash that escapes in the replacement string will be processed. As you can see in <\b>, you need to escape them if you want to avoid it.

You can limit the number of replacements with the optional count argument.

subn(repl, string, count=0)

It is basically the same operation as sub, you can think of it as a utility above sub. It returns a tuple with the new string and the number of substitutions made. Let us see the working by using the same example as before:

>>> text = "imagine⇢a⇢new⇢*world*,⇢a⇢magic⇢*world*"
>>> pattern = re.compile(r'*(.*?)*')
>>> pattern.subn(r"<b>g<1><\b>", text)
('imagine⇢a⇢new⇢<b>world<\b>,⇢a⇢magic⇢<b>world<\b>', 2)

It's been a long section. We explored the main operations we can do with re module and the RegexObject class along with examples. Let's continue with the object we get after a match.

MatchObject

This object represents the matched pattern; you will get one every time you execute one of these operations:

  • match
  • search
  • finditer

This object provides us with a set of operations for working with the captured groups, getting information about the position of the match, and so on. Let's see the most important operations.

group([group1, …])

The group operation gives you the subgroups of the match. If it's invoked with no arguments or zero, it will return the entire match; while if one or more group identifiers are passed, the corresponding groups' matches will be returned.

Let's see them with an example:

>>> pattern = re.compile(r"(w+) (w+)")
>>> match = pattern.search("Hello⇢world")

The pattern matches the whole string and captures two groups, Hello and world. Once we have the match, we can see the the following concrete cases:

  • With no arguments or zero, it returns the entire match.
    >>> match.group()
    'Hello⇢world'
    
    >>> match.group(0)
    'Hello⇢world'
  • With group1 bigger than 0, it returns the corresponding group.
    >>> match.group(1)
    'Hello'
    
    >>> match.group(2)
    'world'
  • If the group doesn't exist, an IndexError will be thrown.
    >>> match.group(3)
    …
    IndexError: no such group
  • With multiple arguments, it returns the corresponding groups.
    >>> match.group(0, 2)
       ('Hello⇢world', 'world')

    In this case, we want the whole pattern and the second group, that's why we pass 0 and 2.

Groups can be named, we'll see it in depth in the next chapter; there is a special notation for it. If the pattern has named groups, they can be accessed using the names or the index:

>>> pattern = re.compile(r"(?P<first>w+) (?P<second>w+)")

In the preceding example, we've compiled a pattern to capture two groups: the first one is named first and the second one is named second.

>>> match = pattern.search("Hello⇢world")
>>> match.group('first')
'Hello'

In this way, we can get a group by its name. Note that using named groups we can still get the groups by their index, as shown in the following code:

>>> match.group(1)
'Hello'

We can even use both types:

>>> match.group(0, 'first', 2)
('Hello⇢world', 'Hello', 'world')

groups([default])

The groups operation is similar to the previous operation. However, in this case it returns a tuple with all the subgroups in the match instead of giving you one or some of the groups. Let's see it with the example we've used in the previous section:

>>> pattern = re.compile("(w+) (w+)")
>>> match = pattern.search("Hello⇢World")
>>> match.groups()
   ('Hello', 'World')

As we had in the previous section, we have two groups Hello and World and that's exactly what groups gives us. In this case, you can see groups as group(1, lastGroup).

In case there are groups that don't match, the default argument is returned. If the default argument is not specified then None is used, for example:

>>> pattern = re.compile("(w+) (w+)?")
>>> match = pattern.search("Hello⇢")
>>> match.groups("mundo")
   ('Hello', 'mundo')
>>> match.groups()
   ('Hello', None)

The pattern in the preceding example is trying to match two groups made of one or more alphanumeric characters. The second one is optional; so we get only one group with the string Hello. After getting the match, we call groups with default set to mundo so that it returns mundo as the second group. Note that in the following call we don't set default, so None is returned.

groupdict([default])

The groupdict method is used in the cases where named groups have been used. It will return a dictionary with all the groups that were found:

>>> pattern = re.compile(r"(?P<first>w+) (?P<second>w+)")
>>> pattern.search("Hello⇢world").groupdict()
{'first': 'Hello', 'second': 'world'}

In the preceding example, we use a pattern similar to what we've seen in the previous sections. It captures two groups with the names first and second. So, groupdict returns them in a dictionary. Note that if there aren't named groups, then it returns an empty dictionary.

Don't worry if you don't understand quite well what is happening here. As we've mentioned before, we'll see everything related to groups in Chapter 3, Groups.

start([group])

Sometimes, it is useful to know the index where the pattern matched. As with all the operations related to groups, if the argument group is zero, then the operation works with the whole string matched:

>>> pattern = re.compile(r"(?P<first>w+) (?P<second>w+)?")
>>> match = pattern.search("Hello⇢")
>>> match.start(1)
0

If there are groups that don't match, then -1 is returned:

>>> math = pattern.search("Hello⇢")
>>> match..start(2)
-1

end([group])

The end operation behaves exactly the same as start, except that it returns the end of the substring matched by the group:

>>> pattern = re.compile(r"(?P<first>w+) (?P<second>w+)?")
>>> match = pattern.search("Hello⇢")
>>> match.end (1)
5

span([group])

It's an operation that gives you a tuple with the values from start and end. This operation is often used in text editors to locate and highlight a search. The following code is an example of this operation:

>>> pattern = re.compile(r"(?P<first>w+) (?P<second>w+)?")
>>> match = pattern.search("Hello⇢")
>>> match.span(1)
(0, 5)

expand(template)

This operation returns the string after replacing it with backreferences in the template string. It's similar to sub.

Continuing with the example in the previous section:

>>> text = "imagine⇢a⇢new⇢*world*,⇢a⇢magic⇢*world*"
>>> match = re.search(r'*(.*?)*', text)
>>> match.expand(r"<b>g<1><\b>")
  '<b>world<\b>'

Module operations

Let's see two useful operations from the module.

escape()

It escapes the literals that may appear in the expressions.

>>> re.findall(re.escape("^"), "^like^")
['^', '^']

purge()

It purges the regular expressions cache. We've already talked about this; you need to use this in order to release memory when you're using the operations through the module. Keep in mind that there is a tradeoff with the performance; once you release the cache, every pattern has to be compiled and cached again.

Well done, you already know the main operations that you can do with the re module. After this, you can start using regex in your projects without many problems.

Now, we're going to see how to change the default behavior of the patterns.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset