^.*error.*$
Regex options: Case insensitive, ^ and $ match at line breaks (“dot matches line breaks” must not be set) |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
It’s often useful to match complete lines in order to collect or
remove them. To match any line that contains the word error
, we start with the
regular expression ‹error
›. The word boundary tokens on both ends
make sure that we match “error” only when it appears as a complete word,
as explained in Recipe 2.6.
To expand the regex to match a complete line, add ‹.*
› at both ends. The dot-asterisk
sequences match zero or more characters within the current line. The
asterisk quantifiers are greedy, so they will match as much text as
possible. The first dot-asterisk matches until the last occurrence of
“error” on the line, and the second dot-asterisk matches any
non-line-break characters that occur after it.
Finally, place caret and dollar sign anchors at the beginning and end of the regular expression, respectively, to ensure that matches contain a complete line. Strictly speaking, the dollar sign anchor at the end is redundant since the dot and greedy asterisk will always match until the end of the line. However, it doesn’t hurt to add it, and makes the regular expression a little more self-explanatory. Adding line or string anchors to your regexes, when appropriate, can sometimes help you avoid unexpected issues, so it’s a good habit to form. Note that unlike the dollar sign, the caret at the beginning of the regular expression is not necessarily redundant, since it ensures that the regex only matches complete lines, even if the search starts in the middle of a line for some reason.
Remember that the three key metacharacters used to
restrict matches to a single line (the ‹^
› and
‹$
› anchors,
and the dot) do not have fixed meanings. To make them all line-oriented,
you have to enable the option to let ^ and $ match at line breaks, and
make sure that the option to let the dot match line breaks is not
enabled. Recipe 3.4 shows how to apply these
options in code. If you’re using JavaScript or Ruby, there is one less
option to worry about, because JavaScript does not have an option to let
dot match line breaks, and Ruby’s caret and dollar sign anchors always
match at line breaks.
To search for lines that contain any one of multiple words, use alternation:
^.*(one|two|three).*$
Regex options: Case insensitive, ^ and $ match at line breaks (“dot matches line breaks” must not be set) |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
This regular expression matches any line that contains at least
one of the words “one,” “two,” or “three.” The parentheses around the
words serve two purposes. First, they limit the reach of the
alternation, and second, they capture the specific word that was found
on the line to backreference 1. If the line contains more than one of
the words, the backreference will hold the one that occurs farthest to
the right. This is because the asterisk quantifier that appears before
the parentheses is greedy, and will expand the dot to match as much text
as possible. If you make the asterisk lazy, as with ‹^.*?(one|two|three).*$
›,
backreference 1 will contain the word from your list that appears
farthest to the left.
To find lines that must contain multiple words, use lookahead:
^(?=.*?one)(?=.*?two)(?=.*?three).+$
Regex options: Case insensitive, ^ and $ match at line breaks (“dot matches line breaks” must not be set) |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
This regular expression uses positive lookahead to match lines
that contain three required words anywhere within them. The ‹.+
› at the end is used to actually
match the line, after the lookaheads have determined that the line meets
the requirements.
Recipe 5.11 shows how to match complete lines that do not contain a particular word.
If you’re not concerned with matching complete lines, Recipe 5.1 describes how to match a specific word, and Recipe 5.2 shows how to match any of multiple words.
Recipe 3.21 includes code listings for searching through text line by line, which can simplify the process of searching within and identifying lines of interest.
Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.4 explains that the dot matches any character. Recipe 2.5 explains anchors. Recipe 2.6 explains word boundaries. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition. Recipe 2.16 explains lookaround.