5.10. Match Complete Lines That Contain a Word

Problem

You want to match all lines that contain the word error anywhere within them.

Solution

^.*error.*$
Regex options: Case insensitive, ^ and $ match at line breaks (“dot matches line breaks” must not be set)
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

It’s often useful to match complete lines in order to collect or remove them. To match any line that contains the word error, we start with the regular expression error. The word boundary tokens on both ends make sure that we match “error” only when it appears as a complete word, as explained in Recipe 2.6.

To expand the regex to match a complete line, add .* at both ends. The dot-asterisk sequences match zero or more characters within the current line. The asterisk quantifiers are greedy, so they will match as much text as possible. The first dot-asterisk matches until the last occurrence of “error” on the line, and the second dot-asterisk matches any non-line-break characters that occur after it.

Finally, place caret and dollar sign anchors at the beginning and end of the regular expression, respectively, to ensure that matches contain a complete line. Strictly speaking, the dollar sign anchor at the end is redundant since the dot and greedy asterisk will always match until the end of the line. However, it doesn’t hurt to add it, and makes the regular expression a little more self-explanatory. Adding line or string anchors to your regexes, when appropriate, can sometimes help you avoid unexpected issues, so it’s a good habit to form. Note that unlike the dollar sign, the caret at the beginning of the regular expression is not necessarily redundant, since it ensures that the regex only matches complete lines, even if the search starts in the middle of a line for some reason.

Remember that the three key metacharacters used to restrict matches to a single line (the ^ and $ anchors, and the dot) do not have fixed meanings. To make them all line-oriented, you have to enable the option to let ^ and $ match at line breaks, and make sure that the option to let the dot match line breaks is not enabled. Recipe 3.4 shows how to apply these options in code. If you’re using JavaScript or Ruby, there is one less option to worry about, because JavaScript does not have an option to let dot match line breaks, and Ruby’s caret and dollar sign anchors always match at line breaks.

Variations

To search for lines that contain any one of multiple words, use alternation:

^.*(one|two|three).*$
Regex options: Case insensitive, ^ and $ match at line breaks (“dot matches line breaks” must not be set)
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

This regular expression matches any line that contains at least one of the words “one,” “two,” or “three.” The parentheses around the words serve two purposes. First, they limit the reach of the alternation, and second, they capture the specific word that was found on the line to backreference 1. If the line contains more than one of the words, the backreference will hold the one that occurs farthest to the right. This is because the asterisk quantifier that appears before the parentheses is greedy, and will expand the dot to match as much text as possible. If you make the asterisk lazy, as with ^.*?(one|two|three).*$, backreference 1 will contain the word from your list that appears farthest to the left.

To find lines that must contain multiple words, use lookahead:

^(?=.*?one)(?=.*?two)(?=.*?three).+$
Regex options: Case insensitive, ^ and $ match at line breaks (“dot matches line breaks” must not be set)
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

This regular expression uses positive lookahead to match lines that contain three required words anywhere within them. The .+ at the end is used to actually match the line, after the lookaheads have determined that the line meets the requirements.

See Also

Recipe 5.11 shows how to match complete lines that do not contain a particular word.

If you’re not concerned with matching complete lines, Recipe 5.1 describes how to match a specific word, and Recipe 5.2 shows how to match any of multiple words.

Recipe 3.21 includes code listings for searching through text line by line, which can simplify the process of searching within and identifying lines of interest.

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.4 explains that the dot matches any character. Recipe 2.5 explains anchors. Recipe 2.6 explains word boundaries. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition. Recipe 2.16 explains lookaround.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset