5.1. Find a Specific Word

Problem

You’re given the simple task of finding all occurrences of the word cat, case insensitively. The catch is that it must appear as a complete word. You don’t want to find pieces of longer words, such as hellcat, application, or Catwoman.

Solution

Word boundary tokens make this a very easy problem to solve:

cat
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

The word boundaries at both ends of the regular expression ensure that cat is matched only when it appears as a complete word. More precisely, the word boundaries require that cat is set apart from other text by the beginning or end of the string, whitespace, punctuation, or other nonword characters.

Regular expression engines consider letters, numbers, and underscores to all be word characters. Recipe 2.6 is where we first talked about word boundaries, and covers them in greater detail.

A problem can occur when working with international text in JavaScript, PCRE, and Ruby, since those regular expression flavors only consider letters in the ASCII table to create a word boundary. In other words, word boundaries are found only at the positions between a match of [^A-Za-z0-9_]|^ and [A-Za-z0-9_], or between [A-Za-z0-9_] and [^A-Za-z0-9_]|$. The same is true in Python when the UNICODE or U flag is not set. This prevents  from being useful for a “whole word only” search within text that contains accented letters or words that use non-Latin scripts. For example, in JavaScript, PCRE, and Ruby, über will find a match within darüber, but not within dar über. In most cases, this is the exact opposite of what you would want. The problem occurs because ü is considered a nonword character, and a word boundary is therefore found between the two characters . No word boundary is found between a space character and ü, because they create a contiguous sequence of nonword characters.

You can deal with this problem by using lookahead and lookbehind (collectively, lookaround—see Recipe 2.16) instead of word boundaries. Like word boundaries, lookarounds match zero-width positions. In PCRE (when compiled with UTF-8 support) and Ruby 1.9, you can emulate Unicode-based word boundaries using, for example, (?<=[^p{L}p{M}]|^)cat(?=[^p{L}p{M}]|$). This regular expression also uses Unicode Letter and Mark category tokens (p{L} and p{M}), which are discussed in Recipe 2.7. If you want the lookarounds to also treat any Unicode decimal numbers and connector punctuation (underscore and similar) as word characters, like  does in regex flavors that correctly support Unicode, replace the two instances of [^p{L}p{M}] with [^p{L}p{M}p{Nd}p{Pc}].

JavaScript and Ruby 1.8 support neither lookbehind nor Unicode categories. You can work around the lack of lookbehind support by matching the nonword character preceding each match, and then either removing it from each match using procedural code or putting it back into the string when replacing matches (see the examples of using parts of a match in a replacement string in Recipe 3.15). The additional lack of support for matching Unicode categories (coupled with the fact that both programming languages’ w and W tokens consider only ASCII word characters) means you might need to make do with a more restrictive solution. Code points in the Letter and Mark categories are scattered throughout Unicode’s character set, so it would take thousands of characters to emulate [^p{L}p{M}] using Unicode escape sequences and character class ranges. A good compromise might be [^A-Za-zxAAxB5xBAxC0-xD6xD8-xF6xF8-xFF], which matches all except Unicode letter characters in eight-bit address space (i.e., the first 256 Unicode code points, from positions 0x00 to 0xFF). There are no code points in the Mark category within this range. See Figure 5-1 for the list of nonmatched characters. This negated character class lets you exclude (or in nonnegated form, match) some of the most commonly used, accented characters.

Unicode letter characters in eight-bit address space

Figure 5-1. Unicode letter characters in eight-bit address space

Following is an example of how to replace all instances of the word “cat” with “dog” in JavaScript. It correctly accounts for common, accented characters, so écat is not altered. To do this, you’ll need to construct your own character class instead of relying on the built-in  or w:

// 8-bit-wide letter characters
var pL = "A-Za-zxAAxB5xBAxC0-xD6xD8-xF6xF8-xFF",
    pattern = "([^{L}]|^)cat([^{L}]|$)".replace(/{L}/g, pL),
    regex = new RegExp(pattern, "gi");

// replace cat with dog, and put back any
// additional matched characters
subject = subject.replace(regex, "$1dog$2");

Note that JavaScript string literals use xHH (where HH is a two-digit hexadecimal number) to insert special characters. Hence, the pL variable that is passed to the regular expression actually ends up containing the literal versions of the characters. If you wanted the xHH metasequences to be passed through to the regex itself, you would have to escape the backslashes in the string literal (i.e., "\xHH"). However, in this case it doesn’t matter and will not change what the regular expression matches.

See Also

This chapter has a variety of recipes that deal with matching words. Recipe 5.2 explains how to find any of multiple words. Recipe 5.3 explains how to find similar words. Recipe 5.4 explains how to find all except a specific word. Recipe 5.10 explains how to match complete lines that contain a word.

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors. Recipe 2.6 explains word boundaries. Recipe 2.7 explains how to match Unicode characters. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.16 explains lookaround.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset