You want to find any one out of a list of words, without having to search through the subject string multiple times.
The simple solution is to alternate between the words you want to match:
(?:one|two|three)
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
More complex examples of matching similar words are shown in Recipe 5.3.
var subject = "One times two plus one equals three."; // Solution 1: var regex = /(?:one|two|three)/gi; subject.match(regex); // Returns an array with four matches: ["One","two","one","three"] // Solution 2 (reusable): // This function does the same thing but accepts an array of words to // match. Any regex metacharacters within the accepted words are escaped // with a backslash before searching. function matchWords(subject, words) { var regexMetachars = /[(){[*+?.\^$|]/g; for (var i = 0; i < words.length; i++) { words[i] = words[i].replace(regexMetachars, "\$&"); } var regex = new RegExp("\b(?:" + words.join("|") + ")\b", "gi"); return subject.match(regex) || []; } matchWords(subject, ["one","two","three"]); // Returns an array with four matches: ["One","two","one","three"]
There are three parts to this regular expression: the
word boundaries on both ends, the noncapturing group, and the list of
words (each separated by the ‹|
›
alternation operator). The word boundaries ensure that the regex does
not match part of a longer word. The noncapturing group limits the
reach of the alternation operators; otherwise, you’d need to write
‹one|two|three
› to achieve the same
effect. Each of the words simply matches itself.
Since the words are surrounded on both sides by word boundaries,
they can appear in any order. Without the word boundaries, however, it
might be important to put longer words first; otherwise, you’d never
find “awesome” when searching for ‹awe|awesome
›. The regex would always just match
the “awe” at the beginning of the word.
Because the regex engine attempts to match each word in the list from left to right, you might see a very slight performance gain by placing the words that are most likely to be found in the subject text near the beginning of the list.
Note that this regular expression is meant to generically
demonstrate matching one out of a list of words. Because both the
‹two
› and ‹three
› in this example start
with the same letter, you can more efficiently guide the regular
expression engine by rewriting the regex as ‹(?:one|t(?:wo|hree))
›. Don’t go crazy with
such hand-tuning, though. Most regex engines try to perform this
optimization for you automatically, at least in simple cases. See
Recipe 5.3 for more examples of how to
efficiently match one out of a
list of similar words.
The JavaScript example matches the same list of words in two
different ways. The first approach is to simply create the regex and
search the subject string using the match()
method that is available for
JavaScript strings. When the match()
method is passed a regular
expression that uses the /g
(global) flag, it
returns an array of all matches found in the string, or null
if no match is found.
The second approach creates a function called matchWords()
that accepts a string to search
within and an array of words to search for. The function first escapes
any regex metacharacters that might exist in the provided words (see
Recipe 2.1), and then splices the word list
into a new regular expression. That regex is then used to search the
string for all of the target words at once, rather than searching for
words one at a time in a loop. The function returns an array of any
matches that are found, or an empty array if the generated regex
doesn’t match the string at all. The desired words can be matched in
any combination of upper- and lowercase, thanks to the use of the
case-insensitive (/i
)
flag.
This chapter has a variety of recipes that deal with matching words. Recipe 5.1 explains how to find a specific word. Recipe 5.3 explains how to find similar words. Recipe 5.4 explains how to find all except a specific word.
Recipe 4.11 shows how to validate affirmative responses, and similarly matches any of several words.
Some programming languages have a built-in function for escaping regular expression metacharacters, as explained in Recipe 5.14.
Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.6 explains word boundaries. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping.