5.2. Find Any of Multiple Words

Problem

You want to find any one out of a list of words, without having to search through the subject string multiple times.

Solution

Using alternation

The simple solution is to alternate between the words you want to match:

(?:one|two|three)
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

More complex examples of matching similar words are shown in Recipe 5.3.

Example JavaScript solution

var subject = "One times two plus one equals three.";

// Solution 1:

var regex = /(?:one|two|three)/gi;

subject.match(regex);
// Returns an array with four matches: ["One","two","one","three"]

// Solution 2 (reusable):

// This function does the same thing but accepts an array of words to
// match. Any regex metacharacters within the accepted words are escaped
// with a backslash before searching.

function matchWords(subject, words) {
    var regexMetachars = /[(){[*+?.\^$|]/g;

    for (var i = 0; i < words.length; i++) {
        words[i] = words[i].replace(regexMetachars, "\$&");
    }

    var regex = new RegExp("\b(?:" + words.join("|") + ")\b", "gi");

    return subject.match(regex) || [];
}

matchWords(subject, ["one","two","three"]);
// Returns an array with four matches: ["One","two","one","three"]

Discussion

Using alternation

There are three parts to this regular expression: the word boundaries on both ends, the noncapturing group, and the list of words (each separated by the | alternation operator). The word boundaries ensure that the regex does not match part of a longer word. The noncapturing group limits the reach of the alternation operators; otherwise, you’d need to write one|two|three to achieve the same effect. Each of the words simply matches itself.

Since the words are surrounded on both sides by word boundaries, they can appear in any order. Without the word boundaries, however, it might be important to put longer words first; otherwise, you’d never find “awesome” when searching for awe|awesome. The regex would always just match the “awe” at the beginning of the word.

Tip

Because the regex engine attempts to match each word in the list from left to right, you might see a very slight performance gain by placing the words that are most likely to be found in the subject text near the beginning of the list.

Note that this regular expression is meant to generically demonstrate matching one out of a list of words. Because both the two and three in this example start with the same letter, you can more efficiently guide the regular expression engine by rewriting the regex as (?:one|t(?:wo|hree)). Don’t go crazy with such hand-tuning, though. Most regex engines try to perform this optimization for you automatically, at least in simple cases. See Recipe 5.3 for more examples of how to efficiently match one out of a list of similar words.

Example JavaScript solution

The JavaScript example matches the same list of words in two different ways. The first approach is to simply create the regex and search the subject string using the match() method that is available for JavaScript strings. When the match() method is passed a regular expression that uses the /g (global) flag, it returns an array of all matches found in the string, or null if no match is found.

The second approach creates a function called matchWords() that accepts a string to search within and an array of words to search for. The function first escapes any regex metacharacters that might exist in the provided words (see Recipe 2.1), and then splices the word list into a new regular expression. That regex is then used to search the string for all of the target words at once, rather than searching for words one at a time in a loop. The function returns an array of any matches that are found, or an empty array if the generated regex doesn’t match the string at all. The desired words can be matched in any combination of upper- and lowercase, thanks to the use of the case-insensitive (/i) flag.

See Also

This chapter has a variety of recipes that deal with matching words. Recipe 5.1 explains how to find a specific word. Recipe 5.3 explains how to find similar words. Recipe 5.4 explains how to find all except a specific word.

Recipe 4.11 shows how to validate affirmative responses, and similarly matches any of several words.

Some programming languages have a built-in function for escaping regular expression metacharacters, as explained in Recipe 5.14.

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.6 explains word boundaries. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset