5.8. Find Repeated Words

Problem

You’re editing a document and would like to check it for any incorrectly repeated words. You want to find these doubled words despite capitalization differences, such as with “The the.” You also want to allow differing amounts of whitespace between words, even if this causes the words to extend across more than one line. Any separating punctuation, however, should cause the words to no longer be treated as if they are repeating.

Solution

A backreference matches something that has been matched before, and therefore provides the key ingredient for this recipe:

([A-Z]+)s+1
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

If you want to use this regular expression to keep the first word but remove subsequent duplicate words, replace all matches with backreference 1. Another approach is to highlight matches by surrounding them with other characters (such as an HTML tag), so you can more easily identify them during later inspection. Recipe 3.15 shows how you can use backreferences in your replacement text, which you’ll need to do to implement either of these approaches.

If you just want to find repeated words so you can manually examine whether they need to be corrected, Recipe 3.7 shows the code you need. A text editor or grep-like tool, such as those mentioned in Tools for Working with Regular Expressions in Chapter 1, can help you find repeated words while providing the context needed to determine whether the words in question are in fact used correctly.

Discussion

There are two things needed to match something that was previously matched: a capturing group and a backreference. Place the thing you want to match more than once inside a capturing group, and then match it again using a backreference. This works differently from simply repeating a token or group using a quantifier. Consider the difference between the simplified regular expressions (w)1 and w{2}. The first regex uses a capturing group and backreference to match the same word character twice, whereas the latter uses a quantifier to match any two word characters. Recipe 2.10 discusses the magic of backreferences in greater depth.

Back to the problem at hand. This recipe only finds repeated words that are composed of letters from A to Z and a to z (since the case insensitive option is enabled). To also allow accented letters and letters from other scripts, you can use the Unicode Letter category p{L} if your regex flavor supports it (see Unicode category).

Between the capturing group and backreference, s+ matches any whitespace characters, such as spaces, tabs, or line breaks. If you want to restrict the characters that can separate repeated words to horizontal whitespace (i.e., no line breaks), replace the s with [ xA0]. This prevents matching repeated words that appear across multiple lines. The xA0 in the character class matches a no-break space, which is sometimes found in text copied and pasted from the Web (most web developers are familiar with using   to insert a no-break space in their content). PCRE 7.2 and Perl 5.10 include the shorthand character class h that you might prefer to use here since it is specifically designed to match horizontal whitespace, and matches some additional esoteric horizontal whitespace characters.

Finally, the word boundaries at the beginning and end of the regular expression ensure that it doesn’t match within other words ( e.g., with “this thistle”).

Note that the use of repeated words is not always incorrect, so simply removing them without examination is potentially dangerous. For example, the constructions “that that” and “had had” are generally accepted in colloquial English. Homonyms, names, onomatopoeic words (such as “oink oink” or “ha ha”), and some other constructions also occasionally result in intentionally repeated words. In most cases you should visually examine each match.

Variations

The solution shown earlier was intentionally kept simple. That simplicity came at the cost of not accounting for a variety of special cases:

  • Repeated words that use letters with accents or other diacritical marks, such as “café café” or “naïve naïve.”

  • Repeated words that include hyphens, single quotes, or right single quotes, such as “co-chair co-chair,” “don’t don’t,” or “rollin’ rollin.’”

  • Repeated words written in a non-English alphabet, such as the Russian words “друзья друзья.”

Dealing with these issues prevents us from relying on the  word boundary token, which we previously used to ensure that complete words only are matched. There are two reasons  won’t work when accounting for the special cases just mentioned. First, hyphens and apostrophes are not word characters, so there is no word boundary to match between the whitespace or punctuation that separates words, and a hyphen or apostrophe that appears at the beginning or end of a word. Second,  is not Unicode aware in some regex flavors (see Word Characters in Recipe 2.6), so it won’t always work correctly if your data uses letters other than A to Z without diacritics.

Instead of , we’ll therefore need to use lookahead and lookbehind (see Recipe 2.16) to make sure that we still match complete words only. We’ll also use Unicode categories (see Recipe 2.7) to match letters (p{L}) and diacritical marks (p{M}) in any alphabet or script:

(?<![p{L}p{M}-'u2019])([-'u2019]?(?:[p{L}p{M}][-'u2019]?)+)↵
s+1(?![p{L}p{M}-'u2019])
Regex options: Case insensitive
Regex flavors: .NET, Java, Ruby 1.9

Even though p{L} matches letters in any casing, you still need to enable the “case insensitive” option, because the backreference matched by 1 might use different casing than the initially matched word.

The u2019 tokens in the regular expression match a right single quote mark (). Perl and PCRE use a different syntax for matching individual Unicode code points, so we need to change the regex slightly for them:

(?<![p{L}p{M}-'x{2019}])([-'x{2019}]?(?:[p{L}p{M}]↵
[-'x{2019}]?)+)s+1(?![p{L}p{M}-'x{2019}])
Regex options: Case insensitive
Regex flavors: Java 7, PCRE, Perl

Neither of these regexes work in JavaScript, Python, or Ruby 1.8, because those flavors lack support for Unicode categories like p{L}. JavaScript and Ruby 1.8 additionally lack support for lookbehind.

Following are several examples of repeated words that these regexes will match:

  • The the

  • café café

  • друзья друзья

  • don't don't

  • rollin’ rollin’

  • O’Keeffe’s O’Keeffe’s

  • co-chair co-chair

  • devil-may-care devil-may-care

Here are some examples of strings that are not matched:

  • hello, hello

  • 1000 1000

  • - -

  • test’’ing test’’ing

  • one--two one--two

See Also

Recipe 5.9 shows how to match repeated lines of text.

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.6 explains word boundaries. Recipe 2.7 explains how to match Unicode characters. Recipe 2.9 explains grouping. Recipe 2.10 explains backreferences. Recipe 2.12 explains repetition. Recipe 2.16 explains lookaround.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset