You’re editing a document and would like to check it for any incorrectly repeated words. You want to find these doubled words despite capitalization differences, such as with “The the.” You also want to allow differing amounts of whitespace between words, even if this causes the words to extend across more than one line. Any separating punctuation, however, should cause the words to no longer be treated as if they are repeating.
A backreference matches something that has been matched before, and therefore provides the key ingredient for this recipe:
([A-Z]+)s+1
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
If you want to use this regular expression to keep the first word but remove subsequent duplicate words, replace all matches with backreference 1. Another approach is to highlight matches by surrounding them with other characters (such as an HTML tag), so you can more easily identify them during later inspection. Recipe 3.15 shows how you can use backreferences in your replacement text, which you’ll need to do to implement either of these approaches.
If you just want to find repeated words so you can manually examine whether they need to be corrected, Recipe 3.7 shows the code you need. A text editor or grep-like tool, such as those mentioned in Tools for Working with Regular Expressions in Chapter 1, can help you find repeated words while providing the context needed to determine whether the words in question are in fact used correctly.
There are two things needed to match something that was previously
matched: a capturing group and a backreference. Place the thing you want
to match more than once inside a capturing group, and then match it
again using a backreference. This works differently from simply
repeating a token or group using a quantifier. Consider the difference
between the simplified regular expressions ‹(w)1
› and ‹w{2}
›. The first regex uses a capturing group and
backreference to match the same word character twice, whereas the latter
uses a quantifier to match any two word characters. Recipe 2.10 discusses the magic of backreferences in
greater depth.
Back to the problem at hand. This recipe only finds repeated words
that are composed of letters from A
to Z
and a
to z
(since the case insensitive option is enabled). To also allow accented
letters and letters from other scripts, you can use the Unicode Letter
category ‹p{L}
› if
your regex flavor supports it (see Unicode category).
Between the capturing group and backreference, ‹s+
› matches any whitespace
characters, such as spaces, tabs, or line breaks. If you want to
restrict the characters that can separate repeated words to horizontal
whitespace (i.e., no line breaks), replace the ‹s
› with ‹[● xA0]
›. This prevents matching repeated
words that appear across multiple lines. The ‹xA0
› in the character class matches a no-break
space, which is sometimes found in text copied and pasted from the Web
(most web developers are familiar with using
to insert a no-break space in their
content). PCRE 7.2 and Perl 5.10 include the shorthand character class
‹h
› that
you might prefer to use here since it is specifically designed to match
horizontal whitespace, and matches some additional esoteric horizontal
whitespace characters.
Finally, the word boundaries at the beginning and end of the regular expression ensure that it doesn’t match within other words ( e.g., with “this thistle”).
Note that the use of repeated words is not always incorrect, so simply removing them without examination is potentially dangerous. For example, the constructions “that that” and “had had” are generally accepted in colloquial English. Homonyms, names, onomatopoeic words (such as “oink oink” or “ha ha”), and some other constructions also occasionally result in intentionally repeated words. In most cases you should visually examine each match.
The solution shown earlier was intentionally kept simple. That simplicity came at the cost of not accounting for a variety of special cases:
Repeated words that use letters with accents or other diacritical marks, such as “café café” or “naïve naïve.”
Repeated words that include hyphens, single quotes, or right single quotes, such as “co-chair co-chair,” “don’t don’t,” or “rollin’ rollin.’”
Repeated words written in a non-English alphabet, such as the Russian words “друзья друзья.”
Dealing with these issues prevents us from relying on the
‹› word
boundary token, which we previously used to ensure that complete words
only are matched. There are two reasons ‹
› won’t work when accounting for the special
cases just mentioned. First, hyphens and apostrophes are not word
characters, so there is no word boundary to match between the whitespace
or punctuation that separates words, and a hyphen or apostrophe that
appears at the beginning or end of a word. Second, ‹
› is not Unicode aware in some
regex flavors (see Word Characters
in Recipe 2.6), so it won’t always work
correctly if your data uses letters other than A to Z without
diacritics.
Instead of ‹›, we’ll therefore need to use lookahead and
lookbehind (see Recipe 2.16) to make sure
that we still match complete words only. We’ll also use Unicode
categories (see Recipe 2.7) to match letters
(‹
p{L}
›) and diacritical
marks (‹p{M}
›) in any
alphabet or script:
(?<![p{L}p{M}-'u2019])([-'u2019]?(?:[p{L}p{M}][-'u2019]?)+)↵ s+1(?![p{L}p{M}-'u2019])
Regex options: Case insensitive |
Regex flavors: .NET, Java, Ruby 1.9 |
Even though ‹p{L}
›
matches letters in any casing, you still need to enable the “case
insensitive” option, because the backreference matched by ‹1
› might use different casing
than the initially matched word.
The ‹u2019
› tokens
in the regular expression match a right single quote mark (’
). Perl and PCRE use a different syntax for
matching individual Unicode code points, so we need to change the regex
slightly for them:
(?<![p{L}p{M}-'x{2019}])([-'x{2019}]?(?:[p{L}p{M}]↵ [-'x{2019}]?)+)s+1(?![p{L}p{M}-'x{2019}])
Regex options: Case insensitive |
Regex flavors: Java 7, PCRE, Perl |
Neither of these regexes work in JavaScript, Python, or Ruby 1.8,
because those flavors lack support for Unicode categories like
‹p{L}
›.
JavaScript and Ruby 1.8 additionally lack support for lookbehind.
Following are several examples of repeated words that these regexes will match:
The
the
café
café
друзья
друзья
don't
don't
rollin’
rollin’
O’Keeffe’s
O’Keeffe’s
co-chair
co-chair
devil-may-care
devil-may-care
Here are some examples of strings that are not matched:
hello,
hello
1000
1000
-
-
test’’ing
test’’ing
one--two
one--two
Recipe 5.9 shows how to match repeated lines of text.
Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.6 explains word boundaries. Recipe 2.7 explains how to match Unicode characters. Recipe 2.9 explains grouping. Recipe 2.10 explains backreferences. Recipe 2.12 explains repetition. Recipe 2.16 explains lookaround.