You have a log file, database query output, or some other type of file or string with duplicate lines. You need to remove all but one of each duplicate line using a text editor or other similar tool.
There is a variety of software (including the Unix command-line
utility uniq
and Windows PowerShell cmdlet Get-Unique
) that can help you remove duplicate
lines in a file or string. The following sections contain three
regex-based approaches that can be especially helpful when trying to
accomplish this task in a nonscriptable text editor with regular
expression search-and-replace support.
When you’re programming, options two and three should be avoided since they are inefficient compared to other available approaches, such as using a hash object to keep track of unique lines. However, the first option (which requires that you sort the lines in advance, unless you only want to remove adjacent duplicates) may be an acceptable approach since it’s quick and easy.
If you’re able to sort lines in the file or string you’re working with so that any duplicate lines appear next to each other, you should do so, unless the order of the lines must be preserved. This option will allow using a simpler and more efficient search-and-replace operation to remove the duplicates than would otherwise be possible.
After sorting the lines, use the following regex and replacement string to get rid of the duplicates:
^(.*)(?:(?: ? | )1)+$
Regex options: ^ and $ match at line breaks (“dot matches line breaks” must not be set) |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Replace with:
$1
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP |
1
Replacement text flavors: Python, Ruby |
This regular expression uses a capturing group and a backreference (among other ingredients) to match two or more sequential, duplicate lines. A backreference is used in the replacement string to put back the first line. Recipe 3.15 shows example code that can be repurposed to implement this.
If you are using a text editor that does not have the built-in ability to sort lines, or if it is important to preserve the original line order, the following solution lets you remove duplicates even when they are separated by other lines:
^([^ ]*)(?: ? | )(?=.*^1$)
Regex options: Dot matches line breaks, ^ and $ match at line breaks |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
Here’s the same thing as a regex compatible with standard JavaScript, without the requirement for the “dot matches line breaks” option:
^(.*)(?: ? | )(?=[sS]*^1$)
Regex options: ^ and $ match at line breaks (“dot matches line breaks” must not be set) |
Regex flavor: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Replace with:
Replacement text flavors: N/A |
If you want to preserve the first occurrence of each duplicate line, you’ll need to use a somewhat different approach. First, here is the regular expression and replacement string we will use:
^([^ ]*)$(.*?)(?:(?: ? | )1$)+
Regex options: Dot matches line breaks, ^ and $ match at line breaks |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
Once again, we need to make a couple changes to make this compatible with JavaScript-flavor regexes, since standard JavaScript doesn’t have a “dot matches line breaks” option.
^(.*)$([sS]*?)(?:(?: ? | )1$)+
Regex options: ^ and $ match at line breaks (“dot matches line breaks” must not be set) |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Replace with:
$1$2
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP |
12
Replacement text flavors: Python, Ruby |
Unlike the Option 1 and 2 regexes, this version cannot remove all duplicate lines with one search-and-replace operation. You’ll need to continually apply “replace all” until the regex no longer matches your string, meaning that there are no more duplicates to remove. See the section of this recipe for further details.
This regex removes all but the first of duplicate lines that appear next to each other. It does not remove duplicates that are separated by other lines. Let’s step through the process.
First, the ‹^
› at the
front of the regular expression matches the start of a line. Normally
it would only match at the beginning of the subject string, so you
need to make sure that the option to let ^ and $ match at line breaks
is enabled (Recipe 3.4 shows you how to
set regex options in code). Next, the ‹.*
› within the capturing parentheses matches the
entire contents of a line (even if it’s blank), and the value is
stored as backreference 1. For this to work correctly, the “dot
matches line breaks” option must not be set; otherwise, the
dot-asterisk combination would match until the end of the
string.
Within an outer, noncapturing group, we’ve used ‹(?:
?
|
)
› to match a line
separator used in Windows/MS-DOS (‹
›), Unix/Linux/BSD/OS X (‹
›), or legacy Mac OS (‹
›) text files. The
backreference ‹1
› then
tries to match the line we just finished matching. If the same line
isn’t found at that position, the match attempt fails and the regex
engine moves on. If it matches, we repeat the group (composed of a
line break sequence and backreference 1) using the ‹+
›
quantifier to match any immediately following duplicate lines.
Finally, we use the dollar sign at the end of the regex to assert position at the end of the line. This ensures that we only match identical lines, and not lines that merely start with the same characters as a previous line.
Because we’re doing a search-and-replace, each entire match (including the original line and line breaks) is removed from the string. We replace this with backreference 1 to put the original line back in.
There are several changes here compared to the Option 1
regex that finds duplicate lines only when they appear next to each
other. First, in the non-JavaScript version of the Option 2 regex, the
dot within the capturing group has been replaced with ‹[^
]
› (any character except a
line break), and the “dot matches line breaks” option has been
enabled. That’s because a dot is used later in the regex to match any
character, including line breaks. Second, a lookahead has been added
to scan for duplicate lines at any position further along in the
string. Since the lookahead does not consume any characters, the text
matched by the regex is always a single line (along with its following
line break) that is known to appear again later in the string.
Replacing all matches with the empty string removes the duplicate
lines, leaving behind only the last occurrence of each.
Lookbehind is not as widely supported as lookahead, and where it is supported, you still may not be able to look as far backward as you need to. Thus, the Option 3 regex is conceptually different from Option 2. Instead of matching lines that are known to be repeated earlier in the string (which would be comparable to Option 2’s tactic), this regex matches a line, the first duplicate of that line that occurs later in the string, and all the lines in between. The original line is stored as backreference 1, and the lines in between (if any) as backreference 2. By replacing each match with both backreference 1 and 2, you put back the parts you want to keep, leaving out the trailing, duplicate line and its preceding line break.
This alternative approach presents a couple of issues. First, because each match of a set of duplicate lines may include other lines in between, it’s possible that there are duplicates of a different value within your matched text, and those will be skipped over during a “replace all” operation. Second, if a line is repeated more than twice, the regex will first match duplicates one and two, but after that, it will take another set of duplicates to get the regex to match again as it advances through the string. Thus, a single “replace all” action will at best remove only every other duplicate of any specific line. To solve both of these problems and make sure that all duplicates are removed, you’ll need to continually apply the search-and-replace operation to your entire subject string until the regex no longer matches within it. Consider how this regex will work when applied to the following text:
value1 value2 value2 value3 value3 value1 value2
Removing all duplicate lines from this string will take three passes. Table 5-1 shows the result of each pass.
Recipe 5.8 shows how to match repeated words.
Recipe 3.19 has code listings for
splitting a string using a regular expression, which provides an
alternative, (mostly) non-regex-based means to remove duplicate lines
when programming. If you use a regex that matches line breaks (such as
‹
?
|
›) as the
separator for your split operation, you’ll be left with a list of all
lines in the string. You can then loop over this list and keep track of
unique lines using a hash object, discarding any lines you’ve previously
encountered.
Techniques used in the regular expressions and replacement text in this recipe are discussed in Chapter 2. Recipe 2.2 explains how to match nonprinting characters. Recipe 2.3 explains character classes. Recipe 2.4 explains that the dot matches any character. Recipe 2.5 explains anchors. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.10 explains backreferences. Recipe 2.12 explains repetition. Recipe 2.21 explains how to insert text matched by capturing groups into the replacement text.