5.9. Remove Duplicate Lines

Problem

You have a log file, database query output, or some other type of file or string with duplicate lines. You need to remove all but one of each duplicate line using a text editor or other similar tool.

Solution

There is a variety of software (including the Unix command-line utility uniq and Windows PowerShell cmdlet Get-Unique) that can help you remove duplicate lines in a file or string. The following sections contain three regex-based approaches that can be especially helpful when trying to accomplish this task in a nonscriptable text editor with regular expression search-and-replace support.

When you’re programming, options two and three should be avoided since they are inefficient compared to other available approaches, such as using a hash object to keep track of unique lines. However, the first option (which requires that you sort the lines in advance, unless you only want to remove adjacent duplicates) may be an acceptable approach since it’s quick and easy.

Option 1: Sort lines and remove adjacent duplicates

If you’re able to sort lines in the file or string you’re working with so that any duplicate lines appear next to each other, you should do so, unless the order of the lines must be preserved. This option will allow using a simpler and more efficient search-and-replace operation to remove the duplicates than would otherwise be possible.

After sorting the lines, use the following regex and replacement string to get rid of the duplicates:

^(.*)(?:(?:
?
|
)1)+$
Regex options: ^ and $ match at line breaks (“dot matches line breaks” must not be set)
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Replace with:

$1
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP
1
Replacement text flavors: Python, Ruby

This regular expression uses a capturing group and a backreference (among other ingredients) to match two or more sequential, duplicate lines. A backreference is used in the replacement string to put back the first line. Recipe 3.15 shows example code that can be repurposed to implement this.

Option 2: Keep the last occurrence of each duplicate line in an unsorted file

If you are using a text editor that does not have the built-in ability to sort lines, or if it is important to preserve the original line order, the following solution lets you remove duplicates even when they are separated by other lines:

^([^
]*)(?:
?
|
)(?=.*^1$)
Regex options: Dot matches line breaks, ^ and $ match at line breaks
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

Here’s the same thing as a regex compatible with standard JavaScript, without the requirement for the “dot matches line breaks” option:

^(.*)(?:
?
|
)(?=[sS]*^1$)
Regex options: ^ and $ match at line breaks (“dot matches line breaks” must not be set)
Regex flavor: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Replace with:

(The empty string—that is, nothing.)

Replacement text flavors: N/A

Option 3: Keep the first occurrence of each duplicate line in an unsorted file

If you want to preserve the first occurrence of each duplicate line, you’ll need to use a somewhat different approach. First, here is the regular expression and replacement string we will use:

^([^
]*)$(.*?)(?:(?:
?
|
)1$)+
Regex options: Dot matches line breaks, ^ and $ match at line breaks
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

Once again, we need to make a couple changes to make this compatible with JavaScript-flavor regexes, since standard JavaScript doesn’t have a “dot matches line breaks” option.

^(.*)$([sS]*?)(?:(?:
?
|
)1$)+
Regex options: ^ and $ match at line breaks (“dot matches line breaks” must not be set)
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Replace with:

$1$2
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP
12
Replacement text flavors: Python, Ruby

Unlike the Option 1 and 2 regexes, this version cannot remove all duplicate lines with one search-and-replace operation. You’ll need to continually apply “replace all” until the regex no longer matches your string, meaning that there are no more duplicates to remove. See the section of this recipe for further details.

Discussion

Option 1: Sort lines and remove adjacent duplicates

This regex removes all but the first of duplicate lines that appear next to each other. It does not remove duplicates that are separated by other lines. Let’s step through the process.

First, the ^ at the front of the regular expression matches the start of a line. Normally it would only match at the beginning of the subject string, so you need to make sure that the option to let ^ and $ match at line breaks is enabled (Recipe 3.4 shows you how to set regex options in code). Next, the .* within the capturing parentheses matches the entire contents of a line (even if it’s blank), and the value is stored as backreference 1. For this to work correctly, the “dot matches line breaks” option must not be set; otherwise, the dot-asterisk combination would match until the end of the string.

Within an outer, noncapturing group, we’ve used (?: ? | ) to match a line separator used in Windows/MS-DOS ( ), Unix/Linux/BSD/OS X ( ), or legacy Mac OS ( ) text files. The backreference 1 then tries to match the line we just finished matching. If the same line isn’t found at that position, the match attempt fails and the regex engine moves on. If it matches, we repeat the group (composed of a line break sequence and backreference 1) using the + quantifier to match any immediately following duplicate lines.

Finally, we use the dollar sign at the end of the regex to assert position at the end of the line. This ensures that we only match identical lines, and not lines that merely start with the same characters as a previous line.

Because we’re doing a search-and-replace, each entire match (including the original line and line breaks) is removed from the string. We replace this with backreference 1 to put the original line back in.

Option 2: Keep the last occurrence of each duplicate line in an unsorted file

There are several changes here compared to the Option 1 regex that finds duplicate lines only when they appear next to each other. First, in the non-JavaScript version of the Option 2 regex, the dot within the capturing group has been replaced with [^ ] (any character except a line break), and the “dot matches line breaks” option has been enabled. That’s because a dot is used later in the regex to match any character, including line breaks. Second, a lookahead has been added to scan for duplicate lines at any position further along in the string. Since the lookahead does not consume any characters, the text matched by the regex is always a single line (along with its following line break) that is known to appear again later in the string. Replacing all matches with the empty string removes the duplicate lines, leaving behind only the last occurrence of each.

Option 3: Keep the first occurrence of each duplicate line in an unsorted file

Lookbehind is not as widely supported as lookahead, and where it is supported, you still may not be able to look as far backward as you need to. Thus, the Option 3 regex is conceptually different from Option 2. Instead of matching lines that are known to be repeated earlier in the string (which would be comparable to Option 2’s tactic), this regex matches a line, the first duplicate of that line that occurs later in the string, and all the lines in between. The original line is stored as backreference 1, and the lines in between (if any) as backreference 2. By replacing each match with both backreference 1 and 2, you put back the parts you want to keep, leaving out the trailing, duplicate line and its preceding line break.

This alternative approach presents a couple of issues. First, because each match of a set of duplicate lines may include other lines in between, it’s possible that there are duplicates of a different value within your matched text, and those will be skipped over during a “replace all” operation. Second, if a line is repeated more than twice, the regex will first match duplicates one and two, but after that, it will take another set of duplicates to get the regex to match again as it advances through the string. Thus, a single “replace all” action will at best remove only every other duplicate of any specific line. To solve both of these problems and make sure that all duplicates are removed, you’ll need to continually apply the search-and-replace operation to your entire subject string until the regex no longer matches within it. Consider how this regex will work when applied to the following text:

value1
value2
value2
value3
value3
value1
value2

Removing all duplicate lines from this string will take three passes. Table 5-1 shows the result of each pass.

Table 5-1. Replacement passes

Pass one

Pass two

Pass three

Final string

One match/replacementTwo matches/replacementsOne match/replacementNo duplicates remain

value1

 value1

 value1

 value1

  value2

value2

value2

 value2

  value2

  value2

  value3

 value3

  value3

value3

  value2

 

  value3

  value3

  

  value1

 value2

  

 value2

   

See Also

Recipe 5.8 shows how to match repeated words.

Recipe 3.19 has code listings for splitting a string using a regular expression, which provides an alternative, (mostly) non-regex-based means to remove duplicate lines when programming. If you use a regex that matches line breaks (such as ? | ) as the separator for your split operation, you’ll be left with a list of all lines in the string. You can then loop over this list and keep track of unique lines using a hash object, discarding any lines you’ve previously encountered.

Techniques used in the regular expressions and replacement text in this recipe are discussed in Chapter 2. Recipe 2.2 explains how to match nonprinting characters. Recipe 2.3 explains character classes. Recipe 2.4 explains that the dot matches any character. Recipe 2.5 explains anchors. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.10 explains backreferences. Recipe 2.12 explains repetition. Recipe 2.21 explains how to insert text matched by capturing groups into the replacement text.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset