Strings

Problem

You need a regex that matches a string, which is a sequence of zero or more characters enclosed by double quotes. A string with nothing between the quotes is an empty string. Two sequential double quotes in a character string denote a single character, a double quote. Strings cannot include line breaks. Backslashes or other characters have no special meaning in strings.

Your regular expression should match any string, including empty strings, and it should return a single match for strings that contain double quotes. For example, it should return "before quote""after quote" as a single match, rather than matching "before quote" and "after quote" separately.

Solution

"[^"
]*(?:""[^"
]*)*"
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

Matching a string that cannot contain quotes or line breaks would be easy with "[^ "]*". Double quotes are literal characters in regular expressions, and we can easily match a sequence of characters that are not quotes or line breaks with a negated character class.

But our strings can contain quotes if they are specified as two consecutive quotes. Matching these is not much more difficult if we handle the quotes separately. After the opening quote, we use [^ "]* to match anything but quotes and line breaks. This may be followed by zero or more pairs of double quotes. We could match those with (?:"")*, but after each pair of double quotes, the string can have more characters that are not quotes or line breaks. So we match one pair of double quotes and following nonquote, nonbreak characters with ""[^ "]*, or all the pairs with (?:""[^ "]*)*. We end the regex with the double quote that closes the string.

The match returned by this regex will be the whole string, including enclosing quotes, and pairs of quotes inside the string. To get only the contents of the string, the code that processes the regex match needs to do some extra work. First, it should strip off the quotes at the start and the end of the match. Then it should search for all pairs of double quotes and replace them with individual double quotes.

You may wonder why we don’t simply use "(?:[^" ]|"")*" to match our strings. This regex matches a pair of quotes containing (?:[^" ]|"")*, which matches zero or more occurrences of any combination of two alternatives. [^" ] matches a character that isn’t a double quote or a line break. "" matches a pair of double quotes. Put together, the overall regex matches a pair of double quotes containing zero or more characters that aren’t quotes or line breaks or that are a pair of double quotes. This is the definition of a string in the stated problem. This regex indeed correctly matches the strings we want, but it is not very efficient. The regular expression engine has to enter a group with two alternatives for each character in the string. With the regex from the section, the regex engine only enters a group for each pair of double quotes in the string, which is a rare occurrence.

You could try to optimize the inefficient regex as "(?:[^" ]+|"")*". The idea is that this regex only enters the group for each pair of double quotes and for each sequence of characters without quotes or line breaks. That is true, as long as the regex encounters only valid strings. But if this regex is ever used on a file that contains a string without the closing quote, this will lead to catastrophic backtracking. When the closing quote fails to match, the regex engine will try each and every permutation of the plus and the asterisk in the regex to match all the characters between the string’s opening quote and the end of the line.

Table 7-1 shows how this regex attempts all different ways of matching "abcd. The cells in the table show the text matched by [^" ]+. At first, it matches abcd, but when the closing quote fails to match, the + will backtrack, giving up part of its match. When it does, the * will repeat the group, causing the next iteration of [^" ]+ to match the remaining characters. Now we have two iterations that will backtrack. This continues until each iteration of [^" ]+ matches a single character, and «*» has repeated the group as many times as there are characters on the line.

Table 7-1. Line separators

Permutation

1st [^" ]+

2nd [^" ]+

3rd [^" ]+

4th [^" ]+

1

abcd

n/a

n/a

n/a

2

abc

d

n/a

n/a

3

ab

cd

n/a

n/a

4

ab

c

d

n/a

5

a

bcd

n/a

n/a

6

a

bc

d

n/a

7

a

b

cd

n/a

8

a

b

c

d

As you can see, the number of permutations grows exponentially[10] with the number of characters after the opening double quote. For a file with short lines, this will result in your application running slowly. For a file with very long lines, your application may lock up or crash. If you use the variant "(?:[^"]+|"")*" to match multiline strings, the permutations may run all the way to the end of the file if there are no further double quotes in the file.

You could prevent that backtracking with an atomic group, as in "(?>[^" ]+|"")*", or with possessive quantifiers, as in "(?:[^" ]++|"")*+", if your regex flavor supports either of these features. But having to resort to special features defeats the purpose of trying to come up with something simpler than the regex presented in the section.

Variations

Strings delimited with single quotes can be matched just as easily:

'[^'
]*(?:''[^'
]*)*'
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

If your language supports both single-quoted and double-quoted strings, you’ll need to handle those as separate alternatives:

"[^"
]*(?:""[^"
]*)*"|'[^'
]*(?:''[^'
]*)*'
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

If strings can include line breaks, simply remove them from the negated character classes:

"[^"]*(?:""[^"]*)*"
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

If the regex will be used in a system that needs to deal with source code files while they’re being edited, you may want to make the closing quote optional. Then everything until the end of the line will be matched as a string while it is being typed in, until the closing quote has been typed in. Syntax coloring in text editors, for example, usually works this way. Making the closing quote optional does not change how this regex works on files that only have properly closed strings. The quantifier for the closing quote is greedy, so the quote will be matched if present. The negated character classes make sure that the regex does not incorrectly match closing quotes as part of the string.

"[^"
]*(?:""[^"
]*)*"?
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

See Also

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes, Recipe 2.9 explains grouping, and Recipe 2.12 explains repetition.

Recipes 2.15 and 2.14 explain catastrophic backtracking and how to avoid it with atomic grouping and possessive quantifiers.



[10] If there are n characters between the double quote and the end of the string, the regex engine will try 21/n permutations of (?:[^" ]+|"")*.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset