Strings with Escapes

Problem

You need a regex that matches a string, which is a sequence of zero or more characters enclosed by double quotes. A string with nothing between the quotes is an empty string. A double quote can be included in the string by escaping it with a backslash, and backslashes can also be used to escape other characters in the string. Strings cannot include line breaks, and line breaks cannot be escaped with backslashes.

Solution

"[^"\
]*(?:\.[^"\
]*)"
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

This regular expression has the same structure as the one in the preceding recipe. The difference is that we now have two characters with a special meaning: the double quote and the backslash. We exclude both from the characters matched by the two negated character classes. We use \. to separately match any escaped character. \ matches a single backslash, and . matches any character that is not a line break. Make sure the option “dot matches line breaks” is turned off.

Variations

Strings delimited with single quotes can be matched just as easily:

'[^'\
]*(?:\.[^'\
]*)'
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

If your language supports both single-quoted and double-quoted strings, you’ll need to handle those as separate alternatives:

"[^"\
]*(?:\.[^"\
]*)"|'[^'\
]*(?:\.[^'\
]*)'
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

If strings can include line breaks escaped with a backslash, we can modify our original regular expression to allow a line break to be matched after the backslash. We use (?:.| ? ) rather than just the dot with the “dot matches line breaks option” to make sure that Windows-style line breaks are matched correctly. The dot would match only the CR in a CR LF line break, and the regex would then fail to match the LF. ? handles both Windows-style and Unix-style line breaks.

"[^"\
]*(?:\(?:.|
?
)[^"\
]*)"
Regex options: None (make sure “dot matches line breaks” is off)
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

If strings can include line breaks even when they are not escaped, remove them from the negated character classes. Also make sure to allow the dot to match line breaks.

"[^"\]*(?:\.[^"\]*)*"
Regex options: None
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

We need a separate solution for JavaScript without XRegExp, because it does not have an option to make the dot match lines.

"[^"\]*(?:\[sS][^"\]*)*"
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

See Also

Strings explains the basic structure of the regular expression in this recipe’s solution. Recipe 2.4 explains the dot, including the option to make it match line breaks, and the workaround for JavaScript.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset