This recipe explains the ins and outs of the dot metacharacter.
Match a quoted character. Provide one solution that allows any single character, except a line break, between the quotes. Provide another that truly allows any character, including line breaks.
The dot is one of the oldest and simplest regular expression features. Its meaning has always been to match any single character.
There is, however, some confusion as to what any character truly means. The oldest tools for working with regular expressions processed files line by line, so there was never an opportunity for the subject text to include a line break. The programming languages discussed in this book process the subject text as a whole, no matter how many line breaks you put into it. If you want true line-by-line processing, you have to write a bit of code that splits the subject into an array of lines and applies the regex to each line in the array. Recipe 3.21 in the next chapter shows how to do this.
Larry Wall, the developer of Perl, wanted Perl to retain
the traditional behavior of line-based tools, in which the dot never
matched a line break. All the other flavors discussed in this book
followed suit. ‹.
› thus
matches any single character except line break
characters.
If you do want to allow your regular expression to span multiple lines, turn on the “dot matches line breaks” option. This option masquerades under different names. Perl and many others confusingly call it “single line” mode, whereas Java calls it “dot all” mode. Recipe 3.4 in the next chapter has all the details. Whatever the name of this option in your favorite programming language is, think of it as “dot matches line breaks” mode. That’s all the option does.
An alternative solution is needed for JavaScript, which doesn’t
have a “dot matches line breaks” option. As Recipe 2.3 explains, ‹s
›
matches any whitespace character, whereas ‹S
›
matches any character that is not matched by ‹s
›. Combining these into ‹[sS]
› results in a character
class that includes all characters, including line breaks. ‹[dD]
› and ‹[wW]
› have the same
effect.
The dot is the most abused regular expression feature.
‹dd.dd.dd
› is not
a good way to match a date. It does match 05/16/08
just fine, but it also matches
99/99/99
.
Worse, it matches 12345678
.
A proper regex for matching only valid dates is a subject for a
later chapter (see Recipe 4.5). But
replacing the dot with a more appropriate character class is very
easy. ‹dd[/.-]dd[/.-]dd
› allows a forward
slash, dot, or hyphen to be used as the date separator. This regex
still matches 99/99/99
, but not 12345678
.
It’s just a coincidence that the previous example includes a dot inside the character classes. Inside a character class, the dot is just a literal character. It’s worth including in this particular regular expression because in some countries, such as Germany, the dot is used as a date separator.
Use the dot only when you really want to allow any character. Use a character class or negated character class in any other situation.
Here’s how to match any quoted character, including line breaks, with the help of an inline mode modifier:
(?s)'.'
Regex options: None |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python |
(?m)'.'
Regex options: None |
Regex flavors: Ruby |
If you cannot turn on “dot matches line breaks” mode outside the regular expression, you can place a mode modifier at the start of the regular expression. We explain the concept of mode modifiers, and JavaScript’s lack of support for them, in the subsection Case-insensitive matching in Recipe 2.1.
‹(?s)
› is
the mode modifier for “dot matches line breaks” mode in .NET, Java,
XRegExp, PCRE, Perl, and Python. The s
stands for “single line” mode, which is
Perl’s confusing name for “dot matches line breaks.”
The terminology is so confusing that the developer of Ruby’s regex
engine copied it wrongly. Ruby uses ‹(?m)
› to
turn on “dot matches line breaks” mode. Other than the different letter,
the functionality is exactly the same. The new engine in Ruby 1.9
continues to use ‹(?m)
› for
“dot matches line breaks.” Perl’s very different meaning for ‹(?m)
› is
explained in Recipe 2.5.
In many cases, you don’t want to match truly any character, but rather any character except a select few. Recipe 2.3 explains how to do that.
Recipe 3.4 explains how to set options such as “dot matches line breaks” in your source code.
When working with Unicode text, you may prefer to use ‹X
› to
match a Unicode grapheme instead of the dot which matches a Unicode code
point. Recipe 2.7 explains this in
detail.