2.4. Match Any Character

This recipe explains the ins and outs of the dot metacharacter.

Problem

Match a quoted character. Provide one solution that allows any single character, except a line break, between the quotes. Provide another that truly allows any character, including line breaks.

Solution

Any character except line breaks

'.'
Regex options: None (the “dot matches line breaks” option must not be set)
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Any character including line breaks

'.'
Regex options: Dot matches line breaks
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
'[sS]'
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

Any character except line breaks

The dot is one of the oldest and simplest regular expression features. Its meaning has always been to match any single character.

There is, however, some confusion as to what any character truly means. The oldest tools for working with regular expressions processed files line by line, so there was never an opportunity for the subject text to include a line break. The programming languages discussed in this book process the subject text as a whole, no matter how many line breaks you put into it. If you want true line-by-line processing, you have to write a bit of code that splits the subject into an array of lines and applies the regex to each line in the array. Recipe 3.21 in the next chapter shows how to do this.

Larry Wall, the developer of Perl, wanted Perl to retain the traditional behavior of line-based tools, in which the dot never matched a line break. All the other flavors discussed in this book followed suit. . thus matches any single character except line break characters.

Any character including line breaks

If you do want to allow your regular expression to span multiple lines, turn on the “dot matches line breaks” option. This option masquerades under different names. Perl and many others confusingly call it “single line” mode, whereas Java calls it “dot all” mode. Recipe 3.4 in the next chapter has all the details. Whatever the name of this option in your favorite programming language is, think of it as “dot matches line breaks” mode. That’s all the option does.

An alternative solution is needed for JavaScript, which doesn’t have a “dot matches line breaks” option. As Recipe 2.3 explains, s matches any whitespace character, whereas S matches any character that is not matched by s. Combining these into [sS] results in a character class that includes all characters, including line breaks. [dD] and [wW] have the same effect.

Dot abuse

The dot is the most abused regular expression feature. dd.dd.dd is not a good way to match a date. It does match 05/16/08 just fine, but it also matches 99/99/99. Worse, it matches 12345678.

A proper regex for matching only valid dates is a subject for a later chapter (see Recipe 4.5). But replacing the dot with a more appropriate character class is very easy. dd[/.-]dd[/.-]dd allows a forward slash, dot, or hyphen to be used as the date separator. This regex still matches 99/99/99, but not 12345678.

Tip

It’s just a coincidence that the previous example includes a dot inside the character classes. Inside a character class, the dot is just a literal character. It’s worth including in this particular regular expression because in some countries, such as Germany, the dot is used as a date separator.

Use the dot only when you really want to allow any character. Use a character class or negated character class in any other situation.

Variations

Here’s how to match any quoted character, including line breaks, with the help of an inline mode modifier:

(?s)'.'
Regex options: None
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python
(?m)'.'
Regex options: None
Regex flavors: Ruby

If you cannot turn on “dot matches line breaks” mode outside the regular expression, you can place a mode modifier at the start of the regular expression. We explain the concept of mode modifiers, and JavaScript’s lack of support for them, in the subsection Case-insensitive matching in Recipe 2.1.

(?s) is the mode modifier for “dot matches line breaks” mode in .NET, Java, XRegExp, PCRE, Perl, and Python. The s stands for “single line” mode, which is Perl’s confusing name for “dot matches line breaks.”

The terminology is so confusing that the developer of Ruby’s regex engine copied it wrongly. Ruby uses (?m) to turn on “dot matches line breaks” mode. Other than the different letter, the functionality is exactly the same. The new engine in Ruby 1.9 continues to use (?m) for “dot matches line breaks.” Perl’s very different meaning for (?m) is explained in Recipe 2.5.

See Also

In many cases, you don’t want to match truly any character, but rather any character except a select few. Recipe 2.3 explains how to do that.

Recipe 3.4 explains how to set options such as “dot matches line breaks” in your source code.

When working with Unicode text, you may prefer to use X to match a Unicode grapheme instead of the dot which matches a Unicode code point. Recipe 2.7 explains this in detail.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset