Match a string of the following ASCII control characters: bell, escape, form feed, line feed, carriage return, horizontal tab, vertical tab. These characters have the hexadecimal ASCII codes 07, 1B, 0C, 0A, 0D, 09, 0B.
This demonstrates the use of escape sequences and how to reference characters by their hexadecimal codes.
aef v
Regex options: None |
Regex flavors: .NET, Java, PCRE, Python, Ruby |
x07x1Bf v
Regex options: None |
Regex flavors: .NET, Java, JavaScript, Python, Ruby |
aef x0B
Regex options: None |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
Seven of the most commonly used ASCII control characters have dedicated escape sequences. These all consist of a backslash followed by a letter. This is the same syntax that is used by string literals in many programming languages. Table 2-1 shows the common nonprinting characters and how they are represented.
Table 2-1. Nonprinting characters
Representation | Meaning | Hexadecimal representation | Regex flavors |
---|---|---|---|
0x07 | .NET, Java, PCRE, Perl, Python, Ruby | ||
0x1B | .NET, Java, PCRE, Perl, Ruby | ||
form feed | 0x0C | .NET, Java, JavaScript, PCRE, Perl, Python, Ruby | |
0x0A | .NET, Java, JavaScript, PCRE, Perl, Python, Ruby | ||
0x0D | .NET, Java, JavaScript, PCRE, Perl, Python, Ruby | ||
0x09 | .NET, Java, JavaScript, PCRE, Perl, Python, Ruby | ||
vertical tab | 0x0B | .NET, Java, JavaScript, Python, Ruby |
In Perl 5.10 and later, and PCRE 7.2 and later, ‹v
› does
match the vertical tab. In these flavors ‹v
› matches
all vertical whitespace. That includes the vertical tab, line breaks,
and the Unicode line and paragraph separators. So for Perl and PCRE we
have to use a different syntax for the vertical tab.
JavaScript does not support ‹
› and
‹a
›. So for JavaScript too we need a
separate solution.e
These control characters, as well as the alternative syntax shown in the following section, can be used equally inside and outside character classes in your regular expression.
Here’s another way to match the same seven ASCII control characters matched by the regexes earlier in this recipe:
cGx1BcLcJcMcIcK
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Ruby 1.9 |
Using ‹cA
›
through ‹cZ
›, you can
match one of the 26 control characters that occupy positions 1 through
26 in the ASCII table. The c
must
be lowercase. The letter that follows the c
is case insensitive in most flavors. We
recommend that you always use an uppercase letter. Java requires
this.
This syntax can be handy if you’re used to entering control
characters on console systems by pressing the Control key along with a
letter. On a terminal, Ctrl-H sends a backspace. In a regex,
‹cH
›
matches a backspace.
Python and the classic Ruby engine in Ruby 1.8 do not support this syntax. The Oniguruma engine in Ruby 1.9 does.
The escape control character, at position 27 in the ASCII table,
is beyond the reach of the English alphabet, so we leave it as
‹x1B
› in our regular
expression.
Following is yet another way to match our list of seven commonly used control characters:
x07x1Bx0Cx0Ax0Dx09x0B
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
A lowercase x
followed by two
uppercase hexadecimal digits matches a single character in the ASCII
set. Figure 2-1 shows which hexadecimal
combinations from ‹x00
›
through ‹x7F
› match
each character in the entire ASCII character set. The table is
arranged with the first hexadecimal digit going down the left side and
the second digit going across the top.
The characters that ‹x80
› through ‹xFF
› match depends on how your regex engine
interprets them, and which code page your subject text is encoded in.
We recommend that you not use ‹x80
› through ‹xFF
›. Instead, use the Unicode code point token
described in Recipe 2.7.
If you’re using Ruby 1.8 or you compiled PCRE without UTF-8
support, you cannot use Unicode code points. Ruby 1.8 and PCRE
without UTF-8 are 8-bit regex engines. They are completely ignorant
about text encodings and multibyte characters. ‹xAA
› in these engines simply
matches the byte 0xAA, regardless of which character 0xAA happens to
represent or whether 0xAA is part of a multibyte
character.
Recipe 2.7 explains how to make a regex match particular Unicode characters. If your regex engine supports Unicode, you can match nonprinting characters that way too.