2.2. Match Nonprintable Characters

Problem

Match a string of the following ASCII control characters: bell, escape, form feed, line feed, carriage return, horizontal tab, vertical tab. These characters have the hexadecimal ASCII codes 07, 1B, 0C, 0A, 0D, 09, 0B.

This demonstrates the use of escape sequences and how to reference characters by their hexadecimal codes.

Solution

aef

	v
Regex options: None
Regex flavors: .NET, Java, PCRE, Python, Ruby
x07x1Bf

	v
Regex options: None
Regex flavors: .NET, Java, JavaScript, Python, Ruby
aef

	x0B
Regex options: None
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

Discussion

Seven of the most commonly used ASCII control characters have dedicated escape sequences. These all consist of a backslash followed by a letter. This is the same syntax that is used by string literals in many programming languages. Table 2-1 shows the common nonprinting characters and how they are represented.

Table 2-1. Nonprinting characters

Representation

Meaning

Hexadecimal representation

Regex flavors

a

bell

0x07

.NET, Java, PCRE, Perl, Python, Ruby

e

escape

0x1B

.NET, Java, PCRE, Perl, Ruby

f

form feed

0x0C

.NET, Java, JavaScript, PCRE, Perl, Python, Ruby

line feed (newline)

0x0A

.NET, Java, JavaScript, PCRE, Perl, Python, Ruby

carriage return

0x0D

.NET, Java, JavaScript, PCRE, Perl, Python, Ruby

horizontal tab

0x09

.NET, Java, JavaScript, PCRE, Perl, Python, Ruby

v

vertical tab

0x0B

.NET, Java, JavaScript, Python, Ruby

In Perl 5.10 and later, and PCRE 7.2 and later, v does match the vertical tab. In these flavors v matches all vertical whitespace. That includes the vertical tab, line breaks, and the Unicode line and paragraph separators. So for Perl and PCRE we have to use a different syntax for the vertical tab.

JavaScript does not support a and e. So for JavaScript too we need a separate solution.

These control characters, as well as the alternative syntax shown in the following section, can be used equally inside and outside character classes in your regular expression.

Variations on Representations of Nonprinting Characters

The 26 control characters

Here’s another way to match the same seven ASCII control characters matched by the regexes earlier in this recipe:

cGx1BcLcJcMcIcK
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Ruby 1.9

Using cA through cZ, you can match one of the 26 control characters that occupy positions 1 through 26 in the ASCII table. The c must be lowercase. The letter that follows the c is case insensitive in most flavors. We recommend that you always use an uppercase letter. Java requires this.

This syntax can be handy if you’re used to entering control characters on console systems by pressing the Control key along with a letter. On a terminal, Ctrl-H sends a backspace. In a regex, cH matches a backspace.

Python and the classic Ruby engine in Ruby 1.8 do not support this syntax. The Oniguruma engine in Ruby 1.9 does.

The escape control character, at position 27 in the ASCII table, is beyond the reach of the English alphabet, so we leave it as x1B in our regular expression.

The 7-bit character set

Following is yet another way to match our list of seven commonly used control characters:

x07x1Bx0Cx0Ax0Dx09x0B
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

A lowercase x followed by two uppercase hexadecimal digits matches a single character in the ASCII set. Figure 2-1 shows which hexadecimal combinations from x00 through x7F match each character in the entire ASCII character set. The table is arranged with the first hexadecimal digit going down the left side and the second digit going across the top.

ASCII table

Figure 2-1. ASCII table

The characters that x80 through xFF match depends on how your regex engine interprets them, and which code page your subject text is encoded in. We recommend that you not use x80 through xFF. Instead, use the Unicode code point token described in Recipe 2.7.

Caution

If you’re using Ruby 1.8 or you compiled PCRE without UTF-8 support, you cannot use Unicode code points. Ruby 1.8 and PCRE without UTF-8 are 8-bit regex engines. They are completely ignorant about text encodings and multibyte characters. xAA in these engines simply matches the byte 0xAA, regardless of which character 0xAA happens to represent or whether 0xAA is part of a multibyte character.

See Also

Recipe 2.7 explains how to make a regex match particular Unicode characters. If your regex engine supports Unicode, you can match nonprinting characters that way too.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset