Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

2.2. Match Nonprintable Characters

Problem

Match a string of the following ASCII control characters: bell, escape, form feed, line feed, carriage return, horizontal tab, vertical tab. These characters have the hexadecimal ASCII codes 07, 1B, 0C, 0A, 0D, 09, 0B.

This demonstrates the use of escape sequences and how to reference characters by their hexadecimal codes.

Solution

aef

	v

Regex options: None

Regex flavors: .NET, Java, PCRE, Python, Ruby

x07x1Bf

	v

Regex options: None

Regex flavors: .NET, Java, JavaScript, Python, Ruby

aef

	x0B

Regex options: None

Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

Discussion

Seven of the most commonly used ASCII control characters have dedicated escape sequences. These all consist of a backslash followed by a letter. This is the same syntax that is used by string literals in many programming languages. Table 2-1 shows the common nonprinting characters and how they are represented.

Table 2-1. Nonprinting characters

Representation	Meaning	Hexadecimal representation	Regex flavors
‹`a`›	bell	0x07	.NET, Java, PCRE, Perl, Python, Ruby
‹`e`›	escape	0x1B	.NET, Java, PCRE, Perl, Ruby
‹`f`›	form feed	0x0C	.NET, Java, JavaScript, PCRE, Perl, Python, Ruby
‹›	line feed (newline)	0x0A	.NET, Java, JavaScript, PCRE, Perl, Python, Ruby
‹›	carriage return	0x0D	.NET, Java, JavaScript, PCRE, Perl, Python, Ruby
‹ ›	horizontal tab	0x09	.NET, Java, JavaScript, PCRE, Perl, Python, Ruby
‹`v`›	vertical tab	0x0B	.NET, Java, JavaScript, Python, Ruby

In Perl 5.10 and later, and PCRE 7.2 and later, ‹v› does match the vertical tab. In these flavors ‹v› matches all vertical whitespace. That includes the vertical tab, line breaks, and the Unicode line and paragraph separators. So for Perl and PCRE we have to use a different syntax for the vertical tab.

JavaScript does not support ‹a› and ‹e›. So for JavaScript too we need a separate solution.

These control characters, as well as the alternative syntax shown in the following section, can be used equally inside and outside character classes in your regular expression.

Variations on Representations of Nonprinting Characters

The 26 control characters

Here’s another way to match the same seven ASCII control characters matched by the regexes earlier in this recipe:

cGx1BcLcJcMcIcK

Regex options: None

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Ruby 1.9

Using ‹cA› through ‹cZ›, you can match one of the 26 control characters that occupy positions 1 through 26 in the ASCII table. The c must be lowercase. The letter that follows the c is case insensitive in most flavors. We recommend that you always use an uppercase letter. Java requires this.

This syntax can be handy if you’re used to entering control characters on console systems by pressing the Control key along with a letter. On a terminal, Ctrl-H sends a backspace. In a regex, ‹cH› matches a backspace.

Python and the classic Ruby engine in Ruby 1.8 do not support this syntax. The Oniguruma engine in Ruby 1.9 does.

The escape control character, at position 27 in the ASCII table, is beyond the reach of the English alphabet, so we leave it as ‹x1B› in our regular expression.

The 7-bit character set

Following is yet another way to match our list of seven commonly used control characters:

x07x1Bx0Cx0Ax0Dx09x0B

Regex options: None

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

A lowercase x followed by two uppercase hexadecimal digits matches a single character in the ASCII set. Figure 2-1 shows which hexadecimal combinations from ‹x00› through ‹x7F› match each character in the entire ASCII character set. The table is arranged with the first hexadecimal digit going down the left side and the second digit going across the top.

Figure 2-1. ASCII table

The characters that ‹x80› through ‹xFF› match depends on how your regex engine interprets them, and which code page your subject text is encoded in. We recommend that you not use ‹x80› through ‹xFF›. Instead, use the Unicode code point token described in Recipe 2.7.

Caution

If you’re using Ruby 1.8 or you compiled PCRE without UTF-8 support, you cannot use Unicode code points. Ruby 1.8 and PCRE without UTF-8 are 8-bit regex engines. They are completely ignorant about text encodings and multibyte characters. ‹xAA› in these engines simply matches the byte 0xAA, regardless of which character 0xAA happens to represent or whether 0xAA is part of a multibyte character.

Table of Contents for
2.2. Match Nonprintable Characters

2.2. Match Nonprintable Characters

Problem

Solution

Discussion

Variations on Representations of Nonprinting Characters

The 26 control characters

The 7-bit character set

Caution

See Also

Table of Contents for 2.2. Match Nonprintable Characters

Create new playlist

Sign In

Sign Up

2.2. Match Nonprintable Characters

Problem

Solution

Discussion

Variations on Representations of Nonprinting Characters

The 26 control characters

The 7-bit character set

Caution

See Also

Table of Contents for
2.2. Match Nonprintable Characters