Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

5.13. Replace Repeated Whitespace with a Single Space

Problem

As part of a cleanup routine for user input or other data, you want to replace repeated whitespace characters with a single space. Any tabs, line breaks, or other whitespace should also be replaced with a space.

Solution

To implement either of the following regular expressions, simply replace all matches with a single space character. Recipe 3.14 shows the code to do this.

Clean any whitespace characters

s+

Regex options: None

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Clean horizontal whitespace characters

[●	xA0]+

Regex options: None

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby 1.8

[●	u00A0]+

Regex options: None

Regex flavors: .NET, Java, JavaScript, Python, Ruby 1.9

h+

Regex options: None

Regex flavors: PCRE 7.2, Perl 5.10

Discussion

A common text cleanup routine is to replace repeated whitespace characters with a single space. In HTML, for example, repeated whitespace is simply ignored when rendering a page (with a few exceptions). Removing repeated whitespace can therefore help to reduce the file size of some pages (or at least page sections) without any negative effects.

Clean any whitespace characters

In this solution, any sequence of whitespace characters (line breaks, tabs, spaces, etc.) is replaced with a single space. Since the ‹+› quantifier repeats the ‹s› whitespace class one or more times, even a single tab character, for example, will be replaced with a space. If you replaced the ‹+› with ‹{2,}›, only sequences of two or more whitespace characters would be replaced. This could result in fewer replacements and thus improved performance, but it could also leave behind tab characters or line breaks that would otherwise be replaced with space characters. The better approach, therefore, depends on what you’re trying to accomplish.

Clean horizontal whitespace characters

This works exactly like the previous solution, except that it leaves line breaks alone. Only spaces, tabs, and no-break spaces are replaced. HTML no-break space entities ( ) are unaffected.

PCRE 7.2 and Perl 5.10 include the shorthand character class ‹h› that you might prefer to use here since it is specifically designed to match horizontal whitespace. It also matches some additional esoteric horizontal whitespace characters.

Using ‹xA0› to match no-break spaces in Ruby 1.9 may lead to an “invalid multibyte escape” or other encoding related errors, since it references a character beyond the ASCII range ‹x00› to ‹x7F›. Use ‹u00A0› instead.

Table of Contents for
5.13. Replace Repeated Whitespace with a Single Space

5.13. Replace Repeated Whitespace with a Single Space

Problem

Solution

Clean any whitespace characters

Clean horizontal whitespace characters

Discussion

Clean any whitespace characters

Clean horizontal whitespace characters

See Also

Table of Contents for 5.13. Replace Repeated Whitespace with a Single Space

Create new playlist

Sign In

Sign Up

5.13. Replace Repeated Whitespace with a Single Space

Problem

Solution

Clean any whitespace characters

Clean horizontal whitespace characters

Discussion

Clean any whitespace characters

Clean horizontal whitespace characters

See Also

Table of Contents for
5.13. Replace Repeated Whitespace with a Single Space