As part of a cleanup routine for user input or other data, you want to replace repeated whitespace characters with a single space. Any tabs, line breaks, or other whitespace should also be replaced with a space.
To implement either of the following regular expressions, simply replace all matches with a single space character. Recipe 3.14 shows the code to do this.
A common text cleanup routine is to replace repeated whitespace characters with a single space. In HTML, for example, repeated whitespace is simply ignored when rendering a page (with a few exceptions). Removing repeated whitespace can therefore help to reduce the file size of some pages (or at least page sections) without any negative effects.
In this solution, any sequence of whitespace characters
(line breaks, tabs, spaces, etc.) is replaced with a single space.
Since the ‹+
›
quantifier repeats the ‹s
›
whitespace class one or more times, even a single tab character, for
example, will be replaced with a space. If you replaced the ‹+
› with ‹{2,}
›, only sequences of two or more whitespace
characters would be replaced. This could result in fewer replacements
and thus improved performance, but it could also leave behind tab
characters or line breaks that would otherwise be replaced with space
characters. The better approach, therefore, depends on what you’re
trying to accomplish.
This works exactly like the previous solution, except
that it leaves line breaks alone. Only spaces, tabs, and no-break
spaces are replaced. HTML no-break space entities (
) are unaffected.
PCRE 7.2 and Perl 5.10 include the shorthand character class
‹h
› that
you might prefer to use here since it is specifically designed to
match horizontal whitespace. It also matches some additional esoteric
horizontal whitespace characters.
Using ‹xA0
› to
match no-break spaces in Ruby 1.9 may lead to an “invalid multibyte
escape” or other encoding related errors, since it references a
character beyond the ASCII range ‹x00
› to ‹x7F
›. Use ‹u00A0
› instead.
Recipe 5.12 explains how to trim leading and trailing whitespace.
Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.2 explains how to match nonprinting characters. Recipe 2.3 explains character classes. Recipe 2.12 explains repetition.