You want to remove leading and trailing whitespace from a string. For instance, you might need to do this to clean up data submitted by users in a web form before passing their input to one of the validation regexes in Chapter 4.
To keep things simple and fast, the best all-around solution is to use two substitutions—one to remove leading whitespace, and another to remove trailing whitespace.
Leading whitespace:
As+
Regex options: None |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
^s+
Regex options: None (“^ and $ match at line breaks” must not be set) |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python |
Trailing whitespace:
s+
Regex options: None |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
s+$
Regex options: None (“^ and $ match at line breaks” must not be set) |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python |
Simply replace matches found using one of the “leading whitespace” regexes and one of the “trailing whitespace” regexes with the empty string. Follow the code in Recipe 3.14 to perform replacements. With both the leading and trailing whitespace regular expressions, you only need to replace the first match found since the regexes match all leading or trailing whitespace in one go.
Removing leading and trailing whitespace is a simple but common
task. The regular expressions just shown contain three parts each: the
shorthand character class to match any whitespace character (‹s
›), a quantifier to repeat the
class one or more times (‹+
›), and an anchor to assert position at the
beginning or end of the string. ‹A
› and
‹^
› match at the
beginning; ‹› and
‹
$
› at the end.
We’ve included two options for matching both leading and
trailing whitespace because of incompatibilities between Ruby and
JavaScript. With the other regex flavors, you can chose either option.
The versions with ‹^
› and
‹$
› don’t
work correctly in Ruby, because Ruby always lets these anchors match at
the beginning and end of any line. JavaScript doesn’t support the
‹A
› and ‹› anchors.
Many programming languages provide a function, usually called
trim
or strip
, that can remove leading and trailing
whitespace for you. Table 5-2
shows how to use this built-in function or method in a variety of
programming languages.
Table 5-2. Standard functions to remove leading and trailing whitespace
Language | Function |
---|---|
C#, VB.NET | |
Java, JavaScript | |
PHP | |
Python, Ruby | |
Perl does not have an equivalent function in its standard library, but you can create your own by using the regular expressions shown earlier in this recipe:
sub trim { my $string = shift; $string =~ s/^s+//; $string =~ s/s+$//; return $string; }
JavaScript’s
method is
a recent addition to the language. For older browsers (prior to Internet
Explorer 9 and Firefox 3.5), you can add it like this:string
.trim()
// Add the trim method for browsers that don't already include it if (!String.prototype.trim) { String.prototype.trim = function() { return this.replace(/^s+/, "").replace(/s+$/, ""); }; }
There are in fact many different ways you can write a regular
expression to help you trim a string. However, the alternatives are
usually slower than using two simple substitutions when working with
long strings (when performance matters most). Following are some of the
more common alternative solutions you might encounter. They are all
written in JavaScript, and since standard JavaScript doesn’t have a “dot
matches line breaks” option, the regular expressions use ‹[sS]
› to
match any single character, including line breaks. In other programming
languages, use a dot instead, and enable the “dot matches line breaks”
option.
string.replace(/^s+|s+$/g, "");
This is probably the most common solution. It combines the
two simple regexes via alternation (see Recipe 2.8), and uses the /g
(global) flag to replace all matches
rather than just the first (it will match twice when its target
contains both leading and trailing whitespace). This isn’t a
terrible approach, but it’s slower than using two simple
substitutions when working with long strings since the two
alternation options need to be tested at every character
position.
string.replace(/^s*([sS]*?)s*$/,
"$1")
This regex works by matching the entire string and capturing the sequence from the first to the last nonwhitespace characters (if any) to backreference 1. By replacing the entire string with backreference 1, you’re left with a trimmed version of the string.
This approach is conceptually simple, but the lazy quantifier inside the capturing group makes the regex do a lot of extra work (i.e., backtracking), and therefore tends to make this option slow with long target strings.
Let’s step back to look at how this actually works. After
the regex enters the capturing group, the ‹[sS]
› class’s lazy ‹*?
›
quantifier requires that it be repeated as few times as possible.
Thus, the regex matches one character at a time, stopping after
each character to try to match the remaining ‹s*$
› pattern. If that fails
because nonwhitespace characters remain somewhere after the
current position in the string, the regex matches one more
character, updates the backreference, and then tries the remainder
of the pattern again.
string.replace(/^s*([sS]*S)?s*$/,
"$1")
This is similar to the last regex, but it replaces
the lazy quantifier with a greedy one for performance reasons. To
make sure that the capturing group still only matches up to the
last nonwhitespace character, a trailing ‹S
›
is required. However, since the regex must be able to match
whitespace-only strings, the entire capturing group is made
optional by adding a trailing question mark quantifier.
Here, the greedy asterisk in ‹[sS]*
› repeats its any-character pattern
to the end of the string. The regex then backtracks one character
at a time until it’s able to match the following ‹S
›, or until it backtracks
to the first character matched within the group (after which it
skips the group).
Unless there’s more trailing whitespace than other text, this generally ends up being faster than the previous solution that used a lazy quantifier. Still, it doesn’t hold up to the consistent performance of using two simple substitutions.
string.replace(/^s*(S*(?:s+S+)*)s*$/,
"$1")
This is a relatively common approach, but there’s no good reason to use it since it’s consistently one of the slowest of the options shown here. It’s similar to the last two regexes in that it matches the entire string and replaces it with the part you want to keep, but because the inner, noncapturing group matches only one word at a time, there are a lot of discrete steps the regex must take. The performance hit may be unnoticeable when trimming short strings, but with long strings that contain many words, this regex can become a performance problem.
Some regular expression implementations contain clever optimizations that alter the internal matching processes described here, and therefore make some of these options perform a bit better or worse than we’ve suggested. Nevertheless, the simplicity of using two substitutions provides consistently respectable performance with different string lengths and varying string contents, and it’s therefore the best all-around solution.
Recipe 5.13 explains how to replace repeated whitespace with a single space.
Techniques used in the regular expressions and replacement text in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition. Recipe 2.13 explains how greedy and lazy quantifiers backtrack. Recipe 2.21 explains how to insert text matched by capturing groups into the replacement text.