5.12. Trim Leading and Trailing Whitespace

Problem

You want to remove leading and trailing whitespace from a string. For instance, you might need to do this to clean up data submitted by users in a web form before passing their input to one of the validation regexes in Chapter 4.

Solution

To keep things simple and fast, the best all-around solution is to use two substitutions—one to remove leading whitespace, and another to remove trailing whitespace.

Leading whitespace:

As+
Regex options: None
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
^s+
Regex options: None (“^ and $ match at line breaks” must not be set)
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python

Trailing whitespace:

s+
Regex options: None
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
s+$
Regex options: None (“^ and $ match at line breaks” must not be set)
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python

Simply replace matches found using one of the “leading whitespace” regexes and one of the “trailing whitespace” regexes with the empty string. Follow the code in Recipe 3.14 to perform replacements. With both the leading and trailing whitespace regular expressions, you only need to replace the first match found since the regexes match all leading or trailing whitespace in one go.

Discussion

Removing leading and trailing whitespace is a simple but common task. The regular expressions just shown contain three parts each: the shorthand character class to match any whitespace character (s), a quantifier to repeat the class one or more times (+), and an anchor to assert position at the beginning or end of the string. A and ^ match at the beginning;  and $ at the end.

We’ve included two options for matching both leading and trailing whitespace because of incompatibilities between Ruby and JavaScript. With the other regex flavors, you can chose either option. The versions with ^ and $ don’t work correctly in Ruby, because Ruby always lets these anchors match at the beginning and end of any line. JavaScript doesn’t support the A and  anchors.

Many programming languages provide a function, usually called trim or strip, that can remove leading and trailing whitespace for you. Table 5-2 shows how to use this built-in function or method in a variety of programming languages.

Table 5-2. Standard functions to remove leading and trailing whitespace

Language

Function

C#, VB.NET

String.Trim([Chars])

Java, JavaScript

string.trim()

PHP

trim($string)

Python, Ruby

string.strip()

Perl does not have an equivalent function in its standard library, but you can create your own by using the regular expressions shown earlier in this recipe:

sub trim {
    my $string = shift;
    $string =~ s/^s+//;
    $string =~ s/s+$//;
    return $string;
}

JavaScript’s string.trim() method is a recent addition to the language. For older browsers (prior to Internet Explorer 9 and Firefox 3.5), you can add it like this:

// Add the trim method for browsers that don't already include it
if (!String.prototype.trim) {
    String.prototype.trim = function() {
        return this.replace(/^s+/, "").replace(/s+$/, "");
    };
}

Tip

In both Perl and JavaScript, s matches any character defined as whitespace by the Unicode standard, in addition to the space, tab, line feed, and carriage return characters that are most commonly considered whitespace.

Variations

There are in fact many different ways you can write a regular expression to help you trim a string. However, the alternatives are usually slower than using two simple substitutions when working with long strings (when performance matters most). Following are some of the more common alternative solutions you might encounter. They are all written in JavaScript, and since standard JavaScript doesn’t have a “dot matches line breaks” option, the regular expressions use [sS] to match any single character, including line breaks. In other programming languages, use a dot instead, and enable the “dot matches line breaks” option.

string.replace(/^s+|s+$/g, "");

This is probably the most common solution. It combines the two simple regexes via alternation (see Recipe 2.8), and uses the /g (global) flag to replace all matches rather than just the first (it will match twice when its target contains both leading and trailing whitespace). This isn’t a terrible approach, but it’s slower than using two simple substitutions when working with long strings since the two alternation options need to be tested at every character position.

string.replace(/^s*([sS]*?)s*$/, "$1")

This regex works by matching the entire string and capturing the sequence from the first to the last nonwhitespace characters (if any) to backreference 1. By replacing the entire string with backreference 1, you’re left with a trimmed version of the string.

This approach is conceptually simple, but the lazy quantifier inside the capturing group makes the regex do a lot of extra work (i.e., backtracking), and therefore tends to make this option slow with long target strings.

Let’s step back to look at how this actually works. After the regex enters the capturing group, the [sS] class’s lazy *? quantifier requires that it be repeated as few times as possible. Thus, the regex matches one character at a time, stopping after each character to try to match the remaining s*$ pattern. If that fails because nonwhitespace characters remain somewhere after the current position in the string, the regex matches one more character, updates the backreference, and then tries the remainder of the pattern again.

string.replace(/^s*([sS]*S)?s*$/, "$1")

This is similar to the last regex, but it replaces the lazy quantifier with a greedy one for performance reasons. To make sure that the capturing group still only matches up to the last nonwhitespace character, a trailing S is required. However, since the regex must be able to match whitespace-only strings, the entire capturing group is made optional by adding a trailing question mark quantifier.

Here, the greedy asterisk in [sS]* repeats its any-character pattern to the end of the string. The regex then backtracks one character at a time until it’s able to match the following S, or until it backtracks to the first character matched within the group (after which it skips the group).

Unless there’s more trailing whitespace than other text, this generally ends up being faster than the previous solution that used a lazy quantifier. Still, it doesn’t hold up to the consistent performance of using two simple substitutions.

string.replace(/^s*(S*(?:s+S+)*)s*$/, "$1")

This is a relatively common approach, but there’s no good reason to use it since it’s consistently one of the slowest of the options shown here. It’s similar to the last two regexes in that it matches the entire string and replaces it with the part you want to keep, but because the inner, noncapturing group matches only one word at a time, there are a lot of discrete steps the regex must take. The performance hit may be unnoticeable when trimming short strings, but with long strings that contain many words, this regex can become a performance problem.

Some regular expression implementations contain clever optimizations that alter the internal matching processes described here, and therefore make some of these options perform a bit better or worse than we’ve suggested. Nevertheless, the simplicity of using two substitutions provides consistently respectable performance with different string lengths and varying string contents, and it’s therefore the best all-around solution.

See Also

Recipe 5.13 explains how to replace repeated whitespace with a single space.

Techniques used in the regular expressions and replacement text in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition. Recipe 2.13 explains how greedy and lazy quantifiers backtrack. Recipe 2.21 explains how to insert text matched by capturing groups into the replacement text.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset