6.12. Add Thousand Separators to Numbers

Problem

You want to add commas as the thousand separator to numbers with four or more digits. You want to do this both for individual numbers and for any numbers in a string or file.

For example, you’d like to convert this:

There are more than 7000000000 people in the world today.

To this:

There are more than 7,000,000,000 people in the world today.

Tip

Not all countries and written languages use the same character as the thousand separator. The solutions here use a comma, but some people use dots, underscores, apostrophes, or spaces for the same purpose. If you want, you can replace the commas in this recipe’s replacement strings with one of these other characters.

Solution

The following solutions work both for individual numbers and for all numbers in a given string. They’re designed to be used in a search-and-replace for all matches.

Basic solution

Regular expression:

[0-9](?=(?:[0-9]{3})+(?![0-9]))
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Although this regular expression works equally well with all of the flavors covered by this book, the accompanying replacement text is decidedly less portable.

Replacement:

$&,
Replacement text flavors: .NET, JavaScript, Perl
$0,
Replacement text flavors: .NET, Java, XRegExp, PHP
,
Replacement text flavors: PHP, Ruby
&,
Replacement text flavor: Ruby
g<0>,
Replacement text flavor: Python

These replacement strings all put the matched number back using backreference zero (the entire match, which in this case is a single digit), followed by a comma. When programming, you can implement this regular expression search-and-replace as explained in Recipe 3.15.

Match separator positions only, using lookbehind

Regular expression:

(?<=[0-9])(?=(?:[0-9]{3})+(?![0-9]))
Regex options: None
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby 1.9

Replacement:

,
Replacement text flavors: .NET, Java, Perl, PHP, Python, Ruby

Recipe 3.14 explains how you can implement this basic regular expression search-and-replace when programming.

This version doesn’t work with JavaScript or Ruby 1.8, because they don’t support any type of lookbehind. This time around, however, we need only one version of the replacement text because we’re simply using a comma without any backreference as the replacement.

Discussion

Introduction

Adding thousand separators to numbers in your documents, data, and program output is a simple but effective way to improve their readability and appearance.

Some of the programming languages covered by this book provide built-in methods to add locale-aware thousand separators to numbers. For instance, in Python you can use locale.format('%d', 1000000, True) to convert the number 1000000 to the string '1,000,000', assuming you’ve previously set your program to use a locale that uses commas as the thousand separator. For other locales, the number might be separated using dots, underscores, apostrophes, or spaces.

However, locale-aware processing is not always available, reliable, or appropriate. In the finance world, for example, using commas as thousand separators is the norm, regardless of location. Internationalization might not be a relevant issue to begin with when working in a text editor rather than programming. For these reasons, and for simplicity, in this recipe we’ve assumed you always want to use commas as the thousand separator. In the upcoming section, we’ve also assumed you want to use dots as decimal points. If you need to use other characters, feel free to swap them in.

Caution

Although adding thousand separators to all numbers in a file or string can improve the presentation of your data, it’s important to understand what kind of content you’re dealing with before doing so. For instance, you probably don’t want to add commas to IDs, four-digit years, and ZIP codes. Documents and data that include these kinds of numbers might not be good candidates for automated comma insertion.

Basic solution

This regular expression matches any single digit that has digits on the right in exact sets of three. It therefore matches twice in the string 12345678, finding the digits 2 and 5. All the other digits are not followed by an exact multiple of three digits.

The accompanying replacement text puts back the matched digit using backreference zero (the entire match), and follows it with a comma. That leaves us with 12,345,678. Voilà!

To explain how the regex determines which digits to match, we’ll split it into two parts. The first part is the leading character class [0-9] that matches any single digit. The second part is the positive lookahead (?=(?:[0-9]{3})+(?![0-9])) that causes the match attempt to fail unless it’s at a position followed by digits in exact sets of three. In other words, the lookahead ensures that the regex matches only the digits that should be followed by a comma. Recipe 2.16 explains how lookahead works.

The (?:[0-9]{3})+ within the lookahead matches digits in sets of three. The negative lookahead (?![0-9]) that follows is there to ensure that no digits come immediately after the digits we matched in sets of three. Otherwise, the outer positive lookahead would be satisfied by any number of following digits, so long as there were at least three.

Match separator positions only, using lookbehind

This adaptation of the previous regex doesn’t match any digits at all. Instead, it matches only the positions where we want to insert commas within numbers. These positions are wherever there are digits on the right in exact sets of three, and at least one digit on the left.

The lookahead used to search for sets of exactly three digits on the right is the same as in the last regex. The difference here is that, instead of starting the regex with [0-9] to match a digit, we instead assert that there is at least one digit to the left by using the positive lookbehind (?<=[0-9]). Without the lookbehind, the regex would match the position to the left of 123 and therefore the search-and-replace would convert it to ,100. Lookbehind is explained together with lookahead in Recipe 2.16.

JavaScript and Ruby 1.8 don’t support lookbehind, so they cannot use this version of the regular expression.

Variations

Don’t add commas after a decimal point

The preceding regexes add commas to any sequence of four or more digits. A rather glaring issue with this basic approach is that it can add commas to digits that come after a dot as the decimal separator, so long as there are at least four digits after the dot. Following are two ways to fix this.

Use infinite lookbehind

The problem is easy to solve if you’re able to use an infinite-length quantifier like + or at least a long finite-length quantifier like {1,100} within lookbehind.

Regular expression:

[0-9](?=(?:[0-9]{3})+(?![0-9]))(?<!.[0-9]+)
Regex options: None
Regex flavors: .NET
[0-9](?=(?:[0-9]{3})+(?![0-9]))(?<!.[0-9]{1,100})
Regex options: None
Regex flavors: .NET, Java

Replacement:

$0,
Replacement text flavors: .NET, Java

The first regex here works in .NET only because of the + in the lookbehind. The second regex works in both .NET and Java, because Java supports any finite-length quantifier inside lookbehind—even arbitrarily long interval quantifiers like {1,100}. The .NET-only version therefore works correctly with any number, whereas the Java version avoids adding commas to numbers after a decimal place only when there are 100 or fewer digits after the dot. You can bump up the second number in the {1,100} quantifier if you want to support even longer numbers to the right of a decimal separator.

With both regexes, we’ve put the new lookbehind at the end of the pattern. The regexes could be restructured to add the lookbehind at the front, as you might intuitively expect, but we’ve done it this way to optimize efficiency. Since the lookbehind is the slowest part of the regex, putting it at the end lets the regex fail more quickly at positions within the subject string where the lookbehind doesn’t need to be evaluated in order to rule out a match.

Search-and-replace within matched numbers

If you’re not working with .NET or Java and therefore can’t look as far back into the subject string as you want, you can still use fixed-length lookbehind to help match entire numbers that aren’t preceded by a dot. That lets you identify the numbers that qualify for having commas added (and correctly exclude any digits that come after a decimal point), but because it matches entire numbers, you can’t simply include a comma in the replacement string and be done with it.

Completing the solution requires using two regexes. An outer regex to match the numbers that should have commas added to them, and an inner regex that searches within the qualifying numbers as part of a search-and-replace that inserts the commas.

Outer regex:

(?<!.)[0-9]{4,}
Regex options: None
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby 1.9

This matches any entire number with four or more digits that is not preceded by a dot. The word boundary at the beginning of the regex ensures that any matched numbers start at the beginning of the string or are separate from other numbers and words. Otherwise, the regex could match the 2345 from 0.12345. In other words, without the word boundary, matches could start from the second digit after a decimal point, since a dot is no longer the preceding character at that point.

The inner regex and replacement text to go with this are the same as the Basic solution.

In order to apply the inner regex’s generated replacement values to each match of the outer regex, we need to replace matches of the outer regex with values generated in code, rather than using a simple string replacement. That way we can run the inner regex within the code that generates the outer regex’s replacement value. This may sound complicated, but the programming languages covered by this book all make it fairly straightforward.

Here’s the complete solution for Ruby 1.9:

subject.gsub(/(?<!.)[0-9]{4,}/) {|match|
    match.gsub(/[0-9](?=(?:[0-9]{3})+(?![0-9]))/, ',')
}

The subject variable in this code holds the string to commafy. Ruby’s gsub string method performs a global search-and-replace. For other programming languages, follow Recipe 3.16, which explains how to replace matches with replacements generated in code. It includes examples that show this technique in action for each language.

The lack of lookbehind support in JavaScript and Ruby 1.8 prevents this solution from being fully portable, since we used lookbehind in the outer regex. We can work around this in JavaScript and Ruby 1.8 by including the character, if any, that precedes a number as part of the match, and requiring that it be something other than a digit or dot. We can then put the nondigit/nondot character back using a backreference in the generated replacement text.

Here’s the JavaScript code to pull this off:

subject.replace(/(^|[^0-9.])([0-9]{4,})/g, function($0, $1, $2) {
    return $1 + $2.replace(/[0-9](?=(?:[0-9]{3})+(?![0-9]))/g, "$&,");
});

See Also

Recipe 6.11 explains how to match numbers that already include commas within them.

All the other recipes in this chapter show more ways of matching different kinds of numbers with a regular expression.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset