EXPLORATION 54

image

Locales and Facets

As you saw in Exploration 18, C++ offers a complicated system to support internationalization and localization of your code. Even if you don’t intend to ship translations of your program in a multitude of languages, you must understand the locale mechanism that C++ uses. Indeed, you have been using it all along, because C++ always sends formatted I/O through the locale system. This Exploration will help you understand locales better and make more effective use of them in your programs.

The Problem

The story of the Tower of Babel resonates with programmers. Imagine a world that speaks a single language and uses a single alphabet. How much simpler programming would be if we didn’t have to deal with character-set issues, language rules, or locales.

The real world has many languages, numerous alphabets and syllabaries, and multitudinous character sets, all making life far richer and more interesting and making a programmer’s job more difficult. Somehow, we programmers must cope. It isn’t easy, and this Exploration cannot give you all the answers, but it’s a start.

Different cultures, languages, and character sets give rise to different methods to present and interpret information, different interpretations of character codes (as you learned in Exploration 17), and different ways of organizing (especially sorting) information. Even with numeric data, you may find that you have to write the same number in several ways, depending on the local environment, culture, and language. Table 54-1 presents just a few examples of the ways to write a number according to various cultures, conventions, and locales.

Table 54-1. Various Ways to Write a Number

Number

Culture

123456.7890

Default C++

123,456.7890

United States

123 456.7890

International scientific

Rs. 1,23,456.7890

Indian currency*

123.456,7890

Germany

*Yes, the commas are correct.

Other cultural differences can include:

  • 12-hour vs. 24-hour clock
  • How accented characters are sorted relative to non-accented characters (does 'a' come before or after 'á'?)
  • Date formats: month/day/year, day/month/year, or year-month-day
  • Formatting of currency (¥123,456 or 99¢)

Somehow, the poor application programmer must figure out exactly what is culturally-dependent, collect the information for all the possible cultures where the application might run, and use that information appropriately in the application. Fortunately, the hard work has already been done for you and is part of the C++ standard library.

Locales to the Rescue

C++ uses a system called locales to manage this disparity of styles. Exploration 18 introduced locales as a means to organize character sets and their properties. Locales also organize formatting of numbers, currency, dates, and times (plus some more stuff that I won’t get into).

C++ defines a basic locale, known as the classic locale, which provides minimal formatting. Each C++ implementation is then free to provide additional locales. Each locale typically has a name, but the C++ standard does not mandate any particular naming convention, which makes it difficult to write portable code. You can rely on only two standard names:

  • The classic locale is named "C". The classic locale specifies the same basic formatting information for all implementations. When a program starts, the classic locale is the initial locale.
  • An empty string ("") means the default, or native, locale. The default locale obtains formatting and other information from the host operating system in a manner that depends on what the OS can offer. With traditional desktop operating systems, you can assume that the default locale specifies the user’s preferred formatting rules and character-set information. With other environments, such as embedded systems, the default locale may be identical to the classic locale.

A number of C++ implementations use ISO and POSIX standards for naming locales: an ISO 639 code for the language (e.g., fr for French, en for English, ko for Korean), optionally followed by an underscore and an ISO 3166 code for the region (e.g., CH for Switzerland, GB for Great Britain, HK for Hong Kong). The name is optionally followed by a dot and the name of the character set (e.g., utf8 for Unicode UTF-8, Big5 for Chinese Big 5 encoding). Thus, I use en_US.utf8 for my default locale. A native of Taiwan might use zh_TW.Big5; developers in French-speaking Switzerland might use fr_CH.latin1. Read your library documentation to learn how it specifies locale names. What is your default locale? ________________ What are its main characteristics?

_____________________________________________________________

_____________________________________________________________

_____________________________________________________________

Every C++ application has a global locale object. Unless you explicitly change a stream’s locale, it starts off with the global locale. (If you later change the global locale, that does not affect streams that already exist, such as the standard I/O streams.) Initially, the global locale is the classic locale. The classic locale is the same everywhere (except for the parts that depend on the character set), so a program has maximum portability with the classic locale. On the other hand, it has minimum local flavor. The next section explores how you can change a stream’s locale.

Locales and I/O

Recall from Exploration 18 that you imbue a stream with a locale in order to format I/O according to the locale’s rules. Thus, to ensure that you read input in the classic locale, and that you print results in the user’s native locale, you need the following:

std::cin.imbue(std::locale::classic()); // standard input uses the classic locale
std::cout.imbue(std::locale{""});       // imbue with the user's default locale

The standard I/O streams initially use the classic locale. You can imbue a stream with a new locale at any time, but it makes the most sense to do so before performing any I/O.

Typically, you would use the classic locale when reading from, or writing to, files. You usually want the contents of files to be portable and not dependent on a user’s OS preferences. For ephemeral output to a console or GUI window, you may want to use the default locale, so the user can be most comfortable reading and understanding it. On the other hand, if there is any chance that another program might try to read your program’s output (as happens with UNIX pipes and filters), you should stick with the classic locale, in order to ensure portability and a common format. If you are preparing output to be displayed in a GUI, by all means, use the default locale.

Facets

The way a stream interprets numeric input and formats numeric output is by making requests of the imbued locale. A locale object is a collection of pieces, each of which manages a small aspect of internationalization. For example, one piece, called numpunct, provides the punctuation symbols for numeric formatting, such as the decimal point character (which is '.' in the United States, but ',' in France). Another piece, num_get, reads from a stream and parses the text to form a number, using information it obtains from numpunct. The pieces such as num_get and numpunct are called facets.

For ordinary numeric I/O, you never have to deal with facets. The I/O streams automatically manage these details for you: the operator<< function uses the num_put facet to format numbers for output, and operator>> uses num_get to interpret text as numeric input. For currency, dates, and times, I/O manipulators use facets to format values. But sometimes you need to use facets yourself. The isalpha, toupper, and other character-related functions about which you learned in Exploration 18 use the ctype facet. Any program that has to do a lot of character testing and converting can benefit by managing its facets directly.

Like strings and I/O streams, facets are class templates, parameterized on the character type. So far, the only character type you have used is char; you will learn about other character types in Exploration 55. The principles are the same, regardless of character type (which is why facets use templates).

To obtain a facet from a locale, call the use_facet function template. The template argument is the facet you seek, and the function argument is the locale object. The returned facet is const and is not copyable, so the best way to use the result is to initialize a const reference, as demonstrated in the following:

std::money_get<char> const&
    mgetter{ std::use_facet<std::money_get<char>>(std::locale{""}) };

Reading from the inside outward, the object named mgetter is initialized to the result of calling the use_facet function, which is requesting a reference to the money_get<char> facet. The default locale is passed as the sole argument to the use_facet function. The type of mgetter is a reference to a const money_get<char> facet. It’s a little daunting to read at first, but you’ll get used to it—eventually.

Once you have a facet, call its member functions to use it. This section introduces the currency facets as an example. A complete library reference tells you about all the facets and their member functions.

The money_get facet has two overloaded functions named get. The get function reads a currency value from a sequence of characters specified by an iterator range. It checks the currency symbol, thousands separator, thousands grouping, and decimal point. It extracts the numeric value and stores the value in a double (overloaded form 1) or as a string of digit characters (overloaded form 2). If you choose to use double, take care that you do not run into rounding errors. The get function assumes that the input originates in an input stream, and you must pass a stream object as one of its arguments. It checks the stream’s flags to see whether a currency symbol is required (showbase flag is set). And, finally, it sets error flags as needed: failbit for input formatting errors and eofbit for end of file, as follows:

std::string digits{};
std::ios_base::iostate error{};
bool international{true};
mgetter.get(std::istreambuf_iterator<char>(stream),  std::istreambuf_iterator<char>(),
   international, stream, error, digits);

Similarly, the money_put facet provides the overloaded put function, which formats a double or digit string, according to the locale’s currency formatting rules, and writes the formatted value to an output iterator. If the stream’s showbase flag is set, money_put prints the currency symbol. The locale’s rules specify the position and formatting of the symbol (in the moneypunct facet). The money facets can use local currency rules or international standards. A bool argument to the get and put functions specifies the choice: true for international and false for local. The putter also requires a fill character, as shown in the following example:

std::money_put<char> const&
    mputter{ std::use_facet<std::money_put<char>>(std::locale{""}) };
mputter.put(std::ostreambuf_iterator<char>(stream), international, stream, '*', digits);

As you can see, using facets directly can be a little complicated. Fortunately, the standard library offers a few I/O manipulators (declared in <iomanip>) to simplify the use of the time and currency facets. Listing 54-1 shows a simple program that imbues the standard I/O streams with the default locale and then reads and writes currency values.

Listing 54-1.  Reading and Writing Currency Using the Money I/O Manipulators

#include <iomanip>
#include <iostream>
#include <iterator>
#include <locale>
#include <string>
 
int main()
{
  std::locale native{""};
  std::cin.imbue(native);
  std::cout.imbue(native);
 
  std::cin >> std::noshowbase;  // currency symbol is optional for input
  std::cout << std::showbase;   // always write the currency symbol for output
 
  std::string digits;
  while (std::cin >> std::get_money(digits))
  {
    std::cout << std::put_money(digits) << ' ';
  }
  if (not std::cin.eof())
    std::cout << "Invalid input. ";
}

The locale manipulators work like other manipulators, but they invoke the associated facets. The manipulators use the stream to take care of the error flags, iterators, fill character, etc. The get_time and put_time manipulators read and write dates and times; consult a library reference for details.

Character Categories

This section continues the examination of character sets and locales that you began in Exploration 18. In addition to testing for alphanumeric characters or lowercase characters, you can test for several different categories. Table 54-2 lists all the classification functions and their behavior in the classic locale. They all take a character as the first argument and a locale as the second; they all return a bool result.

Table 54-2. Character Classification Functions

Function

Description

Classic Locale

isalnum

Alphanumeric

'a''z', 'A''Z', '0''9'

isalpha

Alphabetic

'a''z', 'A''Z'

iscntrl

Control

Any non-printable character*

isdigit

Digit

'0''9' (in all locales)

isgraph

Graphical

Printable character other than ' '*

islower

Lowercase

'a''z'

isprint

Printable

Any printable character in the character set*

ispunct

Punctuation

Printable character other than alphanumeric or white space*

isspace

White space

' ', 'f', ' ', ' ', ' ', 'v'

isupper

Uppercase

'A''Z'

isxdigit

Hexadecimal digit

'a''f', 'A''F', '0''9' (in all locales)

*Behavior depends on the character set, even in the classic locale.

The classic locale has fixed definitions for some categories (such as isupper). Other locales, however, can expand these definitions to include other characters, which may (and probably will) depend on the character set too. Only isdigit and isxdigit have fixed definitions for all locales and all character sets.

However, even in the classic locale, the precise implementation of some functions, such as isprint, depend on the character set. For example, in the popular ISO 8859-1 (Latin-1) character set 'x80' is a control character, but in the equally popular Windows-1252 character set, it is printable. In UTF-8, 'x80' is invalid, so all the categorization functions would return false.

The interaction between the locale and the character set is one of the areas where C++ underperforms. The locale can change at any time, which potentially sets a new character set, which in turn can give new meaning to certain character values. But, the compiler’s view of the runtime character set is fixed. For instance, the compiler treats 'A' as the uppercase Roman letter A and compiles the numeric code according to its idea of the runtime character set. That numeric value is then fixed forever. If the characterization functions use the same character set, everything is fine. The isalpha and isupper functions return true; isdigit returns false; and all is right with the world. If the user changes the locale and by so doing changes the character set, those functions may not work with that character variable any more.

Let’s consider a concrete example as shown in Listing 54-2. This program encodes locale names, which may not work for your environment. Read the comments and see if your environment can support the same kind of locales, albeit with different names. You will need the ioflags class from Listing 39-4. Copy the class to its own header called ioflags.hpp or download the file from the book’s web site. After reading Listing 54-2, what do you expect as the result?

_____________________________________________________________

_____________________________________________________________

Listing 54-2.  Exploring Character Sets and Locales

#include <iomanip>
#include <iostream>
#include <locale>
#include <ostream>
 
#include "ioflags.hpp"  // from Listing 39-4
 
/// Print a character's categorization in a locale.
void print(int c, std::string const& name, std::locale loc)
{
  // Don't concern yourself with the & operator. I'll cover that later
  // in the book, in Exploration 63. Its purpose is just to ensure
  // the character's escape code is printed correctly.
  std::cout << "\x" << std::setw(2) << (c & 0xff) <<
               " is " << name << " in " << loc.name() << ' ';
}
 
/// Test a character's categorization in the locale, @p loc.
void test(char c, std::locale loc)
{
  ioflags save{std::cout};
  std::cout << std::hex << std::setfill('0'),
  if (std::isalnum(c, loc))
    print(c, "alphanumeric", loc);
  else if (std::iscntrl(c, loc))
    print(c, "control", loc);
  else if (std::ispunct(c, loc))
    print(c, "punctuation", loc);
  else
    print(c, "none of the above", loc);
}
 
int main()
{
  // Test the same code point in different locales and character sets.
  char c{'xd7'};
 
  // ISO 8859-1 is also called Latin-1 and is widely used in Western Europe
  // and the Americas. It is often the default character set in these regions.
  // The country and language are unimportant for this test.
  // Choose any that support the ISO 8859-1 character set.
  test(c, std::locale{"en_US.iso88591"});
 
  // ISO 8859-5 is Cyrillic. It is often the default character set in Russia
  // and some Eastern European countries. Choose any language and region that
  // support the ISO 8859-5 character set.
  test(c, std::locale{"ru_RU.iso88595"});
 
  // ISO 8859-7 is Greek. Choose any language and region that
  // support the ISO 8859-7 character set.
  test(c, std::locale{"el_GR.iso88597"});
 
  // ISO 8859-8 contains some Hebrew. The character set is no longer widely used.
  // Choose any language and region that support the ISO 8859-8 character set.
  test(c, std::locale{"he_IL.iso88598"});
}

What do you get as the actual response?

_____________________________________________________________

_____________________________________________________________

_____________________________________________________________

_____________________________________________________________

In case you had trouble identifying locale names or other problems running the program, Listing 54-3 shows the result when I run it on my system.

Listing 54-3.  Result of Running the Program in Listing 54-2

xd7 is punctuation in en_US.iso88591
xd7 is alphanumeric in ru_RU.iso88595
xd7 is alphanumeric in el_GR.iso88597
xd7 is none of the above in he_IL.iso88598

As you can see, the same character has different categories, depending on the locale’s character set. Now imagine that the user has entered a string, and your program has stored the string. If your program changes the global locale or the locale used to process that string, you may end up misinterpreting the string.

In Listing 54-2, the categorization functions reload their facets every time they are called, but you can rewrite the program so it loads its facet only once. The character type facet is called ctype. It has a function named is that takes a category mask and a character as arguments and returns a bool: true if the character has a type in the mask. The mask values are specified in std::ctype_base.

image Note  Notice the convention that the standard library uses throughout. When a class template needs helper types and constants, they are declared in a non-template base class. The class template derives from the base class and so gains easy access to the types and constants. Callers gain access to the types and constants by qualifying with the base class name. By avoiding the template in the base class, the standard library avoids unnecessary instantiations just to use a type or constant that is unrelated to the template argument.

The mask names are the same as the categorization functions, but without the leading is. Listing 54-4 shows how to rewrite the simple character-set demonstration to use a single cached ctype facet.

Listing 54-4.  Caching the ctype Facet

#include <iomanip>
#include <iostream>
#include <locale>
 
#include "ioflags.hpp"  // from Listing 39-4
 
void print(int c, std::string const& name, std::locale loc)
{
  // Don't concern yourself with the & operator. I'll cover that later
  // in the book. Its purpose is just to ensure the character's escape
  // code is printed correctly.
  std::cout << "\x" << std::setw(2) << (c & 0xff) <<
               " is " << name << " in " << loc.name() << ' ';
}
 
/// Test a character's categorization in the locale, @p loc.
void test(char c, std::locale loc)
{
  ioflags save{std::cout};
 
  std::ctype<char> const& ctype{std::use_facet<std::ctype<char>>(loc)};
 
  std::cout << std::hex << std::setfill('0'),
  if (ctype.is(std::ctype_base::alnum, c))
    print(c, "alphanumeric", loc);
  else if (ctype.is(std::ctype_base::cntrl, c))
    print(c, "control", loc);
  else if (ctype.is(std::ctype_base::punct, c))
    print(c, "punctuation", loc);
  else
    print(c, "none of the above", loc);
}
 
int main()
{
  // Test the same code point in different locales and character sets.
  char c{'xd7'};
 
  // ISO 8859-1 is also called Latin-1 and is widely used in Western Europe
  // and the Americas. It is often the default character set in these regions.
  // The country and language are unimportant for this test.
  // Choose any that support the ISO 8859-1 character set.
  test(c, std::locale{"en_US.iso88591"});
 
  // ISO 8859-5 is Cyrillic. It is often the default character set in Russia
  // and some Eastern European countries. Choose any language and region that
  // support the ISO 8859-5 character set.
  test(c, std::locale{"ru_RU.iso88595"});
 
  // ISO 8859-7 is Greek. Choose any language and region that
  // support the ISO 8859-7 character set.
  test(c, std::locale{"el_GR.iso88597"});
 
  // ISO 8859-8 contains some Hebrew. It is no longer widely used.
  // Choose any language and region that support the ISO 8859-8 character set.
  test(c, std::locale{"he_IL.iso88598"});
}

The ctype facet also performs case conversions with the toupper and tolower member functions, which take a single character argument and return a character result. Recall the word-counting problem from Exploration 22. Rewrite your solution (see Listings 22-2 and 22-3) and change the code to use cached facets. Compare your program with Listing 54-5.

Listing 54-5.  Counting Words Again, This Time with Cached Facets

// Copy the initial portion of Listing 22-2 here, including print_counts,
// but stopping just before sanitize.
 
/** Base class to hold a ctype facet. */
class function
{
public:
  function(std::locale loc) : ctype_{std::use_facet<std::ctype<char>>(loc)} {}
  bool isalnum(char ch) const { return ctype_.is(std::ctype_base::alnum, ch); }
  char tolower(char ch) const { return ctype_.tolower(ch); }
private:
  std::ctype<char> const& ctype_;
};
 
/** Sanitize a string by keeping only alphabetic characters.
 * @param str the original string
 * @return a santized copy of the string
 */
class sanitizer : public function
{
public:
  typedef std::string argument_type;
  typedef std::string result_type;
  sanitizer(std::locale loc) : function{loc} {}
  std::string operator()(std::string const& str)
  {
    std::string result{};
    for (char c : str)
      if (isalnum(c))
        result.push_back(tolower(c));
    return result;
  }
};
 
/** Main program to count unique words in the standard input. */
int main()
{
  // Set the global locale to the native locale.
  std::locale::global(std::locale{""});
  initialize_streams();
 
  count_map counts{};
 
  // Read words from the standard input and count the number of times
  // each word occurs.
  std::string word{};
  sanitizer sanitize{std::locale{}};
  while (std::cin >> word)
  {
    std::string copy{sanitize(word)};
 
    // The "word" might be all punctuation, so the copy would be empty.
    // Don't count empty strings.
    if (not copy.empty())
      ++counts[copy];
  }
 
  print_counts(counts);
}

Notice how most of the program is unchanged. The simple act of caching the ctype facet reduces this program’s runtime by about 15 percent on my system.

Collation Order

You can use the relational operators (such as <) with characters and strings, but they don’t actually compare characters or code points; they compare storage units. Most users don’t care whether a list of names is sorted in ascending numerical order by storage unit. They want a list of names sorted in ascending alphabetical order, according to their native collation rules.

For example, which comes first: ångstrom or angle? The answer depends on where you live and what language you speak. In Scandinavia, angle comes first, and ångstrom follows zebra. The collate facet compares strings according to the locale’s rules. Its compare function is somewhat clumsy to use, so the locale class template provides a simple interface for determining whether one string is less than another in a locale: use the locale’s function call operator. In other words, you can use a locale object itself as the comparison functor for standard algorithms, such as sort. Listing 54-6 shows a program that demonstrates how collation order depends on locale. In order to get the program to run in your environment, you may have to change the locale names.

Listing 54-6.  Demonstrating How Collation Order Depends on Locale

#include <algorithm>
#include <iostream>
#include <iterator>
#include <locale>
#include <string>
#include <vector>
 
void sort_words(std::vector<std::string> words, std::locale loc)
{
  std::sort(words.begin(), words.end(), loc);
  std::cout << loc.name() << ": ";
  std::copy(words.begin(), words.end(),
            std::ostream_iterator<std::string>(std::cout, " "));
}
 
int main()
{
  using namespace std;
  vector<string> words{
    "circus",
    "u00e5ngstrom",     // ångstrom
    "u00e7ircu00ea",   // çircê
    "angle",
    "essen",
    "ether",
    "u00e6ther",        // æther
    "aether",
    "eu00dfen"         // eßen
  };
  sort_words(words, locale::classic());
  sort_words(words, locale{"en_GB.utf8"});  // Great Britain
  sort_words(words, locale{"no_NO.utf8"});  // Norway
}

The uNNNN characters are a portable way to express Unicode characters. The NNNN must be four hexadecimal digits, specifying a Unicode code point. You will learn more in the next Exploration.

The boldface line shows how the locale object is used as a comparison functor to sort the words. Table 54-3 lists the results I get for each locale. Depending on your native character set, you may get different results.

Table 54-3. Collation Order for Each Locale

Classic

Great Britain

Norway

ångstrom

aether

aether

æther

æther

angle

çircê

angle

çircê

aether

ångstrom

circus

angle

çircê

essen

circus

circus

eßen

eßen

essen

ether

essen

eßen

æther

ether

ether

ångstrom

The next Exploration takes a closer look at international character sets and related difficulties.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset