Exploration 18: Character Categories

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

EXPLORATION 18

Character Categories

Exploration 17 introduced and discussed characters. This Exploration continues that discussion with character classification (e.g., upper or lowercase, digit or letter), which, as you will see, turns out to be more complicated than you might have expected.

Character Sets

As you learned in Exploration 17, the numeric value of a character, such as 'A', depends on the character set. The compiler must decide which character set to use at compile time and at runtime. This is typically based on preferences that the end user selects in the host operating system.

Character-set issues rarely arise for the basic subset of characters—such as letters, digits, and punctuation symbols—that are used to write C++ source code. Although it is conceivable that you could compile a program using ISO 8859-1 and run that program using EBCDIC, you would have to work pretty hard to arrange such a feat. Most likely, you will find yourself using one or more character sets that share some common characteristics. For example, all ISO 8859 character sets use the same numeric values for the letters of the Roman alphabet, digits, and basic punctuation. Even most Asian character sets preserve the values of these basic characters.

Thus, most programmers blithely ignore the character-set issue. We use character literals, such as '%' and assume the program will function the way we expect it to, on any system, anywhere in the world—and we are usually right. But not always.

Assuming the basic characters are always available in a portable manner, we can modify the word-counting program to treat only letters as characters that make up a word. The program would no longer count right and right? as two distinct words. The string type offers several member functions that can help us search in strings, extract substrings, and so on.

For example, you can build a string that contains only the letters and any other characters that you want to consider to be part of a word (such as '-'). After reading each word from the input stream, make a copy of the word but keep only the characters that are in the string of acceptable characters. Use the find member function to try to find each character; find returns the zero-based index of the character, if found, or std::string::npos, if not found.

Using the find function, rewrite Listing 15-3 to clean up the word string prior to inserting it in the map. Test the program with a variety of input samples. How well does it work? Compare your program with Listing 18-1.

Listing 18-1. Counting Words: Restricting Words to Letters and Letter-Like Characters

#include <iomanip>
#include <iostream>
#include <map>
#include <string>
 
int main()
{
  typedef std::map<std::string, int>    count_map;
  typedef std::string::size_type        str_size;
 
  count_map counts{};
 
  // Read words from the standard input and count the number of times
  // each word occurs.
  std::string okay{"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
                   "abcdefghijklmnopqrstuvwxyz"
                   "0123456789-_"};
  std::string word{};
  while (std::cin >> word)
  {
    // Make a copy of word, keeping only the characters that appear in okay.
    std::string copy{};
    for (char ch : word)
      if (okay.find(ch) != std::string::npos)
        copy.push_back(ch);
    // The "word" might be all punctuation, so the copy would be empty.
    // Don't count empty strings.
    if (not copy.empty())
      ++counts[copy];
  }
 
  // Determine the longest word.
  str_size longest{0};
  for (auto pair : counts)
    if (pair.first.size() > longest)
      longest = pair.first.size();
   
  // For each word/count pair...
  const int count_size{10}; // Number of places for printing the count
  for (auto pair : counts)
    // Print the word, count, newline. Keep the columns neatly aligned.
    std::cout << std::setw(longest)    << std::left << pair.first <<
                 std::setw(count_size) << std::right << pair.second << '
';
}

Some of you may have written a program very similar to mine. Others among you—particularly those living outside the United States—may have written a slightly different program. Perhaps you included other characters in your string of acceptable characters.

For example, if you are French and using Microsoft Windows (and the Windows 1252 character set), you may have defined the okay object as follows:

std::string okay{"ABCDEFGHIJKLMNOPQRSTUVWXYZÀÁÄÇÈÉÊËÎÏÔÙÛÜŒŸ"
            "abcdefghijklmnopqrstuvwxyzàáäçèéêëîïöùûüœÿ"
            "0123456789-_"};

But what if you then try to compile and run this program in a different environment, particularly one that uses the ISO 8859-1 character set (popular on UNIX systems)? ISO 8859-1 and Windows 1252 share many character codes but differ in a few significant ways. In particular, the characters 'Œ', 'œ', and 'Ÿ' are missing from ISO 8859-1. As a result, the program may not compile successfully in an environment that uses ISO 8859-1 for the compile-time character set.

What if you want to share the program with a German user? Surely that user would want to include characters such as 'Ö', 'ö', and 'ß' as letters. What about Greek, Russian, and Japanese users?

We need a better solution. Wouldn’t it be nice if C++ provided a simple function that would notify us if a character is a letter, without forcing us to hard-code exactly which characters are letters? Fortunately, it does.

Character Categories

An easier way to write the program in Listing 18-1 is to call the isalnum function (declared in <locale>). This function indicates whether a character is alphanumeric in the runtime character set. The advantage of using isalnum is that you don’t have to enumerate all the possible alphanumeric characters; you don’t have to worry about differing character sets; and you don’t have to worry about accidentally omitting a character from the approved string.

Rewrite Listing 18-1 to call isalnum instead of find. The first argument to std::isalnum is the character to test, and the second is std::locale{""}. (Don’t worry yet about what that means. Have patience: I’ll get to that soon.)

Try running the program with a variety of alphabetic input, including accented characters. Compare the results with the results from your original program. The files that accompany this book include some samples that use a variety of character sets. Choose the sample that matches your everyday character set and run the program again, redirecting the input to that file.

If you need help with the program, see my version of the program in Listing 18-2. For the sake of brevity, I eliminated the neat-output part of the code, reverting to simple strings and tabs. Feel free to restore the pretty output, if you desire.

Listing 18-2. Testing a Character by Calling std::isalnum

#include <iostream>
#include <locale>
#include <map>
#include <string>
 
int main()
{
  typedef std::map<std::string, int>    count_map;
 
  count_map counts{};
 
  // Read words from the standard input and count the number of times
  // each word occurs.
  std::string word{};
  while (std::cin >> word)
  {
    // Make a copy of word, keeping only alphabetic characters.
    std::string copy{};
    for (char ch : word)
      if (std::isalnum(ch, std::locale{""}))
        copy.push_back(ch);
    // The "word" might be all punctuation, so the copy would be empty.
    // Don't count empty strings.
    if (not copy.empty())
      ++counts[copy];
  }
 
  // For each word/count pair, print the word & count on one line.
  for (auto pair : counts)
    std::cout << pair.first << '	' << pair.second << '
';
}

Now turn your attention to the std::locale{""} argument. The locale directs std::isalnum to the character set it should use to test the character. As you saw in Exploration 17, the character set determines the identity of a character, based on its numeric value. A user can change character sets while a program is running, so the program must keep track of the user’s actual character set and not depend on the character set that was active when you compiled the program.

Download the files that accompany this book and find the text files whose names begin with sample. Find the one that best matches the character set you use every day, and select that file as the redirected input to the program. Look for the appearance of the special characters in the output.

Change locale{""} to locale{} in the boldface line of Listing 18-2. Now compile and run the program with the same input. Do you see a difference? ________________ If so, what is the difference?

_____________________________________________________________

Without knowing more about your environment, I can’t tell you what you should expect. If you are using a Unicode character set, you won’t see any difference. The program would not treat any of the special characters as letters, even when you can plainly see they are letters. This is due to the way Unicode is implemented, and Exploration 55 will discuss this topic in depth.

Other users will notice that only one or two strings make it to the output. Western Europeans who use ISO 8859-1 may notice that ÁÇÐÈ is considered a word. Greek users of ISO 8859-7 will see ΑΒΓΔΕ as a word.

Power users who know how to change their character sets on the fly can try several different options. You must change the character set that programs use at runtime and the character set that your console uses to display text.

What is most noticeable is that the characters the program considers to be letters vary from one character set to another. But after all, that’s the idea of different character sets. The knowledge of which characters are letters in which character sets is embodied in the locale.

Locales

In C++, a locale is a collection of information pertaining to a culture, region, and language. The locale includes information about

formatting numbers, currency, dates, and time
classifying characters (letter, digit, punctuation, etc.)
converting characters from uppercase to lowercase and vice versa
sorting text (e.g., is 'A' less than, equal to, or greater than 'Å'?)
message catalogs (for translations of strings that your program uses)

Every C++ program begins with a minimal, standard locale, which is known as the classic or "C" locale. The std::locale::classic() function returns the classic locale. The unnamed locale, (std::locale{""}), is the user’s preferences that C++ obtains from the host operating system. The locale with the empty-string argument is often known as the native locale.

The advantage of the classic locale is that its behavior is known and fixed. If your program must read data in a fixed format, you don’t want the user’s preferences getting in the way. By contrast, the advantage of the native format is that the user chose those preferences for a reason and wants to see program output follow that format. A user who always specifies a date as day/month/year doesn’t want a program printing month/day/year simply because that’s the convention in the programmer’s home country.

Thus, the classic format is often used for reading and writing data files, and the native format is best used to interpret input from the user and to present output directly to the user.

Every I/O stream has its own locale object. To affect the stream’s locale, call its imbue function, passing the locale object as the sole argument.

Note You read that correctly: imbue, not setlocale or setloc—given that the getloc function returns a stream’s current locale—or anything else that might be easy to remember. On the other hand, imbue is such an unusual name for a member function, you may remember it for that reason alone.

In other words, when C++ starts up, it initializes each stream with the classic locale, as follows:

std::cin.imbue(std::locale::classic());
std::cout.imbue(std::locale::classic());

Suppose you want to change the output stream to adopt the user’s native locale. Do this using the following statement at the start of your program:

std::cout.imbue(std::locale{""});

For example, suppose you have to write a program that reads a list of numbers from the standard input and computes the sum. The numbers are raw data from a scientific instrument, so they are written as digit strings. Therefore, you should continue to use the classic locale to read the input stream. The output is for the user’s benefit, so the output should use the native locale.

Write the program and try it with very large numbers, so the output will be greater than 1000. What does the program print as its output? ________________

See Listing 18-3 for my approach to solving this problem.

Listing 18-3. Using the Native Locale for Output

#include <iostream>
#include <locale>
 
int main()
{
  std::cout.imbue(std::locale{""});
 
  int sum{0};
  int x{};
  while (std::cin >> x)
    sum = sum + x;
  std::cout << "sum = " << sum << '
';
}

When I run the program in Listing 18-3 in my default locale (United States), I get the following result:

sum = 1,234,567

Notice the commas that separate thousands. In some European countries, you might see the following instead:

sum = 1.234.567

You should obtain a result that conforms to native customs, or at least follows the preferences that you set in your host operating system.

When you use the native locale, I recommend defining a variable of type std::locale in which to store it. You can pass this variable to isalnum, imbue, or other functions. By creating this variable and distributing copies of it, your program has to query the operating system for your preferences only once, not every time you need the locale. Thus, the main loop ends up looking something like Listing 18-4.

Listing 18-4. Creating and Sharing a Single Locale Object

#include <iostream>
#include <locale>
#include <map>
#include <string>
 
int main()
{
  typedef std::map<std::string, int>    count_map;
 
  std::locale native{""};             // get the native locale
  std::cin.imbue(native);             // interpret the input and output according to
  std::cout.imbue(native);            // the native locale
 
  count_map counts{};
 
  // Read words from the standard input and count the number of times
  // each word occurs.
  std::string word{};
  while (std::cin >> word)
  {
    // Make a copy of word, keeping only alphabetic characters.
    std::string copy{};
    for (char ch : word)
      if (std::isalnum(ch, native))
        copy.push_back(ch);
    // The "word" might be all punctuation, so the copy would be empty.
    // Don't count empty strings.
    if (not copy.empty())
      ++counts[copy];
  }
 
  // For each word/count pair, print the word & count on one line.
  for (auto pair : counts)
    std::cout << pair.first << '	' << pair.second << '
';
}

The next step toward improving the word-counting program is to ignore case differences, so the program does not count the word The as different from the. It turns out this problem is trickier than it first appears, so it deserves an entire Exploration of its own.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Exploration 18: Character Categories

Create new playlist

Sign In

Sign Up

Table of Contents for
Exploration 18: Character Categories