Chapter 14

Using Strings and Regular Expressions

WHAT’S IN THIS CHAPTER?

  • The differences between C-style strings and C++ strings
  • How you can localize your applications to reach a worldwide audience
  • How to use regular expressions to do powerful pattern matching

Every program that you write will use strings of some kind. With the old C language there is not much choice but to use a dumb null-terminated character array to represent an ASCII string. Unfortunately, doing so can cause a lot of problems, such as buffer overflows, which can result in security vulnerabilities. The C++ STL includes a safe and easy-to-use string class that does not have these disadvantages.

The first section of this chapter discusses strings in more detail. It starts with a discussion of the old C-style strings, explains their disadvantages, and ends with the C++ string class. It also mentions raw string literals, which are new in C++11.

The second section discusses localization, which is becoming more and more important these days to allow you to write software that can be localized to different regions around the world.

The last section introduces the new C++11 regular expressions library, which makes it easy to perform pattern matching on strings. They allow you to search for sub-strings matching a given pattern, but also to validate, parse, and transform strings. They are really powerful and it’s recommended that you start using them instead of manually writing your own string processing code.

DYNAMIC STRINGS

Strings in languages that have supported them as first-class objects tend to have a number of attractive features, such as being able to expand to any size, or have sub-strings extracted or replaced. In other languages, such as C, strings were almost an afterthought; there was no really good “string” data type, just fixed arrays of bytes. The “string library” was nothing more than a collection of rather primitive functions without even bounds checking. C++ provides a string type as a first-class data type, and the strings are implemented using templates and operator overloading.

C-Style Strings

In the C language, strings are represented as an array of characters. The last character of a string is a null character ('') so that code operating on the string can determine where it ends. This null character is officially known as NUL, spelled with one L, not two. NUL is not the same as the NULL pointer. Even though C++ provides a better string abstraction, it is important to understand the C technique for strings because they still arise in C++ programming. One of the most common situations is where a C++ program has to call a C-based interface in some third-party library or as part of interfacing to the operating system.

By far, the most common mistake that programmers make with C strings is that they forget to allocate space for the '' character. For example, the string "hello" appears to be five characters long, but six characters worth of space are needed in memory to store the value, as shown in Figure 14-1.

C++ contains several functions from the C language that operate on strings. As a general rule of thumb, these functions do not handle memory allocation. For example, the strcpy() function takes two strings as parameters. It copies the second string onto the first, whether it fits or not. The following code attempts to build a wrapper around strcpy() that allocates the correct amount of memory and returns the result, instead of taking in an already allocated string. It uses the strlen() function to obtain the length of the string.

image
char* copyString(const char* inString)
{
    char* result = new char[strlen(inString)];  // BUG! Off by one!
    strcpy(result, inString);
    return result;
}

Code snippet from CStringsstrcpy.cpp

The copyString() function as written is incorrect. The strlen() function returns the length of the string, not the amount of memory needed to hold it. For the string "hello", strlen() will return 5, not 6. The proper way to allocate memory for a string is to add one to the amount of space needed for the actual characters. It seems a little weird at first to have +1 all over the place, but it quickly becomes natural, and you (hopefully) miss it when it’s not there.

image
char* copyString(const char* inString) 
{
    char* result = new char[strlen(inString) + 1];
    strcpy(result, inString);
    return result;
}

Code snippet from CStringsstrcpy.cpp

One way to remember that strlen()returns only the number of actual characters in the string is to consider what would happen if you were allocating space for a string made up of several others. For example, if your function took in three strings and returned a string that was the concatenation of all three, how big would it be? To hold exactly enough space, it would be the length of all three strings, added together, plus one for the trailing '' character. If strlen() included the '' in the length of the string, the allocated memory would be too big. The following code uses the strcpy() and strcat() functions to perform this operation.

image
char* appendStrings(const char* inStr1, const char* inStr2, const char* inStr3) 
{
    char* result = new char[strlen(inStr1) + strlen(inStr2) + strlen(inStr3) + 1];
    strcpy(result, inStr1);
    strcat(result, inStr2);
    strcat(result, inStr3);
    return result;
}

Code snippet from CStringsstrcpy.cpp

Note that sizeof() is not the same as strlen(). You should never use sizeof() to try to get the size of a string. For example:

image
char text1[] = "abcdef";
size_t s1 = sizeof(text1);  // is 7
size_t s2 = strlen(text1);  // is 6
char* text2 = "abcdef";
size_t s3 = sizeof(text2);  // is 4  
size_t s4 = strlen(text2);  // is 6

Code snippet from CStringsstrlen.cpp

s3 will be 4 when compiled in 32-bit mode and will be 8 when compiled in 64-bit mode because it is returning the size of a char* which is a pointer.

A complete list of C functions to operate on strings can be found in the <cstring> header file.

cross.gif

When you use the C-style string functions with Microsoft Visual Studio, the compiler is likely to give security-related warnings about these functions being deprecated. You can eliminate these warnings by using the new C standard library functions, such as strcpy_s() or strcat_s(), which are part of the new “secure C library” standard (ISO/IEC TR 24731). However, the best solution is to switch to the C++ string class, which is discussed later in this chapter.

String Literals

You’ve probably seen strings written in a C++ program with quotes around them. For example, the following code outputs the string hello by including the string itself, not a variable that contains it:

cout << "hello" << endl;

In the preceding line, "hello" is a string literal because it is written as a value, not a variable. String literals can be assigned to variables, but doing so can be risky. The actual memory associated with a string literal is in a read-only part of memory. This allows the compiler to optimize memory usage by reusing references to equivalent string literals. That is, even if your program uses the string literal "hello" 500 times, the compiler can create just one instance of hello in memory. This is called literal pooling.

The C++ standard officially says that string literals are of type “array of n const char,” however, for backward compatibility with older non-const aware code, most compilers do not enforce your program to assign a string literal only to a variable of type const char* or const char[]. They let you assign a string to a char* without const, and the program will work fine unless you attempt to change the string. Generally, attempting to change the string will immediately crash your program, as demonstrated in the following code:

char* ptr = "hello";       // Assign the string literal to a variable.
ptr[1] = 'a';              // CRASH! Attempts to write to read-only memory

A much safer way to code is to use a pointer to const characters when referring to string literals. The following code contains the same bug, but because it assigned the literal to a const character array, the compiler will catch the attempt to write to read-only memory.

const char* ptr = "hello"; // Assign the string literal to a variable.
ptr[1] = 'a';              // BUG! Attempts to write to read-only memory

You can also use a string literal as an initial value for a character array (char[]). In this case, the compiler creates an array that is big enough to hold the string and copies the string to this array. So, the compiler will not put the literal in read-only memory and will not do any literal pooling.

char arr[] = "hello"; // Compiler takes care of creating appropriate sized
                      // character array arr.
arr[1] = 'a';         // The contents can be modified. 

The C++ string Class

As mentioned earlier, C++ provides a much-improved implementation of the concept of a string as part of the Standard Library. In C++, string is a class (actually an instantiation of the basic_string template class) that supports many of the same functionalities as the <cstring> functions but takes care of memory allocation for you if you use it properly. The string class has already been used on a number of occasions earlier in this book. Now it’s time to take a deeper look at it.

What Was Wrong with C-Style Strings?

To understand the necessity of the C++ string class, consider the advantages and disadvantages of C-style strings.

Advantages:

  • They are simple, making use of the underlying basic character type and array structure.
  • They are lightweight, taking up only the memory that they need if used properly.
  • They are low level, so you can easily manipulate and copy them as raw memory.
  • They are well understood by C programmers — why learn something new?

Disadvantages:

  • They require incredible efforts to simulate a first-class string data type.
  • They are unforgiving and susceptible to difficult to find memory bugs.
  • They don’t leverage the object-oriented nature of C++.
  • They require knowledge of their underlying representation on the part of the programmer.

The preceding lists were carefully constructed to make you think that perhaps there is a better way. As you’ll learn, C++ strings solve all the problems of C strings and render most of the arguments about the advantages of C strings over a first-class data type irrelevant.

Using the string Class

Even though string is a class, you can almost always treat it as if it were a built-in type. In fact, the more you think of it as a simple type, the better off you are. Programmers generally encounter the least trouble with string when they forget that strings are objects.

Through the magic of operator overloading, C++ strings are much easier to use than C strings. For example, two strings can be concatenated by using the + operator:

string A("abc");
string B("def");
string C;
C = A + B;    // C will become "abcdef"

The + operator does not try to “add” the values; the + is redefined as meaning “string concatenation.” For example, the following produces 1234, not 46:

string A("12");
string B("34");
string C;
C = A + B;

The += operator is also overloaded to allow you to easily append a string:

string A("12");
string B("34");
A += B;    // A will become "1234"

Another problem with C strings was that you could not use == to compare them. Suppose you have the following two strings:

char* a = "12";
char b[] = "12";

Writing a comparison as follows always returned false, because it compared the pointer values, not the contents of the strings:

if (a == b)

You had to write something as follows:

if (strcmp(a, b) == 0)

Furthermore, there was no way to use <, <=, >= or > to compare C strings, so strcmp() would return -1, 0 or 1 depending on the lexicographic relationship of the strings. This resulted in very clumsy code, which was also error-prone.

With C++ strings, operator==, operator!=, operator<, and so on are all overloaded to work on the actual string characters. Individual characters can still be accessed with operator[].

As the following code shows, when string operations require extending the string, the memory requirements are automatically handled by the string class, so memory overruns are a thing of the past:

image
string myString = "hello";
myString += ", there";
string myOtherString = myString;
if (myString == myOtherString) {
    myOtherString[0] = 'H';
}
cout << myString << endl;
cout << myOtherString << endl;

Code snippet from CppStringsCppStrings.cpp

The output of this code is:

hello, there
Hello, there

There are several things to note in this example. One point to note is that there are no memory leaks even though strings are allocated and resized left and right. All of these string objects were created as stack variables. While the string class certainly had a bunch of allocating and resizing to do, the string destructors cleaned up this memory when string objects went out of scope.

Another point to note is that the operators work the way you would want them to. For example, the = operator copies the strings, which is most likely what you wanted. If you are used to working with array-based strings, this will either be refreshingly liberating for you or somewhat confusing. Don’t worry — once you learn to trust the string class to do the right thing, life gets so much easier.

For compatibility, you can use the c_str() method on a string to get a const character pointer, representing a C-style string. However, the returned const pointer becomes invalid whenever the string has to perform any memory reallocation, or when the string object is destroyed. You should call the method just before using the result so that it accurately reflects the current contents of the string.

The Standard Library Reference resource on the website lists all the operations you can perform on string objects.

imageNumeric Conversions

C++11 includes a number of new helper functions making it easier to convert numerical values into strings or strings into numerical values. The following functions are available to convert numerical values into strings:

  • string to_string(int val);
  • string to_string(unsigned val);
  • string to_string(long val);
  • string to_string(unsigned long val);
  • string to_string(long long val);
  • string to_string(unsigned long long val);
  • string to_string(float val);
  • string to_string(double val);
  • string to_string(long double val);

They are pretty straightforward to use. For example, the following code converts a long double value into a string:

long double d = 3.14;
string s = to_string(d);

There are also wide string versions available, which are called to_wstring and return a wstring. Wide strings are discussed later in this chapter.

Converting in the other direction is done by the following set of functions. In these prototypes, str is the string that you want to convert, idx is a pointer that will receive the index of the first non-converted character, and base is the mathematical base that should be used during conversion. The idx pointer can be a null pointer in which case it will be ignored. They throw invalid_argument if no conversion could be performed and throw out_of_range if the converted value is outside the range of the return type.

  • int stoi(const string& str, size_t *idx=0, int base=10);
  • long stol(const string& str, size_t *idx=0, int base=10);
  • unsigned long stoul(const string& str, size_t *idx=0, int base=10);
  • long long stoll(const string& str, size_t *idx=0, int base=10);
  • unsigned long long stoull(const string& str, size_t *idx=0, int base=10);
  • float stof(const string& str, size_t *idx=0);
  • double stod(const string& str, size_t *idx=0);
  • long double stold(const string& str, size_t *idx=0);

For example:

const string s = "1234";
int i = stoi(s);    // i will be 1234

A similar set of functions is available accepting a wstring instead of a string.

imageRaw String Literals

C++11 adds the concept of raw string literals, which are string literals where escape sequences like and are not processed as escape sequences but as normal text. These escape characters are discussed in Chapter 1. If you write the following with a normal string literal, you will get a compiler error, because a normal string literal cannot span multiple lines and the example contains non-escaped quotes in the middle of the string, which are also not allowed:

string str = "Line 1
line "2" 	 (and)
end";

To make the preceding work you need to use a raw string literal, which has the following general format:

R"d-char-sequence(r-char-sequence)d-char-sequence"

The d-char-sequence is an optional delimiter sequence, which should be the same at the beginning and at the end of the raw string. This delimiter sequence can have at most 16 characters. The r-char-sequence is the actual raw string. The preceding example can be modified to use a raw string literal as follows:

image
string str = R"~(Line 1
line "2" 	 (and)
end)~";

Code snippet from RawStringLiteralRawStringLiteral.cpp

In this example, the delimiter sequence is a single ~ character, which means the raw string literal has to start with R"~( and end with )~". As mentioned before, the delimiter is optional, so the following code is equivalent:

image
string str = R"(Line 1
line "2" 	 (and)
end)";

Code snippet from RawStringLiteralRawStringLiteral.cpp

If you write str to the console the output will be:

Line 1
line "2" 	 (and)
end

With the raw string literal you do not need to escape quotes in the middle of the string and the escape character is not replaced with an actual tab character but is taken literally.

You might wonder what the point is of the optional delimiter sequence. It is required for strings that have a character sequence in the middle of the string that could be interpreted as the end of the raw string. For example, the following string is not valid because it contains the )" in the middle of the string, which is interpreted by the compiler as the end of the string:

string str = R"(The characters )" are embedded in this string)";

If you want the preceding string, you need to use a unique delimiter character, for example:

string str = R"-(The characters )" are embedded in this string)-";

Raw string literals will make life much easier for working with database querying strings, regular expressions, and so on. Regular expressions are discussed later in this chapter.

LOCALIZATION

When you’re learning how to program in C or C++, it’s useful to think of a character as equivalent to a byte and to treat all characters as members of the ASCII character set (American Standard Code for Information Interchange). ASCII is a 7-bit set usually stored in an 8-bit char type. In reality, experienced C++ programmers recognize that successful programs are used throughout the world. Even if you don’t initially write your program with international audiences in mind, you shouldn’t prevent yourself from localizing, or making the software local aware, at a later date.

Localizing String Literals

A critical aspect of localization is that you should never put any native-language literal strings in your source code, except maybe for debug strings targeted at the developer. In Microsoft Windows applications, this is accomplished by putting the strings in STRINGTABLE resources. Most other platforms offer similar capabilities. If you need to translate your application to another language, translating those resources should be all that needs to be done, without requiring any source changes. There are tools available that help you with this translation process.

To make your source code localizable, you should not use cout to compose sentences out of string literals, even if the individual literals can be localized. For example:

cout << "Read " << n << " bytes" << endl;

This cout statement cannot be localized to Dutch because it requires a reordering of the words. The Dutch translation is as follows:

cout << n << " bytes gelezen" << endl;

To make sure you can properly localize this cout statement, you could implement something as follows:

cout << Format(IDS_TRANSFERRED, n) << endl;

IDS_TRANSFERRED is the name of an entry in a string resource table. For the English version, IDS_TRANSFERRED could be defined as "Read $1 bytes", while the Dutch version of the resource could be defined as "$1 bytes gelezen". The Format() function loads the string resource and substitutes $1 with the value of n.

Wide Characters

The problem with viewing a character as a byte is that not all languages, or character sets, can be fully represented in 8 bits, or 1 byte. C++ has a built-in type called wchar_t that holds a wide character. Languages with non-ASCII (U.S.) characters, such as Japanese and Arabic, can be represented in C++ with wchar_t. However, the C++ standard does not define a size for wchar_t. Some compilers use 16 bits while others use 32 bits. To write portable software, it is not safe to assume that sizeof(wchar_t) is any particular numerical value.

If there is any chance that your program will be used in a non-Western character set context (hint: there is!), you should use wide characters from the beginning. When working with wchar_t, string and character literals are prefixed with the letter L to indicate that a wide-character encoding should be used. For example, to initialize a wchar_t character to be the letter m, you would write it like this:

wchar_t myWideCharacter = L'm';

There are wide-character versions of most of your favorite types and classes. The wide string class is wstring. The “prefix letter w” pattern applies to streams as well. Wide-character file output streams are handled with the wofstream, and input is handled with the wifstream. The joy of pronouncing these class names (woof-stream? whiff-stream?) is reason enough to make your programs local aware! Streams are discussed in detail in Chapter 15.

In addition to cout, cin, cerr, and clog there are wide versions of the built-in console and error streams called wcout, wcin, wcerr, and wclog. Using them is no different than using the non-wide versions:

image
wcout << L"I am wide-character aware." << endl;

Code snippet from WideStringswcout.cpp

Non-Western Character Sets

Wide characters are a great step forward because they increase the amount of space available to define a single character. The next step is to figure out how that space is used. In wide character sets, just like in ASCII, a number corresponds to a particular glyph. The only difference is that each number does not fit in 8 bits. The map of characters to numbers (now called code points) is quite a bit larger because it handles many different character sets in addition to the characters that English-speaking programmers are familiar with.

The Universal Character Set (UCS), defined by the International Standard ISO 10646, and Unicode are both standardized sets of characters. They contain around one hundred thousand abstract characters, each identified by an unambiguous name and an integer number called its code point. The same characters with the same numbers exist in both standards. Both have specific encodings that you can use. For example, UTF-8 is an example of a Unicode encoding where Unicode characters are encoded using one to four 8-bit bytes. UTF-16 encodes Unicode characters as one or two 16-bit values and UTF-32 encodes Unicode characters as exactly 32 bits.

Different applications can use different encodings. Unfortunately, the C++ standard does not specify a size for wide characters (wchar_t). On Windows it is 16 bits, while on other platforms it could be 32 bits. You need to be aware of this when using wide characters for character encoding in cross platform code. To help solve this issue, C++11 introduces two new character types: char16_t and char32_t. The following list gives an overview of all character types supported by C++11:

  • char: Stores 8 bits. Can be used to store ASCII characters, or as a basic building block for storing UTF-8 encoded Unicode characters, where one Unicode characters is encoded as one to four chars.
  • char16_t: Stores 16 bits. Can be used as the basic building block for UTF-16 encoded Unicode characters where one Unicode character is encoded as one or two char16_ts.
  • char32_t: Stores 32 bits. Can be used for storing UTF-32 encoded Unicode characters as one char32_t.
  • wchar_t: Stores a wide character of a compiler-specific size and encoding.

The benefit of using char16_t and char32_t instead of wchar_t is that the size of char16_t and char32_t are compiler-independent, whereas the size of wchar_t depends on your compiler.

The standard also defines the following two macros:

  • __STDC_UTF_32__: If this is defined, the type char32_t represents a UTF-32 encoding. If it is not defined, the type char32_t has a compiler dependent encoding.
  • __STDC_UTF_16__: If this is defined, the type char16_t represents a UTF-16 encoding. If it is not defined, the type char16_t has a compiler dependent encoding.

C++11 defines three new string prefixes in addition to the existing L prefix. The complete set of supported string prefixes is as follows:

  • u8: A char string literal with UTF-8 encoding.
  • u: A char16_t string literal, which can be UTF-16 if __STDC_UTF_16__ is defined.
  • U: A char32_t string literal, which can be UTF-32 if __STDC_UTF_32__ is defined.
  • L: A wchar_t string literal with a compiler-dependent encoding.

All of these string literals can also be combined with the raw string literal seen earlier in this chapter. For example:

image
const char* s1 = u8R"(Raw UTF-8 encoded string literal)";
const wchar_t* s2 = LR"(Raw wide string literal)";
const char16_t* s3 = uR"(Raw char16_t string literal)";
const char32_t* s4 = UR"(Raw char32_t string literal)";

Code snippet from CharTypesCharTypes.cpp

If you are using Unicode encoding, for example by using u8 UTF-8 string literals or by specifying __STDC_UTF_16__ or __STDC_UTF_32__, you can insert a specific Unicode code point in your non-raw string literal by using the uABCD notation. For example u03C0 represents the PI character, and u00B2 represents the ² character. The following code prints "<Symbol>π</Symbol> r²":

image
const char* formula = u8"u03C0 ru00B2";
cout << formula << endl;

Code snippet from CharTypesCharTypes.cpp

The C++ string library has also been extended to include two new typedefs to work with the new character types:

  • typedef basic_string<char16_t> u16string;
  • typedef basic_string<char32_t> u32string;

Additionally, the following four new conversion functions related to char16_t and char32_t are included: mbrtoc16, c16rtomb, mbrtoc32 and c32rtomb.

Unfortunately, the support for char16_t and char32_t stops there. For example, the I/O stream classes in the C++11 standard library do not include support for these new character types. This means that there is nothing like a version of cout or cin that supports char16_t and char32_t making it difficult to print such strings to a console or to read them from user input. If you want to do more with char16_t and char32_t strings you will have to resort to third-party libraries.

Locales and Facets

Character sets are only one of the differences in data representation between countries. Even countries that use similar character sets, such as Great Britain and the United States, still differ in how they represent data such as dates and money.

The standard C++ mechanism that groups specific data about a particular set of cultural parameters is called a locale. An individual component of a locale, such as date format, time format, number format, etc. is called a facet. An example of a locale is U.S. English. An example of a facet is the format used to display a date. There are several built-in facets that are common to all locales. The language also provides a way to customize or add facets.

Using Locales

When using I/O streams, data is formatted according to a particular locale. Locales are objects that can be attached to a stream. They are defined in the <locale> header file. Locale names can be implementation-specific. One standard is to separate the language and the area in two-letter sections with an optional encoding. For example, the locale for the English language as spoken in the U.S. is en_US, while the locale for the English language as spoken in Great Britain is en_GB. The locale for Japanese spoken in Japan with Japanese Industrial Standard encoding is ja_JP.jis.

Locale names on Windows follow a different standard, which has the following general format:

lang[_country_region[.code_page]]

Everything between the square brackets is optional. The following table lists some examples:

LINUX GCC WINDOWS
U.S. English en_US English_United States
Great Britain English en_GB English_Great Britain

Most operating systems have a mechanism to determine the locale as defined by the user. In C++, you can pass an empty string to the locale object constructor to create a locale from the user’s environment. Once this object is created, you can use it to query the locale, possibly making programmatic decisions based on it. The following code demonstrates how to use the user’s locale by calling the imbue() method on a stream. The result is that everything that is send to wcout will be formatted according to the formatting rules for your system:

wcout.imbue(locale(""));
wcout << 32767 << endl;

This means that if your system locale is English United States and you output the number 32767, the number will be displayed as 32,767, but if your system locale is Dutch Belgium, the same number will be displayed as 32.767.

The user’s locale is usually not the default locale. The default locale is generally the classic locale, which uses ANSI C conventions. The classic C locale is similar to U.S. English, but there are slight differences. For example, if you do not set a locale at all, or set the default locale, and you output a number, it will be presented without any punctuation:

wcout.imbue(locale("C"));
wcout << 32767 << endl;

The output of this code will be as follows:

32767

The following code manually sets the U.S. English locale, so the number 32767 will be formatted with U.S. English punctuation, independent of your system locale:

wcout.imbue(locale("en_US"));
wcout << 32767 << endl;

The output of this code will be as follows:

32,767

A locale object allows you to query information about the locale. For example, the following program creates a locale matching the user’s environment. The name() method is used to get a C++ string that describes the locale. Then, the find() method is used on the string object to find a given sub-string, which returns string::npos when the given sub-string was not found. The code checks for the Windows name and the Linux GCC name. One of two messages is output, depending on whether the locale appears to be U.S. English or not:

image
locale loc("");
if (loc.name().find("en_US") == string::npos &&
    loc.name().find("United States") == string::npos) {
    wcout << L"Welcome non-U.S. English speaker!" << endl;
} else {
    wcout << L"Welcome U.S. English speaker!" << endl;
}

Code snippet from LocalesLocales.cpp

Using Facets

You can use the std::use_facet() function to obtain a particular facet in a particular locale. The argument to use_facet() is a locale. For example, the following expression retrieves the standard monetary punctuation facet of the British English locale using the Linux GCC locale name:

use_facet<moneypunct<wchar_t>>(locale("en_GB"));

Note that the innermost template type determines the character type to use. This is usually wchar_t or char. The use of nested template classes is unfortunate, but once you get past the syntax, the result is an object that contains all the information you want to know about British money punctuation. The data available in the standard facets are defined in the <locale> header and its associated files.

The following program brings together locales and facets by printing out the currency symbol in both U.S. English and British English. Note that, depending on your environment, the British currency symbol may appear as a question mark, a box, or not at all. If your environment is equipped to handle it, you may actually get the British pound symbol:

image
locale locUSEng("en_US");
locale locBritEng("en_GB");
wstring dollars = use_facet<moneypunct<wchar_t>>(locUSEng).curr_symbol();
wstring pounds = use_facet<moneypunct<wchar_t>>(locBritEng).curr_symbol();
wcout << L"In the US, the currency symbol is " << dollars << endl;
wcout << L"In Great Britain, the currency symbol is " << pounds << endl;

Code snippet from Facetsuse_facet.cpp

imageREGULAR EXPRESSIONS

Regular expressions are a new and powerful addition to the C++11 Standard Library. They are a special mini-language for string processing. They might seem complicated at first, but once you get to know them, they make working with strings easier. Regular expressions can be used for several string-related operations:

  • Validation: Check if an input string is well-formed.

    For example: Is the input string a well-formed phone number?

  • Decision: Check what kind of string an input represents.

    For example: Is the input string the name of a JPEG or a PNG file?

  • Parsing: Extract information from an input string.

    For example: From a full filename, extract the filename part without the full path and without its extension.

  • Transformation: Search sub-strings and replace them with a new formatted sub-string.

    For example: Search all occurrences of “C++11” and replace them with “C++.”

  • Iteration: Search all occurrences of a sub-string.

    For example: Extract all phone numbers from an input string.

  • Tokenization: Split a string into sub-strings based on a set of delimiters.

    For example: Split a string on whitespace, commas, periods, and so on to extract its individual words.

Of course you could write your own code to perform any of the preceding operations on your strings, but using the regular expressions feature is highly recommended, because writing correct and safe code to process strings can be tricky.

Before we can go into more details on the regular expressions, there is some important terminology to know. The following terms are used throughout the discussion:

  • Pattern: The actual regular expression is a pattern represented by a string.
  • Match: Determines whether there is a match between a given regular expression and all of the characters in a given sequence [first,last).
  • Search: Determines whether there is some sub-string within a given sequence [first,last) that matches a given regular expression.
  • Replace: Identifies sub-strings in a given sequence, and replaces them with a corresponding new sub-string computed from another pattern, called a substitution pattern.

If you look around on the internet you will find out that there are several different grammars for regular expressions. For this reason, C++11 includes support for several of these grammars: ECMAScript, basic, extended, awk, grep, and egrep. If you already know any of these regular expression grammars, you can use it straight away in C++11 by telling the regular expression library to use that specific syntax (syntax_option_type). The default grammar in C++11 is ECMAScript whose syntax is explained in detail in the following section. It is also the most powerful grammar, so it’s highly recommended to use ECMAScript instead of one of the other more limited grammars. Explaining the other regular expression grammars falls outside the scope of this book.

pen.gif

If this is the first time you hear anything about regular expressions, just leave the powerful default ECMAScript syntax.

ECMAScript Syntax

A regular expression pattern is a sequence of characters representing what you want to match. Any character in the regular expression matches itself except for the following special characters:

^ $  . * + ? ( ) [ ] { } |

These special characters are explained throughout the following discussion. If you need to match one of these special characters, you need to escape it using the character. For example:

[ or . or * or 
pen.gif

Don’t forget that you need to escape the back slash in your C++ string literals. For example, if your regular expression needs to match the single * character, you need to escape it for the regular expression engine and for C++, so your C++ string literal should be \*.

Anchors

The special characters ^ and $ are called anchors. The ^ character will match the beginning of the string and $ will match the end of the string. For example, ^test$ will match only the string test, and not strings which contain test in the line with anything else like 1test, test2, test abc, and so on.

Wildcards

The wildcard character . can be used to match any character except a newline character. For example, the regular expression a.c will match abc, and a5c, but will not match ab5c, ac and so on.

Repetition

Parts of a regular expression can be repeated by using one of four repeats:

  • * matches the preceding part zero or more times. For example: a*b will match b, ab, aab, aaaab, and so on.
  • + matches the preceding part one or more times. For example: a+b will match ab, aab, aaaab, and so on, but not b.
  • ? matches the preceding part zero or one time. For example: a?b will match b and ab, but nothing else.
  • {...} represents a bounded repeat. a{n} will match a repeated exactly n times; a{n,} will match a repeated n times or more, and a{n,m} will match a repeated between n and m times inclusive. For example, ^a{3,4}$ will match aaa and aaaa but not a, aa, aaaaa, and so on.

The repeats described in the previous list are called greedy because they will find the longest match. To make them non-greedy, a ? can be added behind the repeat as in *?, +?, ?? and {...}?. The following table gives an example. The first column is the string on which the regular expression will be applied. The second column represents the matches found by the regular expression a+ and the third column shows the matches found by the non-greedy a+?.

SOURCE STRING A+ A+?
"" no match no match
a matches a matches a
aa matches aa matches a
aaa matches aaa matches a
aaaa matches aaaa matches a

Alternation

The | character can be used to specify the “or” relationship. For example, a|b will match a or b.

Grouping

Parentheses () are used to mark sub-expressions, also called capture groups. Capture groups can be used for several purposes:

  • Capture groups can be used to identify individual sub-sequences of the original string; each marked sub-expression (capture group) will be returned in the result. For example, take the following regular expression: (.*)(ab|cd)(.*). It has three marked sub-expressions. Running a regex_search() with this regular expression on 123cd4 will result in a match with four entries. The first entry is the entire match 123cd4 followed by three entries for the three marked sub-expressions. These three entries are 123, cd and 4. The details on how to use the regex_search() algorithm are shown in a later section.
  • Capture groups can be used during matching for a purpose called back references (explained later).
  • Capture groups can be used to identify components during a replace operations (explained later).

Precedence

Just as with mathematical formulas it’s important to know the precedence of the regular expression elements. Precedence is as follows:

  • Elements: like a are the basic building blocks of a regular expression.
  • Quantifiers: like +, *, ? and {...} bind tightly to the element on the left, for example b+.
  • Concatenation: like ab+c binds after quantifiers.
  • Alternations: like | binds as last.

For example, take the regular expression ab+c|d. This will match abc, abbc, abbbc, and so on and also d. Parentheses can be used to change these precedence rules. For example ab+(c|d) will match abc, abbc, abbbc, ..., abd, abbd, abbbd, and so on. However, by using parentheses you also mark it as a sub-expression or capture group. It is possible to change the precedence rules without creating a new capture group by using (?:...). For example ab+(?:c|d) matches the same as the preceding ab+(c|d) but does not create an additional capture group.

Character Set Matches

Instead of having to write (a|b|c|...|z) which is clumsy and introduces a capture group, a special syntax for specifying sets of characters or ranges of characters is available. In addition, a “not” form of the match is also available. A character set is specified between square brackets, and allows you to write [c1c2c3] which will match any of the characters c1, c2 or c3. For example, [abc] will match any character a, b or c. If the first character is ^, it means “any but”:

  • ab[cde] matches abc, abd, and abe.
  • ab[^cde] matches abf, abp, and so on but not abc, abd, and abe.

If you need to match the ^, [ or ] characters themselves, you need to escape them, for example: [[^]] matches the characters [, ^ or ].

If you want to specify all letters, you could use a character set like [abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ], however, this is clumsy and doing this several times is awkward, especially if you make a typo and omit one of the letters accidentally. There are two solutions to this.

The range specification in square brackets allows you to write [a-zA-Z] which recognizes all the letters in the range a to z and A to Z. If you need to match a hyphen, you need to escape it, for example [a-zA-Z-]* matches any word including a hyphenated word.

Another capability is to use one of the character classes. These are used to denote specific types of characters and are represented as [:name:] where name is one of the classes in the following table:

CHARACTER CLASS NAME DESCRIPTION
alnum lowercase letters, uppercase letters, and digits
alpha lowercase letters and uppercase letters
blank space or tab characters
cntrl file format escape characters like newlines, form feeds, and so on (f, , , and v)
digit digits
graph lowercase letters, uppercase letters, digits, and punctuation characters
lower lowercase letters
print lowercase letters, uppercase letters, digits, punctuation characters, and space characters
punct punctuation characters
space space characters
upper uppercase letters
xdigit digits and ‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘A’, ‘B’, ‘C’, ‘D’, ‘E’, ‘F’
d same as digit
s same as space
w same as alnum

Character classes are used within character sets, for example [[:alpha:]]* in English means the same as [a-zA-Z]*.

Because certain concepts like matching digits are so common, there are shorthand patterns for them. For example, [:digit:] and [:d:] mean the same thing as [0-9]. Some classes have an even shorter pattern using the escape notation . For example d means [:digit:]. Therefore, to recognize a sequence of one or more numbers, you can write any of the following patterns:

  • [0-9]+
  • [[:digit:]]+
  • [[:d:]]+
  • d+

The following table lists the available escape notations for character classes:

ESCAPE NOTATION EQUIVALENT TO
d [[:d:]]
D [^[:d:]]
s [[:s:]]
S [^[:s:]]
w [[:w:]]
W [^[:w:]]

Some examples:

  • Test[5-8] will match Test5, Test6, Test7, and Test8.
  • [[:lower:]] will match a, b, and so on but not A, B, and so on.
  • [^[:lower:]] will match any character except lowercase letters like a, b, and so on.
  • [[:lower:]5-7] will match any lower case letter like a, b, and so on and will also match the numbers 5, 6, and 7.

Word Boundaries

A word boundary can mean the following:

  • The beginning of the source string if the first character of the source string is one of the word characters [A-Za-z0-9_].
  • The end of the source string if the last character of the source string is one of the word characters.
  • The first character of a word, which is one of the word characters, while the preceding character is not a word character.
  • The end of a word, which is a non-word character after a word, while the preceding character is a word character.

You can use  to match a word boundary, and B to match anything except a word boundary.

Back References

Back references allow you to reference a captured group inside the regular expression itself: refers to the n-th captured group. The 0-th capture group is equal to the complete match. For example the regular expression ^(d+)-.*-1$ matches a string that has the following format:

  • The beginning of the string ^
  • followed by one or more digits captured in a capture group (d+)
  • followed by a dash -
  • followed by zero or more characters .*
  • followed by another dash -
  • followed by exactly the same digits captured by the first capture group 1
  • followed by the end of the string $

This regular expression will match 123-abc-123, 1234-a-1234, and so on but will not match 123-abc-1234, 123-abc-321, and so on.

Regular Expressions and Raw String Literals

As seen in the preceding sections, regular expressions often use special characters that should be escaped in normal C++ string literals. For example, if you write d in a regular expression it will match any digit. However, since is a special character in C++, you need to escape it in your regular expression string literal as \d, otherwise your C++ compiler will try to interpret the d. It can get more complicated if you want your regular expression to match a single back-slash character . Because is a special character in the regular expression syntax itself, you need to escape it as \. The character is also a special character in C++ string literals, so you need to escape it in your C++ string literal, resulting in \\.

You can use the new C++11 raw string literals to make complicated regular expression easier to read in your C++ source code. Raw string literals are explained earlier in this chapter. For example take the following regular expression:

( |
|
|\)

This regular expression searches for spaces, newlines, form feeds, and back slashes. As you can see, you need a lot of escape characters. Using raw string literals, this can be replaced with the following more readable regular expression:

R"(( |
|
|\))"

The raw string literal starts with R"( and ends with )". Everything in between is the regular expression. Of course we still need a double back slash at the end because the back slash needs to be escaped in the regular expression itself.

This concludes a brief description of the ECMAScript grammar. The following section starts with actually using regular expressions in your C++11 code.

The regex Library

Everything for the C++11 regular expression library is in the <regex> header file and in the std namespace. The basic templated types defined by the regular expression library are:

  • basic_regex: An object representing a specific regular expression.
  • match_results: A sub-string that matched a regular expression, including all the captured groups. It is a collection of sub_matches.
  • sub_match: An iterator pair representing a specific matched capture group.

The library provides three key algorithms: regex_match(), regex_search() and regex_replace(). These are explained in later sections. All of these algorithms have different versions that allow you to specify the source string as an STL string, a character array, or as a begin and end iterator pair. The iterators can be any of the following:

  • const char*
  • const wchar_t*
  • string::const_iterator
  • wstring::const_iterator

In fact, any iterator that behaves as a bidirectional iterator can be used. Iterators are discussed in detail in Chapter 12.

The library also defines regular expression iterators, which are very important if you want to find all occurrences of a pattern in a source string as you will see in a later section. There are two templated regular expression iterators defined:

  • regex_iterator: iterates over all the occurrences of a pattern in a source string
  • regex_token_iterator: iterates over all the capture groups of all occurrences of a pattern in a source string

To make the library easier to use, the standard defines a number of typedefs for the preceding templates:

typedef basic_regex<char>    regex;
typedef basic_regex<wchar_t> wregex;
 
typedef sub_match<const char*>             csub_match;
typedef sub_match<const wchar_t*>          wcsub_match;
typedef sub_match<string::const_iterator>  ssub_match;
typedef sub_match<wstring::const_iterator> wssub_match;
 
typedef match_results<const char*>             cmatch;
typedef match_results<const wchar_t*>          wcmatch;
typedef match_results<string::const_iterator>  smatch;
typedef match_results<wstring::const_iterator> wsmatch;
 
typedef regex_iterator<const char*>             cregex_iterator;
typedef regex_iterator<const wchar_t*>          wcregex_iterator;
typedef regex_iterator<string::const_iterator>  sregex_iterator;
typedef regex_iterator<wstring::const_iterator> wsregex_iterator;
 
typedef regex_token_iterator<const char*>             cregex_token_iterator;
typedef regex_token_iterator<const wchar_t*>          wcregex_token_iterator;
typedef regex_token_iterator<string::const_iterator>  sregex_token_iterator;
typedef regex_token_iterator<wstring::const_iterator> wsregex_token_iterator;

The following sections explain the regex_match(), regex_search() and regex_replace() algorithms and the regex_iterator and regex_token_iterator.

regex_match()

The regex_match() algorithm can be used to compare a given source string with a regular expression pattern and will return true if the pattern matches the entire source string, false otherwise. It is very easy to use. There are six versions of the regex_match() algorithm accepting different kinds of arguments:

template <class BidirectionalIterator, class Allocator, class charT, class traits>
  bool regex_match(BidirectionalIterator first,
                   BidirectionalIterator last,
                   match_results<BidirectionalIterator, Allocator>& m,
                   const basic_regex<charT, traits>& e,
                   regex_constants::match_flag_type flags =
                       regex_constants::match_default);
 
template <class BidirectionalIterator, class charT, class traits>
  bool regex_match(BidirectionalIterator first,
                   BidirectionalIterator last,
                   const basic_regex<charT, traits>& e,
                   regex_constants::match_flag_type flags =
                       regex_constants::match_default);
 
template <class charT, class Allocator, class traits>
  bool regex_match(const charT* str,
                   match_results<const charT*, Allocator>& m,
                   const basic_regex<charT, traits>& e,
                   regex_constants::match_flag_type flags =
                       regex_constants::match_default);
 
template <class ST, class SA, class Allocator, class charT, class traits>
  bool regex_match(const basic_string<charT, ST, SA>& s,
                   match_results<
                       typename basic_string<charT, ST, SA>::const_iterator,
                       Allocator>& m,
                   const basic_regex<charT, traits>& e,
                   regex_constants::match_flag_type flags =
                       regex_constants::match_default);
 
template <class charT, class traits>
  bool regex_match(const charT* str,
                   const basic_regex<charT, traits>& e,
                   regex_constants::match_flag_type flags =
                       regex_constants::match_default);
 
template <class ST, class SA, class charT, class traits>
  bool regex_match(const basic_string<charT, ST, SA>& s,
                   const basic_regex<charT, traits>& e,
                   regex_constants::match_flag_type flags =
                       regex_constants::match_default);

Some of them accept a start and end iterator into a source string where pattern matching should start and end; others accept a simple string or a character array as source. All of them require a basic_regex as one of their arguments, which represents the regular expression. All variations return true when the entire source string matches the pattern, false otherwise; and all accept a combination of flags to specify options for the matching algorithm. In most cases this can be left as match_default. Consult the Standard Library Reference resource on the website for more details.

Both regex_match() and regex_search() described in a later section can use an optional match_results object. If the algorithms return false, you are only allowed to call match_results::empty() or match_results::size(); anything else is undefined. When the algorithms return true, a match is found and you can inspect the match_results object for what exactly got matched. How to do this is explained with examples in the following sections.

regex_match() Example

The function prototypes in the previous section might look complicated, but actually using regex_match() is not complicated at all.

Suppose you want to write a program that asks the user to enter a date in the following format year/month/day where year is four digits, month is a number between 1 and 12, and day is a number between 1 and 31. You can use a regular expression together with the regex_match() algorithm to validate the user input as follows. The details of the regular expression are explained after the code:

image
regex r("^d{4}/(?:0?[1-9]|1[0-2])/(?:0?[1-9]|[1-2][0-9]|3[0-1])$");
while (true) {
    cout << "Enter a date (year/month/day) (q=quit): ";
    string str;
    if (!getline(cin, str) || str == "q")
        break;
    if (regex_match(str, r))
        cout << "  Valid date." << endl;
    else
        cout << "  Invalid date!" << endl;
}

Code snippet from RegularExpressions egex_match_dates_1.cpp

The first line creates the regular expression. The expression consists of three parts separated by a forward slash / character, one part for year, one for month, and one for day. The following list explains these parts:

  • ^d{4}: This will match any combination of four digits, for example 1234, 2010, and so on, at the beginning of the string.
  • (?:0?[1-9]|1[0-2]): This sub part of the regular expression is wrapped inside parentheses to make sure the precedence is correct. We don’t need any capture group so (?:...) is used. The inner expression consists of an alternation of two parts separated by the | character.
    • 0?[1-9]: This will match any number from 1 to 9 with an optional 0 in front of it. For example it will match 1, 2, 9, 03, 04, and so on. It will not match 0, 10, 11, and so on.
    • 1[0-2]: This will match 10, 11, or 12, nothing else.
  • (?:0?[1-9]|[1-2][0-9]|3[0-1])$: This sub part is also wrapped inside a non-capture group and consists of an alternation of three parts followed by the end of the string:
    • 0?[1-9]: This is the same as the first part of the month matcher explained above.
    • [1-2][0-9]: This will match any number between 10 and 29 and nothing else.
    • 3[0-1]: This will match 30 or 31 and nothing else.

The example then enters an infinite loop to ask the user to enter a date. Each date entered is then given to the regex_match() algorithm. When regex_match() returns true the user has entered a date that matches the date regular expression pattern.

This example can be expanded a bit by asking the regex_match() algorithm to return captured sub-expressions in a results object. The following code extracts the year, month, and day digits into three separate integer variables.

To understand this code, you have to understand what a capture group does. By specifying a match_results object like smatch in the call to regex_match(), the elements of the match_results object are filled in when the regular expression matches the string. To be able to extract these sub-strings, you must create capture groups, so although parentheses are not required for grouping in this example, they are used to define new capture groups.

The first element, [0], in a match_results object contains the string that matched the entire pattern. When using regex_match() and a match is found, this is the entire source sequence. When using regex_search(), discussed in the next section, this is a sub-string in the source sequence that matches the regular expression. Element [1] is the sub-string matched by the first capture group, [2] by the second capture group, and so on.

The regular expression in the revised example has a few small changes. The first part matching the year is wrapped in a capture group, while the month and day parts are now also capture groups instead of non-capture groups. The call to regex_match() includes a smatch parameter, which will contain the matched capture groups. Here is the adapted example:

image
regex r("^(d{4})/(0?[1-9]|1[0-2])/(0?[1-9]|[1-2][0-9]|3[0-1])$");
while (true) {
    cout << "Enter a date (year/month/day) (q=quit): ";
    string str;
    if (!getline(cin, str) || str == "q")
        break;
    smatch m;
    if (regex_match(str, m, r)) {
        int year = atoi(m[1].str().c_str());
        int month = atoi(m[2].str().c_str());
        int day = atoi(m[3].str().c_str());
        cout << "  Valid date: Year=" << year
             << ", month=" << month
             << ", day=" << day << endl;
    } else {
        cout << "  Invalid date!" << endl;
    }
}

Code snippet from RegularExpressions egex_match_dates_2.cpp

In this example there are four elements in the smatch results objects, the full match, and three captured groups:

  • [0]: the string matching the full regular expression, which is the full date in this example
  • [1]: the year
  • [2]: the month
  • [3]: the day

When you execute this example you can get the following output:

Enter a date (year/month/day) (q=quit): 2011/12/01
  Valid date: Year=2011, month=12, day=1
Enter a date (year/month/day) (q=quit): 11/12/01
  Invalid date!
pen.gif

These date matching examples only check if the date consists of a year (four digits), a month (1-12) and a day (1-31). They do not perform any validation for leap years and so on. If you need that, you have to write code to validate the year, month and day values that are extracted by regex_match(). This validation is not a job for regular expressions, so this is not shown.

regex_search()

The regex_match() algorithm discussed in the previous section returns true if the entire source string matches the regular expression, false otherwise. It cannot be used to find a matching sub-string in the source string. The regex_search() algorithm allows you to search for a sub-string that matches a certain pattern in a source string. There are six versions of the regex_search() algorithm. The difference between them is in the type of arguments, similar to the six versions of regex_match(). See the Standard Library Reference resource on the website for more details.

One of the versions of the regex_search() algorithm accepts a begin and end iterator into a string that you want to process. You might be tempted to use this version of regex_search() in a loop to find all occurrences of a pattern in a source string by manipulating these begin and end iterators for each regex_search() call. Never do this! It can cause problems when your regular expression uses anchors (^ or $), word boundaries, and so on. It can also cause an infinite loop due to empty matches. Use the regex_iterator or regex_token_iterator as explained later in this chapter to extract all occurrences of a pattern from a source string.

cross.gif

Never use regex_search() in a loop to find all occurrences of a pattern in a source string. Instead, use a regex_iterator or regex_token_iterator.

regex_search() Example

The regex_search() algorithm can be used to extract matching sub-strings from an input string. The following example extracts code comments from input lines. The regular expression searches for a sub-string that starts with // followed by some optional whitespace \s* followed by one or more characters captured in a capture group (.+). This capture group will capture only the comment sub-string. The smatch object m will contain the search results. To get a string representation of the first capture group, you can write m[1] as in the following code or write m[1].str(). You can check the m[1].first and m[1].second iterators to see where exactly the sub-string was found in the source string.

image
regex r("//s*(.+)");
while (true) {
    cout << "Enter a string (q=quit): ";
    string str;
    if (!getline(cin, str) || str == "q")
        break;
    smatch m;
    if (regex_search(str, m, r))
        cout << "  Found comment '" << m[1] << "'" << endl;
    else
        cout << "  No comment found!" << endl;
}

Code snippet from RegularExpressions egex_search_comments.cpp

The output of this program can look as follows:

Enter a string (q=quit): std::string str;   // Our source string
  Found comment 'Our source string'
Enter a string (q=quit): int a; // A comment with // in the middle
  Found comment 'A comment with // in the middle'
Enter a string (q=quit): float f; // A comment with a       (tab) character
  Found comment 'A comment with a       (tab) character'

The match_results object also has a prefix() and suffix() method, which returns the string preceding or following the match respectively.

regex_iterator

As explained in the previous section, you should never use regex_search() in a loop to extract all occurrences of a pattern from a source string. Instead, you should use a regex_iterator or regex_token_iterator. They work similarly like iterators for STL containers which are discussed in Chapter 12.

Internally, both a regex_iterator and a regex_token_iterator contain a pointer to the regular expression. Because of this, you should not create them with a temporary regex object.

cross.gif

Never try to create a regex_iterator or regex_token_iterator with a temporary regex object.

regex_iterator Example

The following example asks the user to enter a source string, extracts every word from that string, and prints it between quotes. The regular expression in this case is [\w]+, which searches for one or more word-letters. This example uses std::string as source, so it uses sregex_iterator for the iterators. A standard iterator loop is used, but in this case, the end iterator is done slightly differently from the end iterators of ordinary STL containers. Normally, you specify an end iterator for a particular container, but for regex_iterator, there is only one “end” value. You can get this end iterator by simply declaring a regex_iterator type using the default constructor; it will implicitly be initialized to the end value.

The for loop creates a start iterator called it, which accepts a begin and end iterator into the source string together with the regular expression. The loop body will be called for every match found, which is every word in this example. The sregex_iterator iterates over all the matches. By dereferencing a sregex_iterator, you get a smatch object. Accessing the first element of this smatch object, [0], gives you the matched sub-string:

image
regex reg("[w]+");
while (true) {
    cout << "Enter a string to split (q=quit): ";
    string str;
    if (!getline(cin, str) || str == "q")
        break;
    const sregex_iterator end;
    for (sregex_iterator it(str.begin(), str.end(), reg); it != end; ++it) {
        cout << """ << (*it)[0] << """ << endl;
    }
}

Code snippet from RegularExpressions egex_iterator.cpp

The output of this program can look as follows:

Enter a string to split (q=quit): This, is    a test.
"This"
"is"
"a"
"test"

As this example demonstrates, even simple regular expressions can do some powerful string manipulation.

regex_token_iterator

The previous section described regex_iterator which iterates through every matched pattern. In each iteration of the loop you get a match_results object, which you can use to extract sub-expressions for that match captured by capture groups.

A regex_token_iterator can be used to automatically iterate over all or selected capture groups across all matched patterns. It has four constructors. The first creates an iterator that only iterates over capture groups with given index submatch. The second iterates over all capture groups with an index that appears in the submatches vector. The third iterates over all capture groups with an index that appears in the submatches initializer list and the fourth iterates over all capture groups with an index that appears in the submatches array.

regex_token_iterator(BidirectionalIterator a,
                     BidirectionalIterator b,
                     const regex_type& re,
                     int submatch = 0,
                     regex_constants::match_flag_type m =
                         regex_constants::match_default);
 
regex_token_iterator(BidirectionalIterator a,
                     BidirectionalIterator b,
                     const regex_type& re,
                     const std::vector<int>& submatches,
                     regex_constants::match_flag_type m =
                         regex_constants::match_default);
 
regex_token_iterator(BidirectionalIterator a,
                     BidirectionalIterator b,
                     const regex_type& re,
                     initializer_list<int> submatches,
                     regex_constants::match_flag_type m =
                         regex_constants::match_default);
 
template <std::size_t N>
regex_token_iterator(BidirectionalIterator a,
                     BidirectionalIterator b,
                     const regex_type& re,
                     const int (&submatches)[N],
                     regex_constants::match_flag_type m =
                         regex_constants::match_default);

When you use the first constructor and use the default value of 0 for submatch, you get an iterator that iterates over all capture groups with index 0, which are the sub-strings matching the full regular expression.

regex_token_iterator Examples

The previous regex_iterator example can be rewritten by using a regex_token_iterator as follows. Since the token iterator will automatically iterate over all capture groups with index 0, you use *iter in the loop body instead of (*iter)[0]. The output of this code is exactly the same as the output generated by the regex_iterator example:

image
regex reg("[w]+");
while (true) {
    cout << "Enter a string to split (q=quit): ";
    string str;
    if (!getline(cin, str) || str == "q")
        break;
    const sregex_token_iterator end;
    for (sregex_token_iterator iter(str.begin(), str.end(), reg);
        iter != end; ++iter) {
        cout << """ << *iter << """ << endl;
    }
}

Code snippet from RegularExpressions egex_token_iterator_1.cpp

The following example asks the user to enter a date and then uses a regex_token_iterator to iterate over the second and third capture group (month and day), which is specified by using a vector<int>. The regular expression used for dates is explained in an earlier section in this chapter:

image
regex reg("^(d{4})/(0?[1-9]|1[0-2])/(0?[1-9]|[1-2][0-9]|3[0-1])$");
while (true) {
    cout << "Enter a date (year/month/day) (q=quit): ";
    string str;
    if (!getline(cin, str) || str == "q")
        break;
    vector<int> vec = {2, 3};
    const sregex_token_iterator end;
    for (sregex_token_iterator iter(str.begin(), str.end(), reg, vec);
        iter != end; ++iter) {
        cout << """ << *iter << """ << endl;
    }
}

Code snippet from RegularExpressions egex_token_iterator_2.cpp

This code prints only the month and day of valid dates. Output generated by this example can look as follows:

Enter a date (year/month/day) (q=quit): 2011/1/13
"1"
"13"
Enter a date (year/month/day) (q=quit): 2011/1/32
Enter a date (year/month/day) (q=quit): 2011/12/5
"12"
"5"

The regex_token_iterator can also be used to perform a so-called field splitting or tokenization. It is a much safer and more flexible alternative than using the old strtok() function. Tokenization is triggered in the regex_token_iterator constructor by specifying -1 as the capture group index to iterate over. When in tokenization mode, the iterator will iterate over all sub-strings of the source string that do not match the regular expression. The following code demonstrates this by tokenizing a string on the delimiters , and ; with any number of whitespace characters before or after the delimiters:

image
regex reg("s*[,;]+s*");
while (true) {
    cout << "Enter a string to split on ',' and ';' (q=quit): ";
    string str;
    if (!getline(cin, str) || str == "q")
        break;
    const sregex_token_iterator end;
    for (sregex_token_iterator iter(str.begin(), str.end(), reg, -1);
        iter != end; ++iter) {
        cout << """ << *iter << """ << endl;
    }
}

Code snippet from RegularExpressions egex_token_iterator_field_splitting.cpp

The regular expression in this example searches for patterns that match the following:

  • Zero or more whitespace characters,
  • followed by 1 or more , or ; characters,
  • followed by zero or more whitespace characters.

The output can be as follows:

Enter a string to split on ',' and ';' (q=quit): This is,   a; test string.
"This is"
"a"
"test string."

As you can see from this output, the string is split on , and ; and all whitespace characters around the , or ; are removed, because the tokenization iterator iterates over all sub-strings that do not match the regular expression, and because the regular expression matches , and ; with whitespace around them.

regex_replace()

The regex_replace() algorithm requires a regular expression, and a formatting string that will be used to replace matching sub-strings. This formatting string can reference part of the matched sub-strings by using the following escape sequences:

ESCAPE SEQUENCE REPLACED WITH
$n the string matching the n-th capture group, for example $1 for the first capture group, $2 for the second, and so on
$& the string matching the whole regular expression, which is the same as $0
$' the part of the source string that appears to the left of the sub-string matching the regular expression
$' the part of the source string that appears to the right of the sub-string matching the regular expression
$$ a dollar sign

There are six versions of the regex_replace() algorithm. The difference between them is in the type of arguments:

template <class OutputIterator, class BidirectionalIterator,
          class traits, class charT, class ST, class SA>
  OutputIterator
    regex_replace(OutputIterator out,
                  BidirectionalIterator first,
                  BidirectionalIterator last,
                  const basic_regex<charT, traits>& e,
                  const basic_string<charT, ST, SA>& fmt,
                  regex_constants::match_flag_type flags =
                      regex_constants::match_default);
 
template <class OutputIterator, class BidirectionalIterator,
          class traits, class charT>
  OutputIterator
    regex_replace(OutputIterator out,
                  BidirectionalIterator first,
                  BidirectionalIterator last,
                  const basic_regex<charT, traits>& e,
                  const charT* fmt,
                  regex_constants::match_flag_type flags =
                      regex_constants::match_default);
 
template <class traits, class charT, class ST, class SA, class FST, class FSA>
  basic_string<charT, ST, SA>
    regex_replace(const basic_string<charT, ST, SA>& s,
                  const basic_regex<charT, traits>& e,
                  const basic_string<charT, FST, FSA>& fmt,
                  regex_constants::match_flag_type flags =
                      regex_constants::match_default);
 
template <class traits, class charT, class ST, class SA>
  basic_string<charT, ST, SA>
    regex_replace(const basic_string<charT, ST, SA>& s,
                  const basic_regex<charT, traits>& e,
                  const charT* fmt,
                  regex_constants::match_flag_type flags =
                      regex_constants::match_default);
 
template <class traits, class charT, class ST, class SA>
  basic_string<charT>
    regex_replace(const charT* s,
                  const basic_regex<charT, traits>& e,
                  const basic_string<charT, ST, SA>& fmt,
                  regex_constants::match_flag_type flags =
                      regex_constants::match_default);
 
template <class traits, class charT>
  basic_string<charT>
    regex_replace(const charT* s,
                  const basic_regex<charT, traits>& e,
                  const charT* fmt,
                  regex_constants::match_flag_type flags =
                      regex_constants::match_default);

regex_replace() Examples

As a first example, take the source HTML string <body><h1>Header</h1><p>Some text</p> </body> and the regular expression <h1>(.*)</h1><p>(.*)</p>. The following table shows the different escape sequences and with what they will be replaced with:

ESCAPE SEQUENCE REPLACED WITH
$1 Header
$2 Some text
$& <h1>Header</h1><p>Some text</p>
$' <body>
$' </body>

The following code demonstrates the use of regex_replace():

image
const string str("<body><h1>Header</h1><p>Some text</p></body>");
regex r("<h1>(.*)</h1><p>(.*)</p>");
const string format("H1=$1 and P=$2");
string result = regex_replace(str, r, format);
cout << "Original string: '" << str << "'" << endl;
cout << "New string     : '" << result << "'" << endl;

Code snippet from RegularExpressions egex_replace_1.cpp

The output of this program is as follows:

Original string: '<body><h1>Header</h1><p>Some text</p></body>'
New string     : '<body>H1=Header and P=Some text</body>'

The regex_replace() algorithm accepts a number of flags that can be used to manipulate how it is working. The most important flags are given in the following table:

FLAG DESCRIPTION
format_default The default is to replace all occurrences of the pattern, and to also copy everything that does not match the pattern to the result string.
format_no_copy Replace all occurrences of the pattern, but do not copy anything that does not match the pattern to the result string.
format_first_only Replace only the first occurrence of the pattern.

The following example modifies the previous code to use the format_no_copy flag:

image
const string str("<body><h1>Header</h1><p>Some text</p></body>");
regex r("<h1>(.*)</h1><p>(.*)</p>");
const string format("H1=$1 and P=$2");
string result = regex_replace(str, r, format,
    regex_constants::format_no_copy);
cout << "Original string: '" << str << "'" << endl;
cout << "New string     : '" << result << "'" << endl;

Code snippet from RegularExpressions egex_replace_2.cpp

The output is as follows. Compare this with the output of the previous version.

Original string: '<body><h1>Header</h1><p>Some text</p></body>'
New string     : 'H1=Header and P=Some text'

Another example is to get an input string and replace each word boundary with a newline so that the target string contains only one word per line. The following example demonstrates this without using any loops to process a given string. The code first creates a regular expression that matches individual words. When a match is found it is replaced by $1 where $1 will be replaced with the matched word. Note also the use of the format_no_copy flag to prevent copying whitespace from the source string to the result string:

image
regex reg("([w]+)");
const string format("$1
");
while (true) {
    cout << "Enter a string to split over multiple lines (q=quit): ";
    string str;
 
    if (!getline(cin, str) || str == "q")
        break;
    cout << regex_replace(str, reg, format,
        regex_constants::format_no_copy) << endl;
}

Code snippet from RegularExpressions egex_replace_3.cpp

The output of this program can be as follows:

Enter a string to split over multiple lines (q=quit):   This is   a test.
This
is
a
test

SUMMARY

This chapter started with a discussion on the C++ string class and why you should use it instead of the old plain C-style character arrays. It also explained a number of new helper functions added to C++11 to make it easier to convert numerical values into strings and vice versa, and introduced the concept of raw string literals.

The second part of this chapter gave you an appreciation for coding with localization in mind. As anyone who has been through a localization effort will tell you, adding support for a new language or locale is infinitely easier if you have planned ahead, for example by using Unicode characters and being mindful of locales.

The last part of this chapter explained the new C++11 regular expressions library. Once you know the syntax of regular expressions, it becomes much easier to work with strings. Regular expressions allow you to easily validate strings, search for sub-strings inside a source string, perform find-and-replace operations on strings, and so on. It is highly recommended to get to know them and to start using them instead of writing your own string manipulation routines. They will make your life easier.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset