If you talk to a man in a language he understands, that goes to his head. If you talk to him in his language, that goes to his heart.
—Nelson Mandela
The STL provides a special string container for human-language data, such as words, sentences, and markup languages. Available in the <string> header, the std::basic_string is a class template that you can specialize on a string’s underlying character type. As a sequential container, basic_string is essentially similar to a vector but with some special facilities for manipulating language.
STL basic_string provides major safety and feature improvements over C-style or null-terminated strings, and because human-language data inundates most modern programs, you’ll probably find basic_string indispensable.
The STL provides four basic_string specializations in the <string> header. Each specialization implements a string using one of the fundamental character types that you learned about in Chapter 2:
You’ll use the specialization with the appropriate underlying type. Because these specializations have the same interface, all the examples in this chapter will use std::string.
The basic_string container takes three template parameters:
Of these, only T is required. The STL’s std::char_traits template class in the <string> header abstracts character and string operations from the underlying character type. Also, unless you plan on supporting a custom character type, you won’t need to implement your own type traits, because char_traits has specializations available for char, wchar_t, char16_t, and char32_t. When the stdlib provides specializations for a type, you won’t need to provide it yourself unless you require some kind of exotic behavior.
Together, a basic_string specialization looks like this, where T is a character type:
std::basic_string<T, Traits=std::char_traits<T>, Alloc=std::allocator<T>>
NOTE
In most cases, you’ll be dealing with one of the predefined specializations, especially string or wstring. However, if you need a custom allocator, you’ll need to specialize basic_string appropriately.
The basic_string<T> container supports the same constructors as vector<T>, plus additional convenience constructors for converting a C-style string. In other words, a string supports the constructors of vector<char>, a wstring supports the constructors of vector<wchar_t>, and so on. As with vector, use parentheses for all basic_string constructors except when you actually want an initializer list.
You can default construct an empty string, or if you want to fill a string with a repeating character, you can use the fill constructor by passing a size_t and a char, as Listing 15-1 illustrates.
#include <string> TEST_CASE("std::string supports constructing") { SECTION("empty strings") { std::string cheese; ➊ REQUIRE(cheese.empty()); ➋ } SECTION("repeated characters") { std::string roadside_assistance(3, 'A'); ➌ REQUIRE(roadside_assistance == "AAA"); ➍ } }
Listing 15-1: The default and fill constructors of string
After you default construct a string ➊, it contains no elements ➋. If you want to fill the string with repeating characters, you can use the fill constructor by passing in the number of elements you want to fill and their value ➌. The example fills a string with three A characters ➍.
NOTE
You’ll learn about std::string comparisons with operator== later in the chapter. Because you generally handle C-style strings with raw pointers or raw arrays, operator== returns true only when given the same object. However, for std::string, operator== returns true if the contents are equivalent. As you can see in Listing 15-1, the comparison works even when one of the operands is a C-style string literal.
The string constructor also offers two const char*-based constructors. If the argument points to a null-terminated string, the string constructor can determine the input’s length on its own. If the pointer does not point to a null-terminated string or if you only want to use the first part of a string, you can pass a length argument that informs the string constructor of how many elements to copy, as Listing 15-2 illustrates.
TEST_CASE("std::string supports constructing substrings ") { auto word = "gobbledygook"; ➊ REQUIRE(std::string(word) == "gobbledygook"); ➋ REQUIRE(std::string(word, 6) == "gobble"); ➌ }
Listing 15-2: Constructing a string from C-style strings
You create a const char* called word pointing to the C-style string literal gobbledygook ➊. Next, you construct a string by passing word. As expected, the resulting string contains gobbledygook ➋. In the next test, you pass the number 6 as a second argument. This causes string to only take the first six characters of word, resulting in the string containing gobble ➌.
Additionally, you can construct strings from other strings. As an STL container, string fully supports copy and move semantics. You can also construct a string from a substring—a contiguous subset of another string. Listing 15-3 illustrates these three constructors.
TEST_CASE("std::string supports") { std::string word("catawampus"); ➊ SECTION("copy constructing") { REQUIRE(std::string(word) == "catawampus"); ➋ } SECTION("move constructing") { REQUIRE(std::string(move(word)) == "catawampus"); ➌ } SECTION("constructing from substrings") { REQUIRE(std::string(word, 0, 3) == "cat"); ➍ REQUIRE(std::string(word, 4) == "wampus"); ➎ } }
Listing 15-3: Copy, move, and substring construction of string objects
NOTE
In Listing 15-3, word is in a moved-from state, which, you’ll recall from “Move Semantics” on page 122, means it can only be reassigned or destructed.
Here, you construct a string called word containing the characters catawampus ➊. Copy construction yields another string containing a copy of the characters of word ➋. Move construction steals the characters of word, resulting in a new string containing catawampus ➌. Finally, you can construct a new string based on substrings. By passing word, a starting position of 0, and a length of 3, you construct a new string containing the characters cat ➍. If you instead pass word and a starting position of 4 (without a length), you get all the characters from the fourth to the end of the original string, resulting in wampus ➎.
The string class also supports literal construction with std::string_literals::operator""s. The major benefit is notational convenience, but you can also use operator""s to embed null characters within a string easily, as Listing 15-4 illustrates.
TEST_CASE("constructing a string with") { SECTION("std::string(char*) stops at embedded nulls") { std::string str("idioglossia ellohay!"); ➊ REQUIRE(str.length() == 11); ➋ } SECTION("operator""s incorporates embedded nulls") { using namespace std::string_literals; ➌ auto str_lit = "idioglossia ellohay!"s; ➍ REQUIRE(str_lit.length() == 20); ➎ } }
Listing 15-4: Constructing a string
In the first test, you construct a string using the literal idioglossia ellohay! ➊, which results in a string containing idioglossia ➋, The remainder of the literal didn’t get copied into the string due to embedded nulls. In the second test, you bring in the std::string_literals namespace ➌ so you can use operator""s to construct a string from a literal directly ➍. Unlike the std::string constructor ➊, operator""s yields a string containing the entire literal—embedded null bytes and all ➎.
Table 15-1 summarizes the options for constructing a string. In this table, c is a char, n and pos are size_t, str is a string or a C-style string, c_str is a C-style string, and beg and end are input iterators.
Table 15-1: Supported std::string Constructors
Constructor |
Produces a string containing |
string() |
No characters. |
string(n, c) |
c repeated n times. |
string(str, pos, [n]) |
The half-open range pos to pos+n of str. Substring extends from pos to str’s end if n is omitted. |
string(c_str, [n]) |
A copy of c_str, which has length n. If c_str is null terminated, n defaults to the null-terminated string’s length. |
string(beg, end) |
A copy of the elements in the half-open range from beg to end. |
string(str) |
A copy of str. |
string(move(str)) |
The contents of str, which is in a moved-from state after construction. |
string{ c1, c2, c3 } |
The characters c1, c2, and c3. |
"my string literal"s |
A string containing the characters my string literal. |
Exactly like vector, string uses dynamic storage to store its constituent elements contiguously. Accordingly, vector and string have very similar copy/move-construction/assignment semantics. For example, copy operations are potentially more expensive than move operations because the contained elements reside in dynamic memory.
The most popular STL implementations have small string optimizations (SSO). The SSO places the contents of a string within the object’s storage (rather than dynamic storage) if the contents are small enough. As a general rule, a string with fewer than 24 bytes is an SSO candidate. Implementers make this optimization because in many modern programs, most strings are short. (A vector doesn’t have any small optimizations.)
NOTE
Practically, SSO affects moves in two ways. First, any references to the elements of a string will invalidate if the string moves. Second, moves are potentially slower for strings than vectors because strings need to check for SSO.
A string has a size (or length) and a capacity. The size is the number of characters contained in the string, and the capacity is the number of characters that the string can hold before needing to resize.
Table 15-2 contains methods for reading and manipulating the size and capacity of a string. In this table, n is a size_t. An asterisk (*) indicates that this operation invalidates raw pointers and iterators to the elements of s in at least some circumstances.
Table 15-2: Supported std::string Storage and Length Methods
Method |
Returns |
s.empty() |
true if s contains no characters; otherwise false. |
s.size() |
The number of characters in s. |
s.length() |
Identical to s.size() |
s.max_size() |
The maximum possible size of s (due to system/runtime limitations). |
s.capacity() |
The number of characters s can hold before needing to resize. |
s.shrink_to_fit() |
void; issues a non-binding request to reduce s.capacity() to s.size().* |
s.reserve([n]) |
void; if n > s.capacity(), resizes so s can hold at least n elements; otherwise, issues a non-binding request* to reduce s.capacity() to n or s.size(), whichever is greater. |
NOTE
At press time, the draft C++20 standard changes the behavior of the reserve method when its argument is less than the size of the string. This will match the behavior of vector, where there is no effect rather than being equivalent to invoking shrink_to_fit.
Note that the size and capacity methods of string match those of vector very closely. This is a direct result of the closeness of their storage models.
Because string offers random-access iterators to contiguous elements, it accordingly exposes similar element- and iterator-access methods to vector.
For interoperation with C-style APIs, string also exposes a c_str method, which returns a non-modifiable, null-terminated version of the string as a const char*, as Listing 15-5 illustrates.
TEST_CASE("string's c_str method makes null-terminated strings") { std::string word("horripilation"); ➊ auto as_cstr = word.c_str(); ➋ REQUIRE(as_cstr[0] == 'h'); ➌ REQUIRE(as_cstr[1] == 'o'); REQUIRE(as_cstr[11] == 'o'); REQUIRE(as_cstr[12] == 'n'); REQUIRE(as_cstr[13] == ' '); ➍ }
Listing 15-5: Extracting a null-terminated string from a string
You construct a string called word containing the characters horripilation ➊ and use its c_str method to extract a null-terminated string called as_cstr ➋. Because as_cstr is a const char*, you can use operator[] to illustrate that it contains the same characters as word ➌ and that it is null terminated ➍.
NOTE
The std::string class also supports operator[], which has the same behavior as with a C-style string.
Generally, c_str and data produce identical results except that references returned by data can be non-const. Whenever you manipulate a string, implementations usually ensure that the contiguous memory backing the string ends with a null terminator. The program in Listing 15-6 illustrates this behavior by printing the results of calling data and c_str alongside their addresses.
#include <string> #include <cstdio> int main() { std::string word("pulchritudinous"); printf("c_str: %s at 0x%p ", word.c_str(), word.c_str()); ➊ printf("data: %s at 0x%p ", word.data(), word.data()); ➋ } -------------------------------------------------------------------------- c_str: pulchritudinous at 0x0000002FAE6FF8D0 ➊ data: pulchritudinous at 0x0000002FAE6FF8D0 ➋
Listing 15-6: Illustrating that c_str and data return equivalent addresses
Both c_str and data produce identical results because they point to the same addresses ➊ ➋. Because the address is the beginning of a null-terminated string, printf yields identical output for both invocations.
Table 15-3 lists the access methods of string. Note that n is a size_t in the table.
Table 15-3: Supported std::string Element and Iterator Access Methods
Method |
Returns |
s.begin() |
An iterator pointing to the first element. |
s.cbegin() |
A const iterator pointing to the first element. |
s.end() |
An iterator pointing to one past the last element. |
s.cend() |
A const iterator pointing to one past the last element. |
s.at(n) |
A reference to element n of s. Throws std::out_of_range if out of bounds. |
s[n] |
A reference to element n of s. Undefined behavior if n > s.size(). Also s[s.size()] must be 0, so writing a non-zero value into this character is undefined behavior. |
s.front() |
A reference to first element. |
s.back() |
A reference to last element. |
s.data() |
A raw pointer to the first element if string is non-empty. For an empty string, returns a pointer to a null character. |
s.c_str() |
Returns a non-modifiable, null-terminated version of the contents of s. |
Note that string supports comparisons with other strings and with raw C-style strings using the usual comparison operators. For example, the equality operator== returns true if the size and contents of the left and right size are equal, whereas the inequality operator!= returns the opposite. The remaining comparison operators perform lexicographical comparison, meaning they sort alphabetically where A < Z < a < z and where, if all else is equal, shorter words are less than longer words (for example, pal < palindrome). Listing 15-7 illustrates comparisons.
NOTE
Technically, lexicographical comparison depends on the encoding of the string. It’s theoretically possible that a system could use a default encoding where the alphabet is in some completely jumbled order (such as the nearly obsolete EBCDIC encoding, which put lowercase letters before uppercase letters), which would affect string comparison. For ASCII-compatible encodings, you don’t need to worry since they imply the expected lexicographical behavior.
TEST_CASE("std::string supports comparison with") { using namespace std::literals::string_literals; ➊ std::string word("allusion"); ➋ SECTION("operator== and !=") { REQUIRE(word == "allusion"); ➌ REQUIRE(word == "allusion"s); ➍ REQUIRE(word != "Allusion"s); ➎ REQUIRE(word != "illusion"s); ➏ REQUIRE_FALSE(word == "illusion"s); ➐ } SECTION("operator<") { REQUIRE(word < "illusion"); ➑ REQUIRE(word < "illusion"s); ➒ REQUIRE(word > "Illusion"s); ➓ } }
Listing 15-7: The string class supports comparison
Here, you bring in the std::literals::string_literals namespace so you can easily construct a string with operator""s ➊. You also construct a string called word containing the characters allusion ➋. In the first set of tests, you examine operator== and operator!=.
You can see that word equals (==) allusion as both a C-style string ➌ and a string ➍, but it doesn’t equal (!=) strings containing Allusion ➎ or illusion ➏. As usual, operator== and operator!= always return opposite results ➐.
The next set of tests uses operator< to show that allusion is less than illusion ➑, because a is lexicographically less than i. Comparisons work with C-style strings and strings ➒. Listing 15-7 also shows that Allusion is less than allusion ➓ because A is lexicographically less than a.
Table 15-4 lists the comparison methods of string. Note that other is a string or char* C-style string in the table.
Table 15-4: Supported std::string Comparison Operators
Method |
Returns |
s == other |
true if s and other have identical characters and lengths; otherwise false |
s != other |
The opposite of operator== |
s.compare(other) |
Returns 0 if s == other, a negative number if s < other, and a positive number if s > other |
s < other s > other s <= other s >= other |
The result of the corresponding comparison operation, according to lexicographical sort |
For manipulating elements, string has a lot of methods. It supports all the methods of vector<char> plus many others useful to manipulating human-language data.
To add elements to a string, you can use push_back, which inserts a single character at the end. When you want to insert more than one character to the end of a string, you can use operator+= to append a character, a null-terminated char* string, or a string. You can also use the append method, which has three overloads. First, you can pass a string or a null-terminated char* string, an optional offset into that string, and an optional number of characters to append. Second, you can pass a length and a char, which will append that number of chars to the string. Third, you can append a half-open range. Listing 15-8 illustrates all of these operations.
TEST_CASE("std::string supports appending with") { std::string word("butt"); ➊ SECTION("push_back") { word.push_back('e'); ➋ REQUIRE(word == "butte"); } SECTION("operator+=") { word += "erfinger"; ➌ REQUIRE(word == "butterfinger"); } SECTION("append char") { word.append(1, 's'); ➍ REQUIRE(word == "butts"); } SECTION("append char*") { word.append("stockings", 5); ➎ REQUIRE(word == "buttstock"); } SECTION("append (half-open range)") { std::string other("onomatopoeia"); ➏ word.append(other.begin(), other.begin()+2); ➐ REQUIRE(word == "button"); } }
Listing 15-8: Appending to a string
To begin, you initialize a string called word containing the characters butt ➊. In the first test, you invoke push_back with the letter e ➋, which yields butte. Next, you add erfinger to word using operator+= ➌, yielding butterfinger. In the first invocation of append, you append a single s ➍ to yield butts. (This setup works just like push_back.) A second overload of append allows you to provide a char* and a length. By providing stockings and length 5, you add stock to word to yield buttstock ➎. Because append works with half-open ranges, you can also construct a string called other containing the characters onomatopoeia ➏ and append the first two characters via a half-open range to yield button ➐.
NOTE
Recall from “Test Cases and Sections” on page 308 that each SECTION of a Catch unit test runs independently, so modifications to word are independent of each other: the setup code resets word for each test.
To remove elements from a string, you have several options. The simplest method is to use pop_back, which follows vector in removing the last character from a string. If you want to instead remove all the characters (to yield an empty string), use the clear method. When you need more precision in removing elements, use the erase method, which provides several overloads. You can provide an index and a length, which removes the corresponding characters. You can also provide an iterator to remove a single element or a half-open range to remove many. Listing 15-9 illustrates removing elements from a string.
TEST_CASE("std::string supports removal with") { std::string word("therein"); ➊ SECTION("pop_back") { word.pop_back(); word.pop_back(); ➋ REQUIRE(word == "there"); } SECTION("clear") { word.clear(); ➌ REQUIRE(word.empty()); } SECTION("erase using half-open range") { word.erase(word.begin(), word.begin()+3); ➍ REQUIRE(word == "rein"); } SECTION("erase using an index and length") { word.erase(5, 2); REQUIRE(word == "there"); ➎ } }
Listing 15-9: Removing elements from a string
You construct a string called word containing the characters therein ➊. In the first test, you call pop_back twice to first remove the letter n followed by the letter i so word contains the characters there ➋. Next, you invoke clear, which removes all the characters from word so it’s empty ➌. The last two tests use erase to remove some subset of the characters in word. In the first usage, you remove the first three characters with a half-open range so word contains rein ➍. In the second, you remove the characters starting at index 5 (i in therein) and extending two characters ➎. Like the first test, this yields the characters there.
To insert and remove elements simultaneously, use string to expose the replace method, which has many overloads.
First, you can provide a half-open range and a null-terminated char* or a string, and replace will perform a simultaneous erase of all the elements within the half-open range and an insert of the provided string where the range used to be. Second, you can provide two half-open ranges, and replace will insert the second range instead of a string.
Instead of replacing a range, you can use either an index or a single iterator and a length. You can supply a new half-open range, a character and a size, or a string, and replace will substitute new elements over the implied range. Listing 15-10 demonstrates some of these possibilities.
TEST_CASE("std::string replace works with") { std::string word("substitution"); ➊ SECTION("a range and a char*") { word.replace(word.begin()+9, word.end(), "e"); ➋ REQUIRE(word == "substitute"); } SECTION("two ranges") { std::string other("innuendo"); word.replace(word.begin(), word.begin()+3, other.begin(), other.begin()+2); ➌ REQUIRE(word == "institution"); } SECTION("an index/length and a string") { std::string other("vers"); word.replace(3, 6, other); ➍ REQUIRE(word == "subversion"); } }
Listing 15-10: Replacing elements of a string
Here, you construct a string called word containing substitution ➊. In the first test, you replace all the characters from index 9 to the end with the letter e, resulting in the word substitute ➋. Next, you replace the first three letters of word with the first two letters of a string containing innuendo ➌, resulting in institution. Finally, you use an alternate way of specifying the target sequence with an index and a length to replace the characters stitut with the characters vers, yielding subversion ➍.
The string class offers a resize method to manually set the length of string. The resize method takes two arguments: a new length and an optional char. If the new length of string is smaller, resize ignores the char. If the new length of string is larger, resize appends the char the implied number of times to achieve the desired length. Listing 15-11 illustrates the resize method.
TEST_CASE("std::string resize") { std::string word("shamp"); ➊ SECTION("can remove elements") { word.resize(4); ➋ REQUIRE(word == "sham"); } SECTION("can add elements") { word.resize(7, 'o'); ➌ REQUIRE(word == "shampoo"); } }
Listing 15-11: Resizing a string
You construct a string called word containing the characters shamp ➊. In the first test, you resize word to length 4 so it contains sham ➋. In the second, you resize to a length of 7 and provide the optional character o as the value to extend word with ➌. This results in word containing shampoo.
The “Constructing” section on page 482 explained a substring constructor that can extract contiguous sequences of characters to create a new string. You can also generate substrings using the substr method, which takes two optional arguments: a position argument and a length. The position defaults to 0 (the beginning of the string), and the length defaults to the remainder of the string. Listing 15-12 illustrates how to use substr.
TEST_CASE("std::string substr with") { std::string word("hobbits"); ➊ SECTION("no arguments copies the string") { REQUIRE(word.substr() == "hobbits"); ➋ } SECTION("position takes the remainder") { REQUIRE(word.substr(3) == "bits"); ➌ } SECTION("position/index takes a substring") { REQUIRE(word.substr(3, 3) == "bit"); ➍ } }
Listing 15-12: Extracting substrings from a string
You declare a string called word containing hobbits ➊. If you invoke substr with no arguments, you simply copy the string ➋. When you provide the position argument 3, substr extracts the substring beginning at element 3 and extending to the end of the string, yielding bits ➌. Finally, when you provide a position (3) and a length (3), you instead get bit ➍.
Table 15-5 lists many of the insertion and deletion methods of string. In this table, str is a string or a C-style char* string, p and n are size_t, ind is a size_t index or an iterator into s, n and i are a size_t, c is a char, and beg and end are iterators. An asterisk (*) indicates that this operation invalidates raw pointers and iterators to v’s elements in at least some circumstances.
Table 15-5: Supported std::string Element Manipulation Methods
Method |
Description |
s.insert(ind, str, [p], [n]) |
Inserts the n elements of str, starting at p, into s just before ind. If no n supplied, inserts the entire string or up to the first null of a char*; p defaults to 0.* |
s.insert(ind, n, c) |
Inserts n copies of c just before ind.* |
s.insert(ind, beg, end) |
Inserts the half-open range from beg to end just before ind. * |
s.append(str, [p], [n]) |
Equivalent to s.insert(s.end(), str, [p], [n]).* |
s.append(n, c) |
Equivalent to s.insert(s.end(), n, c).* |
s.append(beg, end) |
Appends the half-open range from beg to end to the end of s.* |
s += c |
Appends c or str to the end of s.* |
s.push_back(c) |
Appends c to the end of s.* |
s.clear() |
Removes all characters from s.* |
s.erase([i], [n]) |
Removes n characters starting at position i; i defaults to 0, and n defaults to the remainder of s.* |
s.erase(itr) |
Erases the element pointed to by itr.* |
s.erase(beg, end) |
Erases the elements on the half-open range from beg to end.* |
s.pop_back() |
Removes the last element of s.* |
s.resize(n,[c]) |
Resizes the string so it contains n characters. If this operation increases the string’s length, it adds copies of c, which defaults to 0.* |
s.replace(i, n1, str, [p], [n2]) |
Replaces the n1 characters starting at index i with the n2 elements in str starting at p. By default, p is 0 and n2 is str.length().* |
s.replace(beg, end, str) |
Replaces the half-open range beg to end with str.* |
s.replace(p, n, str) |
Replaces from index p to p+n with str.* |
s.replace(beg1, end1, beg2, end2) |
Replaces the half-open range beg1 to end1 with the half-open range beg2 to end2.* |
s.replace(ind, c, [n]) |
Replaces n elements starting at ind with cs.* |
s.replace(ind, beg, end) |
Replaces elements starting at ind with the half-open range beg to end.* |
s.substr([p], [c]) |
Returns the substring starting at p with length c. By default, p is 0 and c is the remainder of the string. |
s1.swap(s2) |
Exchanges the contents of s1 and s2.* |
In addition to the preceding methods, string offers several search methods, which enable you to locate substrings and characters that you’re interested in. Each method performs a particular kind of search, so which you choose depends on the particulars of the application.
The first method string offers is find, which accepts a string, a C-style string, or a char as its first argument. This argument is an element that you want to locate within this. Optionally, you can provide a second size_t position argument that tells find where to start looking. If find fails to locate the substring, it returns the special size_t-valued, constant, static member std::string::npos. Listing 15-13 illustrates the find method.
TEST_CASE("std::string find") { using namespace std::literals::string_literals; std::string word("pizzazz"); ➊ SECTION("locates substrings from strings") { REQUIRE(word.find("zz"s) == 2); // pi(z)zazz ➋ } SECTION("accepts a position argument") { REQUIRE(word.find("zz"s, 3) == 5); // pizza(z)z ➌ } SECTION("locates substrings from char*") { REQUIRE(word.find("zaz") == 3); // piz(z)azz ➍ } SECTION("returns npos when not found") { REQUIRE(word.find('x') == std::string::npos); ➎ } }
Listing 15-13: Finding substrings within a string
Here, you construct the string called word containing pizzazz ➊. In the first test, you invoke find with a string containing zz, which returns 2 ➋, the index of the first z in pizzazz. When you provide a position argument of 3 corresponding to the second z in pizzazz, find locates the second zz beginning at 5 ➌. In the third test, you use the C-style string zaz, and find returns 3, again corresponding to the second z in pizzazz ➍. Finally, you attempt to find the character x, which doesn’t appear in pizzazz, so find returns std::string::npos ➎.
The rfind method is an alternative to find that takes the same arguments but searches in reverse. You might want to use this functionality if, for example, you were looking for particular punctuation at the end of a string, as Listing 15-14 illustrates.
TEST_CASE("std::string rfind") { using namespace std::literals::string_literals; std::string word("pizzazz"); ➊ SECTION("locates substrings from strings") { REQUIRE(word.rfind("zz"s) == 5); // pizza(z)z ➋ } SECTION("accepts a position argument") { REQUIRE(word.rfind("zz"s, 3) == 2); // pi(z)zazz ➌ } SECTION("locates substrings from char*") { REQUIRE(word.rfind("zaz") == 3); // piz(z)azz ➍ } SECTION("returns npos when not found") { REQUIRE(word.rfind('x') == std::string::npos); ➎ } }
Listing 15-14: Finding substrings in reverse within a string
Using the same word ➊, you use the same arguments as in Listing 15-13 to test rfind. Given zz, rfind returns 5, the second to last z in pizzazz ➋. When you provide the positional argument 3, rfind instead returns the first z in pizzazz ➌. Because there’s only one occurrence of the substring zaz, rfind returns the same position as find ➍. Also like find, rfind returns std::string::npos when given x ➎.
Whereas find and rfind locate exact subsequences in a string, a family of related functions finds the first character contained in a given argument.
The find_first_of function accepts a string and locates the first character in this contained in the argument. Optionally, you can provide a size_t position argument to indicate to find_first_of where to start in the string. If find_first_of cannot find a matching character, it will return std::string::npos. Listing 15-15 illustrates the find_first_of function.
TEST_CASE("std::string find_first_of") { using namespace std::literals::string_literals; std::string sentence("I am a Zizzer-Zazzer-Zuzz as you can plainly see."); ➊ SECTION("locates characters within another string") { REQUIRE(sentence.find_first_of("Zz"s) == 7); // (Z)izzer ➋ } SECTION("accepts a position argument") { REQUIRE(sentence.find_first_of("Zz"s, 11) == 14); // (Z)azzer ➌ } SECTION("returns npos when not found") { REQUIRE(sentence.find_first_of("Xx"s) == std::string::npos); ➍ } }
Listing 15-15: Finding the first element from a set within a string
The string called sentence contains I am a Zizzer-Zazzer-Zuzz as you can plainly see. ➊. Here, you invoke find_first_of with the string Zz, which matches both lowercase and uppercase z. This returns 7, which corresponds to the first Z in sentence, Zizzer ➋. In the second test, you again provide the string Zz but also pass the position argument 11, which corresponds to the e in Zizzer. This results in 14, which corresponds to the Z in Zazzer ➌. Finally, you invoke find_first_of with Xx, which results in std::string::npos because sentence doesn’t contain an x (or an X) ➍.
A string offers three find_first_of variations:
Your choice of find function boils down to what your algorithmic requirements are. Do you need to search from the back of a string, say for a punctuation mark? If so, use find_last_of. Are you looking for the first space in a string? If so, use find_first_of. Do you want to invert your search and look for the first element that is not a member of some set? Then use the alternatives find_first_not_of and find_last_not_of, depending on whether you want to start from the beginning or end of the string.
Listing 15-16 illustrates these three find_first_of variations.
TEST_CASE("std::string") { using namespace std::literals::string_literals; std::string sentence("I am a Zizzer-Zazzer-Zuzz as you can plainly see."); ➊ SECTION("find_last_of finds last element within another string") { REQUIRE(sentence.find_last_of("Zz"s) == 24); // Zuz(z) ➋ } SECTION("find_first_not_of finds first element not within another string") { REQUIRE(sentence.find_first_not_of(" -IZaeimrz"s) == 22); // Z(u)zz ➌ } SECTION("find_last_not_of finds last element not within another string") { REQUIRE(sentence.find_last_not_of(" .es"s) == 43); // plainl(y) ➍ } }
Listing 15-16: Alternatives to the find_first_of method of string
Here, you initialize the same sentence as in Listing 15-15 ➊. In the first test, you use find_last_of on Zz, which searches in reverse for any z or Z and returns 24, the last z in the sentence Zuzz ➋. Next, you use find_first_not_of and pass a farrago of characters (not including the letter u), which results in 22, the position of the first u in Zuzz ➌. Finally, you use find_last_not_of to find the last character not equal to space, period, e, or s. This results in 43, the position of y in plainly ➍.
Table 15-6 lists many of the search methods for string. Note that s2 is a string; cstr is a C-style char* string; c is a char; and n, l, and pos are size_t in the table.
Table 15-6: Supported std::string Search Algorithms
Method |
Searches s starting at p and returns the position of the . . . |
s.find(s2, [p]) |
First substring equal to s2; p defaults to 0. |
s.find(cstr, [p], [l]) |
First substring equal to the first l characters of cstr; p defaults to 0; l defaults to cstr’s length per null termination. |
s.find(c, [p]) |
First character equal to c; p defaults to 0. |
s.rfind(s2, [p]) |
Last substring equal to s2; p defaults to npos. |
s.rfind(cstr, [p], [l]) |
Last substring equal to the first l characters of cstr; p defaults to npos; l defaults to cstr’s length per null termination. |
s.rfind(c, [p]) |
Last character equal to c; p defaults to npos. |
s.find_first_of(s2, [p]) |
First character contained in s2; p defaults to 0. |
s.find_first_of(cstr, [p], [l]) |
First character contained in the first l characters of cstr; p defaults to 0; l defaults to cstr’s length per null termination. |
s.find_first_of(c, [p]) |
First character equal to c; p defaults to 0. |
s.find_last_of(s2, [p]) |
Last character contained in s2; p defaults to 0. |
s.find_last_of(cstr, [p], [l]) |
Last character contained in the first l characters of cstr; p defaults to 0; l defaults to cstr’s length per null termination. |
s.find_last_of(c, [p]) |
Last character equal to c; p defaults to 0. |
s.find_first_not_of(s2, [p]) |
First character not contained in s2; p defaults to 0. |
s.find_first_not_of(cstr, [p], [l]) |
First character not contained in the first l characters of cstr; p defaults to 0; l defaults to cstr’s length per null termination. |
s.find_first_not_of(c, [p]) |
First character not equal to c; p defaults to 0. |
s.find_last_not_of(s2, [p]) |
Last character not contained in s2; p defaults to 0. |
s.find_last_not_of(cstr, [p], [l]) |
Last character not contained in the first l characters of cstr; p defaults to 0; l defaults to cstr’s length per null termination. |
s.find_last_not_of(c, [p]) |
Last character not equal to c; p defaults to 0. |
The STL provides functions for converting between string or wstring and the fundamental numeric types. Given a numeric type, you can use the std::to_string and std::to_wstring functions to generate its string or wstring representation. Both functions have overloads for all the numeric types. Listing 15-17 illustrates string and wstring.
TEST_CASE("STL string conversion function") { using namespace std::literals::string_literals; SECTION("to_string") { REQUIRE("8675309"s == std::to_string(8675309)); ➊ } SECTION("to_wstring") { REQUIRE(L"109951.1627776"s == std::to_wstring(109951.1627776)); ➋ } }
Listing 15-17: Numeric conversion functions of string
NOTE
Thanks to the inherent inaccuracy of the double type, the second unit test ➋ might fail on your system.
The first example uses to_string to convert the int 8675309 into a string ➊; the second example uses to_wstring to convert the double 109951.1627776 into a wstring ➋.
You can also convert the other way, going from a string or wstring to a numeric type. Each numeric conversion function accepts a string or wstring containing a string-encoded number as its first argument. Next, you can provide an optional pointer to a size_t. If provided, the conversion function will write the index of the last character it was able to convert (or the length of the input string if it decoded all characters). By default, this index argument is nullptr, in which case the conversion function doesn’t write the index. When the target type is integral, you can provide a third argument: an int corresponding to the base of the encoded string. This base argument is optional and defaults to 10.
Each conversion function throws std::invalid_argument if no conversion could be performed and throws std::out_of_range if the converted value is out of range for the corresponding type.
Table 15-7 lists each of these conversion functions along with its target type. In this table, s is a string. If p is not nullptr, the conversion function will write the position of the first unconverted character in s to the memory pointed to by p. If all characters are encoded, returns the length of s. Here, b is the number’s base representation in s. Note that p defaults to nullptr, and b defaults to 10.
Table 15-7: Supported Numeric Conversion Functions for std::string and std::wstring
Function |
Converts s to |
stoi(s, [p], [b]) |
An int |
stol(s, [p], [b]) |
A long |
stoll(s, [p], [b]) |
A long long |
stoul(s, [p], [b]) |
An unsigned long |
stoull(s, [p], [b]) |
An unsigned long long |
stof(s, [p]) |
A float |
stod(s, [p]) |
A double |
stold(s, [p]) |
A long double |
to_string(n) |
A string |
to_wstring(n) |
A wstring |
Listing 15-18 illustrates several numeric conversion functions.
TEST_CASE("STL string conversion function") { using namespace std::literals::string_literals; SECTION("stoi") { REQUIRE(std::stoi("8675309"s) == 8675309); ➊ } SECTION("stoi") { REQUIRE_THROWS_AS(std::stoi("1099511627776"s), std::out_of_range); ➋ } SECTION("stoul with all valid characters") { size_t last_character{}; const auto result = std::stoul("0xD3C34C3D"s, &last_character, 16); ➌ REQUIRE(result == 0xD3C34C3D); REQUIRE(last_character == 10); } SECTION("stoul") { size_t last_character{}; const auto result = std::stoul("42six"s, &last_character); ➍ REQUIRE(result == 42); REQUIRE(last_character == 2); } SECTION("stod") { REQUIRE(std::stod("2.7182818"s) == Approx(2.7182818)); ➎ } }
Listing 15-18: String conversion functions of string
First, you use stoi to convert 8675309 to an integer ➊. In the second test, you attempt to use stoi to convert the string 1099511627776 into an integer. Because this value is too large for an int, stoi throws std::out_of_range ➋. Next, you convert 0xD3C34C3D with stoi, but you provide the two optional arguments: a pointer to a size_t called last_character and a hexadecimal base ➌. The last_character object is 10, the length of 0xD3C34C3D, because stoi can parse every character. The string in the next test, 42six, contains the unparsable characters six. When you invoke stoul this time, the result is 42 and last_character equals 2, the position of s in six ➍. Finally, you use stod to convert the string 2.7182818 to a double ➎.
NOTE
Boost’s Lexical Cast provides an alternative, template-based approach to numeric conversions. Refer to the documentation for boost::lexical_cast available in the <boost/lexical_cast.hpp> header.
A string view is an object that represents a constant, contiguous sequence of characters. It’s very similar to a const string reference. In fact, string view classes are often implemented as a pointer to a character sequence and a length.
The STL offers the class template std::basic_string_view in the <string_view> header, which is analogous to std::basic_string. The template std::basic_string_view has a specialization for each of the four commonly used character types:
This section discusses the string_view specialization for demonstration purposes, but the discussion generalizes to the other three specializations.
The string_view class supports most of the same methods as string; in fact, it’s designed to be a drop-in replacement for a const string&.
The string_view class supports default construction, so it has zero length and points to nullptr. Importantly, string_view supports implicit construction from a const string& or a C-style string. You can construct string_view from a char* and a size_t, so you can manually specify the desired length in case you want a substring or you have embedded nulls. Listing 15-19 illustrates the use of string_view.
TEST_CASE("std::string_view supports") { SECTION("default construction") { std::string_view view; ➊ REQUIRE(view.data() == nullptr); REQUIRE(view.size() == 0); REQUIRE(view.empty()); } SECTION("construction from string") { std::string word("sacrosanct"); std::string_view view(word); ➋ REQUIRE(view == "sacrosanct"); } SECTION("construction from C-string") { auto word = "viewership"; std::string_view view(word); ➌ REQUIRE(view == "viewership"); } SECTION("construction from C-string and length") { auto word = "viewership"; std::string_view view(word, 4); ➍ REQUIRE(view == "view"); } }
Listing 15-19: The constructors of string_view
The default-constructed string_view points to nullptr and is empty ➊. When you construct a string_view from a string ➋ or a C-style string ➌, it points to the original’s contents. The final test provides the optional length argument 4, which means the string_view refers to only the first four characters instead ➍.
Although string_view also supports copy construction and assignment, it doesn’t support move construction or assignment. This design makes sense when you consider that string_view doesn’t own the sequence to which it points.
The string_view class supports many of the same operations as a const string& with identical semantics. The following lists all the shared methods between string and string_view:
Iterators begin, end, rbegin, rend, cbegin, cend, crbegin, crend
Element Access operator[], at, front, back, data
Capacity size, length, max_size, empty
Search find, rfind, find_first_of, find_last_of, find_first_not_of, find_last_not_of
Extraction copy, substr
Comparison compare, operator==, operator!= , operator<, operator>, operator<=, operator>=
In addition to these shared methods, string_view supports the remove_prefix method, which removes the given number of characters from the beginning of the string_view, and the remove_suffix method, which instead removes characters from the end. Listing 15-20 illustrates both methods.
TEST_CASE("std::string_view is modifiable with") { std::string_view view("previewing"); ➊ SECTION("remove_prefix") { view.remove_prefix(3); ➋ REQUIRE(view == "viewing"); } SECTION("remove_suffix") { view.remove_suffix(3); ➌ REQUIRE(view == "preview"); } }
Listing 15-20: Modifying a string_view with remove_prefix and remove_suffix
Here, you declare a string_view referring to the string literal previewing ➊. The first test invokes remove_prefix with 3 ➋, which removes three characters from the front of string_view so it now refers to viewing. The second test instead invokes remove_suffix with 3 ➌, which removes three characters from the back of the string_view and results in preview.
Because string_view doesn’t own the sequence to which it refers, it’s up to you to ensure that the lifetime of the string_view is a subset of the referred-to sequence’s lifetime.
Perhaps the most common usage of string_view is as a function parameter. When you need to interact with an immutable sequence of characters, it’s the first port of call. Consider the count_vees function in Listing 15-21, which counts the frequency of the letter v in a sequence of characters.
#include <string_view> size_t count_vees(std::string_view my_view➊) { size_t result{}; for(auto letter : my_view) ➋ if (letter == 'v') result++; ➌ return result; ➍ }
Listing 15-21: The count_vees function
The count_vees function takes a string_view called my_view ➊, which you iterate over using a range-based for loop ➋. Each time a character in my_view equals v, you increment a result variable ➌, which you return after exhausting the sequence ➍.
You could reimplement Listing 15-21 by simply replacing string_view with const string&, as demonstrated in Listing 15-22.
#include <string>
size_t count_vees(const std::string& my_view) {
--snip--
}
Listing 15-22: The count_vees function reimplemented to use a const string& instead of a string_view
If string_view is just a drop-in replacement for a const string&, why bother having it? Well, if you invoke count_vees with a std::string, there’s no difference: modern compilers will emit the same code.
If you instead invoke count_vees with a string literal, there’s a big difference: when you pass a string literal for a const string&, you construct a string. When you pass a string literal for a string_view, you construct a string_view. Constructing a string is probably more expensive, because it might have to allocate dynamic memory and it definitely has to copy characters. A string_view is just a pointer and a length (no copying or allocating is required).
A regular expression, also called a regex, is a string that defines a search pattern. Regexes have a long history in computer science and form a sort of mini-language for searching, replacing, and extracting language data. The STL offers regular expression support in the <regex> header.
When used judiciously, regular expressions can be tremendously powerful, declarative, and concise; however, it’s also easy to write regexes that are totally inscrutable. Use regexes deliberately.
You build regular expressions using strings called patterns. Patterns represent a desired set of strings using a particular regular expression grammar that sets the syntax for building patterns. In other words, a pattern defines the subset of all possible strings that you’re interested in. The STL supports a handful of grammars, but the focus here will be on the very basics of the default grammar, the modified ECMAScript regular expression grammar (see [re.grammar] for details).
In the ECMAScript grammar, you intermix literal characters with special markup to describe your desired strings. Perhaps the most common markup is a character class, which stands in for a set of possible characters: d matches any digit, s matches any whitespace, and w matches any alphanumeric (“word”) character.
Table 15-8 lists a few example regular expressions and possible interpretations.
Table 15-8: Regular Expression Patterns Using Only Character Classes and Literals
Regex pattern |
Possibly describes |
ddd-ddd-dddd |
An American phone number, such as 202-456-1414 |
dd:dd wM |
A time in HH:MM AM/PM format, such as 08:49 PM |
wwdddddd |
An American ZIP code including a prepended state code, such as NJ07932 |
wd-wd |
An astromech droid identifier, such as R2-D2 |
cwt |
A three-letter word starting with c and ending with t, such as cat or cot |
You can also invert a character class by capitalizing the d, s, or w to give the opposite: D matches any non-digit, S matches any non-whitespace, and W matches any non-word character.
In addition, you can build your own character classes by explicitly enumerating them between square brackets []. For example, the character class [02468] includes even digits. You can also use hyphens as shortcuts to include implied ranges, so the character class [0-9a-fA-F] includes any hexadecimal digit whether the letter is capitalized or not. Finally, you can invert a custom character class by prepending the list with a caret ^. For example, the character class [^aeiou] includes all non-vowel characters.
You can save some typing by using quantifiers, which specify that the character directly to the left should be repeated some number of times. Table 15-9 lists the regex quantifiers.
Table 15-9: Regular Expression Quantifiers
Regex quantifier |
Specifies a quantity of |
* |
0 or more |
+ |
1 or more |
? |
0 or 1 |
{n} |
Exactly n |
{n,m} |
Between n and m, inclusive |
{n,} |
At least n |
Using quantifiers, you can specify all words beginning with c and ending with t using the pattern cw*t, because w* matches any number of word characters.
A group is a collection of characters. You can specify a group by placing it within parentheses. Groups are useful in several ways, including specifying a particular collection for eventual extraction and for quantification.
For example, you could improve the ZIP pattern in Table 15-8 to use quantifiers and groups, like this:
(w{2})?➊(d{5})➋(-d{4})?➌
Now you have three groups: the optional state ➊, the ZIP code ➋, and an optional four-digit suffix ➌. As you’ll see later on, these groups make parsing from regexes much easier.
Table 15-10 lists several other special characters available for use in regex patterns.
Table 15-10: Example Special Characters
Character |
Specifies |
X|Y |
Character X or Y |
Y |
The special character Y as a literal (in other words, escape it) |
|
Newline |
|
Carriage return |
|
Tab |