In this chapter you’ll learn:
• To create and manipulate immutable character-string objects of class string
and mutable character-string objects of class StringBuilder
.
• To manipulate character objects of struct Char
.
• To use regular-expression classes Regex
and Match
.
• To iterate through matches to a regular expression.
• To use character classes to match any character from a set of characters.
• To use quantifiers to match a pattern multiple times.
• To search for patterns in text using regular expressions.
• To validate data using regular expressions and LINQ.
• To modify string
s using regular expressions and class Regex
.
The chief defect of Henry King Was chewing little bits of string.
—Hilaire Belloc
The difference between the almost-right word and the right word is really a large matter—it’s the difference between the lightning bug and the lightning.
—Mark Twain
16.1 Introduction
16.2 Fundamentals of Characters and Strings
16.3 string
Constructors
16.4 string
Indexer, Length
Property and CopyTo
Method
16.5 Comparing string
s
16.6 Locating Characters and Substrings in string
s
16.7 Extracting Substrings from string
s
16.8 Concatenating string
s
16.9 Miscellaneous string
Methods
16.10 Class StringBuilder
16.11 Length
and Capacity
Properties, EnsureCapacity
Method and Indexer of Class StringBuilder
16.12 Append
and AppendFormat
Methods of Class StringBuilder
16.13 Insert
, Remove
and Replace
Methods of Class StringBuilder
16.14 Char
Methods
16.15 Regular Expressions
16.15.1 Simple Regular Expressions and Class Regex
16.15.2 Complex Regular Expressions
16.15.3 Validating User Input with Regular Expressions and LINQ
16.15.4 Regex
Methods Replace
and Split
16.16 Wrap-Up
This chapter introduces the .NET Framework Class Library’s string- and character-processing capabilities and demonstrates how to use regular expressions to search for patterns in text. The techniques it presents can be employed in text editors, word processors, page-layout software, computerized typesetting systems and other kinds of text-processing software. Previous chapters presented some basic string-processing capabilities. Now we discuss in detail the text-processing capabilities of class string
and type char
from the System
namespace and class StringBuilder
from the System.Text
namespace.
We begin with an overview of the fundamentals of characters and strings in which we discuss character constants and string literals. We then provide examples of class string
’s many constructors and methods. The examples demonstrate how to determine the length of strings, copy strings, access individual characters in strings, search strings, obtain substrings from larger strings, compare strings, concatenate strings, replace characters in strings and convert strings to uppercase or lowercase letters.
Next, we introduce class StringBuilder
, which is used to build strings dynamically. We demonstrate StringBuilder
capabilities for determining and specifying the size of a StringBuilder
, as well as appending, inserting, removing and replacing characters in a StringBuilder
object. We then introduce the character-testing methods of struct Char
that enable a program to determine whether a character is a digit, a letter, a lowercase letter, an uppercase letter, a punctuation mark or a symbol other than a punctuation mark. Such methods are useful for validating individual characters in user input. In addition, type Char
provides methods for converting a character to uppercase or lowercase.
We discuss regular expressions. We present classes Regex
and Match
from the System.Text.RegularExpressions
namespace as well as the symbols that are used to form regular expressions. We then demonstrate how to find patterns in a string, match entire strings to patterns, replace characters in a string that match a pattern and split strings at delimiters specified as a pattern in a regular expression.
Characters are the fundamental building blocks of C# source code. Every program is composed of characters that, when grouped together meaningfully, create a sequence that the compiler interprets as instructions describing how to accomplish a task. In addition to normal characters, a program also can contain character constants. A character constant is a character that’s represented as an integer value, called a character code. For example, the integer value 122
corresponds to the character constant 'z'
. The integer value 10
corresponds to the newline character '
'
. Character constants are established according to the Unicode character set, an international character set that contains many more symbols and letters than does the ASCII character set (listed in Appendix C). To learn more about Unicode, see Appendix F.
A string is a series of characters treated as a unit. These characters can be uppercase letters, lowercase letters, digits and various special characters: +
, -
, *
, /
, $
and others. A string is an object of class string
in the System
namespace.1 We write string literals, also called string constants, as sequences of characters in double quotation marks, as follows:
"John Q. Doe"
"9999 Main Street"
"Waltham, Massachusetts"
"(201) 555-1212"
A declaration can assign a string
literal to a string
reference. The declaration
string color = "blue";
initializes string
reference color
to refer to the string
literal object "blue"
.
If there are multiple occurrences of the same string
literal object in an application, a single copy of it will be referenced from each location in the program that uses that string
literal. It’s possible to share the object in this manner, because string
literal objects are implicitly constant. Such sharing conserves memory.
On occasion, a string
will contain multiple backslash characters (this often occurs in the name of a file). To avoid excessive backslash characters, it’s possible to exclude escape sequences and interpret all the characters in a string
literally, using the @
character. Backslashes within the double quotation marks following the @
character are not considered escape sequences, but rather regular backslash characters. Often this simplifies programming and makes the code easier to read. For example, consider the string "C:MyFolderMySubFolderMyFile.txt"
with the following assignment:
string file = "C:\MyFolder\MySubFolder\MyFile.txt";
Using the verbatim string syntax, the assignment can be altered to
string file = @"C:MyFolderMySubFolderMyFile.txt";
This approach also has the advantage of allowing string literals to span multiple lines by preserving all newlines, spaces and tabs.
string
ConstructorsClass string
provides eight constructors for initializing strings in various ways. Figure 16.1 demonstrates three of the constructors.
Fig. 16.1. string
constructors.
Lines 10–11 allocate the char
array characterArray
, which contains nine characters. Lines 12–16 declare the string
s originalString
, string1
, string2
, string3
and string4
. Line 12 assigns string
literal "Welcome to C# programming!"
to string
reference originalString
. Line 13 sets string1
to reference the same string
literal.
Line 14 assigns to string2
a new string
, using the string
constructor with a character array argument. The new string
contains a copy of the array’s characters.
Line 15 assigns to string3
a new string
, using the string
constructor that takes a char
array and two int
arguments. The second argument specifies the starting index position (the offset) from which characters in the array are to be copied. The third argument specifies the number of characters (the count) to be copied from the specified starting position in the array. The new string
contains a copy of the specified characters in the array. If the specified offset or count indicates that the program should access an element outside the bounds of the character array, an ArgumentOutOfRangeException
is thrown.
Line 16 assigns to string4
a new string
, using the string
constructor that takes as arguments a character and an int
specifying the number of times to repeat that character in the string
.
In most cases, it’s not necessary to make a copy of an existing string
. All string
s are immutable—their character contents cannot be changed after they’re created. Also, if there are one or more references to a string
(or any object for that matter), the object cannot be reclaimed by the garbage collector.
string
Indexer, Length
Property and CopyTo
MethodThe application in Fig. 16.2 presents the string
indexer, which facilitates the retrieval of any character in the string
, and the string
property Length
, which returns the length of the string
. The string
method CopyTo
copies a specified number of characters from a string
into a char
array.
Fig. 16.2. string
indexer, Length
property and CopyTo
method.
This application determines the length of a string
, displays its characters in reverse order and copies a series of characters from the string
to a character array. Line 17 uses string
property Length
to determine the number of characters in string1
. Like arrays, string
s always know their own size.
Lines 22–23 write the characters of string1
in reverse order using the string
indexer. The string
indexer treats a string
as an array of char
s and returns each character at a specific position in the string
. The indexer receives an integer argument as the position number and returns the character at that position. As with arrays, the first element of a string
is considered to be at position 0.
Attempting to access a character that’s outside a string
’s bounds results in an Index-OutOfRangeException
.
Line 26 uses string
method CopyTo
to copy the characters of string1
into a character array (characterArray
). The first argument given to method CopyTo
is the index from which the method begins copying characters in the string
. The second argument is the character array into which the characters are copied. The third argument is the index specifying the starting location at which the method begins placing the copied characters into the character array. The last argument is the number of characters that the method will copy from the string
. Lines 29–30 output the char
array contents one character at a time.
string
sThe next two examples demonstrate various methods for comparing string
s. To understand how one string
can be “greater than” or “less than” another, consider the process of alphabetizing a series of last names. The reader would, no doubt, place "Jones"
before "Smith"
, because the first letter of "Jones"
comes before the first letter of "Smith"
in the alphabet. The alphabet is more than just a set of 26 letters—it’s an ordered list of characters in which each letter occurs in a specific position. For example, Z is more than just a letter of the alphabet; it’s specifically the twenty-sixth letter of the alphabet. Computers can order characters alphabetically because they’re represented internally as Unicode numeric codes.
Equals
, CompareTo
and the Equality Operator (==
)Class string
provides several ways to compare string
s. The application in Fig. 16.3 demonstrates the use of method Equals
, method CompareTo
and the equality operator (==
).
Fig. 16.3. string
test to determine equality.
The condition in line 21 uses string
method Equals
to compare string1
and literal string "hello"
to determine whether they’re equal. Method Equals
(inherited from object
and overridden in string
) tests any two objects for equality (i.e., checks whether the objects have identical contents). The method returns true
if the objects are equal and false
otherwise. In this case, the condition returns true
, because string1
references string
literal object "hello"
. Method Equals
uses word sorting rules that depend on your system’s currently selected culture. Comparing "hello"
with "HELLO"
would return false
, because the lowercase letters are different from the those of corresponding uppercase letters.
The condition in line 27 uses the overloaded equality operator (==
) to compare string string1
with the literal string "hello"
for equality. In C#, the equality operator also compares the contents of two string
s. Thus, the condition in the if
statement evaluates to true
, because the values of string1
and "hello"
are equal.
Line 33 tests whether string3
and string4
are equal to illustrate that comparisons are indeed case sensitive. Here, static
method Equals
is used to compare the values of two string
s. "Happy Birthday"
does not equal "happy birthday"
, so the condition of the if
statement fails, and the message "string3 does not equal string4"
is output (line 36).
Lines 40–48 use string
method CompareTo
to compare string
s. Method CompareTo
returns 0
if the string
s are equal, a negative value if the string
that invokes CompareTo
is less than the string
that’s passed as an argument and a positive value if the string
that invokes CompareTo
is greater than the string
that’s passed as an argument.
Notice that CompareTo
considers string3
to be greater than string4
. The only difference between these two string
s is that string3
contains two uppercase letters in positions where string4
contains lowercase letters.
Figure 16.4 shows how to test whether a string
instance begins or ends with a given string
. Method StartsWith
determines whether a string
instance starts with the string
text passed to it as an argument. Method EndsWith
determines whether a string
instance ends with the string
text passed to it as an argument. Class stringStartEnd
’s Main
method defines an array of string
s (called strings
), which contains "started"
, "starting"
, "ended"
and "ending"
. The remainder of method Main
tests the elements of the array to determine whether they start or end with a particular set of characters.
Fig. 16.4. StartsWith
and EndsWith
methods.
Line 13 uses method StartsWith
, which takes a string
argument. The condition in the if
statement determines whether the string
at index i
of the array starts with the characters "st"
. If so, the method returns true
, and strings[i]
is output along with a message.
Line 21 uses method EndsWith
to determine whether the string
at index i
of the array ends with the characters "ed"
. If so, the method returns true
, and strings[i]
is displayed along with a message.
string
sIn many applications, it’s necessary to search for a character or set of characters in a string
. For example, a programmer creating a word processor would want to provide capabilities for searching through documents. The application in Fig. 16.5 demonstrates some of the many versions of string
methods IndexOf
, IndexOfAny
, LastIndexOf
and LastIndexOfAny
, which search for a specified character or substring in a string
. We perform all searches in this example on the string letters
(initialized with "abcdefghijklmabcdefghijklm"
) located in method Main
of class StringIndexMethods
.
Fig. 16.5. Searching for characters and substrings in string
s.
Lines 14, 16 and 18 use method IndexOf
to locate the first occurrence of a character or substring in a string
. If it finds a character, IndexOf
returns the index of the specified character in the string
; otherwise, IndexOf
returns –1
. The expression in line 16 uses a version of method IndexOf
that takes two arguments—the character to search for and the starting index at which the search of the string
should begin. The method does not examine any characters that occur prior to the starting index (in this case, 1
). The expression in line 18 uses another version of method IndexOf
that takes three arguments—the character to search for, the index at which to start searching and the number of characters to search.
Lines 22, 24 and 26 use method LastIndexOf
to locate the last occurrence of a character in a string
. Method LastIndexOf
performs the search from the end of the string
to the beginning of the string
. If it finds the character, LastIndexOf
returns the index of the specified character in the string
; otherwise, LastIndexOf
returns –1
. There are three versions of method LastIndexOf
. The expression in line 22 uses the version that takes as an argument the character for which to search. The expression in line 24 uses the version that takes two arguments—the character for which to search and the highest index from which to begin searching backward for the character. The expression in line 26 uses a third version of method LastIndexOf
that takes three arguments—the character for which to search, the starting index from which to start searching backward and the number of characters (the portion of the string
) to search.
Lines 29–44 use versions of IndexOf
and LastIndexOf
that take a string
instead of a character as the first argument. These versions of the methods perform identically to those described above except that they search for sequences of characters (or substrings) that are specified by their string
arguments.
Lines 47–64 use methods IndexOfAny
and LastIndexOfAny
, which take an array of characters as the first argument. These versions of the methods also perform identically to those described above, except that they return the index of the first occurrence of any of the characters in the character-array argument.
In the overloaded methods LastIndexOf
and LastIndexOfAny
that take three parameters, the second argument must be greater than or equal to the third. This might seem counterintuitive, but remember that the search moves from the end of the string toward the start of the string.
string
sClass string
provides two Substring
methods, which create a new string
by copying part of an existing string
. Each method returns a new string
. The application in Fig. 16.6 demonstrates the use of both methods.
Fig. 16.6. Substrings generated from string
s.
The statement in line 13 uses the Substring
method that takes one int
argument. The argument specifies the starting index from which the method copies characters in the original string
. The substring returned contains a copy of the characters from the starting index to the end of the string
. If the index specified in the argument is outside the bounds of the string
, the program throws an ArgumentOutOfRangeException
.
The second version of method Substring
(line 17) takes two int
arguments. The first argument specifies the starting index from which the method copies characters from the original string
. The second argument specifies the length of the substring to copy. The substring returned contains a copy of the specified characters from the original string
. If the supplied length of the substring is too large (i.e., the substring tries to retrieve characters past the end of the original string
), an ArgumentOutOfRangeException
is thrown.
string
sThe +
operator is not the only way to perform string
concatenation. The static
method Concat
of class string
(Fig. 16.7) concatenates two string
s and returns a new string
containing the combined characters from both original string
s. Line 16 appends the characters from string2
to the end of a copy of string1
, using method Concat
. The statement in line 16 does not modify the original string
s.
Fig. 16.7. Concat static
method.
string
MethodsClass string
provides several methods that return modified copies of string
s. The application in Fig. 16.8 demonstrates the use of these methods, which include string
methods Replace
, ToLower
, ToUpper
and Trim
.
Fig. 16.8. string
methods Replace
, ToLower
, ToUpper
and Trim
.
Line 21 uses string
method Replace
to return a new string
, replacing every occurrence in string1
of character 'e'
with 'E'
. Method Replace
takes two arguments—a char
for which to search and another char
with which to replace all matching occurrences of the first argument. The original string
remains unchanged. If there are no occurrences of the first argument in the string
, the method returns the original string
. An overloaded version of this method allows you to provide two string
s as arguments.
The string
method ToUpper
generates a new string
(line 25) that replaces any lowercase letters in string1
with their uppercase equivalents. The method returns a new string
containing the converted string
; the original string
remains unchanged. If there are no characters to convert, the original string
is returned. Line 26 uses string
method ToLower
to return a new string
in which any uppercase letters in string2
are replaced by their lowercase equivalents. The original string
is unchanged. As with ToUpper
, if there are no characters to convert to lowercase, method ToLower
returns the original string
.
Line 30 uses string
method Trim
to remove all whitespace characters that appear at the beginning and end of a string
. Without otherwise altering the original string
, the method returns a new string
that contains the string
, but omits leading and trailing whitespace characters. This method is particularly useful for retrieving user input (i.e., via a TextBox
). Another version of method Trim
takes a character array and returns a copy of the string
that does not begin or end with any of the characters in the array argument.
StringBuilder
The string
class provides many capabilities for processing string
s. However a string
’s contents can never change. Operations that seem to concatenate string
s are in fact assigning string
references to newly created string
s (e.g., the +=
operator creates a new string
and assigns the initial string
reference to the newly created string
).
The next several sections discuss the features of class StringBuilder
(namespace System.Text
), used to create and manipulate dynamic string information—i.e., mutable strings. Every StringBuilder
can store a certain number of characters that’s specified by its capacity. Exceeding the capacity of a StringBuilder
causes the capacity to expand to accommodate the additional characters. As we’ll see, members of class StringBuilder
, such as methods Append
and AppendFormat
, can be used for concatenation like the operators +
and +=
for class string
. StringBuilder
is particularly useful for manipulating in place a large number of string
s, as it’s much more efficient than creating individual immutable string
s.
Objects of class string
are immutable (i.e., constant strings), whereas objects of class StringBuilder
are mutable. C# can perform certain optimizations involving string
s (such as the sharing of one string
among multiple references), because it knows these objects will not change.
Class StringBuilder
provides six overloaded constructors. Class StringBuilderConstructor
(Fig. 16.9) demonstrates three of these overloaded constructors.
Fig. 16.9. StringBuilder
class constructors.
Line 10 employs the no-parameter StringBuilder
constructor to create a StringBuilder
that contains no characters and has an implementation-specific default initial capacity. Line 11 uses the StringBuilder
constructor that takes an int
argument to create a StringBuilder
that contains no characters and has the initial capacity specified in the int
argument (i.e., 10
). Line 12 uses the StringBuilder
constructor that takes a string
argument to create a StringBuilder
containing the characters of the string
argument. Lines 14–16 implicitly use StringBuilder
method ToString
to obtain string
representations of the StringBuilder
s’ contents.
Length
and Capacity
Properties, EnsureCapacity
Method and Indexer of Class StringBuilder
Class StringBuilder
provides the Length
and Capacity
properties to return the number of characters currently in a StringBuilder
and the number of characters that a StringBuilder
can store without allocating more memory, respectively. These properties also can increase or decrease the length or the capacity of the StringBuilder
. Method EnsureCapacity
allows you to reduce the number of times that a StringBuilder
’s capacity must be increased. The method ensures that the StringBuilder
’s capacity is at least the specified value. The program in Fig. 16.10 demonstrates these methods and properties.
Fig. 16.10. StringBuilder
size manipulation.
The program contains one StringBuilder
, called buffer
. Lines 10–11 of the program use the StringBuilder
constructor that takes a string
argument to instantiate the StringBuilder
and initialize its value to "Hello, how are you?"
. Lines 14–16 output the content, length and capacity of the StringBuilder
.
Line 18 expands the capacity of the StringBuilder
to a minimum of 75 characters. If new characters are added to a StringBuilder
so that its length exceeds its capacity, the capacity grows to accommodate the additional characters in the same manner as if method EnsureCapacity
had been called.
Line 23 uses property Length
to set the length of the StringBuilder
to 10
. If the specified length is less than the current number of characters in the StringBuilder
, the contents of the StringBuilder
are truncated to the specified length. If the specified length is greater than the number of characters currently in the StringBuilder
, null characters are appended to the StringBuilder
until the total number of characters in the StringBuilder
is equal to the specified length.
Append
and AppendFormat
Methods of Class StringBuilder
Class StringBuilder
provides 19 overloaded Append
methods that allow various types of values to be added to the end of a StringBuilder
. The Framework Class Library provides versions for each of the simple types and for character arrays, string
s and object
s. (Remember that method ToString
produces a string
representation of any object
.) Each method takes an argument, converts it to a string
and appends it to the StringBuilder
. Figure 16.11 demonstrates the use of several Append
methods.
Fig. 16.11. Append methods of StringBuilder.
Lines 22–40 use 10 different overloaded Append
methods to attach the string representations of objects created in lines 10–18 to the end of the StringBuilder
.
Class StringBuilder
also provides method AppendFormat
, which converts a string
to a specified format, then appends it to the StringBuilder
. The example in Fig. 16.12 demonstrates the use of this method.
Fig. 16.12. StringBuilder
’s AppendFormat
method.
Line 13 creates a string
that contains formatting information. The information enclosed in braces specifies how to format a specific piece of data. Formats have the form {X[,Y][:FormatString]}
, where X
is the number of the argument to be formatted, counting from zero. Y
is an optional argument, which can be positive or negative, indicating how many characters should be in the result. If the resulting string
is less than the number Y
, it will be padded with spaces to make up for the difference. A positive integer aligns the string
to the right; a negative integer aligns it to the left. The optional Format-String
applies a particular format to the argument—currency, decimal or scientific, among others. In this case, “{0}
” means the first argument will be printed out. “{1:C}
” specifies that the second argument will be formatted as a currency value.
Line 22 shows a version of AppendFormat
that takes two parameters—a string
specifying the format and an array of objects to serve as the arguments to the format string
. The argument referred to by “{0}
” is in the object array at index 0
.
Lines 25–27 define another string
used for formatting. The first format “{0:d3}
”, specifies that the first argument will be formatted as a three-digit decimal, meaning that any number having fewer than three digits will have leading zeros placed in front to make up the difference. The next format, “{0, 4}
”, specifies that the formatted string
should have four characters and be right aligned. The third format, “{0, -4}
”, specifies that the strings should be aligned to the left.
Line 30 uses a version of AppendFormat
that takes two parameters—a string
containing a format and an object to which the format is applied. In this case, the object is the number 5
. The output of Fig. 16.12 displays the result of applying these two versions of AppendFormat
with their respective arguments.
Insert
, Remove
and Replace
Methods of Class StringBuilder
Class StringBuilder
provides 18 overloaded Insert
methods to allow various types of data to be inserted at any position in a StringBuilder
. The class provides versions for each of the simple types and for character arrays, string
s and object
s. Each method takes its second argument, converts it to a string
and inserts the string
into the StringBuilder
in front of the character in the position specified by the first argument. The index specified by the first argument must be greater than or equal to 0
and less than the length of the StringBuilder
; otherwise, the program throws an ArgumentOutOfRangeException
.
Class StringBuilder
also provides method Remove
for deleting any portion of a StringBuilder
. Method Remove
takes two arguments—the index at which to begin deletion and the number of characters to delete. The sum of the starting index and the number of characters to be deleted must always be less than the length of the StringBuilder
; otherwise, the program throws an ArgumentOutOfRangeException
. The Insert
and Remove
methods are demonstrated in Fig. 16.13.
Fig. 16.13. StringBuilder
text insertion and removal.
Another useful method included with StringBuilder
is Replace
. Replace
searches for a specified string
or character and substitutes another string
or character in its place. Figure 16.14 demonstrates this method.
Fig. 16.14. StringBuilder
text replacement.
Line 18 uses method Replace
to replace all instances "Jane"
with the "Greg"
in builder1
. Another overload of this method takes two characters as parameters and replaces each occurrence of the first character with the second. Line 19 uses an overload of Replace
that takes four parameters, of which the first two are characters and the second two are int
s. The method replaces all instances of the first character with the second character, beginning at the index specified by the first int
and continuing for a count specified by the second int
. Thus, in this case, Replace
looks through only five characters, starting with the character at index 0
. As the output illustrates, this version of Replace
replaces g
with G
in the word "good"
, but not in "greg"
. This is because the g
s in "greg"
are not in the range indicated by the int
arguments (i.e., between indexes 0
and 4
).
Char
MethodsC# provides a concept called a struct
(short for “structure”) that’s similar to a class. Although struct
s and classes are comparable, struct
s represent value types. Like classes, struct
s can have methods and properties, and can use the access modifiers public
and private
. Also, struct
members are accessed via the member access operator (.
).
The simple types are actually aliases for struct
types. For instance, an int
is defined by struct System.Int32
, a long
by System.Int64
and so on. All struct
types derive from class ValueType
, which derives from object
. Also, all struct
types are implicitly sealed
, so they do not support virtual
or abstract
methods, and their members cannot be declared protected
or protected internal
.
In the struct
Char
,2 which is the struct
for characters, most methods are static
, take at least one character argument and perform either a test or a manipulation on the character. We present several of these methods in the next example. Figure 16.15 demonstrates static
methods that test characters to determine whether they’re of a specific character type and static
methods that perform case conversions on characters.
Fig. 16.15. Char
’s static
character-testing and case-conversion methods.
After the user enters a character, lines 13–27 analyze it. Line 13 uses Char
method IsDigit
to determine whether character
is defined as a digit. If so, the method returns true
; otherwise, it returns false
(note again that bool
values are output capitalized). Line 14 uses Char
method IsLetter
to determine whether character character
is a letter. Line 16 uses Char
method IsLetterOrDigit
to determine whether character character
is a letter or a digit.
Line 18 uses Char
method IsLower
to determine whether character character
is a lowercase letter. Line 20 uses Char
method IsUpper
to determine whether character character
is an uppercase letter. Line 22 uses Char
method ToUpper
to convert character character
to its uppercase equivalent. The method returns the converted character if the character has an uppercase equivalent; otherwise, the method returns its original argument. Line 24 uses Char
method ToLower
to convert character character
to its lowercase equivalent. The method returns the converted character if the character has a lowercase equivalent; otherwise, the method returns its original argument.
Line 26 uses Char
method IsPunctuation
to determine whether character
is a punctuation mark, such as "!"
, ":"
or ")"
. Line 27 uses Char
method IsSymbol
to determine whether character character
is a symbol, such as "+"
, "="
or "^"
.
Structure type Char
also contains other methods not shown in this example. Many of the static
methods are similar—for instance, IsWhiteSpace
is used to determine whether a certain character is a whitespace character (e.g., newline, tab or space). The struct also contains several public
instance methods; many of these, such as methods ToString
and Equals
, are methods that we have seen before in other classes. This group includes method CompareTo
, which is used to compare two character values with one another.
We now introduce regular expressions—specially formatted strings used to find patterns in text. They can be used to ensure that data is in a particular format. For example, a U.S. zip code must consist of five digits, or five digits followed by a dash followed by four more digits. Compilers use regular expressions to validate program syntax. If the program code does not match the regular expression, the compiler indicates that there’s a syntax error. We discuss classes Regex
and Match
from the System.Text.RegularExpressions
namespace as well as the symbols used to form regular expressions. We then demonstrate how to find patterns in a string, match entire strings to patterns, replace characters in a string that match a pattern and split strings at delimiters specified as a pattern in a regular expression.
Regex
The .NET Framework provides several classes to help developers manipulate regular expressions. Figure 16.16 demonstrates the basic regular-expression classes. To use these classes, add a using
statement for the namespace System.Text.RegularExpressions
(line 4). Class Regex
represents a regular expression. We create a Regex
object named expression
(line 16) to represent the regular expression "e"
. This regular expression matches the literal character "e"
anywhere in an arbitrary string
. Regex
method Match
returns an object of class Match
that represents a single regular-expression match. Class Match
’s ToString
method returns the substring that matched the regular expression. The call to method Match
(line 17) matches the leftmost occurrence of the character "e"
in testString
. Class Regex
also provides method Matches
(line 21), which finds all matches of the regular expression in an arbitrary string
and returns a MatchCollection
object containing all the Match
es. A MatchCollection
is a collection, similar to an array
, and can be used with a foreach
statement to iterate through the collection’s elements. We introduced collections in Chapter 9 and discuss them in more detail in Chapter 23, Collections. We use a foreach
statement (lines 21–22) to display all the matches to expression
in testString
. The elements in the MatchCollection
are Match
objects, so the foreach
statement infers variable myMatch
to be of type Match
. For each Match
, line 22 outputs the text that matched the regular expression.
Fig. 16.16. Demonstrating basic regular expressions.
Regular expressions can also be used to match a sequence of literal characters anywhere in a string
. Lines 27–28 display all the occurrences of the character sequence "regex"
in testString
. Here we use the Regex static
method Matches
. Class Regex
provides static
versions of both methods Match
and Matches
. The static
versions take a regular expression as an argument in addition to the string
to be searched. This is useful when you want to use a regular expression only once. The call to method Matches
(line 27) returns two matches to the regular expression "regex"
. Notice that "regexp"
in the testString
matches the regular expression "regex"
, but the "p"
is excluded. We use the regular expression "regexp?"
(line 34) to match occurrences of both "regex"
and "regexp"
. The question mark (?
) is a metacharacter—a character with special meaning in a regular expression. More specifically, the question mark is a quantifier—a metacharacter that describes how many times a part of the pattern may occur in a match. The ?
quantifier matches zero or one occurrence of the pattern to its left. In line 34, we apply the ?
quantifier to the character "p"
. This means that a match to the regular expression contains the sequence of characters "regex"
and may be followed by a "p"
. Notice that the foreach
statement (lines 34–35) displays both "regex"
and "regexp"
.
Metacharacters allow you to create more complex patterns. The "|"
(alternation) metacharacter matches the expression to its left or to its right. We use alternation in the regular expression "(c|h)at"
(line 38) to match either "cat"
or "hat"
. Parentheses are used to group parts of a regular expression, much as you group parts of a mathematical expression. The "|"
causes the pattern to match a sequence of characters starting with either "c"
or "h"
, followed by "at"
. The "|"
character attempts to match the entire expression to its left or to its right. If we didn’t use the parentheses around "c|h"
, the regular expression would match either the single character "c"
or the sequence of characters "hat"
. Line 41 uses the regular expression (line 38) to search the string
s "hat cat"
and "cat hat"
. Notice in the output that the first match in "hat cat"
is "hat"
, while the first match in "cat hat"
is "cat"
. Alternation chooses the leftmost match in the string
for either of the alternating expressions—the order of the expressions doesn’t matter.
The table in Fig. 16.17 lists some character classes that can be used with regular expressions. A character class represents a group of characters that might appear in a string
. For example, a word character (w
) is any alphanumeric character (a-z
, A-Z
and 0-9
) or underscore. A whitespace character (s
) is a space, a tab, a carriage return, a newline or a form feed. A digit (d
) is any numeric character.
Fig. 16.17. Character classes.
Figure 16.18 uses character classes in regular expressions. For this example, we use method DisplayMatches
(lines 53–59) to display all matches to a regular expression. Method DisplayMatches
takes two string
s representing the string
to search and the regular expression to match. The method uses a foreach
statement to display each Match
in the MatchCollection
object returned by the static
method Matches
of class Regex
.
Fig. 16.18. Demonstrating using character classes and quantifiers.
The first regular expression (line 15) matches digits in the testString
. We use the digit character class (d
) to match any digit (0–9). We precede the regular expression string
with @
. Recall that backslashes within the double quotation marks following the @
character are regular backslash characters, not the beginning of escape sequences. To define the regular expression without prefixing @
to the string, you would need to escape every backslash character, as in
"\d"
which makes the regular expression more difficult to read.
The output shows that the regular expression matches 1
, 2
, and 3
in the testString
. You can also match anything that isn’t a member of a particular character class using an uppercase instead of a lowercase letter. For example, the regular expression "D"
(line 19) matches any character that isn’t a digit. Notice in the output that this includes punctuation and whitespace. Negating a character class matches everything that isn’t a member of the character class.
The next regular expression (line 23) uses the character class w
to match any word character in the testString
. Notice that each match consists of a single character. It would be useful to match a sequence of word characters rather than a single character. The regular expression in line 28 uses the +
quantifier to match a sequence of word characters. The +
quantifier matches one or more occurrences of the pattern to its left. There are three matches for this expression, each three characters long. Quantifiers are greedy—they match the longest possible occurrence of the pattern. You can follow a quantifier with a question mark (?
) to make it lazy—it matches the shortest possible occurrence of the pattern. The regular expression "w+?"
(line 33) uses a lazy +
quantifier to match the shortest sequence of word characters possible. This produces nine matches of length one instead of three matches of length three. Figure 16.19 lists other quantifiers that you can place after a pattern in a regular expression, and the purpose of each.
Fig. 16.19. Quantifiers used in regular expressions.
Regular expressions are not limited to the character classes in Fig. 16.17. You can create your own character class by listing the members of the character class between square brackets, [
and ]
. [Note: Metacharacters in square brackets are treated as literal characters.] You can include a range of characters using the "-"
character. The regular expression in line 37 of Fig. 16.18 creates a character class to match any lowercase letter from a
to f
. These custom character classes match a single character that’s a member of the class. The output shows three matches, a
, b
and c
. Notice that D
, E
and F
don’t match the character class [a-f]
because they’re uppercase. You can negate a custom character class by placing a "^"
character after the opening square bracket. The regular expression in line 41 matches any character that isn’t in the range a-f
. As with the predefined character classes, negating a custom character class matches everything that isn’t a member, including punctuation and whitespace. You can also use quantifiers with custom character classes. The regular expression in line 45 uses a character class with two ranges of characters, a-z
and A-Z
, and the +
quantifier to match a sequence of lowercase or uppercase letters. You can also use the "."
(dot) character to match any character other than a newline. The regular expression ".*"
(line 49) matches any sequence of characters. The *
quantifier matches zero or more occurrences of the pattern to its left. Unlike the +
quantifier, the *
quantifier can be used to match an empty string.
The program of Fig. 16.20 tries to match birthdays to a regular expression. For demonstration purposes, the expression matches only birthdays that do not occur in April and that belong to people whose names begin with "J"
. We can do this by combining the basic regular-expression techniques we’ve already discussed.
Fig. 16.20. A more complex regular expression.
Line 11 creates a Regex
object and passes a regular-expression pattern string
to its constructor. The first character in the regular expression, "J"
, is a literal character. Any string
matching this regular expression must start with "J"
. The next part of the regular expression (".*"
) matches any number of unspecified characters except newlines. The pattern "J.*"
matches a person’s name that starts with J and any characters that may come after that.
Next we match the person’s birthday. We use the d
character class to match the first digit of the month. Since the birthday must not occur in April, the second digit in the month can’t be 4
. We could use the character class "[0-35-9]"
to match any digit other than 4
. However, .NET regular expressions allow you to subtract members from a character class, called character-class subtraction. In line 11, we use the pattern "[d-[4]]"
to match any digit other than 4
. When the "-"
character in a character class is followed by a character class instead of a literal character, the "-"
is interpreted as subtraction instead of a range of characters. The members of the character class following the "-"
are removed from the character class preceding the "-"
. When using character-class subtraction, the class being subtracted ([4]
) must be the last item in the enclosing brackets ([d-[4]]
). This notation allows you to write shorter, easier-to-read regular expressions.
Although the "–"
character indicates a range or character-class subtraction when it’s enclosed in square brackets, instances of the "-"
character outside a character class are treated as literal characters. Thus, the regular expression in line 11 searches for a string
that starts with the letter "J"
, followed by any number of characters, followed by a two-digit number (of which the second digit cannot be 4
), followed by a dash, another two-digit number, a dash and another two-digit number.
Lines 20–21 use a foreach
statement to iterate through the MatchCollection
object returned by method Matches
, which received testString
as an argument. For each Match
, line 21 outputs the text that matched the regular expression. The output in Fig. 16.20 displays the two matches that were found in testString
. Notice that both matches conform to the pattern specified by the regular expression.
The application in Fig. 16.21 presents a more involved example that uses regular expressions to validate name, address and telephone-number information input by a user.
Fig. 16.21. Validating user information using regular expressions.
When a user clicks OK
, the program uses a LINQ query to select any empty TextBox
es (lines 22–27) from the Controls
collection. Notice that we explicitly declare the type of the range variable in the from
clause (line 22). When working with nongeneric collections, such as Controls
, you must explicitly type the range variable. The first where
clause (line 23) determines whether the currentControl
is a TextBox
. The let
clause (line 24) creates and initializes a variable in a LINQ query for use later in the query. Here, we use the let
clause to define variable box
as a TextBox
, which contains the Control
object cast to a TextBox
. This allows us to use the control in the LINQ query as a TextBox
, enabling access to its properties (such as Text
). You may include a second where
clause after the let
clause. The second where
clause determines whether the TextBox
’s Text
property is empty. If one or more TextBox
es are empty (line 30), the program displays a message to the user (lines 33–35) that all fields must be filled in before the program can validate the information. Line 37 calls the Select
method of the first TextBox
in the query result so that the user can begin typing in that TextBox
. The query sorted the TextBox
es by TabIndex
(line 26) so the first TextBox
in the query result is the first empty TextBox
on the Form
. If there are no empty fields, lines 39–71 validate the user input.
We call method ValidateInput
to determine whether the user input matches the specified regular expressions. ValidateInput
(lines 83–98) takes as arguments the text input by the user (input
), the regular expression the input must match (expression
) and a message to display if the input is invalid (message
). Line 87 calls Regex static
method Match
, passing both the string
to validate and the regular expression as arguments. The Success
property of class Match
indicates whether method Match
’s first argument matched the pattern specified by the regular expression in the second argument. If the value of Success
is false
(i.e., there was no match), lines 93–94 display the error message passed as an argument to method ValidateInput
. Line 97 then returns the value of the Success
property. If ValidateInput
returns false
, the TextBox
containing invalid data is selected so the user can correct the input. If all input is valid—the else
statement (lines 72–78) displays a message dialog stating that all input is valid, and the program terminates when the user dismisses the dialog.
In the previous example, we searched a string
for substrings that matched a regular expression. In this example, we want to ensure that the entire string
for each input conforms to a particular regular expression. For example, we want to accept "Smith"
as a last name, but not "9@Smith#"
. In a regular expression that begins with a "^"
character and ends with a "$"
character (e.g., line 43), the characters "^"
and "$"
represent the beginning and end of a string
, respectively. These characters force a regular expression to return a match only if the entire string
being processed matches the regular expression.
The regular expressions in lines 43 and 47 use a character class to match an uppercase first letter followed by letters of any case—a-z
matches any lowercase letter, and A-Z
matches any uppercase letter. The *
quantifier signifies that the second range of characters may occur zero or more times in the string
. Thus, this expression matches any string
consisting of one uppercase letter, followed by zero or more additional letters.
The s
character class matches a single whitespace character (lines 51, 56 and 60). In the expression "d{5}"
, used for the zipCode string
(line 64), {5}
is a quantifier (see Fig. 16.19). The pattern to the left of {n}
must occur exactly n
times. Thus "d{5}"
matches any five digits. Recall that the character "|"
(lines 51, 56 and 60) matches the expression to its left or the expression to its right. In line 51, we use the character "|"
to indicate that the address can contain a word of one or more characters or a word of one or more characters followed by a space and another word of one or more characters. Note the use of parentheses to group parts of the regular expression. This ensures that "|"
is applied to the correct parts of the pattern.
The Last Name: and First Name: TextBox
es each accept string
s that begin with an uppercase letter (lines 43 and 47). The regular expression for the Address: TextBox
(line 51) matches a number of at least one digit, followed by a space and then either one or more letters or else one or more letters followed by a space and another series of one or more letters. Therefore, "10 Broadway"
and "10 Main Street"
are both valid addresses. As currently formed, the regular expression in line 51 doesn’t match an address that does not start with a number, or that has more than two words. The regular expressions for the City: (line 56) and State: (line 60) TextBox
es match any word of at least one character or, alternatively, any two words of at least one character if the words are separated by a single space. This means both Waltham
and West Newton
would match. Again, these regular expressions would not accept names that have more than two words. The regular expression for the Zip code: TextBox
(line 64) ensures that the zip code is a five-digit number. The regular expression for the Phone: TextBox
(line 68) indicates that the phone number must be of the form xxx-yyy-yyyy
, where the x
s represent the area code and the y
s the number. The first x
and the first y
cannot be zero, as specified by the range [1–9]
in each case.
Regex
Methods Replace
and Split
Sometimes it’s useful to replace parts of one string
with another or to split a string
according to a regular expression. For this purpose, class Regex
provides static
and instance versions of methods Replace
and Split
, which are demonstrated in Fig. 16.22.
Fig. 16.22. Using Regex
methods Replace
and Split
.
Regex
method Replace
replaces text in a string
with new text wherever the original string
matches a regular expression. We use two versions of this method in Fig. 16.22. The first version (line 18) is a static
method and takes three parameters—the string
to modify, the string
containing the regular expression to match and the replacement string
. Here, Replace
replaces every instance of "*"
in testString1
with "^"
. Notice that the regular expression ("*"
) precedes character *
with a backslash (). Normally,
*
is a quantifier indicating that a regular expression should match any number of occurrences of a preceding pattern. However, in line 18, we want to find all occurrences of the literal character *
; to do this, we must escape character *
with character . By escaping a special regular-expression character, we tell the regular-expression matching engine to find the actual character
*
rather than use it as a quantifier.
The second version of method Replace
(line 34) is an instance method that uses the regular expression passed to the constructor for testRegex1
(line 12) to perform the replacement operation. Line 12 instantiates testRegex1
with argument @"d"
. The call to instance method Replace
in line 34 takes three arguments—a string
to modify, a string
containing the replacement text and an integer specifying the number of replacements to make. In this case, line 34 replaces the first three instances of a digit ("d"
) in testString2
with the text "digit"
.
Method Split
divides a string
into several substrings. The original string
is broken at delimiters that match a specified regular expression. Method Split
returns an array
containing the substrings. In line 39, we use static
method Split
to separate a string
of comma-separated integers. The first argument is the string
to split; the second argument is the regular expression that represents the delimiter. The regular expression ",s"
separates the substrings at each comma. By matching a whitespace character (s
in the regular expression), we eliminate the extra spaces from the resulting substrings.
In this chapter, you learned about the Framework Class Library’s string- and character-processing capabilities. We overviewed the fundamentals of characters and strings. You saw how to determine the length of strings, copy strings, access the individual characters in strings, search strings, obtain substrings from larger strings, compare strings, concatenate strings, replace characters in strings and convert strings to uppercase or lowercase letters.
We showed how to use class StringBuilder
to build strings dynamically. You learned how to determine and specify the size of a StringBuilder
object, and how to append, insert, remove and replace characters in a StringBuilder
object. We then introduced the character-testing methods of type Char
that enable a program to determine whether a character is a digit, a letter, a lowercase letter, an uppercase letter, a punctuation mark or a symbol other than a punctuation mark, and the methods for converting a character to uppercase or lowercase.
Finally, we discussed classes Regex
, Match
and MatchCollection
from namespace System.Text.RegularExpressions
and the symbols that are used to form regular expressions. You learned how to find patterns in a string
and match entire string
s to patterns with Regex
methods Match
and Matches
, how to replace characters in a string
with Regex
method Replace
and how to split string
s at delimiters with Regex
method Split
. In the next chapter, you’ll learn how to read data from and write data to files.