16. Strings and Characters

Objectives

In this chapter you’ll learn:

• To create and manipulate immutable character-string objects of class string and mutable character-string objects of class StringBuilder.

• To manipulate character objects of struct Char.

• To use regular-expression classes Regex and Match.

• To iterate through matches to a regular expression.

• To use character classes to match any character from a set of characters.

• To use quantifiers to match a pattern multiple times.

• To search for patterns in text using regular expressions.

• To validate data using regular expressions and LINQ.

• To modify strings using regular expressions and class Regex.

The chief defect of Henry King Was chewing little bits of string.

Hilaire Belloc

The difference between the almost-right word and the right word is really a large matter—it’s the difference between the lightning bug and the lightning.

Mark Twain

Outline

16.1 Introduction

16.2 Fundamentals of Characters and Strings

16.3 string Constructors

16.4 string Indexer, Length Property and CopyTo Method

16.5 Comparing strings

16.6 Locating Characters and Substrings in strings

16.7 Extracting Substrings from strings

16.8 Concatenating strings

16.9 Miscellaneous string Methods

16.10 Class StringBuilder

16.11 Length and Capacity Properties, EnsureCapacity Method and Indexer of Class StringBuilder

16.12 Append and AppendFormat Methods of Class StringBuilder

16.13 Insert, Remove and Replace Methods of Class StringBuilder

16.14 Char Methods

16.15 Regular Expressions

16.15.1 Simple Regular Expressions and Class Regex

16.15.2 Complex Regular Expressions

16.15.3 Validating User Input with Regular Expressions and LINQ

16.15.4 Regex Methods Replace and Split

16.16 Wrap-Up

16.1 Introduction

This chapter introduces the .NET Framework Class Library’s string- and character-processing capabilities and demonstrates how to use regular expressions to search for patterns in text. The techniques it presents can be employed in text editors, word processors, page-layout software, computerized typesetting systems and other kinds of text-processing software. Previous chapters presented some basic string-processing capabilities. Now we discuss in detail the text-processing capabilities of class string and type char from the System namespace and class StringBuilder from the System.Text namespace.

We begin with an overview of the fundamentals of characters and strings in which we discuss character constants and string literals. We then provide examples of class string’s many constructors and methods. The examples demonstrate how to determine the length of strings, copy strings, access individual characters in strings, search strings, obtain substrings from larger strings, compare strings, concatenate strings, replace characters in strings and convert strings to uppercase or lowercase letters.

Next, we introduce class StringBuilder, which is used to build strings dynamically. We demonstrate StringBuilder capabilities for determining and specifying the size of a StringBuilder, as well as appending, inserting, removing and replacing characters in a StringBuilder object. We then introduce the character-testing methods of struct Char that enable a program to determine whether a character is a digit, a letter, a lowercase letter, an uppercase letter, a punctuation mark or a symbol other than a punctuation mark. Such methods are useful for validating individual characters in user input. In addition, type Char provides methods for converting a character to uppercase or lowercase.

We discuss regular expressions. We present classes Regex and Match from the System.Text.RegularExpressions namespace as well as the symbols that are used to form regular expressions. We then demonstrate how to find patterns in a string, match entire strings to patterns, replace characters in a string that match a pattern and split strings at delimiters specified as a pattern in a regular expression.

16.2 Fundamentals of Characters and Strings

Characters are the fundamental building blocks of C# source code. Every program is composed of characters that, when grouped together meaningfully, create a sequence that the compiler interprets as instructions describing how to accomplish a task. In addition to normal characters, a program also can contain character constants. A character constant is a character that’s represented as an integer value, called a character code. For example, the integer value 122 corresponds to the character constant 'z'. The integer value 10 corresponds to the newline character ' '. Character constants are established according to the Unicode character set, an international character set that contains many more symbols and letters than does the ASCII character set (listed in Appendix C). To learn more about Unicode, see Appendix F.

A string is a series of characters treated as a unit. These characters can be uppercase letters, lowercase letters, digits and various special characters: +, -, *, /, $ and others. A string is an object of class string in the System namespace.1 We write string literals, also called string constants, as sequences of characters in double quotation marks, as follows:

"John Q. Doe"
"9999 Main Street"
"Waltham, Massachusetts"
"(201) 555-1212"

A declaration can assign a string literal to a string reference. The declaration

string color = "blue";

initializes string reference color to refer to the string literal object "blue".

Performance Tip 16.1

image

If there are multiple occurrences of the same string literal object in an application, a single copy of it will be referenced from each location in the program that uses that string literal. It’s possible to share the object in this manner, because string literal objects are implicitly constant. Such sharing conserves memory.

On occasion, a string will contain multiple backslash characters (this often occurs in the name of a file). To avoid excessive backslash characters, it’s possible to exclude escape sequences and interpret all the characters in a string literally, using the @ character. Backslashes within the double quotation marks following the @ character are not considered escape sequences, but rather regular backslash characters. Often this simplifies programming and makes the code easier to read. For example, consider the string "C:MyFolderMySubFolderMyFile.txt" with the following assignment:

string file = "C:\MyFolder\MySubFolder\MyFile.txt";

Using the verbatim string syntax, the assignment can be altered to

string file = @"C:MyFolderMySubFolderMyFile.txt";

This approach also has the advantage of allowing string literals to span multiple lines by preserving all newlines, spaces and tabs.

16.3 string Constructors

Class string provides eight constructors for initializing strings in various ways. Figure 16.1 demonstrates three of the constructors.

Fig. 16.1. string constructors.

image

Lines 10–11 allocate the char array characterArray, which contains nine characters. Lines 12–16 declare the strings originalString, string1, string2, string3 and string4. Line 12 assigns string literal "Welcome to C# programming!" to string reference originalString. Line 13 sets string1 to reference the same string literal.

Line 14 assigns to string2 a new string, using the string constructor with a character array argument. The new string contains a copy of the array’s characters.

Line 15 assigns to string3 a new string, using the string constructor that takes a char array and two int arguments. The second argument specifies the starting index position (the offset) from which characters in the array are to be copied. The third argument specifies the number of characters (the count) to be copied from the specified starting position in the array. The new string contains a copy of the specified characters in the array. If the specified offset or count indicates that the program should access an element outside the bounds of the character array, an ArgumentOutOfRangeException is thrown.

Line 16 assigns to string4 a new string, using the string constructor that takes as arguments a character and an int specifying the number of times to repeat that character in the string.

Software Engineering Observation 16.1

image

In most cases, it’s not necessary to make a copy of an existing string. All strings are immutable—their character contents cannot be changed after they’re created. Also, if there are one or more references to a string (or any object for that matter), the object cannot be reclaimed by the garbage collector.

16.4 string Indexer, Length Property and CopyTo Method

The application in Fig. 16.2 presents the string indexer, which facilitates the retrieval of any character in the string, and the string property Length, which returns the length of the string. The string method CopyTo copies a specified number of characters from a string into a char array.

Fig. 16.2. string indexer, Length property and CopyTo method.

image

This application determines the length of a string, displays its characters in reverse order and copies a series of characters from the string to a character array. Line 17 uses string property Length to determine the number of characters in string1. Like arrays, strings always know their own size.

Lines 22–23 write the characters of string1 in reverse order using the string indexer. The string indexer treats a string as an array of chars and returns each character at a specific position in the string. The indexer receives an integer argument as the position number and returns the character at that position. As with arrays, the first element of a string is considered to be at position 0.

Common Programming Error 16.1

image

Attempting to access a character that’s outside a string’s bounds results in an Index-OutOfRangeException.

Line 26 uses string method CopyTo to copy the characters of string1 into a character array (characterArray). The first argument given to method CopyTo is the index from which the method begins copying characters in the string. The second argument is the character array into which the characters are copied. The third argument is the index specifying the starting location at which the method begins placing the copied characters into the character array. The last argument is the number of characters that the method will copy from the string. Lines 29–30 output the char array contents one character at a time.

16.5 Comparing strings

The next two examples demonstrate various methods for comparing strings. To understand how one string can be “greater than” or “less than” another, consider the process of alphabetizing a series of last names. The reader would, no doubt, place "Jones" before "Smith", because the first letter of "Jones" comes before the first letter of "Smith" in the alphabet. The alphabet is more than just a set of 26 letters—it’s an ordered list of characters in which each letter occurs in a specific position. For example, Z is more than just a letter of the alphabet; it’s specifically the twenty-sixth letter of the alphabet. Computers can order characters alphabetically because they’re represented internally as Unicode numeric codes.

Comparing Strings with Equals, CompareTo and the Equality Operator (==)

Class string provides several ways to compare strings. The application in Fig. 16.3 demonstrates the use of method Equals, method CompareTo and the equality operator (==).

Fig. 16.3. string test to determine equality.

image

image

The condition in line 21 uses string method Equals to compare string1 and literal string "hello" to determine whether they’re equal. Method Equals (inherited from object and overridden in string) tests any two objects for equality (i.e., checks whether the objects have identical contents). The method returns true if the objects are equal and false otherwise. In this case, the condition returns true, because string1 references string literal object "hello". Method Equals uses word sorting rules that depend on your system’s currently selected culture. Comparing "hello" with "HELLO" would return false, because the lowercase letters are different from the those of corresponding uppercase letters.

The condition in line 27 uses the overloaded equality operator (==) to compare string string1 with the literal string "hello" for equality. In C#, the equality operator also compares the contents of two strings. Thus, the condition in the if statement evaluates to true, because the values of string1 and "hello" are equal.

Line 33 tests whether string3 and string4 are equal to illustrate that comparisons are indeed case sensitive. Here, static method Equals is used to compare the values of two strings. "Happy Birthday" does not equal "happy birthday", so the condition of the if statement fails, and the message "string3 does not equal string4" is output (line 36).

Lines 40–48 use string method CompareTo to compare strings. Method CompareTo returns 0 if the strings are equal, a negative value if the string that invokes CompareTo is less than the string that’s passed as an argument and a positive value if the string that invokes CompareTo is greater than the string that’s passed as an argument.

Notice that CompareTo considers string3 to be greater than string4. The only difference between these two strings is that string3 contains two uppercase letters in positions where string4 contains lowercase letters.

Determining Whether a String Begins or Ends with a Specified String

Figure 16.4 shows how to test whether a string instance begins or ends with a given string. Method StartsWith determines whether a string instance starts with the string text passed to it as an argument. Method EndsWith determines whether a string instance ends with the string text passed to it as an argument. Class stringStartEnd’s Main method defines an array of strings (called strings), which contains "started", "starting", "ended" and "ending". The remainder of method Main tests the elements of the array to determine whether they start or end with a particular set of characters.

Fig. 16.4. StartsWith and EndsWith methods.

image

Line 13 uses method StartsWith, which takes a string argument. The condition in the if statement determines whether the string at index i of the array starts with the characters "st". If so, the method returns true, and strings[i] is output along with a message.

Line 21 uses method EndsWith to determine whether the string at index i of the array ends with the characters "ed". If so, the method returns true, and strings[i] is displayed along with a message.

16.6 Locating Characters and Substrings in strings

In many applications, it’s necessary to search for a character or set of characters in a string. For example, a programmer creating a word processor would want to provide capabilities for searching through documents. The application in Fig. 16.5 demonstrates some of the many versions of string methods IndexOf, IndexOfAny, LastIndexOf and LastIndexOfAny, which search for a specified character or substring in a string. We perform all searches in this example on the string letters (initialized with "abcdefghijklmabcdefghijklm") located in method Main of class StringIndexMethods.

Fig. 16.5. Searching for characters and substrings in strings.

image

image

image

Lines 14, 16 and 18 use method IndexOf to locate the first occurrence of a character or substring in a string. If it finds a character, IndexOf returns the index of the specified character in the string; otherwise, IndexOf returns –1. The expression in line 16 uses a version of method IndexOf that takes two arguments—the character to search for and the starting index at which the search of the string should begin. The method does not examine any characters that occur prior to the starting index (in this case, 1). The expression in line 18 uses another version of method IndexOf that takes three arguments—the character to search for, the index at which to start searching and the number of characters to search.

Lines 22, 24 and 26 use method LastIndexOf to locate the last occurrence of a character in a string. Method LastIndexOf performs the search from the end of the string to the beginning of the string. If it finds the character, LastIndexOf returns the index of the specified character in the string; otherwise, LastIndexOf returns –1. There are three versions of method LastIndexOf. The expression in line 22 uses the version that takes as an argument the character for which to search. The expression in line 24 uses the version that takes two arguments—the character for which to search and the highest index from which to begin searching backward for the character. The expression in line 26 uses a third version of method LastIndexOf that takes three arguments—the character for which to search, the starting index from which to start searching backward and the number of characters (the portion of the string) to search.

Lines 29–44 use versions of IndexOf and LastIndexOf that take a string instead of a character as the first argument. These versions of the methods perform identically to those described above except that they search for sequences of characters (or substrings) that are specified by their string arguments.

Lines 47–64 use methods IndexOfAny and LastIndexOfAny, which take an array of characters as the first argument. These versions of the methods also perform identically to those described above, except that they return the index of the first occurrence of any of the characters in the character-array argument.

Common Programming Error 16.2

image

In the overloaded methods LastIndexOf and LastIndexOfAny that take three parameters, the second argument must be greater than or equal to the third. This might seem counterintuitive, but remember that the search moves from the end of the string toward the start of the string.

16.7 Extracting Substrings from strings

Class string provides two Substring methods, which create a new string by copying part of an existing string. Each method returns a new string. The application in Fig. 16.6 demonstrates the use of both methods.

Fig. 16.6. Substrings generated from strings.

image

The statement in line 13 uses the Substring method that takes one int argument. The argument specifies the starting index from which the method copies characters in the original string. The substring returned contains a copy of the characters from the starting index to the end of the string. If the index specified in the argument is outside the bounds of the string, the program throws an ArgumentOutOfRangeException.

The second version of method Substring (line 17) takes two int arguments. The first argument specifies the starting index from which the method copies characters from the original string. The second argument specifies the length of the substring to copy. The substring returned contains a copy of the specified characters from the original string. If the supplied length of the substring is too large (i.e., the substring tries to retrieve characters past the end of the original string), an ArgumentOutOfRangeException is thrown.

16.8 Concatenating strings

The + operator is not the only way to perform string concatenation. The static method Concat of class string (Fig. 16.7) concatenates two strings and returns a new string containing the combined characters from both original strings. Line 16 appends the characters from string2 to the end of a copy of string1, using method Concat. The statement in line 16 does not modify the original strings.

Fig. 16.7. Concat static method.

image

16.9 Miscellaneous string Methods

Class string provides several methods that return modified copies of strings. The application in Fig. 16.8 demonstrates the use of these methods, which include string methods Replace, ToLower, ToUpper and Trim.

Fig. 16.8. string methods Replace, ToLower, ToUpper and Trim.

image

image

Line 21 uses string method Replace to return a new string, replacing every occurrence in string1 of character 'e' with 'E'. Method Replace takes two arguments—a char for which to search and another char with which to replace all matching occurrences of the first argument. The original string remains unchanged. If there are no occurrences of the first argument in the string, the method returns the original string. An overloaded version of this method allows you to provide two strings as arguments.

The string method ToUpper generates a new string (line 25) that replaces any lowercase letters in string1 with their uppercase equivalents. The method returns a new string containing the converted string; the original string remains unchanged. If there are no characters to convert, the original string is returned. Line 26 uses string method ToLower to return a new string in which any uppercase letters in string2 are replaced by their lowercase equivalents. The original string is unchanged. As with ToUpper, if there are no characters to convert to lowercase, method ToLower returns the original string.

Line 30 uses string method Trim to remove all whitespace characters that appear at the beginning and end of a string. Without otherwise altering the original string, the method returns a new string that contains the string, but omits leading and trailing whitespace characters. This method is particularly useful for retrieving user input (i.e., via a TextBox). Another version of method Trim takes a character array and returns a copy of the string that does not begin or end with any of the characters in the array argument.

16.10 Class StringBuilder

The string class provides many capabilities for processing strings. However a string’s contents can never change. Operations that seem to concatenate strings are in fact assigning string references to newly created strings (e.g., the += operator creates a new string and assigns the initial string reference to the newly created string).

The next several sections discuss the features of class StringBuilder (namespace System.Text), used to create and manipulate dynamic string information—i.e., mutable strings. Every StringBuilder can store a certain number of characters that’s specified by its capacity. Exceeding the capacity of a StringBuilder causes the capacity to expand to accommodate the additional characters. As we’ll see, members of class StringBuilder, such as methods Append and AppendFormat, can be used for concatenation like the operators + and += for class string. StringBuilder is particularly useful for manipulating in place a large number of strings, as it’s much more efficient than creating individual immutable strings.

Performance Tip 16.2

image

Objects of class string are immutable (i.e., constant strings), whereas objects of class StringBuilder are mutable. C# can perform certain optimizations involving strings (such as the sharing of one string among multiple references), because it knows these objects will not change.

Class StringBuilder provides six overloaded constructors. Class StringBuilderConstructor (Fig. 16.9) demonstrates three of these overloaded constructors.

Fig. 16.9. StringBuilder class constructors.

image

Line 10 employs the no-parameter StringBuilder constructor to create a StringBuilder that contains no characters and has an implementation-specific default initial capacity. Line 11 uses the StringBuilder constructor that takes an int argument to create a StringBuilder that contains no characters and has the initial capacity specified in the int argument (i.e., 10). Line 12 uses the StringBuilder constructor that takes a string argument to create a StringBuilder containing the characters of the string argument. Lines 14–16 implicitly use StringBuilder method ToString to obtain string representations of the StringBuilders’ contents.

16.11 Length and Capacity Properties, EnsureCapacity Method and Indexer of Class StringBuilder

Class StringBuilder provides the Length and Capacity properties to return the number of characters currently in a StringBuilder and the number of characters that a StringBuilder can store without allocating more memory, respectively. These properties also can increase or decrease the length or the capacity of the StringBuilder. Method EnsureCapacity allows you to reduce the number of times that a StringBuilder’s capacity must be increased. The method ensures that the StringBuilder’s capacity is at least the specified value. The program in Fig. 16.10 demonstrates these methods and properties.

Fig. 16.10. StringBuilder size manipulation.

image

The program contains one StringBuilder, called buffer. Lines 10–11 of the program use the StringBuilder constructor that takes a string argument to instantiate the StringBuilder and initialize its value to "Hello, how are you?". Lines 14–16 output the content, length and capacity of the StringBuilder.

Line 18 expands the capacity of the StringBuilder to a minimum of 75 characters. If new characters are added to a StringBuilder so that its length exceeds its capacity, the capacity grows to accommodate the additional characters in the same manner as if method EnsureCapacity had been called.

Line 23 uses property Length to set the length of the StringBuilder to 10. If the specified length is less than the current number of characters in the StringBuilder, the contents of the StringBuilder are truncated to the specified length. If the specified length is greater than the number of characters currently in the StringBuilder, null characters are appended to the StringBuilder until the total number of characters in the StringBuilder is equal to the specified length.

16.12 Append and AppendFormat Methods of Class StringBuilder

Class StringBuilder provides 19 overloaded Append methods that allow various types of values to be added to the end of a StringBuilder. The Framework Class Library provides versions for each of the simple types and for character arrays, strings and objects. (Remember that method ToString produces a string representation of any object.) Each method takes an argument, converts it to a string and appends it to the StringBuilder. Figure 16.11 demonstrates the use of several Append methods.

Fig. 16.11. Append methods of StringBuilder.

image

image

Lines 22–40 use 10 different overloaded Append methods to attach the string representations of objects created in lines 10–18 to the end of the StringBuilder.

Class StringBuilder also provides method AppendFormat, which converts a string to a specified format, then appends it to the StringBuilder. The example in Fig. 16.12 demonstrates the use of this method.

Fig. 16.12. StringBuilder’s AppendFormat method.

image

Line 13 creates a string that contains formatting information. The information enclosed in braces specifies how to format a specific piece of data. Formats have the form {X[,Y][:FormatString]}, where X is the number of the argument to be formatted, counting from zero. Y is an optional argument, which can be positive or negative, indicating how many characters should be in the result. If the resulting string is less than the number Y, it will be padded with spaces to make up for the difference. A positive integer aligns the string to the right; a negative integer aligns it to the left. The optional Format-String applies a particular format to the argument—currency, decimal or scientific, among others. In this case, “{0}” means the first argument will be printed out. “{1:C}” specifies that the second argument will be formatted as a currency value.

Line 22 shows a version of AppendFormat that takes two parameters—a string specifying the format and an array of objects to serve as the arguments to the format string. The argument referred to by “{0}” is in the object array at index 0.

Lines 25–27 define another string used for formatting. The first format “{0:d3}”, specifies that the first argument will be formatted as a three-digit decimal, meaning that any number having fewer than three digits will have leading zeros placed in front to make up the difference. The next format, “{0, 4}”, specifies that the formatted string should have four characters and be right aligned. The third format, “{0, -4}”, specifies that the strings should be aligned to the left.

Line 30 uses a version of AppendFormat that takes two parameters—a string containing a format and an object to which the format is applied. In this case, the object is the number 5. The output of Fig. 16.12 displays the result of applying these two versions of AppendFormat with their respective arguments.

16.13 Insert, Remove and Replace Methods of Class StringBuilder

Class StringBuilder provides 18 overloaded Insert methods to allow various types of data to be inserted at any position in a StringBuilder. The class provides versions for each of the simple types and for character arrays, strings and objects. Each method takes its second argument, converts it to a string and inserts the string into the StringBuilder in front of the character in the position specified by the first argument. The index specified by the first argument must be greater than or equal to 0 and less than the length of the StringBuilder; otherwise, the program throws an ArgumentOutOfRangeException.

Class StringBuilder also provides method Remove for deleting any portion of a StringBuilder. Method Remove takes two arguments—the index at which to begin deletion and the number of characters to delete. The sum of the starting index and the number of characters to be deleted must always be less than the length of the StringBuilder; otherwise, the program throws an ArgumentOutOfRangeException. The Insert and Remove methods are demonstrated in Fig. 16.13.

Fig. 16.13. StringBuilder text insertion and removal.

image

image

Another useful method included with StringBuilder is Replace. Replace searches for a specified string or character and substitutes another string or character in its place. Figure 16.14 demonstrates this method.

Fig. 16.14. StringBuilder text replacement.

image

Line 18 uses method Replace to replace all instances "Jane" with the "Greg" in builder1. Another overload of this method takes two characters as parameters and replaces each occurrence of the first character with the second. Line 19 uses an overload of Replace that takes four parameters, of which the first two are characters and the second two are ints. The method replaces all instances of the first character with the second character, beginning at the index specified by the first int and continuing for a count specified by the second int. Thus, in this case, Replace looks through only five characters, starting with the character at index 0. As the output illustrates, this version of Replace replaces g with G in the word "good", but not in "greg". This is because the gs in "greg" are not in the range indicated by the int arguments (i.e., between indexes 0 and 4).

16.14 Char Methods

C# provides a concept called a struct (short for “structure”) that’s similar to a class. Although structs and classes are comparable, structs represent value types. Like classes, structs can have methods and properties, and can use the access modifiers public and private. Also, struct members are accessed via the member access operator (.).

The simple types are actually aliases for struct types. For instance, an int is defined by struct System.Int32, a long by System.Int64 and so on. All struct types derive from class ValueType, which derives from object. Also, all struct types are implicitly sealed, so they do not support virtual or abstract methods, and their members cannot be declared protected or protected internal.

In the struct Char,2 which is the struct for characters, most methods are static, take at least one character argument and perform either a test or a manipulation on the character. We present several of these methods in the next example. Figure 16.15 demonstrates static methods that test characters to determine whether they’re of a specific character type and static methods that perform case conversions on characters.

Fig. 16.15. Char’s static character-testing and case-conversion methods.

image

image

After the user enters a character, lines 13–27 analyze it. Line 13 uses Char method IsDigit to determine whether character is defined as a digit. If so, the method returns true; otherwise, it returns false (note again that bool values are output capitalized). Line 14 uses Char method IsLetter to determine whether character character is a letter. Line 16 uses Char method IsLetterOrDigit to determine whether character character is a letter or a digit.

Line 18 uses Char method IsLower to determine whether character character is a lowercase letter. Line 20 uses Char method IsUpper to determine whether character character is an uppercase letter. Line 22 uses Char method ToUpper to convert character character to its uppercase equivalent. The method returns the converted character if the character has an uppercase equivalent; otherwise, the method returns its original argument. Line 24 uses Char method ToLower to convert character character to its lowercase equivalent. The method returns the converted character if the character has a lowercase equivalent; otherwise, the method returns its original argument.

Line 26 uses Char method IsPunctuation to determine whether character is a punctuation mark, such as "!", ":" or ")". Line 27 uses Char method IsSymbol to determine whether character character is a symbol, such as "+", "=" or "^".

Structure type Char also contains other methods not shown in this example. Many of the static methods are similar—for instance, IsWhiteSpace is used to determine whether a certain character is a whitespace character (e.g., newline, tab or space). The struct also contains several public instance methods; many of these, such as methods ToString and Equals, are methods that we have seen before in other classes. This group includes method CompareTo, which is used to compare two character values with one another.

16.15 Regular Expressions

We now introduce regular expressions—specially formatted strings used to find patterns in text. They can be used to ensure that data is in a particular format. For example, a U.S. zip code must consist of five digits, or five digits followed by a dash followed by four more digits. Compilers use regular expressions to validate program syntax. If the program code does not match the regular expression, the compiler indicates that there’s a syntax error. We discuss classes Regex and Match from the System.Text.RegularExpressions namespace as well as the symbols used to form regular expressions. We then demonstrate how to find patterns in a string, match entire strings to patterns, replace characters in a string that match a pattern and split strings at delimiters specified as a pattern in a regular expression.

16.15.1 Simple Regular Expressions and Class Regex

The .NET Framework provides several classes to help developers manipulate regular expressions. Figure 16.16 demonstrates the basic regular-expression classes. To use these classes, add a using statement for the namespace System.Text.RegularExpressions (line 4). Class Regex represents a regular expression. We create a Regex object named expression (line 16) to represent the regular expression "e". This regular expression matches the literal character "e" anywhere in an arbitrary string. Regex method Match returns an object of class Match that represents a single regular-expression match. Class Match’s ToString method returns the substring that matched the regular expression. The call to method Match (line 17) matches the leftmost occurrence of the character "e" in testString. Class Regex also provides method Matches (line 21), which finds all matches of the regular expression in an arbitrary string and returns a MatchCollection object containing all the Matches. A MatchCollection is a collection, similar to an array, and can be used with a foreach statement to iterate through the collection’s elements. We introduced collections in Chapter 9 and discuss them in more detail in Chapter 23, Collections. We use a foreach statement (lines 21–22) to display all the matches to expression in testString. The elements in the MatchCollection are Match objects, so the foreach statement infers variable myMatch to be of type Match. For each Match, line 22 outputs the text that matched the regular expression.

Fig. 16.16. Demonstrating basic regular expressions.

image

image

Regular expressions can also be used to match a sequence of literal characters anywhere in a string. Lines 27–28 display all the occurrences of the character sequence "regex" in testString. Here we use the Regex static method Matches. Class Regex provides static versions of both methods Match and Matches. The static versions take a regular expression as an argument in addition to the string to be searched. This is useful when you want to use a regular expression only once. The call to method Matches (line 27) returns two matches to the regular expression "regex". Notice that "regexp" in the testString matches the regular expression "regex", but the "p" is excluded. We use the regular expression "regexp?" (line 34) to match occurrences of both "regex" and "regexp". The question mark (?) is a metacharacter—a character with special meaning in a regular expression. More specifically, the question mark is a quantifier—a metacharacter that describes how many times a part of the pattern may occur in a match. The ? quantifier matches zero or one occurrence of the pattern to its left. In line 34, we apply the ? quantifier to the character "p". This means that a match to the regular expression contains the sequence of characters "regex" and may be followed by a "p". Notice that the foreach statement (lines 34–35) displays both "regex" and "regexp".

Metacharacters allow you to create more complex patterns. The "|" (alternation) metacharacter matches the expression to its left or to its right. We use alternation in the regular expression "(c|h)at" (line 38) to match either "cat" or "hat". Parentheses are used to group parts of a regular expression, much as you group parts of a mathematical expression. The "|" causes the pattern to match a sequence of characters starting with either "c" or "h", followed by "at". The "|" character attempts to match the entire expression to its left or to its right. If we didn’t use the parentheses around "c|h", the regular expression would match either the single character "c" or the sequence of characters "hat". Line 41 uses the regular expression (line 38) to search the strings "hat cat" and "cat hat". Notice in the output that the first match in "hat cat" is "hat", while the first match in "cat hat" is "cat". Alternation chooses the leftmost match in the string for either of the alternating expressions—the order of the expressions doesn’t matter.

Regular-Expression Character Classes and Quantifiers

The table in Fig. 16.17 lists some character classes that can be used with regular expressions. A character class represents a group of characters that might appear in a string. For example, a word character (w) is any alphanumeric character (a-z, A-Z and 0-9) or underscore. A whitespace character (s) is a space, a tab, a carriage return, a newline or a form feed. A digit (d) is any numeric character.

Fig. 16.17. Character classes.

image

Figure 16.18 uses character classes in regular expressions. For this example, we use method DisplayMatches (lines 53–59) to display all matches to a regular expression. Method DisplayMatches takes two strings representing the string to search and the regular expression to match. The method uses a foreach statement to display each Match in the MatchCollection object returned by the static method Matches of class Regex.

Fig. 16.18. Demonstrating using character classes and quantifiers.

image

image

image

The first regular expression (line 15) matches digits in the testString. We use the digit character class (d) to match any digit (0–9). We precede the regular expression string with @. Recall that backslashes within the double quotation marks following the @ character are regular backslash characters, not the beginning of escape sequences. To define the regular expression without prefixing @ to the string, you would need to escape every backslash character, as in

"\d"

which makes the regular expression more difficult to read.

The output shows that the regular expression matches 1, 2, and 3 in the testString. You can also match anything that isn’t a member of a particular character class using an uppercase instead of a lowercase letter. For example, the regular expression "D" (line 19) matches any character that isn’t a digit. Notice in the output that this includes punctuation and whitespace. Negating a character class matches everything that isn’t a member of the character class.

The next regular expression (line 23) uses the character class w to match any word character in the testString. Notice that each match consists of a single character. It would be useful to match a sequence of word characters rather than a single character. The regular expression in line 28 uses the + quantifier to match a sequence of word characters. The + quantifier matches one or more occurrences of the pattern to its left. There are three matches for this expression, each three characters long. Quantifiers are greedy—they match the longest possible occurrence of the pattern. You can follow a quantifier with a question mark (?) to make it lazy—it matches the shortest possible occurrence of the pattern. The regular expression "w+?" (line 33) uses a lazy + quantifier to match the shortest sequence of word characters possible. This produces nine matches of length one instead of three matches of length three. Figure 16.19 lists other quantifiers that you can place after a pattern in a regular expression, and the purpose of each.

Fig. 16.19. Quantifiers used in regular expressions.

image

Regular expressions are not limited to the character classes in Fig. 16.17. You can create your own character class by listing the members of the character class between square brackets, [ and ]. [Note: Metacharacters in square brackets are treated as literal characters.] You can include a range of characters using the "-" character. The regular expression in line 37 of Fig. 16.18 creates a character class to match any lowercase letter from a to f. These custom character classes match a single character that’s a member of the class. The output shows three matches, a, b and c. Notice that D, E and F don’t match the character class [a-f] because they’re uppercase. You can negate a custom character class by placing a "^" character after the opening square bracket. The regular expression in line 41 matches any character that isn’t in the range a-f. As with the predefined character classes, negating a custom character class matches everything that isn’t a member, including punctuation and whitespace. You can also use quantifiers with custom character classes. The regular expression in line 45 uses a character class with two ranges of characters, a-z and A-Z, and the + quantifier to match a sequence of lowercase or uppercase letters. You can also use the "." (dot) character to match any character other than a newline. The regular expression ".*" (line 49) matches any sequence of characters. The * quantifier matches zero or more occurrences of the pattern to its left. Unlike the + quantifier, the * quantifier can be used to match an empty string.

16.15.2 Complex Regular Expressions

The program of Fig. 16.20 tries to match birthdays to a regular expression. For demonstration purposes, the expression matches only birthdays that do not occur in April and that belong to people whose names begin with "J". We can do this by combining the basic regular-expression techniques we’ve already discussed.

Fig. 16.20. A more complex regular expression.

image

Line 11 creates a Regex object and passes a regular-expression pattern string to its constructor. The first character in the regular expression, "J", is a literal character. Any string matching this regular expression must start with "J". The next part of the regular expression (".*") matches any number of unspecified characters except newlines. The pattern "J.*" matches a person’s name that starts with J and any characters that may come after that.

Next we match the person’s birthday. We use the d character class to match the first digit of the month. Since the birthday must not occur in April, the second digit in the month can’t be 4. We could use the character class "[0-35-9]" to match any digit other than 4. However, .NET regular expressions allow you to subtract members from a character class, called character-class subtraction. In line 11, we use the pattern "[d-[4]]" to match any digit other than 4. When the "-" character in a character class is followed by a character class instead of a literal character, the "-" is interpreted as subtraction instead of a range of characters. The members of the character class following the "-" are removed from the character class preceding the "-". When using character-class subtraction, the class being subtracted ([4]) must be the last item in the enclosing brackets ([d-[4]]). This notation allows you to write shorter, easier-to-read regular expressions.

Although the "–" character indicates a range or character-class subtraction when it’s enclosed in square brackets, instances of the "-" character outside a character class are treated as literal characters. Thus, the regular expression in line 11 searches for a string that starts with the letter "J", followed by any number of characters, followed by a two-digit number (of which the second digit cannot be 4), followed by a dash, another two-digit number, a dash and another two-digit number.

Lines 20–21 use a foreach statement to iterate through the MatchCollection object returned by method Matches, which received testString as an argument. For each Match, line 21 outputs the text that matched the regular expression. The output in Fig. 16.20 displays the two matches that were found in testString. Notice that both matches conform to the pattern specified by the regular expression.

16.15.3 Validating User Input with Regular Expressions and LINQ

The application in Fig. 16.21 presents a more involved example that uses regular expressions to validate name, address and telephone-number information input by a user.

Fig. 16.21. Validating user information using regular expressions.

image

image

image

image

When a user clicks OK, the program uses a LINQ query to select any empty TextBoxes (lines 22–27) from the Controls collection. Notice that we explicitly declare the type of the range variable in the from clause (line 22). When working with nongeneric collections, such as Controls, you must explicitly type the range variable. The first where clause (line 23) determines whether the currentControl is a TextBox. The let clause (line 24) creates and initializes a variable in a LINQ query for use later in the query. Here, we use the let clause to define variable box as a TextBox, which contains the Control object cast to a TextBox. This allows us to use the control in the LINQ query as a TextBox, enabling access to its properties (such as Text). You may include a second where clause after the let clause. The second where clause determines whether the TextBox’s Text property is empty. If one or more TextBoxes are empty (line 30), the program displays a message to the user (lines 33–35) that all fields must be filled in before the program can validate the information. Line 37 calls the Select method of the first TextBox in the query result so that the user can begin typing in that TextBox. The query sorted the TextBoxes by TabIndex (line 26) so the first TextBox in the query result is the first empty TextBox on the Form. If there are no empty fields, lines 39–71 validate the user input.

We call method ValidateInput to determine whether the user input matches the specified regular expressions. ValidateInput (lines 83–98) takes as arguments the text input by the user (input), the regular expression the input must match (expression) and a message to display if the input is invalid (message). Line 87 calls Regex static method Match, passing both the string to validate and the regular expression as arguments. The Success property of class Match indicates whether method Match’s first argument matched the pattern specified by the regular expression in the second argument. If the value of Success is false (i.e., there was no match), lines 93–94 display the error message passed as an argument to method ValidateInput. Line 97 then returns the value of the Success property. If ValidateInput returns false, the TextBox containing invalid data is selected so the user can correct the input. If all input is valid—the else statement (lines 72–78) displays a message dialog stating that all input is valid, and the program terminates when the user dismisses the dialog.

In the previous example, we searched a string for substrings that matched a regular expression. In this example, we want to ensure that the entire string for each input conforms to a particular regular expression. For example, we want to accept "Smith" as a last name, but not "9@Smith#". In a regular expression that begins with a "^" character and ends with a "$" character (e.g., line 43), the characters "^" and "$" represent the beginning and end of a string, respectively. These characters force a regular expression to return a match only if the entire string being processed matches the regular expression.

The regular expressions in lines 43 and 47 use a character class to match an uppercase first letter followed by letters of any case—a-z matches any lowercase letter, and A-Z matches any uppercase letter. The * quantifier signifies that the second range of characters may occur zero or more times in the string. Thus, this expression matches any string consisting of one uppercase letter, followed by zero or more additional letters.

The s character class matches a single whitespace character (lines 51, 56 and 60). In the expression "d{5}", used for the zipCode string (line 64), {5} is a quantifier (see Fig. 16.19). The pattern to the left of {n} must occur exactly n times. Thus "d{5}" matches any five digits. Recall that the character "|" (lines 51, 56 and 60) matches the expression to its left or the expression to its right. In line 51, we use the character "|" to indicate that the address can contain a word of one or more characters or a word of one or more characters followed by a space and another word of one or more characters. Note the use of parentheses to group parts of the regular expression. This ensures that "|" is applied to the correct parts of the pattern.

The Last Name: and First Name: TextBoxes each accept strings that begin with an uppercase letter (lines 43 and 47). The regular expression for the Address: TextBox (line 51) matches a number of at least one digit, followed by a space and then either one or more letters or else one or more letters followed by a space and another series of one or more letters. Therefore, "10 Broadway" and "10 Main Street" are both valid addresses. As currently formed, the regular expression in line 51 doesn’t match an address that does not start with a number, or that has more than two words. The regular expressions for the City: (line 56) and State: (line 60) TextBoxes match any word of at least one character or, alternatively, any two words of at least one character if the words are separated by a single space. This means both Waltham and West Newton would match. Again, these regular expressions would not accept names that have more than two words. The regular expression for the Zip code: TextBox (line 64) ensures that the zip code is a five-digit number. The regular expression for the Phone: TextBox (line 68) indicates that the phone number must be of the form xxx-yyy-yyyy, where the xs represent the area code and the ys the number. The first x and the first y cannot be zero, as specified by the range [1–9] in each case.

16.15.4 Regex Methods Replace and Split

Sometimes it’s useful to replace parts of one string with another or to split a string according to a regular expression. For this purpose, class Regex provides static and instance versions of methods Replace and Split, which are demonstrated in Fig. 16.22.

Fig. 16.22. Using Regex methods Replace and Split.

image

image

Regex method Replace replaces text in a string with new text wherever the original string matches a regular expression. We use two versions of this method in Fig. 16.22. The first version (line 18) is a static method and takes three parameters—the string to modify, the string containing the regular expression to match and the replacement string. Here, Replace replaces every instance of "*" in testString1 with "^". Notice that the regular expression ("*") precedes character * with a backslash (). Normally, * is a quantifier indicating that a regular expression should match any number of occurrences of a preceding pattern. However, in line 18, we want to find all occurrences of the literal character *; to do this, we must escape character * with character . By escaping a special regular-expression character, we tell the regular-expression matching engine to find the actual character * rather than use it as a quantifier.

The second version of method Replace (line 34) is an instance method that uses the regular expression passed to the constructor for testRegex1 (line 12) to perform the replacement operation. Line 12 instantiates testRegex1 with argument @"d". The call to instance method Replace in line 34 takes three arguments—a string to modify, a string containing the replacement text and an integer specifying the number of replacements to make. In this case, line 34 replaces the first three instances of a digit ("d") in testString2 with the text "digit".

Method Split divides a string into several substrings. The original string is broken at delimiters that match a specified regular expression. Method Split returns an array containing the substrings. In line 39, we use static method Split to separate a string of comma-separated integers. The first argument is the string to split; the second argument is the regular expression that represents the delimiter. The regular expression ",s" separates the substrings at each comma. By matching a whitespace character (s in the regular expression), we eliminate the extra spaces from the resulting substrings.

16.16 Wrap-Up

In this chapter, you learned about the Framework Class Library’s string- and character-processing capabilities. We overviewed the fundamentals of characters and strings. You saw how to determine the length of strings, copy strings, access the individual characters in strings, search strings, obtain substrings from larger strings, compare strings, concatenate strings, replace characters in strings and convert strings to uppercase or lowercase letters.

We showed how to use class StringBuilder to build strings dynamically. You learned how to determine and specify the size of a StringBuilder object, and how to append, insert, remove and replace characters in a StringBuilder object. We then introduced the character-testing methods of type Char that enable a program to determine whether a character is a digit, a letter, a lowercase letter, an uppercase letter, a punctuation mark or a symbol other than a punctuation mark, and the methods for converting a character to uppercase or lowercase.

Finally, we discussed classes Regex, Match and MatchCollection from namespace System.Text.RegularExpressions and the symbols that are used to form regular expressions. You learned how to find patterns in a string and match entire strings to patterns with Regex methods Match and Matches, how to replace characters in a string with Regex method Replace and how to split strings at delimiters with Regex method Split. In the next chapter, you’ll learn how to read data from and write data to files.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset