25. Strings, Characters and Regular Expressions

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

25. Strings, Characters and Regular Expressions

Objectives

In this chapter you’ll learn:

• To create and manipulate immutable character string objects of class String.

• To create and manipulates mutable character string objects of class StringBuilder.

• To create and manipulate objects of class Character.

• To use a StringTokenizer object to break a String object into tokens.

• To use regular expressions to validate String data entered into an application.

The chief defect of Henry King Was chewing little bits of string.

—Hilaire Belloc

Vigorous writing is concise. A sentence should contain no unnecessary words, a paragraph no unnecessary sentences.

—William Strunk, Jr.

I have made this letter longer than usual, because I lack the time to make it short.

—Blaise Pascal

Outline

25.1 Introduction

25.2 Fundamentals of Characters and Strings

25.3 Class String

25.3.1 String Constructors

25.3.2 String Methods length, charAt and getChars

25.3.3 Comparing Strings

25.3.4 Locating Characters and Substrings in Strings

25.3.5 Extracting Substrings from Strings

25.3.6 Concatenating Strings

25.3.7 Miscellaneous String Methods

25.3.8 String Method valueOf

25.4 Class StringBuilder

25.4.1 StringBuilder Constructors

25.4.2 StringBuilder Methods length, capacity, setLength and ensureCapacity

25.4.3 StringBuilder Methods charAt, setCharAt, getChars and reverse

25.4.4 StringBuilder append Methods

25.4.5 StringBuilder Insertion and Deletion Methods

25.5 Class Character

25.6 Class StringTokenizer

25.7 Regular Expressions, Class Pattern and Class Matcher

25.8 Wrap-Up

25.1 Introduction

This chapter introduces Java’s string- and character-processing capabilities. The techniques discussed here are appropriate for validating program input, displaying information to users and other text-based manipulations. They are also appropriate for developing text editors, word processors, page-layout software, computerized typesetting systems and other kinds of text-processing software. We have already presented several string-processing capabilities in earlier chapters. This chapter discusses in detail the capabilities of class String, class StringBuilder and class Character from the java.lang package and class StringTokenizer from the java.util package. These classes provide the foundation for string and character manipulation in Java.

The chapter also discusses regular expressions that provide applications with the capability to validate input. The functionality is located in the String class along with classes Matcher and Pattern located in the java.util.regex package.

25.2 Fundamentals of Characters and Strings

Characters are the fundamental building blocks of Java source programs. Every program is composed of a sequence of characters that—when grouped together meaningfully—are interpreted by the computer as a series of instructions used to accomplish a task. A program may contain character literals. A character literal is an integer value represented as a character in single quotes. For example, 'z' represents the integer value of z, and ' ' represents the integer value of newline. The value of a character literal is the integer value of the character in the Unicode character set. Appendix B presents the integer equivalents of the characters in the ASCII character set, which is a subset of Unicode. For detailed information on Unicode, visit www.unicode.org.

Recall from Section 2.2 that a string is a sequence of characters treated as a single unit. A string may include letters, digits and various special characters, such as +, -, *, / and $. A string is an object of class String. String literals (stored in memory as String objects) are written as a sequence of characters in double quotation marks, as in:

A string may be assigned to a String reference. The declaration

String color = "blue";

initializes String variable color to refer to a String object that contains the string "blue".

Performance Tip 25.1

Java treats all string literals with the same contents as a single String object that has many references to it. This conserves memory.

25.3 Class `String`

Class String is used to represent strings in Java. The next several subsections cover many of class String’s capabilities.

25.3.1 `String` Constructors

Class String provides constructors for initializing String objects in a variety of ways. Four of the constructors are demonstrated in the main method of Fig. 25.1.

Fig. 25.1. String class constructors.

Line 12 instantiates a new String object using class String’s no-argument constructor and assigns its reference to s1. The new String object contains no characters (the empty string) and has a length of 0.

Line 13 instantiates a new String object using class String’s constructor that takes a String as an argument and assigns its reference to s2. The new String contains the same sequence of characters as the one that is passed as an argument to the constructor.

Software Engineering Observation 25.1

It is not necessary to copy an existing String object. String objects are immutable—their character contents cannot be changed after they are created, because class String does not provide any methods that allow the contents of a String object to be modified.

Line 14 instantiates a new String object and assigns its reference to s3 using class String’s constructor that takes a char array as an argument. The new String object contains a copy of the characters in the array.

Line 15 instantiates a new String object and assigns its reference to s4 using class String’s constructor that takes a char array and two integers as arguments. The second argument specifies the starting position (the offset) from which characters in the array are accessed. Remember that the first character is at position 0. The third argument specifies the number of characters (the count) to access in the array. The new String object contains a string formed from the accessed characters. If the offset or the count specified as an argument results in accessing an element outside the bounds of the character array, a StringIndexOutOfBoundsException is thrown.

Common Programming Error 25.1

Attempting to access a character that is outside the bounds of a string results in a StringIndexOutOfBoundsException.

25.3.2 `String` Methods `length`, `charAt` and `getChars`

String methods length, charAt and getChars return the length of a string, obtain the character at a specific location in a string and retrieve a set of characters from a string as a char array, respectively. The application in Fig. 25.2 demonstrates each of these methods.

Fig. 25.2. String class character-manipulation methods.

Line 15 uses String method length to determine the number of characters in string s1. Like arrays, strings know their own length. However, unlike arrays, you cannot access a String’s length via a length field—instead you must call the String’s length method.

Lines 20–21 print the characters of the string s1 in reverse order (and separated by spaces). String method charAt (line 21) returns the character at a specific position in the string. Method charAt receives an integer argument that is used as the index and returns the character at that position. Like arrays, the first element of a string is at position 0.

Line 24 uses String method getChars to copy the characters of a string into a character array. The first argument is the starting index in the string from which characters are to be copied. The second argument is the index that is one past the last character to be copied from the string. The third argument is the character array into which the characters are to be copied. The last argument is the starting index where the copied characters are placed in the target character array. Next, line 28 prints the char array contents one character at a time.

25.3.3 Comparing Strings

Chapter 7 discussed sorting and searching arrays. Frequently, the information being sorted or searched consists of strings that must be compared to place them into the proper order or to determine whether a string appears in an array (or other collection). Class String provides several methods for comparing strings.

To understand what it means for one string to be greater than or less than another, consider the process of alphabetizing a series of last names. You would, no doubt, place “Jones” before “Smith” because the first letter of “Jones” comes before the first letter of “Smith” in the alphabet. But the alphabet is more than just a list of 26 letters—it is an ordered set of characters. Each letter occurs in a specific position within the set. Z is more than just a letter of the alphabet—it is specifically the twenty-sixth letter of the alphabet.

How does the computer know that one letter comes before another? All characters are represented in the computer as numeric codes (see Appendix B). When the computer compares strings, it actually compares the numeric codes of the characters in the strings.

Figure 25.3 demonstrates String methods equals, equalsIgnoreCase, compareTo and regionMatches and using the equality operator == to compare String objects.

Fig. 25.3. String comparisons.

The condition at line 17 uses method equals to compare string s1 and the string literal "hello" for equality. Method equals (a method of class Object overridden in String) tests any two objects for equality—the strings contained in the two objects are identical. The method returns true if the contents of the objects are equal, and false otherwise. The preceding condition is true because string s1 was initialized with the string literal "hello". Method equals uses a lexicographical comparison—it compares the integer Unicode values (see www.unicode.com, for more information) that represent each character in each string. Thus, if the string "hello" is compared with the string "HELLO", the result is false, because the integer representation of a lowercase letter is different from that of the corresponding uppercase letter.

The condition at line 23 uses the equality operator == to compare string s1 for equality with the string literal "hello". Operator == has different functionality when it is used to compare references than when it is used to compare values of primitive types. When primitive-type values are compared with ==, the result is true if both values are identical. When references are compared with ==, the result is true if both references refer to the same object in memory. To compare the actual contents (or state information) of objects for equality, a method must be invoked. In the case of Strings, that method is equals. The preceding condition evaluates to false at line 23 because the reference s1 was initialized with the statement

s1 = new String( "hello" );

which creates a new String object with a copy of string literal "hello" and assigns the new object to variable s1. If s1 had been initialized with the statement

s1 = "hello";

which directly assigns the string literal "hello" to variable s1, the condition would be true. Remember that Java treats all string literal objects with the same contents as one String object to which there can be many references. Thus, lines 8, 17 and 23 all refer to the same String object "hello" in memory.

Common Programming Error 25.2

Comparing references with == can lead to logic errors, because == compares the references to determine whether they refer to the same object, not whether two objects have the same contents. When two identical (but separate) objects are compared with ==, the result will be false. When comparing objects to determine whether they have the same contents, use method equals.

If you are sorting Strings, you may compare them for equality with method equalsIgnoreCase, which ignores whether the letters in each string are uppercase or lowercase when performing the comparison. Thus, the string "hello" and the string "HELLO" compare as equal. Line 29 uses String method equalsIgnoreCase to compare string s3—Happy Birthday—for equality with string s4—happy birthday. The result of this comparison is true because the comparison ignores case sensitivity.

Lines 35–44 use method compareTo to compare strings. Method compareTo is declared in the Comparable interface and implemented in the String class. Line 36 compares string s1 to string s2. Method compareTo returns 0 if the strings are equal, a negative number if the string that invokes compareTo is less than the string that is passed as an argument and a positive number if the string that invokes compareTo is greater than the string that is passed as an argument. Method compareTo uses a lexicographical comparison—it compares the numeric values of corresponding characters in each string. (For more information on the exact value returned by the compareTo method, see java.sun.com/javase/6/docs/api/java/lang/String.html.)

The condition at line 47 uses String method regionMatches to compare portions of two strings for equality. The first argument is the starting index in the string that invokes the method. The second argument is a comparison string. The third argument is the starting index in the comparison string. The last argument is the number of characters to compare between the two strings. The method returns true only if the specified number of characters are lexicographically equal.

Finally, the condition at line 54 uses a five-argument version of String method regionMatches to compare portions of two strings for equality. When the first argument is true, the method ignores the case of the characters being compared. The remaining arguments are identical to those described for the four-argument regionMatches method.

Figure 25.4 demonstrates String methods startsWith and endsWith. Method main creates array strings containing the strings "started", "starting", "ended" and "ending". The rest of main consists of three for statements that test the elements of the array to determine whether they start with or end with a particular set of characters.

Fig. 25.4. String class startsWith and endsWith methods.

Lines 11–15 use the version of method startsWith that takes a String argument. The condition in the if statement (line 13) determines whether each String in the array starts with the characters "st". If so, the method returns true and the application prints that String. Otherwise, the method returns false and nothing happens.

Lines 20–25 use the startsWith method that takes a String and an integer as arguments. The integer specifies the index at which the comparison should begin in the string. The condition in the if statement (line 22) determines whether each String in the array has the characters "art" beginning with the third character in each string. If so, the method returns true and the application prints the String.

The third for statement (lines 30–34) uses method endsWith, which takes a String argument. The condition at line 32 determines whether each String in the array ends with the characters "ed". If so, the method returns true and the application prints the String.

25.3.4 Locating Characters and Substrings in Strings

Often it is useful to search for a character or set of characters in a string. For example, if you are creating your own word processor, you might want to provide a capability for searching through documents. Figure 25.5 demonstrates the many versions of String methods indexOf and lastIndexOf that search for a specified character or substring in a string. All the searches in this example are performed on the string letters (initialized with "abcdefghijklmabcdefghijklm") in method main. Lines 11–16 use method indexOf to locate the first occurrence of a character in a string. If method indexOf finds the character, it returns the character’s index in the string—otherwise, indexOf returns –1. There are two versions of indexOf that search for characters in a string. The expression in line 12 uses the version of method indexOf that takes an integer representation of the character to find. The expression at line 14 uses another version of method indexOf, which takes two integer arguments—the character and the starting index at which the search of the string should begin.

Fig. 25.5. String class searching methods.

The statements at lines 19–24 use method lastIndexOf to locate the last occurrence of a character in a string. Method lastIndexOf performs the search from the end of the string toward the beginning. If method lastIndexOf finds the character, it returns the index of the character in the string—otherwise, lastIndexOf returns –1. There are two versions of lastIndexOf that search for characters in a string. The expression at line 20 uses the version that takes the integer representation of the character. The expression at line 22 uses the version that takes two integer arguments—the integer representation of the character and the index from which to begin searching backward.

Lines 27–40 demonstrate versions of methods indexOf and lastIndexOf that each take a String as the first argument. These versions perform identically to those described earlier except that they search for sequences of characters (or substrings) that are specified by their String arguments. If the substring is found, these methods return the index in the string of the first character in the substring.

25.3.5 Extracting Substrings from Strings

Class String provides two substring methods to enable a new String object to be created by copying part of an existing String object. Each method returns a new String object. Both methods are demonstrated in Fig. 25.6.

Fig. 25.6. String class substring methods.

The expression letters.substring( 20 ) at line 12 uses the substring method that takes one integer argument. The argument specifies the starting index in the original string letters from which characters are to be copied. The substring returned contains a copy of the characters from the starting index to the end of the string. Specifying an index outside the bounds of the string causes a StringIndexOutOfBoundsException.

The expression letters.substring( 3, 6 ) at line 15 uses the substring method that takes two integer arguments. The first argument specifies the starting index from which characters are copied in the original string. The second argument specifies the index one beyond the last character to be copied (i.e., copy up to, but not including, that index in the string). The substring returned contains a copy of the specified characters from the original string. Specifying an index outside the bounds of the string causes a StringIndexOutOfBoundsException.

25.3.6 Concatenating Strings

String method concat (Fig. 25.7) concatenates two String objects and returns a new String object containing the characters from both original strings. The expression s1.concat( s2 ) at line 13 forms a string by appending the characters in string s2 to the characters in string s1. The original Strings to which s1 and s2 refer are not modified.

Fig. 25.7. String method concat.

25.3.7 Miscellaneous `String` Methods

Class String provides several methods that return modified copies of strings or that return character arrays. These methods are demonstrated in the application in Fig. 25.8.

Fig. 25.8. String methods replace, toLowerCase, toUpperCase, trim and toCharArray.

Line 16 uses String method replace to return a new String object in which every occurrence in string s1 of character 'l' (lowercase el) is replaced with character 'L'. Method replace leaves the original string unchanged. If there are no occurrences of the first argument in the string, method replace returns the original string.

Line 19 uses String method toUpperCase to generate a new String with uppercase letters where corresponding lowercase letters exist in s1. The method returns a new String object containing the converted string and leaves the original string unchanged. If there are no characters to convert, method toUpperCase returns the original string.

Line 20 uses String method toLowerCase to return a new String object with lowercase letters where corresponding uppercase letters exist in s2. The original string remains unchanged. If there are no characters in the original string to convert, toLowerCase returns the original string.

Line 23 uses String method trim to generate a new String object that removes all white-space characters that appear at the beginning or end of the string on which trim operates. The method returns a new String object containing the string without leading or trailing white space. The original string remains unchanged.

Line 26 uses String method toCharArray to create a new character array containing a copy of the characters in string s1. Lines 29–30 output each char in the array.

25.3.8 `String` Method `valueOf`

As we’ve seen, every object in Java has a toString method that enables a program to obtain the object’s string representation. Unfortunately, this technique cannot be used with primitive types because they do not have methods. Class String provides static methods that take an argument of any type and convert the argument to a String object. Figure 25.9 demonstrates the String class valueOf methods.

Fig. 25.9. String class valueOf methods.

The expression String.valueOf(charArray) at line 18 uses the character array charArray to create a new String object. The expression String.valueOf(charArray, 3, 3) at line 20 uses a portion of the character array charArray to create a new String object. The second argument specifies the starting index from which the characters are used. The third argument specifies the number of characters to be used.

There are seven other versions of method valueOf, which take arguments of type boolean, char, int, long, float, double and Object, respectively. These are demonstrated in lines 21–25. Note that the version of valueOf that takes an Object as an argument can do so because all Objects can be converted to Strings with method toString.

[Note: Lines 12–13 use literal values 10000000000L and 2.5f as the initial values of long variable longValue and float variable floatValue, respectively. By default, Java treats integer literals as type int and floating-point literals as type double. Appending the letter L to the literal 10000000000 and appending letter f to the literal 2.5 indicates to the compiler that 10000000000 should be treated as a long and that 2.5 should be treated as a float. An uppercase L or lowercase l can be used to denote a variable of type long and an uppercase F or lowercase f can be used to denote a variable of type float.]

25.4 Class `StringBuilder`

Once a String object is created, its contents can never change. We now discuss the features of class StringBuilder for creating and manipulating dynamic string information—that is, modifiable strings. Every StringBuilder is capable of storing a number of characters specified by its capacity. If the capacity of a StringBuilder is exceeded, the capacity is automatically expanded to accommodate the additional characters. Class StringBuilder is also used to implement operators + and += for String concatenation.

Performance Tip 25.2

Java can perform certain optimizations involving String objects (such as sharing one String object among multiple references) because it knows these objects will not change. Strings (not StringBuilders) should be used if the data will not change.

Performance Tip 25.3

In programs that frequently perform string concatenation, or other string modifications, it is often more efficient to implement the modifications with class StringBuilder.

Software Engineering Observation 25.2

StringBuilders are not thread safe. If multiple threads require access to the same dynamic string information, use class StringBuffer in your code. Classes StringBuilder and StringBuffer are identical, but class StringBuffer is thread safe.

25.4.1 `StringBuilder` Constructors

Class StringBuilder provides four constructors. We demonstrate three of these in Fig. 25.10. Line 8 uses the no-argument StringBuilder constructor to create a StringBuilder with no characters in it and an initial capacity of 16 characters (the default for a StringBuilder). Line 9 uses the StringBuilder constructor that takes an integer argument to create a StringBuilder with no characters in it and the initial capacity specified by the integer argument (i.e., 10). Line 10 uses the StringBuilder constructor that takes a String argument (in this case, a string literal) to create a StringBuilder containing the characters in the String argument. The initial capacity is the number of characters in the String argument plus 16.

Fig. 25.10. StringBuilder class constructors.

Lines 12–14 use the method toString of class StringBuilder to output the StringBuilders with the printf method. In Section 25.4.4, we discuss how Java uses StringBuilder objects to implement the + and += operators for string concatenation.

25.4.2 `StringBuilder` Methods `length`, `capacity`, `setLength` and `ensureCapacity`

StringBuilder methods length and capacity return the number of characters currently in a StringBuilder and the number of characters that can be stored in a StringBuilder without allocating more memory, respectively. Method ensureCapacity guarantees that a StringBuilder has at least the specified capacity. Method setLength increases or decreases the length of a StringBuilder. Figure 25.11 demonstrates these methods.

Fig. 25.11. StringBuilder methods length and capacity.

The application contains one StringBuilder called buffer. Line 8 uses the StringBuilder constructor that takes a String argument to initialize the StringBuilder with "Hello, how are you?". Lines 10–11 print the contents, length and capacity of the StringBuilder. Note in the output window that the capacity of the StringBuilder is initially 35. Recall that the StringBuilder constructor that takes a String argument initializes the capacity to the length of the string passed as an argument plus 16.

Line 13 uses method ensureCapacity to expand the capacity of the StringBuilder to a minimum of 75 characters. Actually, if the original capacity is less than the argument, the method ensures a capacity that is the greater of the number specified as an argument and twice the original capacity plus 2. The StringBuilder’s current capacity remains unchanged if it is more than the specified capacity.

Performance Tip 25.4

Dynamically increasing the capacity of a StringBuilder can take a relatively long time. Executing a large number of these operations can degrade the performance of an application. If a StringBuilder is going to increase greatly in size, possibly multiple times, setting its capacity high at the beginning will increase performance.

Line 16 uses method setLength to set the StringBuilder’s length to 10. If the specified length is less than the StringBuilder’s current number of characters, the buffer is truncated to the specified length (i.e., the remaining characters in the StringBuilder are discarded). If the specified length is greater than the StringBuilder’s current number of characters, null characters (characters with the numeric representation 0) are appended until the total number of characters in the StringBuilder is equal to the specified length.

25.4.3 `StringBuilder` Methods `charAt`, `setCharAt`, `getChars` and `reverse`

StringBuilder methods charAt, setCharAt, getChars and reverse manipulate the characters in a StringBuilder. Each of these methods is demonstrated in Fig. 25.12.

Fig. 25.12. StringBuilder class character-manipulation methods.

Method charAt (line 12) takes an integer argument and returns the character in the StringBuilder at that index. Method getChars (line 15) copies characters from a StringBuilder into the character array passed as an argument. This method takes four arguments—the starting index from which characters should be copied in the StringBuilder, the index one past the last character to be copied from the StringBuilder, the character array into which the characters are to be copied and the starting location in the character array where the first character should be placed. Method setCharAt (lines 21 and 22) takes an integer and a character argument and sets the character at the specified position in the StringBuilder to the character argument. Method reverse (line 25) reverses the contents of the StringBuilder.

Common Programming Error 25.3

Attempting to access a character that is outside the bounds of a StringBuilder (i.e., with an index less than 0 or greater than or equal to the StringBuilder’s length) results in a StringIndexOutOfBoundsException.

25.4.4 `StringBuilder append` Methods

Class StringBuilder provides overloaded append methods (demonstrated in Fig. 25.13) to allow values of various types to be appended to the end of a StringBuilder. Versions are provided for each of the primitive types and for character arrays, Strings, Objects, StringBuilders and CharSequences. (Remember that method toString produces a string representation of any Object.) Each of the methods takes its argument, converts it to a string and appends it to the StringBuilder.

Fig. 25.13. StringBuilder class append methods.

Actually, the compiler uses StringBuilder and the append methods to implement the + and += operators for String concatenation. For example, assuming the declarations

String string1 = "hello";
String string2 = "BC";
int value = 22;

the statement

String s = string1 + string2 + value;

concatenates "hello", "BC" and 22. The concatenation is performed as follows:

new StringBuilder().append( "hello" ).append( "BC" ).append(
22 ).toString();

First, Java creates an empty StringBuilder, then appends to it the string "hello", the string "BC" and the integer 22. Next, StringBuilder’s method toString converts the StringBuilder to a String object to be assigned to String s. The statement

s += "!";

is performed as follows:

s = new StringBuilder().append( s ).append( "!" ).toString();

First, Java creates an empty StringBuilder, then it appends to the StringBuilder the current contents of s followed by "!". Next, StringBuilder’s method toString converts the StringBuilder to a string representation, and the result is assigned to s.

25.4.5 `StringBuilder` Insertion and Deletion Methods

Class StringBuilder provides overloaded insert methods to allow values of various types to be inserted at any position in a StringBuilder. Versions are provided for each of the primitive types and for character arrays, Strings, Objects and CharSequences. Each method takes its second argument, converts it to a string and inserts it immediately preceding the index specified by the first argument. The first argument must be greater than or equal to 0 and less than the length of the StringBuilder—otherwise, a StringIndexOutOfBoundsException occurs. Class StringBuilder also provides methods delete and deleteCharAt for deleting characters at any position in a StringBuilder. Method delete takes two arguments—the starting index and the index one past the end of the characters to delete. All characters beginning at the starting index up to but not including the ending index are deleted. Method deleteCharAt takes one argument—the index of the character to delete. Invalid indices cause both methods to throw a StringIndexOutOfBoundsException. Methods insert, delete and deleteCharAt are demonstrated in Fig. 25.14.

Fig. 25.14. StringBuilder methods insert and delete.

25.5 Class `Character`

Java provides eight type-wrapper classes—Boolean, Character, Double, Float, Byte, Short, Integer and Long—that enable primitive-type values to be treated as objects. In this section, we present class Character—the type-wrapper class for primitive type char.

Most Character methods are static methods designed for convenience in processing individual char values. These methods take at least a character argument and perform either a test or a manipulation of the character. Class Character also contains a constructor that receives a char argument to initialize a Character object. Most of the methods of class Character are presented in the next three examples. For more information on class Character (and all the type-wrapper classes), see the java.lang package in the Java API documentation.

Figure 25.15 demonstrates some static methods that test characters to determine whether they are a specific character type and the static methods that perform case conversions on characters. You can enter any character and apply the methods to the character.

Fig. 25.15. Character class static methods for testing characters and converting character case.

Line 15 uses Character method isDefined to determine whether character c is defined in the Unicode character set. If so, the method returns true, and otherwise, it returns false. Line 16 uses Character method isDigit to determine whether character c is a defined Unicode digit. If so, the method returns true, and otherwise, it returns false.

Line 18 uses Character method isJavaIdentifierStart to determine whether c is a character that can be the first character of an identifier in Java—that is, a letter, an underscore (_) or a dollar sign ($). If so, the method returns true, and otherwise, it returns false. Line 20 uses Character method isJavaIdentifierPart to determine whether character c is a character that can be used in an identifier in Java—that is, a digit, a letter, an underscore (_) or a dollar sign ($). If so, the method returns true, and otherwise, false.

Line 21 uses Character method isLetter to determine whether character c is a letter. If so, the method returns true, and otherwise, false. Line 23 uses Character method isLetterOrDigit to determine whether character c is a letter or a digit. If so, the method returns true, and otherwise, false.

Line 25 uses Character method isLowerCase to determine whether character c is a lowercase letter. If so, the method returns true, and otherwise, false. Line 27 uses Character method isUpperCase to determine whether character c is an uppercase letter. If so, the method returns true, and otherwise, false.

Line 29 uses Character method toUpperCase to convert the character c to its uppercase equivalent. The method returns the converted character if the character has an uppercase equivalent, and otherwise, the method returns its original argument. Line 31 uses Character method toLowerCase to convert the character c to its lowercase equivalent. The method returns the converted character if the character has a lowercase equivalent, and otherwise, the method returns its original argument.

Figure 25.16 demonstrates static Character methods digit and forDigit, which convert characters to digits and digits to characters, respectively, in different number systems. Common number systems include decimal (base 10), octal (base 8), hexadecimal (base 16) and binary (base 2). The base of a number is also known as its radix.

Fig. 25.16. Character class static conversion methods.

Line 28 uses method forDigit to convert the integer digit into a character in the number system specified by the integer radix (the base of the number). For example, the decimal integer 13 in base 16 (the radix) has the character value 'd'. Lowercase and uppercase letters represent the same value in number systems. Line 35 uses method digit to convert the character c into an integer in the number system specified by the integer radix (the base of the number). For example, the character 'A' is the base 16 (the radix) representation of the base 10 value 10. The radix must be between 2 and 36, inclusive.

Figure 25.17 demonstrates the constructor and several non-static methods of class Character—charValue, toString and equals. Lines 8–9 instantiate two Character objects by autoboxing the character constants 'A' and 'a', respectively. Line 12 uses Character method charValue to return the char value stored in Character object c1. Line 12 returns a string representation of Character object c2 using method toString. The condition in the if...else statement at lines 14–17 uses method equals to determine whether the object c1 has the same contents as the object c2 (i.e., the characters inside each object are equal).

Fig. 25.17. Character class non-static methods.

25.6 Class `StringTokenizer`

When you read a sentence, your mind breaks the sentence into tokens—individual words and punctuation marks, each of which conveys meaning to you. Compilers also perform tokenization. They break up statements into individual pieces like keywords, identifiers, operators and other programming-language elements. We now study Java’s StringTokenizer class (from package java.util), which breaks a string into its component tokens. Tokens are separated from one another by delimiters, typically white-space characters such as space, tab, newline and carriage return. Other characters can also be used as delimiters to separate tokens. The application in Fig. 25.18 demonstrates class StringTokenizer.

Fig. 25.18. StringTokenizer object used to tokenize strings.

When the user presses the Enter key, the input sentence is stored in variable sentence. Line 17 creates a StringTokenizer for sentence. This StringTokenizer constructor takes a string argument and creates a StringTokenizer for it, and will use the default delimiter string " f" consisting of a space, a tab, a carriage return and a newline for tokenization. There are two other constructors for class StringTokenizer. In the version that takes two String arguments, the second String is the delimiter string. In the version that takes three arguments, the second String is the delimiter string and the third argument (a boolean) determines whether the delimiters are also returned as tokens (only if the argument is true). This is useful if you need to know what the delimiters are.

Line 19 uses StringTokenizer method countTokens to determine the number of tokens in the string to be tokenized. The condition at line 21 uses StringTokenizer method hasMoreTokens to determine whether there are more tokens in the string being tokenized. If so, line 22 prints the next token in the String. The next token is obtained with a call to StringTokenizer method nextToken, which returns a String. The token is output using println, so subsequent tokens appear on separate lines.

If you would like to change the delimiter string while tokenizing a string, you may do so by specifying a new delimiter string in a nextToken call as follows:

tokens.nextToken( newDelimiterString );

This feature is not demonstrated in Fig. 25.18.

25.7 Regular Expressions, Class `Pattern` and Class `Matcher`

Regular expressions are sequences of characters and symbols that define a set of strings. They are useful for validating input and ensuring that data is in a particular format. For example, a ZIP code must consist of five digits, and a last name must contain only letters, spaces, apostrophes and hyphens. One application of regular expressions is to facilitate the construction of a compiler. Often, a large and complex regular expression is used to validate the syntax of a program. If the program code does not match the regular expression, the compiler knows that there is a syntax error within the code.

Class String provides several methods for performing regular-expression operations, the simplest of which is the matching operation. String method matches receives a string that specifies the regular expression and matches the contents of the String object on which it is called to the regular expression. The method returns a boolean indicating whether the match succeeded.

A regular expression consists of literal characters and special symbols. Figure 25.19 specifies some predefined character classes that can be used with regular expressions. A character class is an escape sequence that represents a group of characters. A digit is any numeric character. A word character is any letter (uppercase or lowercase), any digit or the underscore character. A whitespace character is a space, a tab, a carriage return, a newline or a form feed. Each character class matches a single character in the string we are attempting to match with the regular expression.

Fig. 25.19. Predefined character classes.

Regular expressions are not limited to these predefined character classes. The expressions employ various operators and other forms of notation to match complex patterns. We examine several of these techniques in the application in Figs. 25.20 and 25.21 which validates user input via regular expressions. [Note: This application is not designed to match all possible valid user input.]

Fig. 25.20. Validating user information using regular expressions.

Fig. 25.21. Inputs and validates data from user using the ValidateInput class.

Figure 25.20 validates user input. Line 9 validates the first name. To match a set of characters that does not have a predefined character class, use square brackets, []. For example, the pattern "[aeiou]" matches a single character that is a vowel. Character ranges are represented by placing a dash (-) between two characters. In the example, "[AZ]" matches a single uppercase letter. If the first character in the brackets is "^", the expression accepts any character other than those indicated. However, it is important to note that "[^Z]" is not the same as "[A-Y]", which matches uppercase letters A–Y—"[^Z]" matches any character other than capital Z, including lowercase letters and non-letters such as the newline character. Ranges in character classes are determined by the letters’ integer values. In this example, "[A-Za-z]" matches all uppercase and lowercase letters. The range "[A-z]" matches all letters and also matches those characters (such as % and 6) with an integer value between uppercase Z and lowercase a (for more information on integer values of characters see Appendix B, ASCII Character Set). Like predefined character classes, character classes delimited by square brackets match a single character in the search object.

In line 9, the asterisk after the second character class indicates that any number of letters can be matched. In general, when the regular-expression operator "*" appears in a regular expression, the application attempts to match zero or more occurrences of the subexpression immediately preceding the "*". Operator "+" attempts to match one or more occurrences of the subexpression immediately preceding "+". So both "A*" and "A+" will match "AAA", but only "A*" will match an empty string.

If method validateFirstName returns true (line 29), the application attempts to validate the last name (line 31) by calling validateLastName (lines 13–16 of Fig. 25.20). The regular expression to validate the last name matches any number of letters split by spaces, apostrophes or hyphens.

Line 33 validates the address by calling method validateAddress (lines 19–23 of Fig. 25.20). The first character class matches any digit one or more times (\d+). Note that two characters are used, because normally starts an escape sequences in a string. So \d in a Java string represents the regular expression pattern d. Then we match one or more white-space characters (\s+). The character "|" allows a match of the expression to its left or to its right. For example, "Hi (John|Jane)" matches both "Hi John" and "Hi Jane". The parentheses are used to group parts of the regular expression. In this example, the left side of | matches a single word, and the right side matches two words separated by any amount of white space. So the address must contain a number followed by one or two words. Therefore, "10 Broadway" and "10 Main Street" are both valid addresses in this example. The city (lines 26–29 of Fig. 25.20) and state (lines 32–35 of Fig. 25.20) methods also match any word of at least one character or, alternatively, any two words of at least one character if the words are separated by a single space. This means both Waltham and West Newton would match.

Quantifiers

The asterisk (*) and plus (+) are formally called quantifiers. Figure 25.22 lists all the quantifiers. We have already discussed how the asterisk (*) and plus (+) quantifiers work. All quantifiers affect only the subexpression immediately preceding the quantifier. Quantifier question mark (?) matches zero or one occurrences of the expression that it quantifies. A set of braces containing one number ({n}) matches exactly n occurrences of the expression it quantifies. We demonstrate this quantifier to validate the zip code in Fig. 25.20 at line 40. Including a comma after the number enclosed in braces matches at least n occurrences of the quantified expression. The set of braces containing two numbers ({n, m}), matches between n and m occurrences of the expression that it qualifies. Quantifiers may be applied to patterns enclosed in parentheses to create more complex regular expressions.

Fig. 25.22. Quantifiers used in regular expressions.

All of the quantifiers are greedy. This means that they will match as many occurrences as they can as long as the match is still successful. However, if any of these quantifiers is followed by a question mark (?), the quantifier becomes reluctant (sometimes called lazy). It then will match as few occurrences as possible as long as the match is still successful.

The zip code (line 40 in Fig. 25.20) matches a digit five times. This regular expression uses the digit character class and a quantifier with the digit 5 between braces. The phone number (line 46 in Fig. 25.20) matches three digits (the first one cannot be zero) followed by a dash followed by three more digits (again the first one cannot be zero) followed by four more digits.

String Method matches checks whether an entire string conforms to a regular expression. For example, we want to accept "Smith" as a last name, but not "9@Smith#". If only a substring matches the regular expression, method matches returns false.

Replacing Substrings and Splitting Strings

Sometimes it is useful to replace parts of a string or to split a string into pieces. For this purpose, class String provides methods replaceAll, replaceFirst and split. These methods are demonstrated in Fig. 25.23.

Fig. 25.23. Methods replaceFirst, replaceAll and split.

Method replaceAll replaces text in a string with new text (the second argument) wherever the original string matches a regular expression (the first argument). Line 14 replaces every instance of "*" in firstString with "^". Note that the regular expression ("\*") precedes character * with two backslashes. Normally, * is a quantifier indicating that a regular expression should match any number of occurrences of a preceding pattern. However, in line 14, we want to find all occurrences of the literal character *—to do this, we must escape character * with character . Escaping a special regular-expression character with instructs the matching engine to find the actual character. Since the expression is stored in a Java string and is a special character in Java strings, we must include an additional . So the Java string "\*" represents the regular-expression pattern * which matches a single * character in the search string. In line 19, every match for the regular expression "stars" in firstString is replaced with "carets".

Method replaceFirst (line 32) replaces the first occurrence of a pattern match. Java Strings are immutable, therefore method replaceFirst returns a new string in which the appropriate characters have been replaced. This line takes the original string and replaces it with the string returned by replaceFirst. By iterating three times we replace the first three instances of a digit (d) in secondString with the text "digit".

Method split divides a string into several substrings. The original string is broken in any location that matches a specified regular expression. Method split returns an array of strings containing the substrings between matches for the regular expression. In line 38, we use method split to tokenize a string of comma-separated integers. The argument is the regular expression that locates the delimiter. In this case, we use the regular expression ",\s*" to separate the substrings wherever a comma occurs. By matching any whitespace characters, we eliminate extra spaces from the resulting substrings. Note that the commas and white-space characters are not returned as part of the substrings. Again, note that the Java string ",\s*" represents the regular expression ,s*.

Classes `Pattern` and `Matcher`

In addition to the regular-expression capabilities of class String, Java provides other classes in package java.util.regex that help developers manipulate regular expressions. Class Pattern represents a regular expression. Class Matcher contains both a regular-expression pattern and a CharSequence in which to search for the pattern.

CharSequence is an interface that allows read access to a sequence of characters. The interface requires that the methods charAt, length, subSequence and toString be declared. Both String and StringBuilder implement interface CharSequence, so an instance of either of these classes can be used with class Matcher.

Common Programming Error 25.4

A regular expression can be tested against an object of any class that implements interface CharSequence, but the regular expression must be a String. Attempting to create a regular expression as a StringBuilder is an error.

If a regular expression will be used only once, static Pattern method matches can be used. This method takes a string that specifies the regular expression and a CharSequence on which to perform the match. This method returns a boolean indicating whether the search object (the second argument) matches the regular expression.

If a regular expression will be used more than once, it is more efficient to use static Pattern method compile to create a specific Pattern object for that regular expression. This method receives a string representing the pattern and returns a new Pattern object, which can then be used to call method matcher. This method receives a CharSequence to search and returns a Matcher object.

Matcher provides method matches, which performs the same task as Pattern method matches, but receives no arguments—the search pattern and search object are encapsulated in the Matcher object. Class Matcher provides other methods, including find, lookingAt, replaceFirst and replaceAll.

Figure 25.24 presents a simple example that employs regular expressions. This program matches birthdays against a regular expression. The expression matches only birthdays that do not occur in April and that belong to people whose names begin with "J".

Fig. 25.24. Regular expressions checking birthdays.

Lines 11–12 create a Pattern by invoking static Pattern method compile. The dot character "." in the regular expression (line 12) matches any single character except a newline character.

Line 20 creates the Matcher object for the compiled regular expression and the matching sequence (string1). Lines 22–23 use a while loop to iterate through the string. Line 22 uses Matcher method find to attempt to match a piece of the search object to the search pattern. Each call to this method starts at the point where the last call ended, so multiple matches can be found. Matcher method lookingAt performs the same way, except that it always starts from the beginning of the search object and will always find the first match if there is one.

Common Programming Error 25.5

Method matches (from class String, Pattern or Matcher) will return true only if the entire search object matches the regular expression. Methods find and lookingAt (from class Matcher) will return true if a portion of the search object matches the regular expression.

Line 23 uses Matcher method group, which returns the string from the search object that matches the search pattern. The string that is returned is the one that was last matched by a call to find or lookingAt. The output in Fig. 25.24 shows the two matches that were found in string1.

25.8 Wrap-Up

In this chapter, you learned about more String methods for selecting portions of Strings and manipulating Strings. You also learned about the Character class and some of the methods it declares to handle chars. The chapter also discussed the capabilities of the StringBuilder class for creating Strings. The end of the chapter discussed regular expressions, which provide a powerful capability to search and match portions of Strings that fit a particular pattern.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 25. Strings, Characters and Regular Expressions

Create new playlist

Sign In

Sign Up

25. Strings, Characters and Regular Expressions

25.1 Introduction

25.2 Fundamentals of Characters and Strings

25.3 Class String

25.3.1 String Constructors

25.3.2 String Methods length, charAt and getChars

25.3.3 Comparing Strings

25.3.4 Locating Characters and Substrings in Strings

25.3.5 Extracting Substrings from Strings

25.3.6 Concatenating Strings

25.3.7 Miscellaneous String Methods

25.3.8 String Method valueOf

25.4 Class StringBuilder

25.4.1 StringBuilder Constructors

25.4.2 StringBuilder Methods length, capacity, setLength and ensureCapacity

25.4.3 StringBuilder Methods charAt, setCharAt, getChars and reverse

25.4.4 StringBuilder append Methods

25.4.5 StringBuilder Insertion and Deletion Methods

25.5 Class Character

25.6 Class StringTokenizer

25.7 Regular Expressions, Class Pattern and Class Matcher

Quantifiers

Replacing Substrings and Splitting Strings

Classes Pattern and Matcher

25.8 Wrap-Up

Table of Contents for
25. Strings, Characters and Regular Expressions

25.3 Class `String`

25.3.1 `String` Constructors

25.3.2 `String` Methods `length`, `charAt` and `getChars`

25.3.7 Miscellaneous `String` Methods

25.3.8 `String` Method `valueOf`

25.4 Class `StringBuilder`

25.4.1 `StringBuilder` Constructors

25.4.2 `StringBuilder` Methods `length`, `capacity`, `setLength` and `ensureCapacity`

25.4.3 `StringBuilder` Methods `charAt`, `setCharAt`, `getChars` and `reverse`

25.4.4 `StringBuilder append` Methods

25.4.5 `StringBuilder` Insertion and Deletion Methods

25.5 Class `Character`

25.6 Class `StringTokenizer`

25.7 Regular Expressions, Class `Pattern` and Class `Matcher`

Classes `Pattern` and `Matcher`