Objectives
In this chapter you’ll learn:
• To create and manipulate immutable character string objects of class String
.
• To create and manipulates mutable character string objects of class StringBuilder
.
• To create and manipulate objects of class Character
.
• To use a StringTokenizer
object to break a String
object into tokens.
• To use regular expressions to validate String
data entered into an application.
The chief defect of Henry King Was chewing little bits of string.
—Hilaire Belloc
Vigorous writing is concise. A sentence should contain no unnecessary words, a paragraph no unnecessary sentences.
—William Strunk, Jr.
I have made this letter longer than usual, because I lack the time to make it short.
—Blaise Pascal
Outline
25.1 Introduction
25.2 Fundamentals of Characters and Strings
25.3 Class String
25.3.1 String
Constructors
25.3.2 String
Methods length
, charAt
and getChars
25.3.3 Comparing Strings
25.3.4 Locating Characters and Substrings in Strings
25.3.5 Extracting Substrings from Strings
25.3.6 Concatenating Strings
25.3.7 Miscellaneous String
Methods
25.3.8 String
Method valueOf
25.4 Class StringBuilder
25.4.1 StringBuilder
Constructors
25.4.2 StringBuilder
Methods length
, capacity
, setLength
and ensureCapacity
25.4.3 StringBuilder
Methods charAt
, setCharAt
, getChars
and reverse
25.4.4 StringBuilder append
Methods
25.4.5 StringBuilder
Insertion and Deletion Methods
25.5 Class Character
25.6 Class StringTokenizer
25.7 Regular Expressions, Class Pattern
and Class Matcher
25.8 Wrap-Up
This chapter introduces Java’s string- and character-processing capabilities. The techniques discussed here are appropriate for validating program input, displaying information to users and other text-based manipulations. They are also appropriate for developing text editors, word processors, page-layout software, computerized typesetting systems and other kinds of text-processing software. We have already presented several string-processing capabilities in earlier chapters. This chapter discusses in detail the capabilities of class String
, class StringBuilder
and class Character
from the java.lang
package and class StringTokenizer
from the java.util
package. These classes provide the foundation for string and character manipulation in Java.
The chapter also discusses regular expressions that provide applications with the capability to validate input. The functionality is located in the String
class along with classes Matcher
and Pattern
located in the java.util.regex
package.
Characters are the fundamental building blocks of Java source programs. Every program is composed of a sequence of characters that—when grouped together meaningfully—are interpreted by the computer as a series of instructions used to accomplish a task. A program may contain character literals. A character literal is an integer value represented as a character in single quotes. For example, 'z'
represents the integer value of z
, and '
'
represents the integer value of newline. The value of a character literal is the integer value of the character in the Unicode character set. Appendix B presents the integer equivalents of the characters in the ASCII character set, which is a subset of Unicode. For detailed information on Unicode, visit www.unicode.org.
Recall from Section 2.2 that a string is a sequence of characters treated as a single unit. A string may include letters, digits and various special characters, such as +
, -
, *
, /
and $
. A string is an object of class String
. String literals (stored in memory as String
objects) are written as a sequence of characters in double quotation marks, as in:
A string may be assigned to a String
reference. The declaration
String color = "blue";
initializes String
variable color
to refer to a String
object that contains the string "blue"
.
Performance Tip 25.1
Java treats all string literals with the same contents as a single String
object that has many references to it. This conserves memory.
String
Class String
is used to represent strings in Java. The next several subsections cover many of class String
’s capabilities.
String
ConstructorsClass String
provides constructors for initializing String
objects in a variety of ways. Four of the constructors are demonstrated in the main
method of Fig. 25.1.
Fig. 25.1. String
class constructors.
Line 12 instantiates a new String
object using class String
’s no-argument constructor and assigns its reference to s1
. The new String
object contains no characters (the empty string) and has a length of 0.
Line 13 instantiates a new String
object using class String
’s constructor that takes a String
as an argument and assigns its reference to s2
. The new String
contains the same sequence of characters as the one that is passed as an argument to the constructor.
Software Engineering Observation 25.1
It is not necessary to copy an existing String
object. String
objects are immutable—their character contents cannot be changed after they are created, because class String
does not provide any methods that allow the contents of a String
object to be modified.
Line 14 instantiates a new String
object and assigns its reference to s3
using class String
’s constructor that takes a char
array as an argument. The new String
object contains a copy of the characters in the array.
Line 15 instantiates a new String
object and assigns its reference to s4
using class String
’s constructor that takes a char
array and two integers as arguments. The second argument specifies the starting position (the offset) from which characters in the array are accessed. Remember that the first character is at position 0
. The third argument specifies the number of characters (the count) to access in the array. The new String
object contains a string formed from the accessed characters. If the offset or the count specified as an argument results in accessing an element outside the bounds of the character array, a StringIndexOutOfBoundsException
is thrown.
Common Programming Error 25.1
Attempting to access a character that is outside the bounds of a string results in a StringIndexOutOfBoundsException
.
String
Methods length
, charAt
and getChars
String
methods length
, charAt
and getChars
return the length of a string, obtain the character at a specific location in a string and retrieve a set of characters from a string as a char
array, respectively. The application in Fig. 25.2 demonstrates each of these methods.
Fig. 25.2. String
class character-manipulation methods.
Line 15 uses String
method length
to determine the number of characters in string s1
. Like arrays, strings know their own length. However, unlike arrays, you cannot access a String
’s length via a length
field—instead you must call the String
’s length
method.
Lines 20–21 print the characters of the string s1
in reverse order (and separated by spaces). String
method charAt
(line 21) returns the character at a specific position in the string. Method charAt
receives an integer argument that is used as the index and returns the character at that position. Like arrays, the first element of a string is at position 0.
Line 24 uses String
method getChars
to copy the characters of a string into a character array. The first argument is the starting index in the string from which characters are to be copied. The second argument is the index that is one past the last character to be copied from the string. The third argument is the character array into which the characters are to be copied. The last argument is the starting index where the copied characters are placed in the target character array. Next, line 28 prints the char
array contents one character at a time.
Chapter 7 discussed sorting and searching arrays. Frequently, the information being sorted or searched consists of strings that must be compared to place them into the proper order or to determine whether a string appears in an array (or other collection). Class String
provides several methods for comparing strings.
To understand what it means for one string to be greater than or less than another, consider the process of alphabetizing a series of last names. You would, no doubt, place “Jones” before “Smith” because the first letter of “Jones” comes before the first letter of “Smith” in the alphabet. But the alphabet is more than just a list of 26 letters—it is an ordered set of characters. Each letter occurs in a specific position within the set. Z is more than just a letter of the alphabet—it is specifically the twenty-sixth letter of the alphabet.
How does the computer know that one letter comes before another? All characters are represented in the computer as numeric codes (see Appendix B). When the computer compares strings, it actually compares the numeric codes of the characters in the strings.
Figure 25.3 demonstrates String
methods equals
, equalsIgnoreCase
, compareTo
and regionMatches
and using the equality operator ==
to compare String
objects.
Fig. 25.3. String
comparisons.
The condition at line 17 uses method equals
to compare string s1
and the string literal "hello"
for equality. Method equals
(a method of class Object
overridden in String
) tests any two objects for equality—the strings contained in the two objects are identical. The method returns true
if the contents of the objects are equal, and false
otherwise. The preceding condition is true
because string s1
was initialized with the string literal "hello"
. Method equals
uses a lexicographical comparison—it compares the integer Unicode values (see www.unicode.com, for more information) that represent each character in each string. Thus, if the string "hello"
is compared with the string "HELLO"
, the result is false
, because the integer representation of a lowercase letter is different from that of the corresponding uppercase letter.
The condition at line 23 uses the equality operator ==
to compare string s1
for equality with the string literal "hello"
. Operator ==
has different functionality when it is used to compare references than when it is used to compare values of primitive types. When primitive-type values are compared with ==
, the result is true
if both values are identical. When references are compared with ==
, the result is true
if both references refer to the same object in memory. To compare the actual contents (or state information) of objects for equality, a method must be invoked. In the case of String
s, that method is equals
. The preceding condition evaluates to false
at line 23 because the reference s1
was initialized with the statement
s1 = new String( "hello" );
which creates a new String
object with a copy of string literal "hello"
and assigns the new object to variable s1
. If s1
had been initialized with the statement
s1 = "hello";
which directly assigns the string literal "hello"
to variable s1
, the condition would be true
. Remember that Java treats all string literal objects with the same contents as one String
object to which there can be many references. Thus, lines 8, 17 and 23 all refer to the same String
object "hello"
in memory.
Common Programming Error 25.2
Comparing references with ==
can lead to logic errors, because ==
compares the references to determine whether they refer to the same object, not whether two objects have the same contents. When two identical (but separate) objects are compared with ==
, the result will be false
. When comparing objects to determine whether they have the same contents, use method equals
.
If you are sorting String
s, you may compare them for equality with method equalsIgnoreCase
, which ignores whether the letters in each string are uppercase or lowercase when performing the comparison. Thus, the string "hello"
and the string "HELLO"
compare as equal. Line 29 uses String
method equalsIgnoreCase
to compare string s3
—Happy Birthday
—for equality with string s4
—happy birthday
. The result of this comparison is true
because the comparison ignores case sensitivity.
Lines 35–44 use method compareTo
to compare strings. Method compareTo
is declared in the Comparable
interface and implemented in the String
class. Line 36 compares string s1
to string s2
. Method compareTo
returns 0 if the strings are equal, a negative number if the string that invokes compareTo
is less than the string that is passed as an argument and a positive number if the string that invokes compareTo
is greater than the string that is passed as an argument. Method compareTo
uses a lexicographical comparison—it compares the numeric values of corresponding characters in each string. (For more information on the exact value returned by the compareTo
method, see java.sun.com/javase/6/docs/api/java/lang/String.html.)
The condition at line 47 uses String
method regionMatches
to compare portions of two strings for equality. The first argument is the starting index in the string that invokes the method. The second argument is a comparison string. The third argument is the starting index in the comparison string. The last argument is the number of characters to compare between the two strings. The method returns true
only if the specified number of characters are lexicographically equal.
Finally, the condition at line 54 uses a five-argument version of String
method regionMatches
to compare portions of two strings for equality. When the first argument is true
, the method ignores the case of the characters being compared. The remaining arguments are identical to those described for the four-argument regionMatches
method.
Figure 25.4 demonstrates String
methods startsWith
and endsWith
. Method main
creates array strings
containing the strings "started"
, "starting"
, "ended"
and "ending"
. The rest of main
consists of three for
statements that test the elements of the array to determine whether they start with or end with a particular set of characters.
Fig. 25.4. String
class startsWith
and endsWith
methods.
Lines 11–15 use the version of method startsWith
that takes a String
argument. The condition in the if
statement (line 13) determines whether each String
in the array starts with the characters "st"
. If so, the method returns true
and the application prints that String
. Otherwise, the method returns false
and nothing happens.
Lines 20–25 use the startsWith
method that takes a String
and an integer as arguments. The integer specifies the index at which the comparison should begin in the string. The condition in the if
statement (line 22) determines whether each String
in the array has the characters "art"
beginning with the third character in each string. If so, the method returns true
and the application prints the String
.
The third for
statement (lines 30–34) uses method endsWith
, which takes a String
argument. The condition at line 32 determines whether each String
in the array ends with the characters "ed"
. If so, the method returns true
and the application prints the String
.
Often it is useful to search for a character or set of characters in a string. For example, if you are creating your own word processor, you might want to provide a capability for searching through documents. Figure 25.5 demonstrates the many versions of String
methods indexOf
and lastIndexOf
that search for a specified character or substring in a string. All the searches in this example are performed on the string letters
(initialized with "abcdefghijklmabcdefghijklm"
) in method main
. Lines 11–16 use method indexOf
to locate the first occurrence of a character in a string. If method indexOf
finds the character, it returns the character’s index in the string—otherwise, indexOf
returns –1
. There are two versions of indexOf
that search for characters in a string. The expression in line 12 uses the version of method indexOf
that takes an integer representation of the character to find. The expression at line 14 uses another version of method indexOf
, which takes two integer arguments—the character and the starting index at which the search of the string should begin.
Fig. 25.5. String
class searching methods.
The statements at lines 19–24 use method lastIndexOf
to locate the last occurrence of a character in a string. Method lastIndexOf
performs the search from the end of the string toward the beginning. If method lastIndexOf
finds the character, it returns the index of the character in the string—otherwise, lastIndexOf
returns –1. There are two versions of lastIndexOf
that search for characters in a string. The expression at line 20 uses the version that takes the integer representation of the character. The expression at line 22 uses the version that takes two integer arguments—the integer representation of the character and the index from which to begin searching backward.
Lines 27–40 demonstrate versions of methods indexOf
and lastIndexOf
that each take a String
as the first argument. These versions perform identically to those described earlier except that they search for sequences of characters (or substrings) that are specified by their String
arguments. If the substring is found, these methods return the index in the string of the first character in the substring.
Class String
provides two substring
methods to enable a new String
object to be created by copying part of an existing String
object. Each method returns a new String
object. Both methods are demonstrated in Fig. 25.6.
Fig. 25.6. String
class substring
methods.
The expression letters.substring( 20 )
at line 12 uses the substring
method that takes one integer argument. The argument specifies the starting index in the original string letters
from which characters are to be copied. The substring returned contains a copy of the characters from the starting index to the end of the string. Specifying an index outside the bounds of the string causes a StringIndexOutOfBoundsException
.
The expression letters.substring( 3, 6 )
at line 15 uses the substring
method that takes two integer arguments. The first argument specifies the starting index from which characters are copied in the original string. The second argument specifies the index one beyond the last character to be copied (i.e., copy up to, but not including, that index in the string). The substring returned contains a copy of the specified characters from the original string. Specifying an index outside the bounds of the string causes a StringIndexOutOfBoundsException
.
String
method concat
(Fig. 25.7) concatenates two String
objects and returns a new String
object containing the characters from both original strings. The expression s1.concat( s2 )
at line 13 forms a string by appending the characters in string s2
to the characters in string s1
. The original String
s to which s1
and s2
refer are not modified.
Fig. 25.7. String
method concat
.
String
MethodsClass String
provides several methods that return modified copies of strings or that return character arrays. These methods are demonstrated in the application in Fig. 25.8.
Fig. 25.8. String
methods replace
, toLowerCase
, toUpperCase
, trim
and toCharArray
.
Line 16 uses String
method replace
to return a new String
object in which every occurrence in string s1
of character 'l'
(lowercase el) is replaced with character 'L'
. Method replace
leaves the original string unchanged. If there are no occurrences of the first argument in the string, method replace
returns the original string.
Line 19 uses String
method toUpperCase
to generate a new String
with uppercase letters where corresponding lowercase letters exist in s1
. The method returns a new String
object containing the converted string and leaves the original string unchanged. If there are no characters to convert, method toUpperCase
returns the original string.
Line 20 uses String
method toLowerCase
to return a new String
object with lowercase letters where corresponding uppercase letters exist in s2
. The original string remains unchanged. If there are no characters in the original string to convert, toLowerCase
returns the original string.
Line 23 uses String
method trim
to generate a new String
object that removes all white-space characters that appear at the beginning or end of the string on which trim
operates. The method returns a new String
object containing the string without leading or trailing white space. The original string remains unchanged.
Line 26 uses String
method toCharArray
to create a new character array containing a copy of the characters in string s1
. Lines 29–30 output each char
in the array.
String
Method valueOf
As we’ve seen, every object in Java has a toString
method that enables a program to obtain the object’s string representation. Unfortunately, this technique cannot be used with primitive types because they do not have methods. Class String
provides static
methods that take an argument of any type and convert the argument to a String
object. Figure 25.9 demonstrates the String
class valueOf
methods.
Fig. 25.9. String
class valueOf
methods.
The expression String.valueOf(charArray)
at line 18 uses the character array charArray
to create a new String
object. The expression String.valueOf(charArray, 3, 3)
at line 20 uses a portion of the character array charArray
to create a new String
object. The second argument specifies the starting index from which the characters are used. The third argument specifies the number of characters to be used.
There are seven other versions of method valueOf
, which take arguments of type boolean
, char
, int
, long
, float
, double
and Object
, respectively. These are demonstrated in lines 21–25. Note that the version of valueOf
that takes an Object
as an argument can do so because all Object
s can be converted to String
s with method toString
.
[Note: Lines 12–13 use literal values 10000000000L
and 2.5f
as the initial values of long
variable longValue
and float
variable floatValue
, respectively. By default, Java treats integer literals as type int
and floating-point literals as type double
. Appending the letter L
to the literal 10000000000
and appending letter f
to the literal 2.5
indicates to the compiler that 10000000000
should be treated as a long
and that 2.5
should be treated as a float
. An uppercase L
or lowercase l
can be used to denote a variable of type long
and an uppercase F
or lowercase f
can be used to denote a variable of type float
.]
StringBuilder
Once a String
object is created, its contents can never change. We now discuss the features of class StringBuilder
for creating and manipulating dynamic string information—that is, modifiable strings. Every StringBuilder
is capable of storing a number of characters specified by its capacity. If the capacity of a StringBuilder
is exceeded, the capacity is automatically expanded to accommodate the additional characters. Class StringBuilder
is also used to implement operators +
and +=
for String
concatenation.
Performance Tip 25.2
Java can perform certain optimizations involving String
objects (such as sharing one String
object among multiple references) because it knows these objects will not change. String
s (not StringBuilder
s) should be used if the data will not change.
Performance Tip 25.3
In programs that frequently perform string concatenation, or other string modifications, it is often more efficient to implement the modifications with class StringBuilder
.
Software Engineering Observation 25.2
StringBuilder
s are not thread safe. If multiple threads require access to the same dynamic string information, use class StringBuffer
in your code. Classes StringBuilder
and StringBuffer
are identical, but class StringBuffer
is thread safe.
StringBuilder
ConstructorsClass StringBuilder
provides four constructors. We demonstrate three of these in Fig. 25.10. Line 8 uses the no-argument StringBuilder
constructor to create a StringBuilder
with no characters in it and an initial capacity of 16 characters (the default for a StringBuilder
). Line 9 uses the StringBuilder
constructor that takes an integer argument to create a StringBuilder
with no characters in it and the initial capacity specified by the integer argument (i.e., 10
). Line 10 uses the StringBuilder
constructor that takes a String
argument (in this case, a string literal) to create a StringBuilder
containing the characters in the String
argument. The initial capacity is the number of characters in the String
argument plus 16.
Fig. 25.10. StringBuilder
class constructors.
Lines 12–14 use the method toString
of class StringBuilder
to output the StringBuilder
s with the printf
method. In Section 25.4.4, we discuss how Java uses StringBuilder
objects to implement the +
and +=
operators for string concatenation.
StringBuilder
Methods length
, capacity
, setLength
and ensureCapacity
StringBuilder
methods length
and capacity
return the number of characters currently in a StringBuilder
and the number of characters that can be stored in a StringBuilder
without allocating more memory, respectively. Method ensureCapacity
guarantees that a StringBuilder
has at least the specified capacity. Method setLength
increases or decreases the length of a StringBuilder
. Figure 25.11 demonstrates these methods.
Fig. 25.11. StringBuilder
methods length
and capacity
.
The application contains one StringBuilder
called buffer
. Line 8 uses the StringBuilder
constructor that takes a String
argument to initialize the StringBuilder
with "Hello, how are you?"
. Lines 10–11 print the contents, length and capacity of the StringBuilder
. Note in the output window that the capacity of the StringBuilder
is initially 35. Recall that the StringBuilder
constructor that takes a String
argument initializes the capacity to the length of the string passed as an argument plus 16.
Line 13 uses method ensureCapacity
to expand the capacity of the StringBuilder
to a minimum of 75 characters. Actually, if the original capacity is less than the argument, the method ensures a capacity that is the greater of the number specified as an argument and twice the original capacity plus 2. The StringBuilder
’s current capacity remains unchanged if it is more than the specified capacity.
Performance Tip 25.4
Dynamically increasing the capacity of a StringBuilder
can take a relatively long time. Executing a large number of these operations can degrade the performance of an application. If a StringBuilder
is going to increase greatly in size, possibly multiple times, setting its capacity high at the beginning will increase performance.
Line 16 uses method setLength
to set the StringBuilder
’s length to 10. If the specified length is less than the StringBuilder
’s current number of characters, the buffer is truncated to the specified length (i.e., the remaining characters in the StringBuilder
are discarded). If the specified length is greater than the StringBuilder
’s current number of characters, null
characters (characters with the numeric representation 0) are appended until the total number of characters in the StringBuilder
is equal to the specified length.
StringBuilder
Methods charAt
, setCharAt
, getChars
and reverse
StringBuilder
methods charAt
, setCharAt
, getChars
and reverse
manipulate the characters in a StringBuilder
. Each of these methods is demonstrated in Fig. 25.12.
Fig. 25.12. StringBuilder
class character-manipulation methods.
Method charAt
(line 12) takes an integer argument and returns the character in the StringBuilder
at that index. Method getChars
(line 15) copies characters from a StringBuilder
into the character array passed as an argument. This method takes four arguments—the starting index from which characters should be copied in the StringBuilder
, the index one past the last character to be copied from the StringBuilder
, the character array into which the characters are to be copied and the starting location in the character array where the first character should be placed. Method setCharAt
(lines 21 and 22) takes an integer and a character argument and sets the character at the specified position in the StringBuilder
to the character argument. Method reverse
(line 25) reverses the contents of the StringBuilder
.
Common Programming Error 25.3
Attempting to access a character that is outside the bounds of a StringBuilder
(i.e., with an index less than 0 or greater than or equal to the StringBuilder
’s length) results in a StringIndexOutOfBoundsException
.
StringBuilder append
MethodsClass StringBuilder
provides overloaded append
methods (demonstrated in Fig. 25.13) to allow values of various types to be appended to the end of a StringBuilder
. Versions are provided for each of the primitive types and for character arrays, String
s, Object
s, StringBuilder
s and CharSequence
s. (Remember that method toString
produces a string representation of any Object
.) Each of the methods takes its argument, converts it to a string and appends it to the StringBuilder
.
Fig. 25.13. StringBuilder
class append
methods.
Actually, the compiler uses StringBuilder
and the append
methods to implement the +
and +=
operators for String
concatenation. For example, assuming the declarations
String string1 = "hello";
String string2 = "BC";
int value = 22;
the statement
String s = string1 + string2 + value;
concatenates "hello"
, "BC"
and 22
. The concatenation is performed as follows:
new StringBuilder().append( "hello" ).append( "BC" ).append(
22 ).toString();
First, Java creates an empty StringBuilder
, then appends to it the string "hello"
, the string "BC"
and the integer 22
. Next, StringBuilder
’s method toString
converts the StringBuilder
to a String
object to be assigned to String s
. The statement
s += "!";
is performed as follows:
s = new StringBuilder().append( s ).append( "!" ).toString();
First, Java creates an empty StringBuilder
, then it appends to the StringBuilder
the current contents of s
followed by "!"
. Next, StringBuilder
’s method toString
converts the StringBuilder
to a string representation, and the result is assigned to s
.
StringBuilder
Insertion and Deletion MethodsClass StringBuilder
provides overloaded insert
methods to allow values of various types to be inserted at any position in a StringBuilder
. Versions are provided for each of the primitive types and for character arrays, String
s, Object
s and CharSequence
s. Each method takes its second argument, converts it to a string and inserts it immediately preceding the index specified by the first argument. The first argument must be greater than or equal to 0
and less than the length of the StringBuilder
—otherwise, a StringIndexOutOfBoundsException
occurs. Class StringBuilder
also provides methods delete
and deleteCharAt
for deleting characters at any position in a StringBuilder
. Method delete
takes two arguments—the starting index and the index one past the end of the characters to delete. All characters beginning at the starting index up to but not including the ending index are deleted. Method deleteCharAt
takes one argument—the index of the character to delete. Invalid indices cause both methods to throw a StringIndexOutOfBoundsException
. Methods insert
, delete
and deleteCharAt
are demonstrated in Fig. 25.14.
Fig. 25.14. StringBuilder
methods insert
and delete.
Character
Java provides eight type-wrapper classes—Boolean
, Character
, Double
, Float
, Byte
, Short
, Integer
and Long
—that enable primitive-type values to be treated as objects. In this section, we present class Character
—the type-wrapper class for primitive type char
.
Most Character
methods are static
methods designed for convenience in processing individual char
values. These methods take at least a character argument and perform either a test or a manipulation of the character. Class Character
also contains a constructor that receives a char
argument to initialize a Character
object. Most of the methods of class Character
are presented in the next three examples. For more information on class Character
(and all the type-wrapper classes), see the java.lang
package in the Java API documentation.
Figure 25.15 demonstrates some static
methods that test characters to determine whether they are a specific character type and the static
methods that perform case conversions on characters. You can enter any character and apply the methods to the character.
Fig. 25.15. Character
class static
methods for testing characters and converting character case.
Line 15 uses Character
method isDefined
to determine whether character c
is defined in the Unicode character set. If so, the method returns true
, and otherwise, it returns false
. Line 16 uses Character
method isDigit
to determine whether character c
is a defined Unicode digit. If so, the method returns true
, and otherwise, it returns false
.
Line 18 uses Character
method isJavaIdentifierStart
to determine whether c
is a character that can be the first character of an identifier in Java—that is, a letter, an underscore (_
) or a dollar sign ($
). If so, the method returns true
, and otherwise, it returns false
. Line 20 uses Character
method isJavaIdentifierPart
to determine whether character c
is a character that can be used in an identifier in Java—that is, a digit, a letter, an underscore (_
) or a dollar sign ($
). If so, the method returns true
, and otherwise, false
.
Line 21 uses Character
method isLetter
to determine whether character c
is a letter. If so, the method returns true
, and otherwise, false
. Line 23 uses Character
method isLetterOrDigit
to determine whether character c
is a letter or a digit. If so, the method returns true
, and otherwise, false
.
Line 25 uses Character
method isLowerCase
to determine whether character c
is a lowercase letter. If so, the method returns true
, and otherwise, false
. Line 27 uses Character
method isUpperCase
to determine whether character c
is an uppercase letter. If so, the method returns true
, and otherwise, false
.
Line 29 uses Character
method toUpperCase
to convert the character c
to its uppercase equivalent. The method returns the converted character if the character has an uppercase equivalent, and otherwise, the method returns its original argument. Line 31 uses Character
method toLowerCase
to convert the character c
to its lowercase equivalent. The method returns the converted character if the character has a lowercase equivalent, and otherwise, the method returns its original argument.
Figure 25.16 demonstrates static Character
methods digit
and forDigit
, which convert characters to digits and digits to characters, respectively, in different number systems. Common number systems include decimal (base 10), octal (base 8), hexadecimal (base 16) and binary (base 2). The base of a number is also known as its radix.
Fig. 25.16. Character
class static
conversion methods.
Line 28 uses method forDigit
to convert the integer digit
into a character in the number system specified by the integer radix
(the base of the number). For example, the decimal integer 13
in base 16 (the radix
) has the character value 'd'
. Lowercase and uppercase letters represent the same value in number systems. Line 35 uses method digit
to convert the character c
into an integer in the number system specified by the integer radix
(the base of the number). For example, the character 'A'
is the base 16 (the radix
) representation of the base 10 value 10. The radix must be between 2 and 36, inclusive.
Figure 25.17 demonstrates the constructor and several non-static
methods of class Character
—charValue
, toString
and equals
. Lines 8–9 instantiate two Character
objects by autoboxing the character constants 'A'
and 'a'
, respectively. Line 12 uses Character
method charValue
to return the char
value stored in Character
object c1
. Line 12 returns a string representation of Character
object c2
using method toString
. The condition in the if
...else
statement at lines 14–17 uses method equals
to determine whether the object c1
has the same contents as the object c2
(i.e., the characters inside each object are equal).
Fig. 25.17. Character
class non-static
methods.
StringTokenizer
When you read a sentence, your mind breaks the sentence into tokens—individual words and punctuation marks, each of which conveys meaning to you. Compilers also perform tokenization. They break up statements into individual pieces like keywords, identifiers, operators and other programming-language elements. We now study Java’s StringTokenizer
class (from package java.util
), which breaks a string into its component tokens. Tokens are separated from one another by delimiters, typically white-space characters such as space, tab, newline and carriage return. Other characters can also be used as delimiters to separate tokens. The application in Fig. 25.18 demonstrates class StringTokenizer
.
Fig. 25.18. StringTokenizer
object used to tokenize strings.
When the user presses the Enter key, the input sentence is stored in variable sentence
. Line 17 creates a StringTokenizer
for sentence
. This StringTokenizer
constructor takes a string argument and creates a StringTokenizer
for it, and will use the default delimiter string "
f"
consisting of a space, a tab, a carriage return and a newline for tokenization. There are two other constructors for class StringTokenizer
. In the version that takes two String
arguments, the second String
is the delimiter string. In the version that takes three arguments, the second String
is the delimiter string and the third argument (a boolean
) determines whether the delimiters are also returned as tokens (only if the argument is true
). This is useful if you need to know what the delimiters are.
Line 19 uses StringTokenizer
method countTokens
to determine the number of tokens in the string to be tokenized. The condition at line 21 uses StringTokenizer
method hasMoreTokens
to determine whether there are more tokens in the string being tokenized. If so, line 22 prints the next token in the String
. The next token is obtained with a call to StringTokenizer
method nextToken
, which returns a String
. The token is output using println
, so subsequent tokens appear on separate lines.
If you would like to change the delimiter string while tokenizing a string, you may do so by specifying a new delimiter string in a nextToken
call as follows:
tokens.nextToken( newDelimiterString );
This feature is not demonstrated in Fig. 25.18.
Pattern
and Class Matcher
Regular expressions are sequences of characters and symbols that define a set of strings. They are useful for validating input and ensuring that data is in a particular format. For example, a ZIP code must consist of five digits, and a last name must contain only letters, spaces, apostrophes and hyphens. One application of regular expressions is to facilitate the construction of a compiler. Often, a large and complex regular expression is used to validate the syntax of a program. If the program code does not match the regular expression, the compiler knows that there is a syntax error within the code.
Class String
provides several methods for performing regular-expression operations, the simplest of which is the matching operation. String
method matches
receives a string that specifies the regular expression and matches the contents of the String
object on which it is called to the regular expression. The method returns a boolean
indicating whether the match succeeded.
A regular expression consists of literal characters and special symbols. Figure 25.19 specifies some predefined character classes that can be used with regular expressions. A character class is an escape sequence that represents a group of characters. A digit is any numeric character. A word character is any letter (uppercase or lowercase), any digit or the underscore character. A whitespace character is a space, a tab, a carriage return, a newline or a form feed. Each character class matches a single character in the string we are attempting to match with the regular expression.
Fig. 25.19. Predefined character classes.
Regular expressions are not limited to these predefined character classes. The expressions employ various operators and other forms of notation to match complex patterns. We examine several of these techniques in the application in Figs. 25.20 and 25.21 which validates user input via regular expressions. [Note: This application is not designed to match all possible valid user input.]
Fig. 25.20. Validating user information using regular expressions.
Fig. 25.21. Inputs and validates data from user using the ValidateInput
class.
Figure 25.20 validates user input. Line 9 validates the first name. To match a set of characters that does not have a predefined character class, use square brackets, []
. For example, the pattern "[aeiou]"
matches a single character that is a vowel. Character ranges are represented by placing a dash (-
) between two characters. In the example, "[AZ]"
matches a single uppercase letter. If the first character in the brackets is "^"
, the expression accepts any character other than those indicated. However, it is important to note that "[^Z]"
is not the same as "[A-Y]"
, which matches uppercase letters A–Y—"[^Z]"
matches any character other than capital Z
, including lowercase letters and non-letters such as the newline character. Ranges in character classes are determined by the letters’ integer values. In this example, "[A-Za-z]"
matches all uppercase and lowercase letters. The range "[A-z]"
matches all letters and also matches those characters (such as % and 6) with an integer value between uppercase Z and lowercase a (for more information on integer values of characters see Appendix B, ASCII Character Set). Like predefined character classes, character classes delimited by square brackets match a single character in the search object.
In line 9, the asterisk after the second character class indicates that any number of letters can be matched. In general, when the regular-expression operator "*"
appears in a regular expression, the application attempts to match zero or more occurrences of the subexpression immediately preceding the "*"
. Operator "+"
attempts to match one or more occurrences of the subexpression immediately preceding "+"
. So both "A*"
and "A+"
will match "AAA"
, but only "A*"
will match an empty string.
If method validateFirstName
returns true
(line 29), the application attempts to validate the last name (line 31) by calling validateLastName
(lines 13–16 of Fig. 25.20). The regular expression to validate the last name matches any number of letters split by spaces, apostrophes or hyphens.
Line 33 validates the address by calling method validateAddress
(lines 19–23 of Fig. 25.20). The first character class matches any digit one or more times (\d+
). Note that two characters are used, because
normally starts an escape sequences in a string. So
\d
in a Java string represents the regular expression pattern d
. Then we match one or more white-space characters (\s+
). The character "|"
allows a match of the expression to its left or to its right. For example, "Hi (John|Jane)"
matches both "Hi John"
and "Hi Jane"
. The parentheses are used to group parts of the regular expression. In this example, the left side of |
matches a single word, and the right side matches two words separated by any amount of white space. So the address must contain a number followed by one or two words. Therefore, "10 Broadway"
and "10 Main Street"
are both valid addresses in this example. The city (lines 26–29 of Fig. 25.20) and state (lines 32–35 of Fig. 25.20) methods also match any word of at least one character or, alternatively, any two words of at least one character if the words are separated by a single space. This means both Waltham
and West Newton
would match.
The asterisk (*
) and plus (+
) are formally called quantifiers. Figure 25.22 lists all the quantifiers. We have already discussed how the asterisk (*
) and plus (+
) quantifiers work. All quantifiers affect only the subexpression immediately preceding the quantifier. Quantifier question mark (?
) matches zero or one occurrences of the expression that it quantifies. A set of braces containing one number ({
n}
) matches exactly n occurrences of the expression it quantifies. We demonstrate this quantifier to validate the zip code in Fig. 25.20 at line 40. Including a comma after the number enclosed in braces matches at least n
occurrences of the quantified expression. The set of braces containing two numbers ({
n, m}
), matches between n and m occurrences of the expression that it qualifies. Quantifiers may be applied to patterns enclosed in parentheses to create more complex regular expressions.
Fig. 25.22. Quantifiers used in regular expressions.
All of the quantifiers are greedy. This means that they will match as many occurrences as they can as long as the match is still successful. However, if any of these quantifiers is followed by a question mark (?
), the quantifier becomes reluctant (sometimes called lazy). It then will match as few occurrences as possible as long as the match is still successful.
The zip code (line 40 in Fig. 25.20) matches a digit five times. This regular expression uses the digit character class and a quantifier with the digit 5 between braces. The phone number (line 46 in Fig. 25.20) matches three digits (the first one cannot be zero) followed by a dash followed by three more digits (again the first one cannot be zero) followed by four more digits.
String
Method matches
checks whether an entire string conforms to a regular expression. For example, we want to accept "Smith"
as a last name, but not "9@Smith#"
. If only a substring matches the regular expression, method matches
returns false
.
Sometimes it is useful to replace parts of a string or to split a string into pieces. For this purpose, class String
provides methods replaceAll
, replaceFirst
and split
. These methods are demonstrated in Fig. 25.23.
Fig. 25.23. Methods replaceFirst
, replaceAll
and split
.
Method replaceAll
replaces text in a string with new text (the second argument) wherever the original string matches a regular expression (the first argument). Line 14 replaces every instance of "*"
in firstString
with "^"
. Note that the regular expression ("\*"
) precedes character *
with two backslashes. Normally, *
is a quantifier indicating that a regular expression should match any number of occurrences of a preceding pattern. However, in line 14, we want to find all occurrences of the literal character *
—to do this, we must escape character *
with character . Escaping a special regular-expression character with
instructs the matching engine to find the actual character. Since the expression is stored in a Java string and
is a special character in Java strings, we must include an additional
. So the Java string
"\*"
represents the regular-expression pattern *
which matches a single *
character in the search string. In line 19, every match for the regular expression "stars"
in firstString
is replaced with "carets"
.
Method replaceFirst
(line 32) replaces the first occurrence of a pattern match. Java String
s are immutable, therefore method replaceFirst
returns a new string in which the appropriate characters have been replaced. This line takes the original string and replaces it with the string returned by replaceFirst
. By iterating three times we replace the first three instances of a digit (d
) in secondString
with the text "digit"
.
Method split
divides a string into several substrings. The original string is broken in any location that matches a specified regular expression. Method split
returns an array of strings containing the substrings between matches for the regular expression. In line 38, we use method split
to tokenize a string of comma-separated integers. The argument is the regular expression that locates the delimiter. In this case, we use the regular expression ",\s*"
to separate the substrings wherever a comma occurs. By matching any whitespace characters, we eliminate extra spaces from the resulting substrings. Note that the commas and white-space characters are not returned as part of the substrings. Again, note that the Java string ",\s*"
represents the regular expression ,s*
.
Pattern
and Matcher
In addition to the regular-expression capabilities of class String
, Java provides other classes in package java.util.regex
that help developers manipulate regular expressions. Class Pattern
represents a regular expression. Class Matcher
contains both a regular-expression pattern and a CharSequence
in which to search for the pattern.
CharSequence
is an interface that allows read access to a sequence of characters. The interface requires that the methods charAt
, length
, subSequence
and toString
be declared. Both String
and StringBuilder
implement interface CharSequence
, so an instance of either of these classes can be used with class Matcher
.
Common Programming Error 25.4
A regular expression can be tested against an object of any class that implements interface CharSequence
, but the regular expression must be a String
. Attempting to create a regular expression as a StringBuilder
is an error.
If a regular expression will be used only once, static Pattern
method matches
can be used. This method takes a string that specifies the regular expression and a CharSequence
on which to perform the match. This method returns a boolean
indicating whether the search object (the second argument) matches the regular expression.
If a regular expression will be used more than once, it is more efficient to use static Pattern
method compile
to create a specific Pattern
object for that regular expression. This method receives a string representing the pattern and returns a new Pattern
object, which can then be used to call method matcher
. This method receives a CharSequence
to search and returns a Matcher
object.
Matcher
provides method matches
, which performs the same task as Pattern
method matches
, but receives no arguments—the search pattern and search object are encapsulated in the Matcher
object. Class Matcher
provides other methods, including find
, lookingAt
, replaceFirst
and replaceAll
.
Figure 25.24 presents a simple example that employs regular expressions. This program matches birthdays against a regular expression. The expression matches only birthdays that do not occur in April and that belong to people whose names begin with "J"
.
Fig. 25.24. Regular expressions checking birthdays.
Lines 11–12 create a Pattern
by invoking static Pattern
method compile
. The dot character "
.
"
in the regular expression (line 12) matches any single character except a newline character.
Line 20 creates the Matcher
object for the compiled regular expression and the matching sequence (string1
). Lines 22–23 use a while
loop to iterate through the string. Line 22 uses Matcher
method find
to attempt to match a piece of the search object to the search pattern. Each call to this method starts at the point where the last call ended, so multiple matches can be found. Matcher
method lookingAt
performs the same way, except that it always starts from the beginning of the search object and will always find the first match if there is one.
Method matches
(from class String
, Pattern
or Matcher
) will return true
only if the entire search object matches the regular expression. Methods find
and lookingAt
(from class Matcher
) will return true
if a portion of the search object matches the regular expression.
Line 23 uses Matcher
method group
, which returns the string from the search object that matches the search pattern. The string that is returned is the one that was last matched by a call to find
or lookingAt
. The output in Fig. 25.24 shows the two matches that were found in string1
.
In this chapter, you learned about more String
methods for selecting portions of String
s and manipulating String
s. You also learned about the Character
class and some of the methods it declares to handle char
s. The chapter also discussed the capabilities of the StringBuilder
class for creating String
s. The end of the chapter discussed regular expressions, which provide a powerful capability to search and match portions of String
s that fit a particular pattern.