What's the use of a good quotation if you can't change it? | ||
--Dr. Who, The Two Doctors |
Strings are standard objects with built-in language support. You have already seen many examples of using string literals to create string objects. You've also seen the +
and +=
operators that concatenate strings to create new strings. The String
class, however, has much more functionality to offer. String
objects are immutable (read-only), so you also have a StringBuilder
class for mutable strings. This chapter describes String
and StringBuilder
and some related classes, including utilities for regular expression matching.
As described in “Character Set” on page 161, the Java programming language represents text consisting of Unicode characters as sequences of char
values using the UTF
-16 encoding format. The String
class defines objects that represent such character sequences. More generally, the java.lang.CharSequence
interface is implemented by any class that represents such a character sequence—this includes the String
, StringBuilder
, and StringBuffer
classes described in this chapter, together with the java.nio.CharBuffer
class that is used for performing I/O.
The CharSequence
interface is simple, defining only four methods:
public char
charAt(int index)
Returns the char
in this sequence at the given index
. Sequences are indexed from zero to length()-1
(just as arrays are indexed). As this is a UTF
-16 sequence of characters, the returned value may be an actual character or a value that is part of a surrogate pair. If the index is negative or not less than the length of the sequence, then an IndexOutOfBoundsException
is thrown.
public int
length()
Returns the length of this character sequence.
public CharSequence
subSequence(int start, int end)
Returns a new CharSequence
that contains the char
values in this sequence consisting of charAt(start)
through to charAt(end-1)
. If start
is less than end
or if use of either value would try to index outside this sequence, then an IndexOutOfBoundsException
is thrown. Be careful to ensure that the specified range doesn't split any surrogate pairs.
public String
toString()
Overrides the contract of Object.toString
to specify that it returns the character sequence represented by this CharSequence
.
Strings are immutable (read-only) character sequences: Their contents can never be changed after the string is constructed. The String
class provides numerous methods for working with strings—searching, comparing, interacting with other character sequences—an overview of which is given in the following sections.
You can create strings implicitly either by using a string literal (such as "Gröçe"
) or by using +
or +=
on two String
objects to create a new one.
You can also construct String
objects explicitly using new
. The String
class supports the following simple constructors (other constructors are shown in later sections):
public
String()
Constructs a new String
with the value ""
—an empty string.
public
String(String value)
Constructs a new String
that is a copy of the specified String
object value
—this is a copy constructor. Because String
objects are immutable, this is rarely used.
public
String(StringBuilder value)
Constructs a new String
with the same contents as the given StringBuilder
.
public
String(StringBuffer value)
Constructs a new String
with the same contents as the given StringBuffer
.
The most basic methods of String
objects are length
and charAt
, as defined by the CharSequence
interface. This loop counts the number of each kind of character in a string:
Note that length
is a method for String
, while for array it is a field—it's common for beginners to confuse the two.
In most String
methods, a string index position less than zero or greater than length()-1
throws an IndexOutOfBoundsException
. Some implementations throw the more specific StringIndexOutOfBoundsException
, which can take the illegal index as a constructor argument and then include it in a detailed message. Methods or constructors that copy values to or from an array will also throw IndexOutOfBoundsException
if any attempt is made to access outside the bounds of that array.
There are also simple methods to find the first or last occurrence of a particular character or substring in a string. The following method returns the number of characters between the first and last occurrences of a given character in a string:
static int countBetween(String str, char ch) { int begPos = str.indexOf(ch); if (begPos < 0) // not there return -1; int endPos = str.lastIndexOf(ch); return endPos - begPos - 1; }
The countBetween
method finds the first and last positions of the character ch
in the string str
. If the character does not occur twice in the string, the method returns -1
. The difference between the two character positions is one more than the number of characters in between (if the two positions were 2 and 3, the number of characters in between is zero).
Several overloads of the method indexOf
search forward in a string, and several overloads of lastIndexOf
search backward. Each method returns the index of what it found, or –1 if the search was unsuccessful:
Method | Returns Index Of... |
---|---|
| first position of |
| first position of |
| first position of |
| first position of |
| last position of |
| last position of |
| last position of |
| last position of |
The indexing methods that take an int
parameter for the character to look for will search for the given character if the value is less than 0xFFFF, or else the code point with the given value—see “Working with UTF-16” on page 336.
If you don't care about the actual index of the substring, you can use the contains
method, which returns true
if the current string contains a given CharSequence
as a subsequence. If you want to find the index of an arbitrary CharSequence
you must invoke toString
on the CharSequence
and pass that to indexOf
instead.
Exercise 13.1: Write a method that counts the number of occurrences of a given character in a string.
Exercise 13.2: Write a method that counts the number of occurrences of a particular string in another string.
The String
class supports several methods to compare strings and parts of strings. Before we describe the methods, though, you should be aware that internationalization and localization issues of full Unicode strings are not addressed with these methods. For example, when you're comparing two strings to determine which is “greater,” characters in strings are compared numerically by their Unicode values, not by their localized notion of order. To a French speaker, c
and ç
are the same letter, differing only by a small diacritical mark. Sorting a set of strings in French should ignore the difference between them, placing "açb"
before "acz"
because b
comes before z.
But the Unicode characters are different—c
(u0063
) comes before ç
(u00e7
) in the Unicode character set—so these strings will actually sort the other way around. Internationalization and localization are discussed in Chapter 24.
The first compare operation is equals
, which returns true
if it is passed a reference to a String
object having the same contents—that is, the two strings have the same length and exactly the same Unicode characters. If the other object isn't a String
or if the contents are different, String.equals
returns false
. As you learned on page 100, this overrides Object.equals
to define equivalence instead of identity.
To compare strings while ignoring case, use the equalsIgnoreCase
method. By “ignore case,” we mean that Ë
and ë
are considered the same but are different from E
and e
. Characters with no case distinctions, such as punctuation, compare equal only to themselves. Unicode has many interesting case issues, including a notion of “titlecase.” Case issues in String
are handled in terms of the case-related methods of the Character
class, as described in “Character” on page 192.
A String
can be compared with an arbitrary CharSequence
by using the contentEquals
method, which returns true
if both objects represent exactly the same sequence of characters.
To sort strings, you need a way to order them, so String
implements the interface Comparable<String>
—the Comparable
interface was described on page 118. The compareTo
method returns an int
that is less than, equal to, or greater than zero when the string on which it is invoked is less than, equal to, or greater than the other string. The ordering used is Unicode character ordering. The String
class also defines a compareToIgnoreCase
method.
The compareTo
method is useful for creating an internal canonical ordering of strings. A binary search, for example, requires a sorted list of elements, but it is unimportant that the sorted order be local language order. Here is a binary search lookup method for a class that has a sorted array of strings:
private String[] table; public int position(String key) { int lo = 0; int hi = table.length - 1; while (lo <= hi) { int mid = lo + (hi - lo) / 2; int cmp = key.compareTo(table[mid]); if (cmp == 0) // found it! return mid; else if (cmp < 0) // search the lower part hi = mid - 1; else // search the upper part lo = mid + 1; } return -1; // not found }
This is the basic binary search algorithm. It first checks the midpoint of the search range to determine whether the key is greater than, equal to, or less than the element at that position. If they are the same, the element has been found and the search is over. If the key is less than the element at the position, the lower half of the range is searched; otherwise, the upper half is searched. Eventually, either the element is found or the lower end of the range becomes greater than the higher end, in which case the key is not in the list.
In addition to entire strings, regions of strings can also be compared for equality. The method for this is regionMatches
, and it has two forms:
public boolean
regionMatches(int start, String other, int ostart, int count)
Returns true
if the given region of this String
has the same Unicode characters as the given region of the string other
. Checking starts in this string at the position start
, and in the other
string at position ostart
. Only the first count
characters are compared.
public boolean
regionMatches(boolean ignoreCase, int start, String other, int ostart, int count)
This version of regionMatches
behaves exactly like the previous one, but the boolean ignoreCase
controls whether case is significant.
For example:
class RegionMatch { public static void main(String[] args) { String str = "Look, look!"; boolean b1, b2, b3; b1 = str.regionMatches(6, "Look", 0, 4); b2 = str.regionMatches(true, 6, "Look", 0, 4); b3 = str.regionMatches(true, 6, "Look", 0, 5); System.out.println("b1 = " + b1); System.out.println("b2 = " + b2); System.out.println("b3 = " + b3); } }
Here is its output:
The first comparison yields false
because the character at position 6 of the main string is 'l'
and the character at position 0 of the other string is 'L'
. The second comparison yields true
because case is not significant. The third comparison yields false
because the comparison length is now 5 and the two strings are not the same over five characters, even ignoring case.
In querying methods, such as regionMatches
and those we mention next, any invalid indexes simply cause false
to be returned rather than throwing exceptions. Passing a null
argument when an object is expected generates a NullPointerException
.
You can do simple tests for the beginnings and ends of strings by using startsWith
and endsWith
:
public boolean
startsWith(String prefix, int start)
Returns true
if this String
starts (at start
) with the given prefix
.
public boolean
startsWith(String prefix)
Equivalent to startsWith(prefix,0)
.
public boolean
endsWith(String suffix)
Returns true
if this String
ends with the given suffix
.
In general, using ==
to compare strings will give you the wrong results. Consider the following code:
This does not compare the contents of the two strings. It compares one object reference (str
) to another (the string object representing the literal "¿Peña?"
). Even if str
contains the string "¿Peña?"
this ==
expression will almost always yield false
because the two strings will be held in different objects. Using ==
on objects only tests whether the two references refer to the same object, not whether they are equivalent objects.
However, any two string literals with the same contents will refer to the same String
object. For example, ==
works correctly in the following code:
Because str
is initially set to a string literal, comparing with another string literal is equivalent to comparing the strings for equal contents. But be careful—this works only if you are sure that all string references involved are references to string literals. If str
is changed to refer to a manufactured String
object, such as the result of a user typing some input, the ==
operator will return false
even if the user types ¿Peña?
as the string.
To overcome this problem you can intern the strings that you don't know for certain refer to string literals. The intern
method returns a String
that has the same contents as the one it is invoked on. However, any two strings with the same contents return the same String
object from intern
, which enables you to compare string references to test equality, instead of the slower test of string contents. For example:
int putIn(String key) { String unique = key.intern(); int i; // see if it's in the table already for (i = 0; i < tableSize; i++) if (table[i] == unique) return i; // it's not there--add it in table[i] = unique; tableSize++; return i; }
All the strings stored in the table
array are the result of an intern
invocation. The table is searched for a string that was the result of an intern
invocation on another string that had the same contents as the key
. If this string is found, the search is finished. If not, we add the unique representative of the key
at the end. Dealing with the results of intern
makes comparing object references equivalent to comparing string contents, but much faster.
Any two strings with the same contents are guaranteed to have the same hash code—the String
class overrides Object.hashCode
—although two different strings might also have the same hash code. Hash codes are useful for hashtables, such as the HashMap
class in java.util
—see “HashMap” on page 590.
Several String
methods return new strings that are like the old one but with a specified modification. New strings are returned because String
objects are immutable. You could extract delimited substrings from another string by using a method like this one:
public static String delimitedString( String from, char start, char end) { int startPos = from.indexOf(start); int endPos = from.lastIndexOf(end); if (startPos == -1) // no start found return null; else if (endPos == -1) // no end found return from.substring(startPos); else if (startPos > endPos) // start after end return null; else // both start and end found return from.substring(startPos, endPos + 1); }
The method delimitedString
returns a new String
object containing the string inside from
that is delimited by start
and end
—that is, it starts with the character start
and ends with the character end
. If start
is found but not end
, the method returns a new String
object containing everything from the start position to the end of the string. The method delimitedString
works by using the two overloaded forms of substring
. The first form takes only an initial start position and returns a new string containing everything in the original string from that point on. The second form takes both a start and an end position and returns a new string that contains all the characters in the original string from the start to the endpoint, including the character at the start but not the one at the end. This “up to but not including the end” behavior is the reason that the method adds one to endPos
to include the delimiter characters in the returned string. For example, the string returned by
delimitedString("Il a dit «Bonjour!»", '«', '»'),
is
Here are the rest of the “related string” methods:
public String
replace(char oldChar, char newChar)
Returns a String
with all instances of oldChar
replaced with the character newChar
.
public String
replace(CharSequence oldSeq, CharSquence newSeq)
Returns a String
with each occurrence of the subsequence oldSeq
replaced by the subsequence newSeq
.
public String
trim()
Returns a String
with leading and trailing whitespace stripped. Whitespace characters are those identified as such by the Character.isWhitespace
method and include space, tab, and newline.
A number of methods return related strings based on a match with a given regular expression—see “Regular Expression Matching” on page 321:
public String
replaceFirst(String regex, String repStr)
Returns a String
with the first substring that matches the regular expression regex
replaced by repStr
. Invoked on str
, this is equivalent to Pattern.compile(regex).matcher(str).replaceFirst(repStr)
.
public String
replaceAll(String regex, String repStr)
Returns a String
with all substrings that match the regular expression regex
replaced by repStr
. Invoked on str
, this is equivalent to Pattern.compile(regex).matcher(str).replaceAll(repStr)
.
public String[]
split(String regex)
Equivalent to split(regex,0)
(see below).
public String[]
split(String regex, int limit)
Returns an array of strings resulting from splitting up this string according to the regular expression. Each match of the regular expression will cause a split in the string, with the matched part of the string removed. The limit
affects the number of times the regular expression will be applied to the string to create the array. Any positive number n limits the number of applications to n–1, with the remainder of the string returned as the last element of the array (so the array will be no larger than n). Any negative limit means that there is no limit to the number of applications and the array can have any length. A limit of zero behaves like a negative limit, but trailing empty strings will be discarded. Invoked on str
, this is equivalent to Pattern.compile(regex).split(str, limit)
.
This is easier to understand with an example. The following table shows the array elements returned from split("--",n)
invoked on the string "w--x--y--"
for n
equal to –1, 0, 1, 2, 3, and 4:
With a negative or zero limit we remove all occurrences of "--"
, with the difference between the two being the trailing empty string in the negative case. With a limit of one we don't actually apply the pattern and so the whole string is returned as the zeroth element. A limit of two applies the pattern once, breaking the string into two substrings. A limit of three gives us three substrings. A limit of four gives us four substrings, with the fourth being the empty string due to the original string ending with the pattern we were splitting on. Any limit greater than four will return the same results as a limit of four.
In all the above, if the regular expression syntax is incorrect a PatternSyntaxException
is thrown.
These are all convenience methods that avoid the need to work with Pattern
and Matcher
objects directly, but they require that the regular expression be compiled each time. If you just want to know if a given string matches a given regular expression, the matches
method returns a boolean
to tell you.
Case issues are locale sensitive—that is, they vary from place to place and from culture to culture. The platform allows users to specify a locale, which includes language and character case issues. Locales are represented by Locale
objects, which you'll learn about in more detail in Chapter 24. The methods toLowerCase
and toUpperCase
use the current default locale, or you can pass a specific locale as an argument:
public String
toLowerCase()
Returns a String
with each character converted to its lowercase equivalent if it has one according to the default locale.
public String
toUpperCase()
Returns a String
with each character converted to its uppercase equivalent if it has one according to the default locale.
public String
toLowerCase(Locale loc)
Returns a String
with each character converted to its lowercase equivalent if it has one according to the specified locale.
public String
toUpperCase(Locale loc)
Returns a String
with each character converted to its uppercase equivalent if it has one according to the specified locale.
The concat
method returns a new string that is equivalent to the string returned when you use +
on two strings. The following two statements are equivalent:
Exercise 13.3: As shown, the delimitedString
method assumes only one such string per input string. Write a version that will pull out all the delimited strings and return an array.
Exercise 13.4: Write a program to read an input string with lines of the form “type value
”, where type
is one of the wrapper class names (Boolean
, Character
, and so on) and value
is a string that the type's constructor can decode. For each such entry, create an object of that type with that value and add it to an ArrayList
—see “ArrayList” on page 582. Display the final result when all the lines have been read. Assume a line is ended simply by the newline character '
'
.
You often need to convert strings to and from something else, such as integers or booleans. The convention is that the type being converted to has the method that does the conversion. For example, converting from a String
to an int
requires a static method in class Integer
. This table shows all the types that you can convert, and how to convert each to and from a String
:
Type | To | From |
---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
To convert a primitive type to a String
you invoke one of the static valueOf
methods of String
, which for numeric types produces a base 10 representation.
The Integer
and Long
wrapper classes—as described in Chapter 8—also provide methods toBinaryString
, toOctalString
, and toHexString
for other representations.
To convert, or more accurately to parse a string into a primitive type you invoke the static parse
Type
method of the primitives type's corresponding wrapper class. Each parsing method has its own rules about the allowed format of the string, for example Float.parseFloat
will accept a floating-point literal of the form "3.14f"
, whereas Long.parseLong
will not accept the string "25L"
. These numeric parsing methods have two overloaded forms: one that takes a numeric base between 2 and 32 in addition to the string to parse; and one that takes only the string and assumes base 10. These parsing methods will also reject the string if it has characters representing the base of the number, such as "0x12FE"
for a hexadecimal value, or " 33"
for an octal value. However, the Integer
and Long
wrapper classes also provide a static decode
method that will parse a string that does include this base information. For the numeric types, if the string does not represent a valid value of that type, a NumberFormatException
is thrown.
To convert a String
to a char
you simply extract the first char
from the String
.
Your classes can support string encoding and decoding by having an appropriate toString
method and a constructor that creates a new object given the string description. The method String.valueOf(Objectobj)
is defined to return either "null"
(if obj
is null
) or the result of obj.toString
. The String
class provides enough overloads of valueOf
that you can convert any value of any type to a String
by invoking valueOf
.
A String
maps to an array of char
and vice versa. You often want to build a string in a char
array and then create a String
object from the contents. Assuming that the writable StringBuilder
class (described later) isn't adequate, several String
methods and constructors help you convert a String
to an array of char
, or convert an array of char
to a String
.
There are two constructors for creating a String
from a char
array:
public
String(char[] chars, int start, int count)
Constructs a new String
whose contents are the same as the chars
array, from index start
up to a maximum of count
characters.
public
String(char[] chars)
Equivalent to String(chars,0,
chars.length)
.
Both of these constructors make copies of the array, so you can change the array contents after you have created a String
from it without affecting the contents of the String
.
For example, the following simple algorithm squeezes out all occurrences of a character from a string:
public static String squeezeOut(String from, char toss) { char[] chars = from.toCharArray(); int len = chars.length; int put = 0; for (int i = 0; i < len; i++) if (chars[i] != toss) chars[put++] = chars[i]; return new String(chars, 0, put); }
The method squeezeOut
first converts its input string from
into a character array using the method toCharArray
. It then sets up put
, which will be the next position into which to put a character. After that it loops, copying into the array any character that isn't a toss
character. When the method is finished looping over the array, it returns a new String
object that contains the squeezed string.
You can use the two static String.copyValueOf
methods instead of the constructors if you prefer. For instance, squeezeOut
could have been ended with
There is also a single-argument form of copyValueOf
that copies the entire array. For completeness, two static valueOf
methods are also equivalent to the two String
constructors.
The toCharArray
method is simple and sufficient for most needs. When you need more control over copying pieces of a string into a character array, you can use the getChars
method:
public void
getChars(int srcBegin, int srcEnd, char[] dst, int dstBegin)
Copies characters from this String
into the specified array. The characters of the specified substring are copied into the character array, starting at dst[dstBegin]
. The specified substring is the part of the string starting at srcBegin
, up to but not including srcEnd
.
Strings represent characters encoded as char
values with the UTF
-16 encoding format. To convert those char
values into raw byte values requires that another encoding format be used. Similarly, to convert individual “characters” or arrays of raw 8-bit “characters” into char
values requires that the encoding format of the raw bytes is known. For example, you would convert an array of ASCII
or Latin-1 bytes to Unicode characters simply by setting the high bits to zero, but that would not work for other 8-bit character set encodings such as those for Hebrew. Different character sets are discussed shortly. In the following constructors and methods, you can name a character set encoding or use the user's or platform's default encoding:
public
String(byte[] bytes, int start, int count)
Constructs a new String
by converting the bytes, from index start
up to a maximum of count
bytes, into characters using the default encoding for the default locale.
public
String(byte[] bytes)
Equivalent to String(bytes, 0, bytes.length)
.
public
String(byte[] bytes, int start, int count, String enc)
throws UnsupportedEncodingException
Constructs a new String
by converting the bytes, from index start
up to a maximum of count
bytes, into characters using the encoding named by enc
.
public
String(byte[] bytes, String enc)
throws UnsupportedEncodingException
Equivalent to String(bytes,0,
bytes.length,enc)
.
public byte[]
getBytes()
Returns a byte array that encodes the contents of the string using the default encoding for the default locale.
public byte[]
getBytes(String enc)
throws UnsupportedEncodingException
Returns a byte array that encodes the contents of the string using the encoding named by enc
.
The String
constructors for building from byte
arrays make copies of the data, so further modifications to the arrays will not affect the contents of the String
.
A character set encoding specifies how to convert between raw 8-bit “characters” and their 16-bit Unicode equivalents. Character sets are named using their standard and common names. The local platform defines which character set encodings are understood, but every implementation is required to support the following:
| 7-bit |
|
|
| 8-bit Unicode Transformation Format |
| 16-bit Unicode Transformation Format, big-endian byte order |
| 16-bit Unicode Transformation Format, little-endian byte order |
| 16-bit Unicode Transformation Format, byte order specified by a mandatory initial byte-order mark (either order accepted on input, big-endian used on output) |
Consult the release documentation for your implementation to see if any other character set encodings are supported.
Character sets and their encoding mechanisms are represented by specific classes within the java.nio.charset
package:
Charset
A named mapping (such as US-ASCII
or UTF
-8) between sequences of 16-bit Unicode code units and sequences of bytes. This contains general information on the sequence encoding, simple mechanisms for encoding and decoding, and methods to create CharsetEncoder
and CharsetDecoder
objects for richer abilities.
CharsetEncoder
An object that can transform a sequence of 16-bit Unicode code units into a sequence of bytes in a specific character set. The encoder object also has methods to describe the encoding.
CharsetDecoder
An object that can transform a sequence of bytes in a specific character set into a sequence of 16-bit Unicode code units. The decoder object also has methods to describe the decoding.
You can obtain a Charset
via its own static forName
method, though usually you will just specify the character set name to some other method (such as the String
constructor or an I/O operation) rather than working with the Charset
object directly. To test whether a given character set is supported use the forName
method, and if you get an UnsuppportedCharsetException
then it is not.
You can find a list of available character sets from the static availableCharsets
method, which returns a SortedMap
of names and Charset
instances, of all known character sets. For example, to print out the names of all the known character sets you can use:
for (String name : Charset.availableCharsets().keySet()) System.out.println(name);
Every instance of the Java virtual machine has a default character set that is determined during virtual-machine startup and typically depends on the locale and encoding being used by the underlying operating system. You can obtain the default Charset
using the static defaultCharset
method.
The package java.util.regex
provides you a way to find if a string matches a general description of a category of strings called a regular expression. A regular expression describes a class of strings by using wildcards that match or exclude groups of characters, markers to require matches in particular places, etc. The package uses a common kind of regular expression, quite similar to those used in the popular perl
programming language, which itself evolved from those used in several Unix utilities.
You can use regular expressions to ask if strings match a pattern and pick out parts of strings using a rich expression language. First you will learn what regular expressions are. Then you will learn how to compile and use them.
A full description of regular expressions is complex and many other works describe them. So we will not attempt a complete tutorial, but instead will simply give some examples of the most commonly used features. (A full reference alone would take several pages.) A list of resources for understanding regular expressions is in “Further Reading” on page 758.
Regular expressions search in character sequences, as defined by java.lang.CharSequence
, implemented by String
and StringBuilder
. You can implement it yourself if you want to provide new sources.
A regular expression defines a pattern that can be applied to a character sequence to search for matches. The simplest form is something that is matched exactly; the pattern xyz
matches the string xyzzy
but not the string plugh
. Wildcards make the pattern more general. For example, .
(dot) matches any single character, so the pattern .op
matches both hop
and pop
, and *
matches zero or more of the thing before it, so xyz*
matches xy
, xyz
, and xyzzy
.
Other useful wildcards include simple sets (p[aeiou]p
matches pop
and pup
but not pgp
, while [a-z]
matches any single lowercase letter); negations ([^aeiou]
matches anything that is not a single lowercase vowel); predefined sets (d
matches any digit; s
any whitespace character); and boundaries (^twisty
matches the word “twisty” only at the beginning of a line; alike
matches “alike” only after a word boundary, that is, at the beginning of a word).
Special symbols for particular characters include
for tab;
for newline; a
for the alert (bell) character; e
for escape; and \
for backslash itself. Any character that would otherwise have a special meaning can be preceded by a to remove that meaning; in other words
c
always represents the character c
. This is how, for example, you would match a *
in an expression—by using *
.
Special symbols start with the character, which is also the character used to introduce an escape character. This means, for example, that in the string expression
"alike"
, the actual pattern will consist of a backspace character followed by the word "alike"
, while "s"
would not be a pattern for whitespace but would cause a compile-time error because s
is not valid escape character. To use the special symbols within a string expression the leading must itself be escaped using
\
, so the example strings become "\balike"
and "\s"
, respectively. To include an actual backslash in a pattern it has to be escaped twice, using four backslash characters: "\\"
. Each backslash pair becomes a single backslash within the string, resulting in a single backslash pair being included in the pattern, which is then interpreted as a single backslash character.
Regular expressions can also capture parts of the string for later use, either inside the regular expression itself or as a means of picking out parts of the string. You capture parts of the expression inside parentheses. For example, the regular expression (.)-(.*)-2-1
matches x-yup-yup-x
or ñ-å-å-ñ
or any other similar string because 1
matches the group (.)
and 2
matches the group (.*)
.[1] Groups are numbered from one, in order of the appearance of their opening parenthesis.
Evaluating a regular expression can be compute intensive, and in many instances a single regular expression will be used repeatedly. This can be addressed by compiling the regular expression once and using the result. In addition, a single character sequence might be checked repeatedly against the same pattern to find multiple matches, which can be done fastest by remembering some information about previous matches. To address both these opportunities for optimization, the full model of using a regular expression is this:
First you turn your regular expression string into a Pattern
object that is the compiled version of the pattern.
Next you ask the Pattern
object for a Matcher
object that applies that pattern to a particular CharSequence
(such as a String
or StringBuilder
).
Finally you ask the Matcher
to perform operations on the sequence using the compiled pattern.
Or, expressed in code:
Pattern pat = Pattern.compile(regularExpression); Matcher matcher = pat.matcher(sequence); boolean foundMatch = matcher.find();
If you are only using a pattern once, or are only matching each string against that pattern once, you need not actually deal with the intermediate objects. As you will see, there are convenience methods on Pattern
for matching without a Matcher
, and methods that create their own Pattern
and Matcher
. These are easy to use, but inefficient if you are using the same pattern multiple times, or matching against the same string with the same pattern repeatedly.
The Pattern
class has the following methods:
public static Pattern
compile(String regex)
throws PatternSyntaxException
Compiles the given regular expression into a pattern.
public static Pattern
compile(String regex, int flags)
throws PatternSyntaxException
Compiles the given regular expression into a pattern with the given flags. The flags control how certain interesting cases are handled, as you will soon learn.
public String
pattern()
Returns the regular expression from which this pattern was compiled.
public int
flags()
Returns this pattern's match flags.
public Matcher
matcher(CharSequence input)
Creates a matcher that will match the given input against this pattern.
public String[]
split(CharSequence input, int limit)
A convenience method that splits the given input sequence around matches of this pattern. Useful when you do not need to reuse the matcher.
public String[]
split(CharSequence input)
A convenience method that splits the given input sequence around matches of this pattern. Equivalent to split(input,0)
.
public static boolean
matches(String regex, CharSequence input)
A convenience method that compiles the given regular expression and attempts to match the given input against it. Useful when you do not need to reuse either parser or matcher. Returns true
if a match is found.
public static String
quote(String str)
Returns a string that can be used to create a pattern that would match with str
.
The toString
method of a Pattern
also returns the regular expression from which the pattern was compiled.
The flags you can specify when creating the pattern object affect how the matching will be done. Some of these affect the performance of the matching, occasionally severely, but they may be functionality you need.
Flag | Meaning |
---|---|
| Case-insensitive matching. By default, only handle case for the ASCII characters. |
| Unicode-aware case folding when combined with |
| Canonical equivalence. If a character has multiple expressions, treat them as equivalent. For example, |
| Dot-all mode, where |
| Multiline mode, where |
| Unix lines mode, where only |
| Comments and whitespace in pattern. Whitespace will be ignored, and comments starting with |
| Enable literal parsing of the pattern |
The Matcher
class has methods to match against the sequence. Each of these returns a boolean indicating success or failure. If successful, the position and other state associated with the match can then be retrieved from the Matcher
object via the start
, end
, and group
methods. The matching queries are
public boolean
matches()
Attempts to match the entire input sequence against the pattern.
public boolean
lookingAt()
Attempts to match the input sequence, starting at the beginning, against the pattern. Like the matches
method, this method always starts at the beginning of the input sequence; unlike that method, it does not require that the entire input sequence be matched.
public boolean
find()
Attempts to find the next subsequence of the input sequence that matches the pattern. This method starts at the beginning of the input sequence or, if a previous invocation of find
was successful and the matcher has not since been reset, at the first character not matched by the previous match.
public boolean
find(int start)
Resets this matcher and then attempts to find the next subsequence of the input sequence that matches the pattern, starting at the specified index. If a match is found, subsequent invocations of the find
method will start at the first character not matched by this match.
Once matching has commenced, the following methods allow the state of the matcher to be modified:
public Matcher
reset()
Resets this matcher. This discards all state and resets the append position (see below) to zero. The returned Matcher
is the one on which the method was invoked.
public Matcher
reset(CharSequence input)
Resets this matcher to use a new input sequence. The returned Matcher
is the one on which the method was invoked.
public Matcher
usePattern(Pattern pattern)
Changes the pattern used by this matcher to be pattern
. Any group information is discarded, but the input and append positions remain the same.
Once a match has been found, the following methods return more information about the match:
public int
start()
Returns the start index of the previous match.
public int
end()
Returns the index of the last character matched, plus one.
public String
group()
Returns the input subsequence matched by the previous match; in other words, the substring defined by start
and end
.
public int
groupCount()
Returns the number of capturing groups in this matcher's pattern. Group numbers range from zero to one less than this count.
public String
group(int group)
Returns the input subsequence matched by the given group in the previous match. Group zero is the entire matched pattern, so group(0)
is equivalent to group()
.
public int
start(int group)
Returns the start index of the given group from the previous match.
public int
end(int group)
Returns the index of the last character matched of the given group, plus one.
Together these methods form the MatchResult
interface, which allows a match result to be queried but not modified. You can convert the current matcher state to a MatchResult
instance by invoking its toMatchResult
method. Any subsequent changes to the matcher state do not affect the existing MatchResult
objects.
You will often want to pair finding matches with replacing the matched characters with new ones. For example, if you want to replace all instances of sun
with moon
, your code might look like this:[2]
Pattern pat = Pattern.compile("sun"); Matcher matcher = pat.matcher(input); StringBuffer result = new StringBuffer(); boolean found; while ((found = matcher.find())) matcher.appendReplacement(result, "moon"); matcher.appendTail(result);
The loop continues as long as there are matches to sun
. On each iteration through the loop, all the characters from the append position (the position after the last match; initially zero) to the start of the current match are copied into the string buffer. Then the replacement string moon
is copied. When there are no more matches, appendTail
copies any remaining characters into the buffer.
The replacement methods of Matcher
are
public String
replaceFirst(String replacement)
Replaces the first occurrence of this matcher's pattern with the replacement string, returning the result. The matcher is first reset and is not reset after the operation.
public String
replaceAll(String replacement)
Replaces all occurrences of this matcher's pattern with the replacement string, returning the result. The matcher is first reset and is not reset after the operation.
public Matcher
appendReplacement(StringBuffer buf, String replacement)
Adds to the string buffer the characters between the current append and match positions, followed by the replacement string, and then moves the append position to be after the match. As shown above, this can be used as part of a replacement loop. Returns this matcher.
public StringBuffer
appendTail(StringBuffer buf)
Adds to the string buffer all characters from the current append position until the end of the sequence. Returns the buffer.
So the previous example can be written more simply with replaceAll
:
Pattern pat = Pattern.compile("sun"); Matcher matcher = pat.matcher(input); String result = matcher.replaceAll("moon");
As an example of a more complex usage of regular expressions, here is code that will replace every number with the next largest number:
Pattern pat = Pattern.compile("[-+]?[0-9]+"); Matcher matcher = pat.matcher(input); StringBuffer result = new StringBuffer(); boolean found; while ((found = matcher.find())) { String numStr = matcher.group(); int num = Integer.parseInt(numStr); String plusOne = Integer.toString(num + 1); matcher.appendReplacement(result, plusOne); } matcher.appendTail(result);
Here we decode the number found by the match, add one to it, and replace the old value with the new one.
The replacement string can contain a $
g
, which will be replaced with the value from the g
th capturing group in the expression. The following method uses this feature to swap all instances of two adjacent words:
public static String swapWords(String w1, String w2, String input) { String regex = "\b(" + w1 + ")(\W+)(" + w2 + ")\b"; Pattern pat = Pattern.compile(regex); Matcher matcher = pat.matcher(input); return matcher.replaceAll("$3$2$1"); }
First we build a pattern from the two words, using parenthesis to capture groups of characters. A in a pattern matches a word boundary (otherwise the word “crow” would match part of “crown”), and
W
matches any character that would not be part of a word. The original pattern matches groups one (the first word), two (the separator characters), and three (the second word), which the "$3$2$1"
replacement string inverts.
For example, the invocation
would return the string
If we only wanted to swap the first time the words were encountered we could use replaceFirst
:
A Matcher
looks for matches in the character sequence that it is given as input. By default, the entire character sequence is considered when looking for a match. You can control the region of the character sequence to be used, through the method region
which takes a starting index and an ending index to define the subsequence in the input character sequence. The methods regionStart
and regionEnd
return, respectively, the current start index and the current end index.
You can control whether a region is considered to be the true start and end of the input, so that matching with the beginning or end of a line will work, by invoking useAnchoringBounds
with an argument of true
(the default). If you don't want the region to match with the line anchors then use false
. The method hasAnchoringBounds
will return the current setting.
Similarly, you can control whether the bounds of the region are transparent to matching methods that want to look-ahead, look-behind, or detect a boundary. By default bounds are opaque—that is, they will appear to be hard bounds on the input sequence—but you can change that with useTransparentBounds
. The hasTransparentBounds
method returns the current setting.
Suppose you want to parse a string into two parts that are separated by a comma. The pattern (.*),(.*)
is clear and straightforward, but it is not necessarily the most efficient way to do this. The first .*
will attempt to consume the entire input. The matcher will have to then back up to the last comma and then expand the rest into the second .*
. You could help this along by being clear that a comma is not part of the group: ([^,]*),([^,]*)
. Now it is clear that the matcher should only go so far as the first comma and stop, which needs no backing up. On the other hand, the second expression is somewhat less clear to the casual user of regular expressions.
You should avoid trading clarity for efficiency unless you are writing a performance critical part of the code. Regular expressions are by nature already cryptic. Sophisticated techniques make them even more difficult to understand, and so should be used only when needed. And when you do need to be more efficient be sure that you are doing things that are more efficient—as with all optimizations, you should test carefully what is actually faster. In the example we give, a sufficiently smart pattern compiler and matcher might make both patterns comparably quick. Then you would have traded clarity for nothing. And even if today one is more efficient than the other, a better implementation tomorrow may make that vanish. With regular expressions, as with any other part of programming, choosing optimization over clarity is a choice to be made sparingly.
If immutable strings were the only kind available, you would have to create a new String
object for each intermediate result in a sequence of String
manipulations. Consider, for example, how the compiler would evaluate the following expression:
If the compiler were restricted to String
expressions, it would have to do the following:
Each valueOf
and concat
invocation creates another String
object, so this operation would construct four String
objects, of which only one would be used afterward. The others strings would have incurred overhead to create, to set to proper values, and to garbage collect.
The compiler is more efficient than this. It uses a StringBuilder
object to build strings from expressions, creating the final String
only when necessary. StringBuilder
objects can be modified, so new objects are not needed to hold intermediate results. With StringBuilder
, the previous string expression would be represented as
This code creates just one StringBuilder
object to hold the construction, appends stuff to it, and then uses toString
to create a String
from the result.
To build and modify a string, you probably want to use the StringBuilder
class. StringBuilder
provides the following constructors:
public
StringBuilder()
Constructs a StringBuilder
with an initial value of ""
(an empty string) and a capacity of 16.
public
StringBuilder(int capacity)
Constructs a StringBuilder
with an initial value of ""
and the given capacity.
public
StringBuilder(String str)
Constructs a StringBuilder
with an initial value copied from str
.
public
StringBuilder(CharSequence seq)
Constructs a StringBuilder
with an initial value copied from seq
.
StringBuilder
is similar to String
, and it supports many methods that have the same names and contracts as some String
methods—indexOf
, lastIndexof
, replace
, substring
. However, StringBuilder
does not extend String
nor vice versa. They are independent implementations of CharSequence
.
There are several ways to modify the buffer of a StringBuilder
object, including appending to the end and inserting in the middle. The simplest method is setCharAt
, which changes the character at a specific position. The following replace
method does what String.replace
does, except that it uses a StringBuilder
object. The replace
method doesn't need to create a new object to hold the results, so successive replace
calls can operate on one buffer:
public static void replace(StringBuilder str, char oldChar, char newChar) { for (int i = 0; i < str.length(); i++) if (str.charAt(i) == oldChar) str.setCharAt(i, newChar); }
The setLength
method truncates or extends the string in the buffer. If you invoke setLength
with a length smaller than the length of the current string, the string is truncated to the specified length. If the length is longer than the current string, the string is extended with null characters ('\u0000'
).
There are also append
and insert
methods to convert any data type to a String
and then append the result to the end or insert the result at a specified position. The insert
methods shift characters over to make room for inserted characters as needed. The following types are converted by these append
and insert
methods:
|
|
|
|
|
|
|
|
|
|
There are also append
and insert
methods that take part of a CharSequence
or char
array as an argument. Here is some code that uses various append
invocations to create a StringBuilder
that describes the square root of an integer:
String sqrtInt(int i) { StringBuilder buf = new StringBuilder(); buf.append("sqrt(").append(i).append(')'), buf.append(" = ").append(Math.sqrt(i)); return buf.toString(); }
The append
and insert
methods return the StringBuilder
object itself, enabling you to append to the result of a previous append.
A few append
methods together form the java.lang.Appendable
interface. These methods are
public Appendable append(char c) public Appendable append(CharSequence seq) public Appendable append(CharSequence seq, int start, int end)
The Appendable
interface is used to mark classes that can receive formatted output from a java.util.Formatter
object—see “Formatter” on page 624.
The insert
methods take two parameters. The first is the index at which to insert characters into the StringBuilder
. The second is the value to insert, after conversion to a String
if necessary. Here is a method to put the current date at the beginning of a buffer:
public static StringBuilder addDate(StringBuilder buf) { String now = new java.util.Date().toString(); buf.insert(0, now).insert(now.length(), ": "); return buf; }
The addDate
method first creates a string with the current time using java.util.Date
, whose default constructor creates an object that represents the time it was created. Then addDate
inserts the string that represents the current date, followed by a simple separator string. Finally, it returns the buffer it was passed so that invoking code can use the same kind of method concatenation that proved useful in StringBuilder
's own methods.
The reverse
method reverses the order of characters in the StringBuilder
. For example, if the contents of the buffer are "good"
, the contents after reverse
are "doog"
.
You can remove part of the buffer with delete
, which takes a starting and ending index. The segment of the string up to but not including the ending index is removed from the buffer, and the buffer is shortened. You can remove a single character by using deleteCharAt
.
You can also replace characters in the buffer:
public StringBuilder
replace(int start, int end, String str)
Replace the characters starting at start
up to but not including end
with the contents of str
. The buffer is grown or shrunk as the length of str
is greater than or less than the range of characters replaced.
To get a String
object from a StringBuilder
object, you simply invoke the toString
method. If you need a substring of the buffer, the substring
methods works analogously to those of String
. If you want some or all of the contents as a character array, you can use getChars
, which is analogous to String.getChars
.
public void
getChars(int srcBegin, int srcEnd, char[] dst, int dstBegin)
Copies characters from this StringBuilder
into the specified array. The characters of the specified substring are copied into the character array, starting at dst[dstBegin]
. The specified substring is the part of the string buffer from srcBegin
up to but not including srcEnd
.
Here is a method that uses getChars
to remove part of a buffer:
public static StringBuilder remove(StringBuilder buf, int pos, int cnt) { if (pos < 0 || cnt < 0 || pos + cnt > buf.length()) throw new IndexOutOfBoundsException(); int leftover = buf.length() - (pos + cnt); if (leftover == 0) { // a simple truncation buf.setLength(pos); return buf; } char[] chrs = new char[leftover]; buf.getChars(pos + cnt, buf.length(), chrs, 0); buf.setLength(pos); buf.append(chrs); return buf; }
First remove
ensures that the array references will stay in bounds. You could handle the actual exception later, but checking now gives you more control. Then remove
calculates how many characters follow the removed portion. If there are none, it truncates and returns. Otherwise, remove
retrieves them using getChars
and then truncates the buffer and appends the leftover characters before returning.
The buffer of a StringBuilder
object has a capacity, which is the length of the string it can store before it must allocate more space. The buffer grows automatically as characters are added, but it is more efficient to specify the size of the buffer only once.
You set the initial size of a StringBuilder
object by using the constructor that takes a single int
:
public
StringBuilder(int capacity)
Constructs a StringBuilder
with the given initial capacity
and an initial value of ""
.
public void
ensureCapacity(int minimum)
Ensures that the capacity of the buffer is at least the specified minimum
.
public int
capacity()
Returns the current capacity of the buffer.
public void
trimToSize()
Attempts to reduce the capacity of the buffer to accommodate the current sequence of characters. There is no guarantee that this will actually reduce the capacity of the buffer, but this gives a hint to the system that it may be a good time to try and reclaim some storage space.
You can use these methods to avoid repeatedly growing the buffer. Here, for example, is a rewrite of the sqrtInt
method from page 332 that ensures that you allocate new space for the buffer at most once:
String sqrtIntFaster(int i) { StringBuilder buf = new StringBuilder(50); buf.append("sqrt(").append(i).append(')'), buf.append(" = ").append(Math.sqrt(i)); return buf.toString(); }
The only change is to use a constructor that creates a StringBuilder
object large enough to contain the result string. The value 50 is somewhat larger than required; therefore, the buffer will never have to grow.
The StringBuffer
class is essentially identical to the StringBuilder
class except for one thing: It provides a thread-safe implementation of an appendable character sequence—see Chapter 14 for more on thread safety. This difference would normally relegate discussion of StringBuffer
to a discussion on thread-safe data structures, were it not for one mitigating factor: The StringBuffer
class is older, and previously filled the role that StringBuilder
does now as the standard class for mutable character sequences. For this reason, you will often find methods that take or return StringBuffer
rather than StringBuilder
, CharSequence
, or Appendable
. These historical uses of StringBuffer
are likely to be enshrined in the existing API
s for many years to come.
Exercise 13.5: Write a method to convert strings containing decimal numbers into comma-punctuated numbers, with a comma every third digit from the right. For example, given the string "1543729"
, the method should return the string "1,543,729"
.
Exercise 13.6: Modify the method to accept parameters specifying the separator character to use and the number of digits between separator characters.
In “Working with UTF-16” on page 196, we described a number of utility methods provided by the Character
class to ease working with the supplementary Unicode characters (those greater in value than 0xFFFF
that require encoding as a pair of char
values in a CharSequence
). Each of the String
, StringBuilder
, and StringBuffer
classes provides these methods:
public int
codePointAt(int index)
Returns the code point defined at the given index in this
, taking into account that it may be a supplementary character represented by the pair this.charAt(index)
and this.charAt(index+1)
.
public int
codePointBefore(int index)
Returns the code point defined at the given index in this
, taking into account that it may be a supplementary character represented by the pair this.charAt(index-2)
and this.charAt(index-1)
.
public int
codePointCount(int start, int end)
Returns the number of code points defined in this.charAt(start)
to this.charAt(end)
, taking into account surrogate pairs. Any unpaired surrogate values count as one code point each.
public int
offsetByCodePoints(int index, int numberOfCodePoints)
Returns the index into this
that is numberOfCodePoints
away from index
, taking into account surrogate pairs.
In addition, the StringBuilder
and StringBuffer
classes define the appendCodePoint
method that takes an int
representing an arbitrary Unicode character, encodes it as a surrogate pair if needed, and appends it to the end of the buffer. Curiously, there is no corresponding insertCodePoint
method.
Finally, the String
class also provides the following constructor:
public
String(int[] codePoints, int start, int count)
Constructs a new String
with the contents from codePoints[start]
up to a maximum of count
code points, with supplementary characters encoded as surrogate pairs as needed. If any value in the array is not a valid Unicode code point, then IllegalArgumentException
is thrown.
When ideas fail, words come in very handy. | ||
--Johann Wolfgang von Goethe |
[1] The .*
means “zero or more characters,” because .
means “any character” and *
means “zero or more of the thing I follow,” so together they mean “zero or more of any character.”