For most of Java’s existence, search-and-replace operations that could be accomplished in a single line of Perl code were an arduous undertaking in Java because it lacked support for regular expressions, a sophisticated form of text processing.
The Java class library includes robust support for regular expressions with the java.util.regex
package and additional methods in several classes that handle characters.
In this bonus chapter, you’ll learn how to use these features as the following topics are covered:
How to create and use regular expressions
How to find a pattern in a line of text
How to split text into smaller strings delimited by regular expressions
The most popular way to process text with software is through the use of regular expressions, a standard way to find patterns in one or more lines of text and replace them with new text.
Regular expressions, which also are called regex, are implemented in a wide variety of computer programming languages but are most strongly associated with the Perl scripting language. Most popular languages support them as a core part of the language or as an optional module or library, enabling programmers with regular expression skills to apply them across several languages.
This can be compared to the use of Structured Query Language (SQL), a method of querying and updating a database that has been implemented by numerous language and database vendors. SQL can be employed in Java, Visual C++, and many other development environments.
A regular expression is a series of characters and punctuation that describe a pattern that may be found in text. You can use this pattern to find something you’re looking for, extract some of the text, replace the text with something new, and similar tasks.
Sun’s support for regular expressions turns up in three places:
The javax.util.regex
package, which is composed of the Pattern
and Matcher
classes and the exception class PatternSyntaxException
The three classes in Java that represent a sequence of characters: String
and StringBuffer
in the java.lang
package and CharBuffer
in the java.nio
package
The CharSequence
interface in java.lang
, an interface shared by String
, StringBuffer
, and CharBuffer
This is a comparatively small number of classes, interfaces, and methods, so putting regular expressions to work in Java should be pretty easy for anyone experienced with this kind of text handling, especially if you have used them with the Perl language. Java’s implementation of regular expressions is close to the implementation offered in Perl 5.
On the converse, regular expressions have a complex and sophisticated syntax that can be challenging to learn—several computer programming books cover nothing but this topic.
Sun’s official Java documentation recommends one book in particular: Mastering Regular Expressions, 2nd Edition, by Jeffrey E. F. Friedl (O’Reilly, ISBN 0-596-00289-0).
Though it’s beyond the scope of this chapter to offer a complete introductory tutorial on regular expressions, all the expressions used in upcoming code examples will be described in full. Even if you are completely new to the subject, you’ll learn some useful simple expressions that can accomplish common string-handling tasks.
Regular expressions also are supported by two open source class libraries from the Apache Project: Jakarta-ORO and Jakarta-Regexp.
Unless there’s a reason you should only rely on the Java class library, the choice of a regular expressions package depends on the depth of support for expressions and the functionality of the classes in the package.
Sun’s java.util.regex
package is considerably smaller than Jakarta-ORO, so it might be worth a look at that library to determine whether it addresses some of the particular problems you’re trying to solve with regular expressions.
For more information on these libraries, visit the Apache Project at http://jakarta.apache.org/oro and jakarta.apache.org/regexp.
As part of the support for regular expressions, the CharSequence
interface is implemented by objects that represent a series of characters in a defined sequence. This interface, part of the java.lang
package, is composed of four methods:
length()
—Returns an int
that equals the number of characters in the sequence
charAt(
int)
—Returns the char
at the int position in the sequence, which could be anything from 0 (the first position) to one less than the length()
(the last position)
subSequence(
int,
int)
—Returns a CharSequence
that holds a portion of the sequence, beginning at the first int position and ending 1 below the second int position
toString()
—Returns a String
containing the sequence
This interface is implemented by three classes: String
, StringBuffer
, and CharBuffer
, one of the buffer classes in the java.nio
networking package.
Regular expressions in java.util.regex
require the interaction of only two classes: Pattern
and Matcher
.
Here’s how they work: A Pattern
object is created that holds a compiled version of a regular expression. A Matcher
object is then created that can find text that matches the expression and take action as a result, such as retrieving, deleting, or replacing the text.
To compare an entire character sequence to a pattern, call the Pattern
class method matches(
String,
CharSequence)
. The first argument is the pattern to look for, and the second is the source text. The method returns true
if the pattern is found and false
otherwise.
The following statement looks for the pattern “[Aa]mazon.com”
in a string called store
:
boolean pm = Pattern.matches("[Aa]mazon.com", store);
The pattern in this example matches either the text “Amazon.com” or “amazon.com”, setting the pm
variable to true
for a match and false
otherwise.
The String
class has a matches(
String)
method that takes a pattern as its only argument , returning true
if the string calling the method contains the pattern and false
otherwise:
boolean pm = store.matches("[Aa]mazon.com");
These methods are suitable for times when you only are testing a pattern once and want to look for it in an entire character sequence. Both matches()
methods use the default behavior for pattern matching in Java. You’ll see later how to efficiently conduct repeated checks for the same pattern and customize the behavior.
Both matches()
methods throw a PatternSyntaxException
exception if the pattern’s not a valid regular expression. They also throw a NullPointerException
if the pattern is null.
Though you aren’t required to catch these exceptions, it’s a good idea to at least deal with PatternSyntaxException
problems because regular expressions are complex and easy to write incorrectly.
The first project you will create is PatternTester
, a Swing application that takes text and a pattern as input and displays whether the pattern was found in the text. Enter the text of Listing 1 and save the file as PatternTester.java
.
Example 1. The Full Text of PatternTester.java
1: import java.awt.*; 2: import java.awt.event.*; 3: import java.util.regex.*; 4: import javax.swing.*; 5: 6: public class PatternTester extends JFrame implements ActionListener { 7: JTextArea text = new JTextArea(5, 29); 8: JTextField pattern = new JTextField(35); 9: JButton search = new JButton("Search"); 10: JButton newSearch = new JButton("New Search"); 11: JTextArea result = new JTextArea(5, 29); 12: 13: public PatternTester() { 14: super("Test Patterns"); 15: setSize(430, 320); 16: setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE); 17: Container pane = getContentPane(); 18: GridLayout grid = new GridLayout(3, 1); 19: pane.setLayout(grid); 20: // set up the top row 21: JLabel textLabel = new JLabel("Text: "); 22: JPanel row1 = new JPanel(); 23: row1.add(textLabel); 24: text.setLineWrap(true); 25: text.setWrapStyleWord(true); 26: JScrollPane scroll = new JScrollPane(text, 27: ScrollPaneConstants.VERTICAL_SCROLLBAR_ALWAYS, 28: ScrollPaneConstants.HORIZONTAL_SCROLLBAR_NEVER); 29: row1.add(scroll); 30: // set up the middle row 31: JPanel row2 = new JPanel(); 32: JLabel patternLabel = new JLabel("Pattern: "); 33: row2.add(patternLabel); 34: row2.add(pattern); 35: search.addActionListener(this); 36: newSearch.addActionListener(this); 37: row2.add(search); 38: row2.add(newSearch); 39: // set up the bottom row 40: JPanel row3 = new JPanel(); 41: JLabel resultLabel = new JLabel("Result: "); 42: row3.add(resultLabel); 43: result.setEditable(false); 44: JScrollPane scroll2 = new JScrollPane(result, 45: ScrollPaneConstants.VERTICAL_SCROLLBAR_ALWAYS, 46: ScrollPaneConstants.HORIZONTAL_SCROLLBAR_NEVER); 47: row3.add(scroll2); 48: // set up the content pane 49: pane.add(row1); 50: pane.add(row2); 51: pane.add(row3); 52: setContentPane(pane); 53: setVisible(true); 54: } 55: 56: public void actionPerformed(ActionEvent evt) { 57: Object source = evt.getSource(); 58: if (source == search) { 59: checkPattern(); 60: } else { 61: pattern.setText(""); 62: result.setText(""); 63: } 64: } 65: 66: private void checkPattern() { 67: try { 68: if (Pattern.matches(pattern.getText(), text.getText())) 69: result.setText("That pattern was found"); 70: else 71: result.setText("That pattern was not found"); 72: } catch (PatternSyntaxException pse) { 73: result.setText("Regex error: " + pse.getMessage()); 74: } 75: } 76: 77: public static void main(String[] arguments) { 78: PatternTester app = new PatternTester(); 79: } 80: }
Compile and run the PatternTester
application to see the graphical user interface shown in Figure 1.
This application can be used to try out Java’s support for regular expressions. Enter a string in the Text field, a regular expression in the Pattern field, and click the Search button to see whether the pattern was found in the text. Click the New Search button to try a different pattern.
The only new material in the application is the checkPattern()
method (lines 66–75). The rest creates the graphical user interface and responds to user events.
The checkPattern()
method calls the Pattern
class method matches()
with two arguments, the contents of the pattern
and text
components. The results of the attempted pattern match are displayed in the result
text area.
Because the application takes a pattern as user input, the call to matches()
is placed in a try
-catch
block that looks for PatternSyntaxException
exceptions.
The StringTokenizer
class in the java.util
package divides a string into smaller strings using a delimiter character such as a comma, slash (“/”), or backslash (“”).
Regular expressions enable much more versatile string-division techniques using two split()
methods in the String
class. These methods use a regular expression as a delimiter instead of a character:
split(
String)
—Returns a String[]
array containing substrings separated by the specified pattern
split(
String,
int)
—Returns a String[]
array containing substrings separated by the specified pattern with a maximum number of array elements (the second argument)
The DataSplitter
application demonstrates the use of a split()
method. The program looks at three lines of stock price data that use three different delimiters: the “/”, “-”, and “%” characters.
Enter the text of Listing 2 in your editor and save the file as DataSplitter.java
.
Example 2. The Full Text of DataSplitter.java
1: import java.util.regex.*; 2: 3: public class DataSplitter { 4: String[] input = { "320/10.50/Dec 09 2006/39.95", 5: "110-4.25-Dec 09 2006-39.95", 6: "8%54.00%Dec 8 2006%0" }; 7: 8: public DataSplitter() { 9: for (int i = 0; i < input.length; i++) { 10: String[] piece = input[i].split("[-/%]"); 11: for (int j = 0; j < piece.length; j++) 12: System.out.print(piece[j] + " "); 13: System.out.print(" "); 14: } 15: } 16: 17: public static void main(String[] arguments) { 18: DataSplitter app = new DataSplitter(); 19: } 20: }
The DataSplitter application uses pattern matching in line 10 to subdivide each string in the input
array. The pattern will be a match for any of the three characters “-”, “/”, or “%”.
When you compile and run the application, the output should be the following:
320 10.50 Dec 09 2006 39.95 110 4.25 Dec 09 2006 39.95 8 54.00 Dec 8 2006 0
The Pattern
class in the java.util.regex
package represents regular expressions in Java, which are comparable but not identical to the implementation of regular expressions for other languages.
A Pattern
object is a compiled regular expression that can be used repeatedly in much less time than repeated calls to a character sequence’s matches()
method. It also can be set up to perform special searches that ignore comments, treat upper and lowercase characters the same, and other options.
There is no constructor method you can use in the Pattern
class. To create and compile a Pattern
object, call one of the following class methods:
compile(
String)
—Returns a Pattern
object representing a compiled regular expression of the specified pattern text
compile(
String,
int)
—Returns a Pattern
object like the preceding method, but set up to search in a special way using one or more integers added together (the second argument)
The integers that can be specified with the second method are class variables of the Pattern
class. They are used by combining them with the OR
operator “|”, as in the following statement:
Pattern pt = Pattern.compile("[y]es", Pattern.CASE_INSENSITIVE | Pattern.COMMENTS);
This pattern treats uppercase and lowercase the same and ignores comments and multiple space characters.
The following class variables can be used:
CANON_EQ
—Enables canonical equivalence, a feature of Unicode character encoding that treats visually indistinguishable character sequences like they were identical, even if they were created using different characters
CASE_INSENSITIVE
—Treats uppercase and lowercase characters the same for the purposes of determining a match. This variable should only be used for pattern matching with the ASCII character set
COMMENTS
—Ignores whitespace and any text on a line following the #
character
DOTALL
—Treats a line separator the same as other characters when the dot character (“.”) is used in a pattern
MULTILINE
—Lets the beginning of line expression (“^”) and end of line expression (“$”) be triggered by the beginning and ending of individual lines, not just the beginning and end of the character sequence
UNICODE_CASE
—Treats uppercase and lowercase characters in the Unicode character set the same for the purpose of determining matches
UNIX_LINES
—Treats ‘
’
as the only line terminator
The following statement creates a Pattern
object holding a compiled regular expression:
Pattern name = Pattern.compile("[A-Z][a-z]*");
This regular expression gets a match on any single word that begins with an initial capital letter and is followed by nothing but lowercase letters. The words “Abracadabra” and “Presto” would match, but “alakazam” would not.
If the regular expression is not syntactically correct, a PatternSyntaxException
is thrown by the compile()
method.
The Pattern
class does not contain any behavior to compare a pattern to a string. It simply compiles the regular expression for subsequent use. To conduct a search, you must use another class, Matcher
.
The Matcher
class in the java.util.regex
package looks for a regular expression in text—any of the classes that implement the CharSequence
interface.
To create a Matcher
object associated with a pattern, call the pattern’s matcher(
String)
method with the text as the argument, as in the following example:
Pattern pattern = Pattern.compile("[a-e]"); Matcher looksee = pattern.matcher(userInput);
After you have a Matcher
object, call one of three of its methods to look for a match:
Call matches()
with no arguments to compare the pattern to the entire text. This returns true
if the pattern matches the entire text or false
otherwise.
Call lookingAt()
with no arguments to compare the pattern to the start of the text. This returns true
if the pattern is a match from the first character of the text until the last character described by the pattern, or false
otherwise.
Call find()
with no arguments to look for the first sequence in the text that matches the pattern; call it again to look for the next sequence that matches. This method returns true
as long as the sequence continues to be found.
A Matcher
object can be reused by calling its reset()
or reset(
CharSequence)
methods. Call the reset()
method with no arguments to move back to the start of the character sequence associated with the pattern before the next call to find()
. Call reset(
CharSequence)
with a character sequence argument to associate the Matcher
object with different text.
After using one of these methods to find a pattern match, the Matcher
object’s start()
and end()
methods return integers that indicate the position of the match. These values can be used with a string’s substring()
method to retrieve the matched text.
This task is so common that the Matcher
object includes a shortcut. After a match is found, call the object’s group()
method to return a String
containing the text that successfully matched the pattern.
The WordSplitter
application in Listing 3 demonstrates the use of the find()
and group()
methods. Enter the text of the listing with your text editor and save the result as WordSplitter.java
.
Example 3. The Full Text of WordSplitter.java
1: import java.util.regex.*; 2: 3: public class WordSplitter { 4: 5: public static void main(String[] arguments) { 6: Pattern pattern = Pattern.compile("\S+"); 7: Matcher matcher = pattern.matcher(arguments[0]); 8: while (matcher.find()) 9: System.out.println("[" + matcher.group() + "]"); 10: } 11: }
This application uses pattern matching to find and display each word in user-submitted text. The pattern in line 6, “\S+”
, looks for one or more characters in a row that are not spaces.
To run it, specify some text as a command-line argument, using quotation marks around the argument. Here’s an example for users working with the JDK:
java WordSplitter “The rain in Spain falls mainly on the plain”
This would produce the following output:
[The] [rain] [in] [Spain] [falls] [mainly] [on] [the] [plain]
One of the most powerful features of regular expressions is a capturing group, a subexpression found within the larger pattern.
The start(
int)
, end(
int)
, and group(
int)
methods are used to work with these groups. The integer to each argument is the position of the subexpression relative to other subexpressions in the pattern. These are numbered from left to right within the pattern, so if a regular expression contains only one capturing group, call start(1)
to find its starting position in the text.
The final project uses regular expressions to find and display hyperlinks contained within a web page. Enter the text of Listing 4 in your Java editor and save the file as LinkExtractor.java
.
Example 4. The Full Text of LinkExtractor.java
1: import java.io.*; 2: import java.util.regex.*; 3: 4: public class LinkExtractor { 5: public static void main(String[] arguments) { 6: if (arguments.length < 1) { 7: System.out.println("Usage: java LinkExtractor [page]"); 8: System.exit(0); 9: } 10: String page = loadPage(arguments[0]); 11: Pattern pattern = Pattern.compile("<a.+href="(.+?)""); 12: Matcher matcher = pattern.matcher(page); 13: while (matcher.find()) { 14: System.out.println( matcher.group(1)); 15: } 16: } 17: 18: private static String loadPage(String name) { 19: StringBuffer output = new StringBuffer(); 20: try { 21: FileReader file = new FileReader(name); 22: BufferedReader buff = new BufferedReader(file); 23: boolean eof = false; 24: while (!eof) { 25: String line = buff.readLine(); 26: if (line == null) 27: eof = true; 28: else 29: output.append(line + " "); 30: } 31: buff.close(); 32: } catch (IOException e) { 33: System.out.println("Error — " + e.toString()); 34: } 35: return output.toString(); 36: } 37: }
After compiling the file, you should save a web page to the same folder that contains LinkExtractor.class
, so you have something to search. To save a page from within a web browser, choose Page, Save As in Internet Explorer 7 or File, Save Page As in Mozilla Firefox.
Run the application with the filename of the page as an argument, as in this JDK example:
java LinkExtractor java-home-page.html
Here’s some example output from the Java home page at http://www.java.com:
/ /en/ http://www.sun.com/ /en/download/index.jsp /en/selectlanguage.jsp /en/dukeszone/ /en/games/ /en/mobile/ /en/desktop/ /en/desktop/meez.jsp /en/games/desktop/crystalsolitaire.jsp /en/levelup/ http://javawear.brandvia.com https://subscriptions.sun.com/javacom/alert.html http://www.sun.com/share/text/termsofuse.html http://www.sun.com/suntrademarks/ /en/download/license.jsp /en/about/disclaimer.jsp
The regular expression finds any text located after “<a”, after “href=“ followed by a quotation mark, and before the next quotation mark.
The java.util.regex
package offers comprehensive support for regular expressions.
As you have seen, regular expressions are a much more capable technique for text processing than anything else in the Java class library.
Regular expressions, a means of finding complex PatternS in text so that they can be deleted, replaced, or retrieved, are available for many programming languages.
One thing to keep in mind about them is that there are differences between implementations. perl programmer who is an expert at writing expressions might find that some things work differently in java.util.regex
than expected, though most of the core functionality is comparable to Perl 5.
As you learn more about writing regular expressions, you’ll be able to put them to use with the same pattern creation and matching techniques covered in this chapter.