Appendix . Bonus Chapter: Regular Expressions

For most of Java’s existence, search-and-replace operations that could be accomplished in a single line of Perl code were an arduous undertaking in Java because it lacked support for regular expressions, a sophisticated form of text processing.

The Java class library includes robust support for regular expressions with the java.util.regex package and additional methods in several classes that handle characters.

In this bonus chapter, you’ll learn how to use these features as the following topics are covered:

  • How to create and use regular expressions

  • How to find a pattern in a line of text

  • How to split text into smaller strings delimited by regular expressions

Introduction to Pattern Matching

The most popular way to process text with software is through the use of regular expressions, a standard way to find patterns in one or more lines of text and replace them with new text.

Regular expressions, which also are called regex, are implemented in a wide variety of computer programming languages but are most strongly associated with the Perl scripting language. Most popular languages support them as a core part of the language or as an optional module or library, enabling programmers with regular expression skills to apply them across several languages.

This can be compared to the use of Structured Query Language (SQL), a method of querying and updating a database that has been implemented by numerous language and database vendors. SQL can be employed in Java, Visual C++, and many other development environments.

A regular expression is a series of characters and punctuation that describe a pattern that may be found in text. You can use this pattern to find something you’re looking for, extract some of the text, replace the text with something new, and similar tasks.

Sun’s support for regular expressions turns up in three places:

  • The javax.util.regex package, which is composed of the Pattern and Matcher classes and the exception class PatternSyntaxException

  • The three classes in Java that represent a sequence of characters: String and StringBuffer in the java.lang package and CharBuffer in the java.nio package

  • The CharSequence interface in java.lang, an interface shared by String, StringBuffer, and CharBuffer

This is a comparatively small number of classes, interfaces, and methods, so putting regular expressions to work in Java should be pretty easy for anyone experienced with this kind of text handling, especially if you have used them with the Perl language. Java’s implementation of regular expressions is close to the implementation offered in Perl 5.

On the converse, regular expressions have a complex and sophisticated syntax that can be challenging to learn—several computer programming books cover nothing but this topic.

Note

Sun’s official Java documentation recommends one book in particular: Mastering Regular Expressions, 2nd Edition, by Jeffrey E. F. Friedl (O’Reilly, ISBN 0-596-00289-0).

Though it’s beyond the scope of this chapter to offer a complete introductory tutorial on regular expressions, all the expressions used in upcoming code examples will be described in full. Even if you are completely new to the subject, you’ll learn some useful simple expressions that can accomplish common string-handling tasks.

Note

Regular expressions also are supported by two open source class libraries from the Apache Project: Jakarta-ORO and Jakarta-Regexp.

Unless there’s a reason you should only rely on the Java class library, the choice of a regular expressions package depends on the depth of support for expressions and the functionality of the classes in the package.

Sun’s java.util.regex package is considerably smaller than Jakarta-ORO, so it might be worth a look at that library to determine whether it addresses some of the particular problems you’re trying to solve with regular expressions.

For more information on these libraries, visit the Apache Project at http://jakarta.apache.org/oro and jakarta.apache.org/regexp.

The CharSequence Interface

As part of the support for regular expressions, the CharSequence interface is implemented by objects that represent a series of characters in a defined sequence. This interface, part of the java.lang package, is composed of four methods:

  • length()—Returns an int that equals the number of characters in the sequence

  • charAt(int)—Returns the char at the int position in the sequence, which could be anything from 0 (the first position) to one less than the length() (the last position)

  • subSequence(int, int)—Returns a CharSequence that holds a portion of the sequence, beginning at the first int position and ending 1 below the second int position

  • toString()—Returns a String containing the sequence

This interface is implemented by three classes: String, StringBuffer, and CharBuffer, one of the buffer classes in the java.nio networking package.

Using Regular Expressions

Regular expressions in java.util.regex require the interaction of only two classes: Pattern and Matcher.

Here’s how they work: A Pattern object is created that holds a compiled version of a regular expression. A Matcher object is then created that can find text that matches the expression and take action as a result, such as retrieving, deleting, or replacing the text.

Looking for a Match

To compare an entire character sequence to a pattern, call the Pattern class method matches(String, CharSequence). The first argument is the pattern to look for, and the second is the source text. The method returns true if the pattern is found and false otherwise.

The following statement looks for the pattern “[Aa]mazon.com” in a string called store:

boolean pm = Pattern.matches("[Aa]mazon.com", store);

The pattern in this example matches either the text “Amazon.com” or “amazon.com”, setting the pm variable to true for a match and false otherwise.

The String class has a matches(String) method that takes a pattern as its only argument , returning true if the string calling the method contains the pattern and false otherwise:

boolean pm = store.matches("[Aa]mazon.com");

These methods are suitable for times when you only are testing a pattern once and want to look for it in an entire character sequence. Both matches() methods use the default behavior for pattern matching in Java. You’ll see later how to efficiently conduct repeated checks for the same pattern and customize the behavior.

Caution

Both matches() methods throw a PatternSyntaxException exception if the pattern’s not a valid regular expression. They also throw a NullPointerException if the pattern is null.

Though you aren’t required to catch these exceptions, it’s a good idea to at least deal with PatternSyntaxException problems because regular expressions are complex and easy to write incorrectly.

The first project you will create is PatternTester, a Swing application that takes text and a pattern as input and displays whether the pattern was found in the text. Enter the text of Listing 1 and save the file as PatternTester.java.

Example 1. The Full Text of PatternTester.java

1: import java.awt.*;
2: import java.awt.event.*;
3: import java.util.regex.*;
4: import javax.swing.*;
5:
6: public class PatternTester extends JFrame implements ActionListener {
7:     JTextArea text = new JTextArea(5, 29);
8:     JTextField pattern = new JTextField(35);
9:     JButton search = new JButton("Search");
10:    JButton newSearch = new JButton("New Search");
11:    JTextArea result = new JTextArea(5, 29);
12:
13:    public PatternTester() {
14:        super("Test Patterns");
15:        setSize(430, 320);
16:        setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
17:        Container pane = getContentPane();
18:        GridLayout grid = new GridLayout(3, 1);
19:        pane.setLayout(grid);
20:        // set up the top row
21:        JLabel textLabel = new JLabel("Text: ");
22:        JPanel row1 = new JPanel();
23:        row1.add(textLabel);
24:        text.setLineWrap(true);
25:        text.setWrapStyleWord(true);
26:        JScrollPane scroll = new JScrollPane(text,
27:            ScrollPaneConstants.VERTICAL_SCROLLBAR_ALWAYS,
28:            ScrollPaneConstants.HORIZONTAL_SCROLLBAR_NEVER);
29:        row1.add(scroll);
30:        // set up the middle row
31:        JPanel row2 = new JPanel();
32:        JLabel patternLabel = new JLabel("Pattern: ");
33:        row2.add(patternLabel);
34:        row2.add(pattern);
35:        search.addActionListener(this);
36:        newSearch.addActionListener(this);
37:        row2.add(search);
38:        row2.add(newSearch);
39:        // set up the bottom row
40:        JPanel row3 = new JPanel();
41:        JLabel resultLabel = new JLabel("Result: ");
42:        row3.add(resultLabel);
43:        result.setEditable(false);
44:        JScrollPane scroll2 = new JScrollPane(result,
45:            ScrollPaneConstants.VERTICAL_SCROLLBAR_ALWAYS,
46:            ScrollPaneConstants.HORIZONTAL_SCROLLBAR_NEVER);
47:        row3.add(scroll2);
48:        // set up the content pane
49:        pane.add(row1);
50:        pane.add(row2);
51:        pane.add(row3);
52:        setContentPane(pane);
53:        setVisible(true);
54:    }
55:
56:    public void actionPerformed(ActionEvent evt) {
57:        Object source = evt.getSource();
58:        if (source == search) {
59:            checkPattern();
60:        } else {
61:            pattern.setText("");
62:            result.setText("");
63:        }
64:    }
65:
66:    private void checkPattern() {
67:        try {
68:            if (Pattern.matches(pattern.getText(), text.getText()))
69:                result.setText("That pattern was found");
70:            else
71:                result.setText("That pattern was not found");
72:        } catch (PatternSyntaxException pse) {
73:            result.setText("Regex error: " + pse.getMessage());
74:        }
75:    }
76:
77:    public static void main(String[] arguments) {
78:        PatternTester app = new PatternTester();
79:    }
80: }

Compile and run the PatternTester application to see the graphical user interface shown in Figure 1.

Testing Java’s pattern-matching features.

Figure 1. Testing Java’s pattern-matching features.

This application can be used to try out Java’s support for regular expressions. Enter a string in the Text field, a regular expression in the Pattern field, and click the Search button to see whether the pattern was found in the text. Click the New Search button to try a different pattern.

The only new material in the application is the checkPattern() method (lines 66–75). The rest creates the graphical user interface and responds to user events.

The checkPattern() method calls the Pattern class method matches() with two arguments, the contents of the pattern and text components. The results of the attempted pattern match are displayed in the result text area.

Because the application takes a pattern as user input, the call to matches() is placed in a try-catch block that looks for PatternSyntaxException exceptions.

Splitting Strings with Patterns

The StringTokenizer class in the java.util package divides a string into smaller strings using a delimiter character such as a comma, slash (“/”), or backslash (“”).

Regular expressions enable much more versatile string-division techniques using two split() methods in the String class. These methods use a regular expression as a delimiter instead of a character:

  • split(String)—Returns a String[] array containing substrings separated by the specified pattern

  • split(String, int)—Returns a String[] array containing substrings separated by the specified pattern with a maximum number of array elements (the second argument)

The DataSplitter application demonstrates the use of a split() method. The program looks at three lines of stock price data that use three different delimiters: the “/”, “-”, and “%” characters.

Enter the text of Listing 2 in your editor and save the file as DataSplitter.java.

Example 2. The Full Text of DataSplitter.java

1: import java.util.regex.*;
2:
3: public class DataSplitter {
4:     String[] input = { "320/10.50/Dec 09 2006/39.95",
5:         "110-4.25-Dec 09 2006-39.95",
6:         "8%54.00%Dec 8 2006%0" };
7:
8:     public DataSplitter() {
9:         for (int i = 0; i < input.length; i++) {
10:            String[] piece = input[i].split("[-/%]");
11:            for (int j = 0; j < piece.length; j++)
12:                System.out.print(piece[j] + "	");
13:             System.out.print("
");
14:         }
15:     }
16:
17:     public static void main(String[] arguments) {
18:         DataSplitter app = new DataSplitter();
19:     }
20: }

The DataSplitter application uses pattern matching in line 10 to subdivide each string in the input array. The pattern will be a match for any of the three characters “-”, “/”, or “%”.

When you compile and run the application, the output should be the following:

320   10.50   Dec 09 2006   39.95
110   4.25   Dec 09 2006   39.95
8   54.00   Dec 8 2006   0

Patterns

The Pattern class in the java.util.regex package represents regular expressions in Java, which are comparable but not identical to the implementation of regular expressions for other languages.

A Pattern object is a compiled regular expression that can be used repeatedly in much less time than repeated calls to a character sequence’s matches() method. It also can be set up to perform special searches that ignore comments, treat upper and lowercase characters the same, and other options.

There is no constructor method you can use in the Pattern class. To create and compile a Pattern object, call one of the following class methods:

  • compile(String)—Returns a Pattern object representing a compiled regular expression of the specified pattern text

  • compile(String, int)—Returns a Pattern object like the preceding method, but set up to search in a special way using one or more integers added together (the second argument)

The integers that can be specified with the second method are class variables of the Pattern class. They are used by combining them with the OR operator “|”, as in the following statement:

Pattern pt = Pattern.compile("[y]es",
    Pattern.CASE_INSENSITIVE | Pattern.COMMENTS);

This pattern treats uppercase and lowercase the same and ignores comments and multiple space characters.

The following class variables can be used:

  • CANON_EQ—Enables canonical equivalence, a feature of Unicode character encoding that treats visually indistinguishable character sequences like they were identical, even if they were created using different characters

  • CASE_INSENSITIVE—Treats uppercase and lowercase characters the same for the purposes of determining a match. This variable should only be used for pattern matching with the ASCII character set

  • COMMENTS—Ignores whitespace and any text on a line following the # character

  • DOTALL—Treats a line separator the same as other characters when the dot character (“.”) is used in a pattern

  • MULTILINE—Lets the beginning of line expression (“^”) and end of line expression (“$”) be triggered by the beginning and ending of individual lines, not just the beginning and end of the character sequence

  • UNICODE_CASE—Treats uppercase and lowercase characters in the Unicode character set the same for the purpose of determining matches

  • UNIX_LINES—Treats ‘ ’ as the only line terminator

The following statement creates a Pattern object holding a compiled regular expression:

Pattern name = Pattern.compile("[A-Z][a-z]*");

This regular expression gets a match on any single word that begins with an initial capital letter and is followed by nothing but lowercase letters. The words “Abracadabra” and “Presto” would match, but “alakazam” would not.

If the regular expression is not syntactically correct, a PatternSyntaxException is thrown by the compile() method.

The Pattern class does not contain any behavior to compare a pattern to a string. It simply compiles the regular expression for subsequent use. To conduct a search, you must use another class, Matcher.

Matches

The Matcher class in the java.util.regex package looks for a regular expression in text—any of the classes that implement the CharSequence interface.

To create a Matcher object associated with a pattern, call the pattern’s matcher(String) method with the text as the argument, as in the following example:

Pattern pattern = Pattern.compile("[a-e]");
Matcher looksee = pattern.matcher(userInput);

After you have a Matcher object, call one of three of its methods to look for a match:

  • Call matches() with no arguments to compare the pattern to the entire text. This returns true if the pattern matches the entire text or false otherwise.

  • Call lookingAt() with no arguments to compare the pattern to the start of the text. This returns true if the pattern is a match from the first character of the text until the last character described by the pattern, or false otherwise.

  • Call find() with no arguments to look for the first sequence in the text that matches the pattern; call it again to look for the next sequence that matches. This method returns true as long as the sequence continues to be found.

A Matcher object can be reused by calling its reset() or reset(CharSequence) methods. Call the reset() method with no arguments to move back to the start of the character sequence associated with the pattern before the next call to find(). Call reset(CharSequence) with a character sequence argument to associate the Matcher object with different text.

After using one of these methods to find a pattern match, the Matcher object’s start() and end() methods return integers that indicate the position of the match. These values can be used with a string’s substring() method to retrieve the matched text.

Note

This task is so common that the Matcher object includes a shortcut. After a match is found, call the object’s group() method to return a String containing the text that successfully matched the pattern.

The WordSplitter application in Listing 3 demonstrates the use of the find() and group() methods. Enter the text of the listing with your text editor and save the result as WordSplitter.java.

Example 3. The Full Text of WordSplitter.java

1: import java.util.regex.*;
2:
3: public class WordSplitter {
4:
5:     public static void main(String[] arguments) {
6:         Pattern pattern = Pattern.compile("\S+");
7:         Matcher matcher = pattern.matcher(arguments[0]);
8:         while (matcher.find())
9:             System.out.println("[" + matcher.group() + "]");
10:    }
11: }

This application uses pattern matching to find and display each word in user-submitted text. The pattern in line 6, “\S+”, looks for one or more characters in a row that are not spaces.

To run it, specify some text as a command-line argument, using quotation marks around the argument. Here’s an example for users working with the JDK:

java WordSplitter “The rain in Spain falls mainly on the plain”

This would produce the following output:

[The]
[rain]
[in]
[Spain]
[falls]
[mainly]
[on]
[the]
[plain]

One of the most powerful features of regular expressions is a capturing group, a subexpression found within the larger pattern.

The start(int), end(int), and group(int) methods are used to work with these groups. The integer to each argument is the position of the subexpression relative to other subexpressions in the pattern. These are numbered from left to right within the pattern, so if a regular expression contains only one capturing group, call start(1) to find its starting position in the text.

The final project uses regular expressions to find and display hyperlinks contained within a web page. Enter the text of Listing 4 in your Java editor and save the file as LinkExtractor.java.

Example 4. The Full Text of LinkExtractor.java

1: import java.io.*;
2: import java.util.regex.*;
3:
4: public class LinkExtractor {
5:     public static void main(String[] arguments) {
6:         if (arguments.length < 1) {
7:             System.out.println("Usage: java LinkExtractor [page]");
8:             System.exit(0);
9:         }
10:        String page = loadPage(arguments[0]);
11:        Pattern pattern = Pattern.compile("<a.+href="(.+?)"");
12:        Matcher matcher = pattern.matcher(page);
13:        while (matcher.find()) {
14:            System.out.println( matcher.group(1));
15:        }
16:    }
17:
18:    private static String loadPage(String name) {
19:        StringBuffer output = new StringBuffer();
20:        try {
21:            FileReader file = new FileReader(name);
22:            BufferedReader buff = new BufferedReader(file);
23:            boolean eof = false;
24:            while (!eof) {
25:                String line = buff.readLine();
26:                if (line == null)
27:                    eof = true;
28:                else
29:                    output.append(line + "
");
30:            }
31:            buff.close();
32:        } catch (IOException e) {
33:            System.out.println("Error — " + e.toString());
34:        }
35:        return output.toString();
36:    }
37: }

After compiling the file, you should save a web page to the same folder that contains LinkExtractor.class, so you have something to search. To save a page from within a web browser, choose Page, Save As in Internet Explorer 7 or File, Save Page As in Mozilla Firefox.

Run the application with the filename of the page as an argument, as in this JDK example:

java LinkExtractor java-home-page.html

Here’s some example output from the Java home page at http://www.java.com:

/
/en/
http://www.sun.com/
/en/download/index.jsp
/en/selectlanguage.jsp
/en/dukeszone/
/en/games/
/en/mobile/
/en/desktop/
/en/desktop/meez.jsp
/en/games/desktop/crystalsolitaire.jsp
/en/levelup/
http://javawear.brandvia.com
https://subscriptions.sun.com/javacom/alert.html
http://www.sun.com/share/text/termsofuse.html
http://www.sun.com/suntrademarks/
/en/download/license.jsp
/en/about/disclaimer.jsp

The regular expression finds any text located after “<a”, after “href=“ followed by a quotation mark, and before the next quotation mark.

The java.util.regex package offers comprehensive support for regular expressions.

As you have seen, regular expressions are a much more capable technique for text processing than anything else in the Java class library.

Regular expressions, a means of finding complex PatternS in text so that they can be deleted, replaced, or retrieved, are available for many programming languages.

One thing to keep in mind about them is that there are differences between implementations. perl programmer who is an expert at writing expressions might find that some things work differently in java.util.regex than expected, though most of the core functionality is comparable to Perl 5.

As you learn more about writing regular expressions, you’ll be able to put them to use with the same pattern creation and matching techniques covered in this chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset