Regular Expressions

A regular expression (or regex for short) is a special type of pattern-matching string that can be very useful for programs that do string manipulation. Regular expression strings contain special pattern-matching characters that can be matched against another string to see whether the other string fits the pattern. Regular expressions are very handy for doing complex data validation, such as making sure that users enter properly formatted phone numbers, e-mail addresses, or Social Security numbers, for example.

Regular expressions are also useful for many other purposes, including searching text files to see whether they contain certain patterns, filtering e-mail based on its contents, or performing complicated search-and-replace functions.

This section presents important reference information for forming regular expressions. For information about using regular expressions in a Java program, see Pattern Class and Matcher Class.

Remember.eps Many regex patterns use the backslash as an escape symbol. Unfortunately, the backslash is also an escape symbol in Java. Thus, to create a Java string that has a regex pattern that contains a backslash, you must code two consecutive backslashes in your Java program. For example, suppose you want to create a string variable named pattern and assign it the regex pattern w. To do that, you would write code similar to the following:

string pattern = “\w”;

Here, the two backslashes in the string are converted to a single backslash by the Java compiler.

Matching single characters

The simplest regex patterns match a string literal exactly. For example, the regex string abc matches the test string abc but not the string abcd.

Using predefined character classes

A character class represents a particular type of character rather than a specific character. A regex pattern lets you use two types of character classes: predefined classes and custom classes.

Regex Character Class

Matches

.

Any character (single-character wildcard)

d

Any digit (0–9)

D

Any nondigit (anything other than 0–9)

s

Any white-space character (space, Tab, new line, Return, or backspace)

S

Any character other than a white-space character

w

Any word character (a–z, A–Z, 0–9, or an underscore)

W

Any character other than a word character

The period is like a wildcard that matches any single character. For example, the regex c.t matches the strings cat and cot but not the string cart. The first two strings (cat and cot) match, but the third string (cart) doesn’t because it’s more than three characters.

The d class represents a digit and is often used in regex patterns to validate input data. Here’s a simple regex pattern that validates a U.S. Social Security number, which must be entered in the form xxx-xx-xxxx:

ddd-dd-dddd

This regex matches the string 779-54-3994, but not 550-403-004 because the last group of digits has just three digits instead of the required four.

The d class has a counterpart: D. The D class matches any character that is not a digit. Here’s a regex for validating the names of droids found in a familiar science fiction universe:

Dd-Dd

Here, the pattern matches strings that begin with a character that isn’t a digit, followed by a character that is a digit, followed by a hyphen, followed by another nondigit character, and ending with a digit. Thus, R2-D2 and C3-P0 match.

The s class matches white-space characters including spaces, Tabs, new lines, Returns, and backspaces. This class is useful when you want to allow the user to separate parts of a string in various ways.

Here, the pattern specifies that the string can be two groups of any three characters separated by one white-space character. In the first string that’s entered, the groups are separated by a space; in the second group, they’re separated by a Tab. The s class also has a counterpart: S. It matches any character that isn’t a white-space character.

The last set of predefined classes is w and W. The w class identifies any character that’s typically used in words, including uppercase and lowercase letters, digits, and underscores. For example:

wwwWwww

This regex matches the strings abc def and 123 456 but not the string abcd123, because the letter d is a word character where the regex is looking for a nonword character.

Here, the pattern calls for two groups of word characters separated by a nonword character.

Using custom character classes

To create a custom character class, you simply list all the characters that you want to include in the class within a set of brackets. Here’s an example:

b[aeiou]t

Here, the pattern specifies that the string must start with the letter b, followed by a class that can include a, e, i, o, or u, followed by t. In other words, it accepts three-letter words that begin with b, end with t, and have a vowel in the middle (bat, bet, bit, bot, and but).

If you want to let the pattern include uppercase letters as well as lowercase letters, you have to list them both:

b[aAeEiIoOuU]t

This example allows the vowel in the middle to be uppercase or lowercase.

You can use as many custom groups on a line as you want. Here’s an example that defines classes for the first and last characters so that they, too, can be uppercase or lowercase:

[bB][aAeEiIoOuU][tT]

This pattern specifies three character classes. The first can be b or B, the second can be any uppercase or lowercase vowel, and the third can be t or T.

Using ranges

Custom character classes can also specify ranges of letters and numbers, as in this regex:

[a-z][0-5]

Here, the string can be two characters long. The first must be a character from az, and the second must be 05. Thus, the strings a5 and m3 match, but z9 and 6a do not.

Using negation

Regular expressions can include classes that match any character except the ones listed for the class. To do that, you start the class with a caret, as in this pattern:

[^cf]at

Here, the string must be a three-letter word that ends in at but isn’t fat or cat.

Using quantifiers

Quantifiers let you create patterns that match a variable number of characters at a certain position in the string.

Regex Quantifier

Matches the Preceding Element

?

Zero times or one time

*

Zero or more times

+

One or more times

{ n}

Exactly n times

{ n,}

At least n times

{ n,m}

At least n times but no more than m times

To use a quantifier, you code it immediately after the element you want it to apply to. Here’s a version of the Social Security number pattern that uses quantifiers:

d{3}-d{2}-d{4}

The preceding pattern matches three digits, followed by a hyphen, followed by two digits, followed by another hyphen, followed by four digits.

tip.eps Simply duplicating elements rather than using a quantifier works as well. For example, dd is equivalent to d{2}.

The ? quantifier lets you create an optional element that may or may not be present in the string. Suppose that you want to allow the user to enter Social Security numbers without the hyphens. You could use this pattern:

d{3}-?d{2}-?d{4}

The preceding pattern matches 779-48-9955, 779489955, and 779-489955.

Using escapes

In regular expressions, certain characters have special meaning. What if you want to search for one of those special characters? In that case, you escape the character by preceding it with a backslash. For example:

(d{3}) d{3}-d{4}

The preceding pattern matches (559) 555-1234 but not 559 555-1234 because the escape sequences ( and ) require the opening and closing parentheses around the area code of the phone number.

Here are a few additional points to ponder about escapes:

tip.eps check.png Strictly speaking, you need to use the backslash escape only for characters that have special meanings in regular expressions. I recommend, however, that you escape any punctuation character or symbol, just to be sure.

check.png You can’t escape alphabetic characters (letters) because a backslash followed by certain alphabetic characters represents a character, a class, or some other regex element.

check.png To escape a backslash, code two slashes in a row. The regex dd\dd, for example, matches strings made up of two digits followed by a backslash and two more digits, such as 2388 and 9555.

Using parentheses

You can use parentheses to create groups of characters to apply other regex elements to. For example, the regex pattern (bla)+ matches any string that consists of one or more sequences of the characters bla. Thus, the strings bla, blabla, and blablabla all match the pattern.

Here, the parentheses treat bla as a group, so the + quantifier applies to the entire sequence. Thus, this pattern looks for one or more occurrences of the sequence bla.

Here’s an example that finds U.S. phone numbers that can have an optional area code:

((d{3})s?)?d{3}-d{4}

This pattern matches the strings 555-1234 and (559)555-1239.

Using capture groups

When you mark a group of characters with parentheses, the text that matches that group is captured so that you can use it later in the pattern. The groups that are captured are “capture groups” and are numbered, beginning with 1. Then you can use a backslash followed by the capture-group number to indicate that the text must match the text that was captured for the specified capture group.

Suppose that droids named following the pattern wd-wd must have the same digit in the second and fifth characters. In other words, r2-d2 and b9-k9 are valid droid names, but r2-d4 and d3-r4 are not.

Here’s an example of a regex pattern that can validate that type of name:

w(d)-w1

Here, 1 refers to the first capture group. Thus, the last character in the string must be the same as the second character, which must be a digit.

Using the vertical bar symbol

The vertical bar (|) symbol defines an or operation, which lets you create patterns that accept any of two or more variations. Here’s another version of a pattern for validating droid names:

(wd-wd)|(w-dwd)

This pattern matches the strings r2-d2 and c-3p0. In other words, it allows the hyphen to be before or after the first digit.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset