Matching Groups of Characters

So far, so good? The regular expressions we've been building so far shouldn't strike you as being that complex, particularly if you look at each pattern in the way Perl does, character by character and alternate by alternate, taking grouping into effect. Now we're going to start looking at some of the shortcuts that regular expressions provide for describing and grouping various kinds of characters.

Character Classes

Say you had a string, and you wanted to match one of five words in that string: pet, get, met, set, and bet. You could do this:

/pet|get|met|set|bet/

That would work. Perl would search through the whole string for pet, then search through the whole string for get, then do the same thing for met, and so on. A shorter way—both for number of characters for you to type and for Perl—would be to group characters so that we don't duplicate the et part each time:

/(p|g|m|s|b)et/

In this case, Perl searches through the entire string for p, g, m, s, or b, and if it finds one of those, it'll try to match et just after it. Much more efficient!

This sort of pattern—where you have lots of alternates of single characters, is such a common case that there's regular expression syntax for it. The set of alternating characters is called a character class, and you enclose it inside brackets. So, for example, that same pet/get/met pattern would look like this using a character class:

/[pgmsb]et/

That's a savings of at least a couple of characters, and it's even slightly easier to read. Perl will do the same thing as the alternation character, in this case it'll look for any of the characters inside the character class before testing any of the characters outside it.

The rules for the characters that can appear inside a character class are different from those that can appear outside of one—most of the metacharacters become plain ordinary characters inside a character class (the exception being a right-bracket, which needs to be escaped for obvious reasons, a caret (^), which can't appear first, or a hyphen, which has a special meaning inside a character class). So, for example, a pattern to match on punctuation at the end of a sentence (punctuation after a word boundary and before two spaces) might look like this:

/[.!?]  /

Although . and ? have special meanings outside the character class, here they're plain old characters.

Ranges

What if you wanted to match, say, all the lowercase characters a through f (as you might in a hexadecimal number, for example). You could do

/[abcdef]/

Looks like a job for a range, doesn't it? You can do ranges inside character classes, but you don't use the range operator .. that you learned about on Day 4. Regular expressions use a hyphen for ranges instead (which is why you have to backslash it if you actually want to match a hyphen). So, for example, lowercase a through f looks like this:

/[a-f]/

You can use any range of numbers or characters, as in /[0-9]/, /[a-z]/, or /[A-Z]/. You can even combine them: /[0-9a-z]/ will match the same thing as /[0123456789abcdefghijklmnopqrstuvwxyz]/.

Negated Character Classes

Brackets define a class of characters to match in a pattern. You can also define a set of characters not to match using negated character classes—just make sure the first character in your character class is a caret (^). So, for example, to match anything that isn't an A or a B, use

/[^AB]/

Note that the caret inside a character class is not the same as the caret outside one. The former is used to create a negated character class, and the latter is used to mean the beginning of a line.

If you want to actually search for the caret character inside a character class, you're welcome to—just make sure it's not the first character or escape it (it might be best just to escape it either way to cut down on the rules you have to keep track of):

/[^?.%]/   # search for ^, ?, ., %

You will most likely end up using a lot of negated character classes in your regular expressions, so keep this syntax in mind. Note one subtlety: Negated character classes don't negate the entire value of the pattern. If /[12]/ means “return true if the data contains 1 or 2,” /[^12]/ does not mean “return true if the data doesn't contain 1 or 2.” If that were the case, you'd get a match even if the string in question was empty. What negated character classes really mean is “match any character that's not these characters.” There must be at least one actual character to match for a negated character class to work.

Special Classes

If character class ranges are still too much for you to type, you can use the character classes that were introduced in Chapter 5. You'll see these a lot in regular expressions, particularly those that match numbers in specific formats. Note that these special codes don't need to be enclosed between brackets; you can use them all by themselves to refer to that class of characters.

Table 9.2 shows the list of special character class codes:

Table 9.2. Character Class Codes
Code Equivalent Character Class What It Means
d [0-9] Any digit
D [^0-9] Any character not a digit
w [0-9a-zA-z_] Any “word character”
W [^0-9a-zA-z_] Any character not a word character
s [ f] whitespace (space, tab, newline, carriage return, form feed)
S [^ f] Any nonwhitespace character

Word characters (w and W) are a bit mystifying—why is an underscore considered a word character, but punctuation isn't? In reality, word characters have little to do with words, but are the valid characters you can use in variable names: numbers, letters, and underscores. Any other characters are not considered word characters.

You can use these character codes anywhere you need a specific type of character. For example, the d code to refers to any digit. With d, you could create patterns that match any three digits /ddd/, or, perhaps, any three digits, a dash, and any four digits, to represent a phone number such as 555-1212: /ddd-dddd/. All this repetition isn't necessarily the best way to go, however, as you'll learn in a bit when we cover quantifiers.

Matching Any Character with . (Dot)

The broadest possible character class you can get is to match based on any character whatsoever. For that, you'd use the dot character (.). So, for example, the following pattern will match lines that contain one character and one character only:

/^.$/

You'll use the dot more often in patterns with quantifiers (which you'll learn about next), but the dot can be used to indicate fields of a certain width, for example:

/^..:/

This pattern will match only if the line starts with two characters and a colon.

More about the dot operator after we pause for an example.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset