Regular Expressions and Pattern Matching

Regular expressions are used to provide advanced string matching and manipulation. A pattern in a regular expression is nothing more than a set of characters that describes the nature of a string. For instance, you could find a literal string or validate that the input the user entered was actually an e-mail address. Regular expressions have a mini-language of their own. Once you learn it, you can apply it to many other languages in addition to PHP. You will use regular expressions sparingly in your games. Most of the time it will be to simply validate that the input entered is in the form that you expect it in, which is very important, isn’t it?

Examine the following regular expression pattern:

^create

The carat (^) in this pattern tells the regular expression engine that it should only match this pattern on the beginning of strings. So the string “Create a new character” would meet the criteria for the match, but “To create a new character” would not meet the criteria. Regular expressions also offer you a dollar sign ($) character to match strings that end with the pattern.

If you take the previous example of ^create and turn it into the string created$, then it would no longer find a match on “Create a new character” but it would find a match on “Your character has been created.”And if you combine the carat (^) and the dollar sign ($) together like this:

^create$

you can now match on an exact word. So,“To create a new character” would now meet the criteria of the pattern. You are not limited to matching on literal characters either. You can also match on special characters such as a new line or a tab. To do this, you would simply use the appropriate escape character, in your string. For instance, to match a tab at the beginning of a line you would do this:

^	

To match on a new line, a carriage return, or a form feed you would use the respective escape characters , ,or f.For punctuation marks you would also escape the character; for example, a literal period would be ., and a literal backslash would be \.

So far you know how to match literals only, but you’ll need a way to describe the pattern more loosely. You can describe this pattern with character classes. Creating a character class is very simple: you simply place the content within brackets. For example, if you wanted to match all vowels you would create a character class that looks like this:

[AaEeIiOoUu]

You can also specify ranges in your character classes by using a hyphen, like this:

[a-z]       // match any lowercase letter
[A-Z]       // match any uppercase letter
[0-9]       // match any digit
[f	
]  // match any white space

Let’s say you want to create a character class to forbid a digit from being the first character in a string. To do this you would again use the carat (^). Inside a character class the carat (^) means “not” instead of the beginning of the string.

^[^0-9][a-z]$

This will match any strings such as “all23” or “u47”. But it will not match strings such as “7all” or “8teen”. PHP has several of these character classes already built in. Take a look at Table 4.4.

Table 4.4. PHP Character Classes
Character Class Description
[[:alpha:]] Matches any letter.
[[::digit:]] Matches any digit.
[[:xdigit:]] Matches any hexadecimal digit.
[[:alnum:]] Matches any letter or any digit.
[[:space:]] Matches any white space.
[[:upper:]] Matches any uppercase letter.
[[:lower:]] Matches any lowercase letter.
[[:punct:]] Matches any punctuation mark.
* These classes are defined in PHP and may not work correctly if you try to use them in other languages.

To allow even more flexibility in your patterns, you can use curly braces ({}) to match multiple cases of characters or character classes. So to match exactly x number of occurrences of the previous character or character class you would do something like this:

^a{1,5}$ // matches: a, aa, aaa, aaaa, or aaaaa

You can also find a string with x or more occurrences of a character or character class. For instance:

^b{2,}

This will match a string with two or more b’s in it. So it would find a match on bbooo,or bbblood,etc.

There are five more special characters in regular expressions you should know about. They are:

period .
question mark ?
star *
plus +
pipe |

The period is used in regular expressions to represent any non-new-line character. So the pattern ^.s$ will match any two character strings that end in “s” and begin with any non-new-line character.

The question mark means that the previous character is optional. So if you were matching doubles in a string your pattern would look like this:

^-?[0-9]{0,}.?[0-9]{0,}$

This may look a bit confusing, but it is actually quite simple to explain. The “^-?” means look for a string that begins with an optional minus sign, followed by zero or more digits, “[0-9]{0,}”. Now look for an optional period, “.?”, followed by zero or more digits, “[0- 9]{0,}”.

The star is just like a wild card. It means match anything with zero or more of the previous character. So you could further simplify the matching doubles pattern to:

^-?[0-9]*.?[0-9]*$

Make sense? The star is the exact equivalent to saying {0,},which means zero or more.

The plus symbol means match one or more of the previous characters, or the same thing as saying {1,}.A simple example of this would be matching any integer number:

^-?[0-9]+$

This is looking for a string that begins with an optional minus sign, followed by one or more digits. Another very handy function of the plus sign would be to validate e-mail addresses, which is something you might want to do quite often in your Web-based games.

^.+@.+..+$

This pattern might look complex, but if you break it down you will find that it is actually quite simple. First, it is looking for a string that begins with any non-white-space character, followed by any non-white-space character. Then it is looking for the literal character “@”, followed by any non-white-space character. Then it looks for the literal character “.” followed by any non-white-space character.

Now this pattern isn’t perfect, but it is close enough. To match on any non-white-space character is a broad scope, but for the most purposes you just want to make sure that the user entered in a valid looking e-mail address.

The final character to take note of in regular expressions is the pipe (|). The pipe behaves exactly like a logical OR operator. This is extremely useful because you can check a string for certain words or characters. For instance:

(G|St)un$

This will match any string that has the words “Gun” or “Stun” in them.

Using the Regular Expression Functions

PHP has five functions for handling regular expressions. Two of them are used for searching and matching, two are used for searching and replacing, and one is used for splitting.

The most basic of the five functions is ereg(). It takes two parameters with an optional third parameter, and returns true if the pattern is found and false if the pattern is not found.

bool ereg(string pattern, string source, array [regs]);

Let’s go back to the e-mail validation pattern that was presented earlier to see how to use the ereg() function.

$email = “[email protected]“;
$nResult = ereg(“^.+@.+..+“, $email);
if($nResult)
{
    echo(“This is a valid email address“);
}
else
{
    echo(“This is a invalid email address“);
}

The optional third argument, array [regs], can store the matching substrings of the pattern for later use. This means, for example, that when you pass in an e-mail address it can take the username, domain name, and top-level domain name and store them in the array. Take a look at the following example:

$email = “[email protected]“;
$nResult = ereg(“^(.+)@(.+).(.+)“, $email, $arrEmail);
if($nResult)
{
    echo(“$arrEmail[0] is a valid email address“ .
    “<br>Username: $arrEmail[1]<br>Domain: $arrEmail[2]<br>Top Level: $arrEmail[3]“);
}
else
{
    echo(“This is an invalid email address“);
}

The results of the preceding example are:

[email protected] is a valid email address
Username: ruts
Domain: datausa
Top Level: com

So what did this do exactly? After ereg() verified the pattern, the original string was stored in the first index of the array. Then the first parenthetical substring from the pattern is stored in the second index of the array. The first parenthesized substring, of course, would be the username, ruts. Then it matched the “@”symbol and proceeded to match the third argument in the pattern. After it found a match for the third argument it created another index in the array containing the domain name, datausa. Finally it matched the top-level domain name, com, and created the final index in the array.

ereg() also has a sister function called eregi(). eregi() functions exactly the same as ereg() but ignores case when looking for matching patterns.

Now take a look at the two functions for searching and replacing strings: ereg_replace() and eregi_replace(). Both of these functions search for a given pattern and replace all occurrences of that pattern with the new string that you specify. eregi_replace() does exactly the same thing as ereg_replace(),but it isn’t case sensitive. Each of these functions takes three arguments:

string ereg_replace(string pattern, string replacement, string source);

You specify a pattern that you would like to look for, a replacement for any occurrences of that pattern, and the source that you are searching. These functions do not have an optional array argument, but they do have something similar. Any parenthetical substring in the pattern will be stored in a buffer that you can access by referencing it as \1. There are nine slots that you can access, (\1...\9). Take a look at the following example:

$strSource = “Games Are Great“;
$strModified = ereg_replace(“G(ame)s“, “g\1S“, $strSource);
echo($strModified);

This example takes the string,“Games Are Great”, and searches for a pattern of “G(ame)s”, where it finds a match at the beginning of the string and replaces the match with “gameS”, and returns the modified string. The results look like this:

gameS Are Great

You can reference the “ame” characters by using the \1 because it is a parenthetical sub-string. If the pattern were broken up like, “G(am)(es)”, then “am” would be referenced as \1, and “es” would be referenced as \2.

The fifth and final function that uses regular expressions is the split() function. The split() function searches a source string for a pattern and breaks up the source string into an array based on matches of the pattern. A great use of the split function is if you are reading a comma-separated list in and need to break apart the strings so you can work with them.

$strSource = “e1, d4, e5, e6“;
$arr = split(“,“, $strSource);
echo(“$arr[0]<br>$arr[1]<br>$arr[2]<br>$arr[3]“);

The above example splits the source string on the literal character “,“ and creates an array with each string in its own index. The results look like this:

e1

d4

e5

e6

Unlike ereg_replace() and eregi_replace, split() has an optional third argument. You can specify a limit of how many elements you would like to split. Here is the full split function:

array split(string pattern, string source, int [limit]);

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset