4 String Manipulation and Regular Expressions

IN THIS CHAPTER, WE DISCUSS HOW YOU can use PHP’s string functions to format and manipulate text. We also discuss using string functions or regular expression functions to search (and replace) words, phrases, or other patterns within a string.

These functions are useful in many contexts. You often may want to clean up or reformat user input that is going to be stored in a database. Search functions are great when building search engine applications (among other things).

Key topics covered in this chapter include

Image  Formatting strings

Image  Joining and splitting strings

Image  Comparing strings

Image  Matching and replacing substrings with string functions

Image  Using regular expressions

Creating a Sample Application: Smart Form Mail

In this chapter, you use string and regular expression functions in the context of a Smart Form Mail application. You then add these scripts to the Bob’s Auto Parts site you’ve been building in preceding chapters.

This time, you build a straightforward and commonly used customer feedback form for Bob’s customers to enter their complaints and compliments, as shown in Figure 4.1. However, this application has one improvement over many you will find on the Web. Instead of emailing the form to a generic email address like [email protected], you’ll attempt to put some intelligence into the process by searching the input for key words and phrases and then sending the email to the appropriate employee at Bob’s company. For example, if the email contains the word advertising, you might send the feedback to the Marketing department. If the email is from Bob’s biggest client, it can go straight to Bob.

Figure 4.1  Bob’s feedback form asks customers for their name, email address, and comments.

Image

Start with the simple script shown in Listing 4.1 and add to it as you read along.

Listing 4.1 processfeedback.phpBasic Script to Email Form Contents


<?php
//create short variable names
$name=$_POST['name'];
$email=$_POST['email'];
$feedback=$_POST['feedback'];
//set up some static information
$toaddress = "[email protected]";
$subject = "Feedback from web site";
$mailcontent = "Customer name: ".$name." ".
               "Customer email: ".$email." ".
               "Customer comments: ".$feedback." ";
$fromaddress = "From: [email protected]";
//invoke mail() function to send mail
mail($toaddress, $subject, $mailcontent, $fromaddress);
?>
<html>
<head>
<title>Bob’s Auto Parts - Feedback Submitted</title>
</head>
<body>
<h1>Feedback submitted</h1>
<p>Your feedback has been sent.</p>
</body>
</html>


Generally, you should check that users have filled out all the required form fields using, for example, isset(). We have omitted this function call from the script and other examples for the sake of brevity.

In this script, you can see that we have concatenated the form fields together and used PHP’s mail() function to email them to [email protected]. This is a sample email address. If you want to test the code in this chapter, substitute your own email address here. Because we haven’t yet used mail(), we need to discuss how it works.

Unsurprisingly, this function sends email. The prototype for mail() looks like this:

bool mail(string to, string subject, string message,
          string [additional_headers [, string additional_parameters]]);

The first three parameters are compulsory and represent the address to send email to, the subject line, and the message contents, respectively. The fourth parameter can be used to send any additional valid email headers. Valid email headers are described in the document RFC822, which is available online if you want more details. (RFCs, or Requests for Comment, are the source of many Internet standards; we discuss them in Chapter 20, “Using Network and Protocol Functions.”) Here, the fourth parameter adds a From: address for the mail. You can also use it to add Reply-To: and Cc: fields, among others. If you want more than one additional header, just separate them by using newlines and carriage returns ( ) within the string, as follows:

$additional_headers="From: [email protected] "
                     .'Reply-To: [email protected]

The optional fifth parameter can be used to pass a parameter to whatever program you have configured to send mail.

To use the mail() function, set up your PHP installation to point at your mail-sending program. If the script doesn’t work for you in its current form, an installation issue might be at fault, check Appendix A, “Installing PHP and MySQL.”

Throughout this chapter, you enhance this basic script by making use of PHP’s string handling and regular expression functions.

Formatting Strings

You often need to tidy up user strings (typically from an HTML form interface) before you can use them. The following sections describe some of the functions you can use.

Trimming Strings: chop ( ), ltrim ( ), and trim ( )

The first step in tidying up is to trim any excess whitespace from the string. Although this step is never compulsory, it can be useful if you are going to store the string in a file or database, or if you’re going to compare it to other strings.

PHP provides three useful functions for this purpose. In the beginning of the script when you give short names to the form input variables, you can use the trim() function to tidy up your input data as follows:

$name = trim($_POST['name']);
$email = trim($_POST['email']);
$feedback = trim($_POST['feedback'),

The trim() function strips whitespace from the start and end of a string and returns the resulting string. The characters it strips by default are newlines and carriage returns ( and ), horizontal and vertical tabs ( and x0B), end-of-string characters (), and spaces. You can also pass it a second parameter containing a list of characters to strip instead of this default list. Depending on your particular purpose, you might like to use the ltrim() or rtrim() functions instead. They are both similar to trim(), taking the string in question as a parameter and returning the formatted string. The difference between these three is that trim() removes whitespace from the start and end of a string, ltrim() removes whitespace from the start (or left) only, and rtrim() removes whitespace from the end (or right) only.

Formatting Strings for Presentation

PHP includes a set of functions that you can use to reformat a string in different ways.

Using HTML Formatting: The nl2br ( ) Function

The nl2br() function takes a string as a parameter and replaces all the newlines in it with the XHTML <br /> tag. This capability is useful for echoing a long string to the browser. For example, you can use this function to format the customer’s feedback to echo it back:

<p>Your feedback (shown below) has been sent.</p>
<p><?php echo nl2br($mailcontent); ?> </p>

Remember that HTML disregards plain whitespace, so if you don’t filter this output through nl2br(), it will appear on a single line (except for newlines forced by the browser window). The result is illustrated in Figure 4.2.

Formatting a String for Printing

So far, you have used the echo language construct to print strings to the browser. PHP also supports a print() construct, which does the same thing as echo, but returns a value (true or false, denoting success).

Figure 4.2  Using PHP’s nl2br() function improves the display of long strings within HTML.

Image

Both of these techniques print a string “as is.” You can apply some more sophisticated formatting using the functions printf() and sprintf(). They work basically the same way, except that printf() prints a formatted string to the browser and sprintf() returns a formatted string.

If you have previously programmed in C, you will find that these functions are conceptually similar to the C versions. Be careful, though, because the syntax is not exactly the same. If you haven’t, they take getting used to but are useful and powerful.

The prototypes for these functions are

string sprintf (string format [, mixed args…])
void printf (string format [, mixed args…])

The first parameter passed to both of these functions is a format string that describes the basic shape of the output with format codes instead of variables. The other parameters are variables that will be substituted in to the format string.

For example, using echo, you can use the variables you want to print inline, like this:

echo "Total amount of order is $total.";

To get the same effect with printf(), you would use

printf ("Total amount of order is %s.", $total);

The %s in the format string is called a conversion specification. This one means “replace with a string.” In this case, it is replaced with $total interpreted as a string. If the value stored in $total was 12.4, both of these approaches would print it as 12.4.

The advantage of printf() is that you can use a more useful conversion specification to specify that $total is actually a floating-point number and that it should have two decimal places after the decimal point, as follows:

printf ("Total amount of order is %.2f", $total);

Given this formatting, and 12.4 stored in $total, this statement will print as 12.40.

You can have multiple conversion specifications in the format string. If you have n conversion specifications, you will usually have n arguments after the format string. Each conversion specification will be replaced by a reformatted argument in the order they are listed. For example,

printf ("Total amount of order is %.2f (with shipping %.2f) ",
           $total, $total_shipping);

Here, the first conversion specification uses the variable $total, and the second uses the variable $total_shipping.

Each conversion specification follows the same format, which is

%['padding_character][-][width][.precision]type

All conversion specifications start with a % symbol. If you actually want to print a % symbol, you need to use %%.

The padding_character is optional. It is used to pad your variable to the width you have specified. An example would be to add leading zeros to a number like a counter. The default padding character is a space. If you are specifying a space or zero, you do not need to prefix it with the apostrophe ('). For any other padding character, you need to prefix it with an apostrophe.

The - symbol is optional. It specifies that the data in the field will be left-justified rather than right-justified, which is the default.

The width specifier tells printf() how much room (in characters) to leave for the variable to be substituted in here.

The precision specifier should begin with a decimal point. It should contain the number of places after the decimal point you would like displayed.

The final part of the specification is a type code. A summary of these codes is shown in Table 4.1.

Table 4.1  Conversion Specification Type Codes

Image

When using the printf() function with conversion type codes, you can use argument numbering. That means that the arguments don’t need to be in the same order as the conversion specifications. For example,

printf ("Total amount of order is %2$.2f (with shipping %1$.2f) ",
           $total_shipping, $total);

Just add the argument position in the list directly after the % sign, followed by an escaped $ symbol; in this example, 2$ means “replace with the second argument in the list.” This method can also be used to repeat arguments.

Two alternative versions of these functions are called vprintf() and vsprintf(). These variants accept two parameters: the format string and an array of the arguments rather than a variable number of parameters.

Changing the Case of a String

You can also reformat the case of a string. This capability is not particularly useful for the sample application, but we’ll look at some brief examples.

If you start with the subject string, $subject, which you are using for email, you can change its case by using several functions. The effect of these functions is summarized in Table 4.2. The first column shows the function name, the second describes its effect, the third shows how it would be applied to the string $subject, and the last column shows what value would be returned from the function.

Table 4.2  String Case Functions and Their Effects

Image

Formatting Strings for Storage: addslashes ( ) and stripslashes ( )

In addition to using string functions to reformat a string visually, you can use some of these functions to reformat strings for storage in a database. Although we don’t cover actually writing to the database until Part II, “Using MySQL,” we cover formatting strings for database storage now.

Certain characters are perfectly valid as part of a string but can cause problems, particularly when you are inserting data into a database because the database could interpret these characters as control characters. The problematic ones are quotation marks (single and double), backslashes (), and the NULL character.

You need to find a way of marking or escaping these characters so that databases such as MySQL can understand that you meant a literal special character rather than a control sequence. To escape these characters, add a backslash in front of them. For example, " (double quotation mark) becomes " (backslash double quotation mark), and (backslash) becomes \ (backslash backslash). (This rule applies universally to special characters, so if you have \ in your string, you need to replace it with \\.)

PHP provides two functions specifically designed for escaping characters. Before you write any strings into a database, you should reformat them with addslashes(), as follows if your PHP configuration does not already have this functionality turned on by default:

$feedback = addslashes(trim($_POST['feedback']));

Like many of the other string functions, addslashes() takes a string as a parameter and returns the reformatted string.

Figure 4.3 shows the actual effects of using these functions on the string.

You may try these functions on your server and get a result that looks more like Figure 4.4.

Figure 4.3  After the addslashes() function is called, all the quotation marks have been slashed out. stripslashes() removes the slashes.

Image

Figure 4.4  All problematic characters have been escaped twice; this means the magic quotes feature is switched on.

Image

If you see this result, it means that your configuration of PHP is set up to add and strip slashes automatically. This capability is controlled by the magic_quotes_gpc configuration directive in its name. The letters gpc, which is turned on by default in new installations of PHP, stand for GET, POST, and cookie. This means that variables coming from these sources are automatically quoted. You can check whether this directive is switched on in your system by using the get_magic_quotes_gpc() function, which returns true if strings from these sources are being automatically quoted for you. If this directive is on in your system, you need to call stripslashes() before displaying user data; otherwise, the slashes will be displayed.

Using magic quotes allows you to write more portable code. You can read more about this feature in Chapter 24, “Other Useful Features.”

Joining and Splitting Strings with String Functions

Often, you may want to look at parts of a string individually. For example, you might want to look at words in a sentence (say, for spellchecking) or split a domain name or email address into its component parts. PHP provides several string functions (and one regular expression function) that allow you to do this.

In the example, Bob wants any customer feedback from bigcustomer.com to go directly to him, so you can split the email address the customer typed into parts to find out whether he or she works for Bob’s big customer.

Using explode ( ), implode ( ), and join ( )

The first function you could use for this purpose, explode(), has the following prototype:

array explode(string separator, string input [, int limit]);

This function takes a string input and splits it into pieces on a specified separator string. The pieces are returned in an array. You can limit the number of pieces with the optional limit parameter.

To get the domain name from the customer’s email address in the script, you can use the following code:

$email_array = explode('@', $email);

This call to explode() splits the customer’s email address into two parts: the username, which is stored in $email_array[0], and the domain name, which is stored in $email_array[1]. Now you can test the domain name to determine the customer’s origin and then send the feedback to the appropriate person:

if ($email_array[1] == "bigcustomer.com") {
  $toaddress = "[email protected]";
} else {
  $toaddress = "[email protected]";
}

If the domain is capitalized or mixed case, however, this approach will not work. You could avoid this problem by first converting the domain to all uppercase or all lowercase and then checking for a match, as follows:

if (strtolower($email_array[1]) == "bigcustomer.com") {
  $toaddress = "[email protected]";
} else {
  $toaddress = "[email protected]";
}

You can reverse the effects of explode() by using either implode() or join(), which are identical. For example,

$new_email = implode('@', $email_array);

This statement takes the array elements from $email_array and joins them with the string passed in the first parameter. The function call is similar to explode(), but the effect is the opposite.

Using strtok ( )

Unlike explode(), which breaks a string into all its pieces at one time, strtok() gets pieces (called tokens) from a string one at a time. strtok() is a useful alternative to using explode() for processing words from a string one at a time.

The prototype for strtok() is

string strtok(string input, string separator);

The separator can be either a character or a string of characters, but the input string is split on each of the characters in the separator string rather than on the whole separator string (as explode does).

Calling strtok() is not quite as simple as it seems in the prototype. To get the first token from a string, you call strtok() with the string you want tokenized and a separator. To get the subsequent tokens from the string, you just pass a single parameter—the separator. The function keeps its own internal pointer to its place in the string. If you want to reset the pointer, you can pass the string into it again.

strtok() is typically used as follows:

$token = strtok($feedback, '");
echo $token."<br />";
while ($token != " ") {
  $token = strtok(" ");
  echo $token."<br />";
}

As usual, it’s a good idea to check that the customer actually typed some feedback in the form, using, for example, the empty() function. We have omitted these checks for brevity.

The preceding code prints each token from the customer’s feedback on a separate line and loops until there are no more tokens. Empty strings are automatically skipped in the process.

Using substr ( )

The substr() function enables you to access a substring between given start and end points of a string. It’s not appropriate for the example used here but can be useful when you need to get at parts of fixed format strings.

The substr() function has the following prototype:

string substr(string string, int start[, int length]);

This function returns a substring copied from within string.

The following examples use this test string:

$test ='Your customer service is excellent';

If you call it with a positive number for start (only), you will get the string from the start position to the end of the string. For example,

substr($test, 1);

returns our customer service is excellent. Note that the string position starts from 0, as with arrays.

If you call substr() with a negative start (only), you will get the string from the end of the string minus start characters to the end of the string. For example,

substr($test,-9);

returns excellent.

The length parameter can be used to specify either a number of characters to return (if it is positive) or the end character of the return sequence (if it is negative). For example,

substr($test, 0, 4);

returns the first four characters of the string—namely, Your. The code

echo substr($test, 5, -13);

returns the characters between the fourth character and the thirteenth-to-last character—that is, customer service. The first character is location 0. So location 5 is the sixth character.

Comparing Strings

So far, we’ve just shown you how to use == to compare two strings for equality. You can do some slightly more sophisticated comparisons using PHP. We’ve divided these comparisons into two categories for you: partial matches and others. We deal with the others first and then get into partial matching, which we need to further develop the Smart Form example.

Performing String Ordering: strcmp ( ), strcasecmp ( ), and strnatcmp ( )

The strcmp(), strcasecmp(), and strnatcmp() functions can be used to order strings. This capability is useful when you are sorting data.

The prototype for strcmp() is

int strcmp(string str1, string str2);

The function expects to receive two strings, which it compares. If they are equal, it will return 0. If str1 comes after (or is greater than) str2 in lexicographic order, strcmp() will return a number greater than zero. If str1 is less than str2, strcmp() will return a number less than zero. This function is case sensitive.

The function strcasecmp() is identical except that it is not case sensitive.

The function strnatcmp() and its non–case sensitive twin, strnatcasecmp() compare strings according to a “natural ordering,” which is more the way a human would do it. For example, strcmp() would order the string 2 as greater than the string 12 because it is lexicographically greater. strnatcmp() would order them the other way around. You can read more about natural ordering at http://www.naturalordersort.org/

Testing String Length with strlen ( )

You can check the length of a string by using the strlen() function. If you pass it a string, this function will return its length. For example, the result of code is 5: echo’strlen("hello");.

You can use this function for validating input data. Consider the email address on the sample form, stored in $email. One basic way of validating an email address stored in $email is to check its length. By our reasoning, the minimum length of an email address is six characters—for example, [email protected] if you have a country code with no second-level domains, a one-letter server name, and a one-letter email address. Therefore, an error could be produced if the address is not at least this length:

if (strlen($email) < 6){
  echo 'That email address is not valid';
  exit; // force execution of PHP script
}

Clearly, this approach is a very simplistic way of validating this information. We look at better ways in the next section.

Matching and Replacing Substrings with String Functions

Checking whether a particular substring is present in a larger string is a common operation. This partial matching is usually more useful than testing for complete equality in strings.

In the Smart Form example, you want to look for certain key phrases in the customer feedback and send the mail to the appropriate department. If you want to send emails discussing Bob’s shops to the retail manager, for example, you want to know whether the word shop or derivatives thereof appear in the message.

Given the functions you have already looked at, you could use explode() or strtok() to retrieve the individual words in the message and then compare them using the == operator or strcmp().

You could also do the same thing, however, with a single function call to one of the string-matching or regular expression-matching functions. They search for a pattern inside a string. Next, we look at each set of functions one by one.

Finding Strings in Strings: strstr ( ), strchr ( ), strrchr ( ), and stristr ( )

To find a string within another string, you can use any of the functions strstr(), strchr(), strrchr(), or stristr().

The function strstr(), which is the most generic, can be used to find a string or character match within a longer string. In PHP, the strchr() function is exactly the same as strstr(), although its name implies that it is used to find a character in a string, similar to the C version of this function. In PHP, either of these functions can be used to find a string inside a string, including finding a string containing only a single character.

The prototype for strstr() is as follows:

string strstr(string haystack, string needle);

You pass the function a haystack to be searched and a needle to be found. If an exact match of the needle is found, the function returns the haystack from the needle onward; otherwise, it returns false. If the needle occurs more than once, the returned string will start from the first occurrence of needle.

For example, in the Smart Form application, you can decide where to send the email as follows:

$toaddress = '[email protected]';  // the default value
// Change the $toaddress if the criteria are met
if (strstr($feedback,'shop'))
  $toaddress ='[email protected]';
else if (strstr($feedback,'delivery'))
  $toaddress ='[email protected]';
else if (strstr($feedback,'bill'))
  $toaddress ='[email protected]';

This code checks for certain keywords in the feedback and sends the mail to the appropriate person. If, for example, the customer feedback reads “I still haven’t received delivery of my last order,” the string “delivery” will be detected and the feedback will be sent to [email protected].

There are two variants on strstr(). The first variant is stristr(), which is nearly identical but is not case sensitive. This variation is useful for this application because the customer might type "delivery", "Delivery", "DELIVERY", or some other mixed-case variation.

The second variant is strrchr(), which is again nearly identical, but returns the haystack from the last occurrence of the needle onward.

Finding the Position of a Substring: strpos ( ) and strrpos ( )

The functions strpos() and strrpos() operate in a similar fashion to strstr(), except, instead of returning a substring, they return the numerical position of a needle within a haystack. Interestingly enough, the PHP manual recommends using strpos() instead of strstr() to check for the presence of a string within a string because it runs faster.

The strpos() function has the following prototype:

int strpos(string haystack, string needle, int [offset] );

The integer returned represents the position of the first occurrence of the needle within the haystack. The first character is in position 0 as usual.

For example, the following code echoes the value 4 to the browser:

$test = "Hello world";
echo strpos($test, "o");

This code passes in only a single character as the needle, but it can be a string of any length.

The optional offset parameter specifies a point within the haystack to start searching. For example,

echo strpos($test,'o', 5);

This code echoes the value 7 to the browser because PHP has started looking for the character o at position 5 and therefore does not see the one at position 4.

The strrpos() function is almost identical but returns the position of the last occurrence of the needle in the haystack.

In any of these cases, if the needle is not in the string, strpos() or strrpos() will return false. This result can be problematic because false in a weakly typed language such as PHP is equivalent to 0—that is, the first character in a string.

You can avoid this problem by using the === operator to test return values:

$result = strpos($test, "H");
if ($result === false) {
  echo "Not found";
} else {
  echo "Found at position ".$result;
}

Replacing Substrings: str_replace () and substr_replace ()

Find-and-replace functionality can be extremely useful with strings. You can use find and replace for personalizing documents generated by PHP—for example, by replacing <name> with a person’s name and <address> with her address. You can also use it for censoring particular terms, such as in a discussion forum application, or even in the Smart Form application. Again, you can use string functions or regular expression functions for this purpose.

The most commonly used string function for replacement is str_replace(). It has the following prototype:

mixed str_replace(mixed needle, mixed new_needle, mixed haystack[, int &count]));

This function replaces all the instances of needle in haystack with new_needle and returns the new version of the haystack. The optional fourth parameter, count, contains the number of replacements made.

For example, because people can use the Smart Form to complain, they might use some colorful words. As a programmer, you can easily prevent Bob’s various departments from being abused in that way if you have an array $offcolor that contains a number of offensive words. Here is an example using str_replace() with an array:

$feedback = str_replace($offcolor,'%!@*', $feedback);

The function substr_replace() finds and replaces a particular substring of a string based on its position. It has the following prototype:

string substr_replace(string string, string replacement,
                      int start, int [length] );

This function replaces part of the string string with the string replacement. Which part is replaced depends on the values of the start and optional length parameters.

The start value represents an offset into the string where replacement should begin. If it is zero or positive, it is an offset from the beginning of the string; if it is negative, it is an offset from the end of the string. For example, this line of code replaces the last character in $test with "X":

$test = substr_replace($test, 'X', -1);

The length value is optional and represents the point at which PHP will stop replacing. If you don’t supply this value, the string will be replaced from start to the end of the string.

If length is zero, the replacement string will actually be inserted into the string without overwriting the existing string. A positive length represents the number of characters that you want replaced with the new string; a negative length represents the point at which you would like to stop replacing characters, counted from the end of the string.

Introducing Regular Expressions

PHP supports two styles of regular expression syntax: POSIX and Perl. Both types are compiled into PHP by default, and as of PHP versions 5.3 the Perl (PCRE) type cannot be disabled. However, we cover the simpler POSIX style here; if you’re already a Perl programmer or want to learn more about PCRE, read the online manual at http://www.php.net/pcre.

So far, all the pattern matching you’ve done has used the string functions. You have been limited to exact matches or to exact substring matches. If you want to do more complex pattern matching, you should use regular expressions. Regular expressions are difficult to grasp at first but can be extremely useful.

The Basics

A regular expression is a way of describing a pattern in a piece of text. The exact (or literal) matches you’ve seen so far are a form of regular expression. For example, earlier you searched for regular expression terms such as "shop" and "delivery".

Matching regular expressions in PHP is more like a strstr() match than an equal comparison because you are matching a string somewhere within another string. (It can be anywhere within that string unless you specify otherwise.) For example, the string "shop" matches the regular expression "shop". It also matches the regular expressions "h", "ho", and so on.

You can use special characters to indicate a meta-meaning in addition to matching characters exactly. For example, with special characters you can indicate that a pattern must occur at the start or end of a string, that part of a pattern can be repeated, or that characters in a pattern must be of a particular type. You can also match on literal occurrences of special characters. We look at each of these variations next.

Character Sets and Classes

Using character sets immediately gives regular expressions more power than exact matching expressions. Character sets can be used to match any character of a particular type; they’re really a kind of wildcard.

First, you can use the . character as a wildcard for any other single character except a newline ( ). For example, the regular expression

.at

matches the strings "cat", "sat", and "mat", among others. This kind of wildcard matching is often used for filename matching in operating systems.

With regular expressions, however, you can be more specific about the type of character you would like to match and can actually specify a set that a character must belong to. In the preceding example, the regular expression matches "cat" and "mat" but also matches "#at". If you want to limit this to a character between a and z, you can specify it as follows:

[a-z]at

Anything enclosed in the square brackets ([ and ]) is a character class—a set of characters to which a matched character must belong. Note that the expression in the square brackets matches only a single character.

You can list a set; for example,

[aeiou]

means any vowel.

You can also describe a range, as you just did using the special hyphen character, or a set of ranges, as follows:

[a-zA-Z]

This set of ranges stands for any alphabetic character in upper- or lowercase.

You can also use sets to specify that a character cannot be a member of a set. For example,

[^a-z]

matches any character that is not between a and z. The caret symbol (^) means not when it is placed inside the square brackets. It has another meaning when used outside square brackets, which we look at shortly.

In addition to listing out sets and ranges, you can use a number of predefined character classes in a regular expression. These classes are shown in Table 4.3.

Table 4.3  Character Classes for Use in POSIX-Style Regular Expressions

Image

Repetition

Often, you may want to specify that there might be multiple occurrences of a particular string or class of character. You can represent this using two special characters in your regular expression. The * symbol means that the pattern can be repeated zero or more times, and the + symbol means that the pattern can be repeated one or more times. The symbol should appear directly after the part of the expression that it applies to. For example,

[[:alnum:]]+

means “at least one alphanumeric character.”

Subexpressions

Being able to split an expression into subexpressions is often useful so that you can, for example, represent “at least one of these strings followed by exactly one of those.” You can split expressions using parentheses, exactly the same way as you would in an arithmetic expression. For example,

(very )*large

matches "large", "very large", "very very large", and so on.

Counted Subexpressions

You can specify how many times something can be repeated by using a numerical expression in curly braces ({}). You can show an exact number of repetitions ({3} means exactly three repetitions), a range of repetitions ({2, 4} means from two to four repetitions), or an open-ended range of repetitions ({2,} means at least two repetitions).

For example,

(very ){1, 3}

matches "very ", "very very " and "very very very ".

Anchoring to the Beginning or End of a String

The pattern [a-z] will match any string containing a lowercase alphabetic character. It does not matter whether the string is one character long or contains a single matching character in a longer string.

You also can specify whether a particular subexpression should appear at the start, the end, or both. This capability is useful when you want to make sure that only your search term and nothing else appears in the string.

The caret symbol (^) is used at the start of a regular expression to show that it must appear at the beginning of a searched string, and $ is used at the end of a regular expression to show that it must appear at the end.

For example, the following matches bob at the start of a string:

^bob

This pattern matches com at the end of a string:

com$

Finally, this pattern matches a string containing only a single character from a to z:

^[a-z]$

Branching

You can represent a choice in a regular expression with a vertical pipe. For example, if you want to match com, edu, or net, you can use the following expression:

com|edu|net

Matching Literal Special Characters

If you want to match one of the special characters mentioned in the preceding sections, such as ., {, or $, you must put a backslash () in front of it. If you want to represent a backslash, you must replace it with two backslashes (\).

Be careful to put your regular expression patterns in single-quoted strings in PHP. Using regular expressions in double-quoted PHP strings adds unnecessary complications. PHP also uses the backslash to escape special characters—such as a backslash. If you want to match a backslash in your pattern, you need to use two to indicate that it is a literal backslash, not an escape code.

Similarly, if you want a literal backslash in a double-quoted PHP string, you need to use two for the same reason. The somewhat confusing, cumulative result of these rules is that a PHP string that represents a regular expression containing a literal backslash needs four backslashes. The PHP interpreter will parse the four backslashes as two. Then the regular expression interpreter will parse the two as one.

The dollar sign is also a special character in double-quoted PHP strings and regular expressions. To get a literal $ matched in a pattern, you would need "\$". Because this string is in double quotation marks, PHP will parse it as $, which the regular expression interpreter can then match against a dollar sign.

Reviewing the Special Characters

A summary of all the special characters is shown in Tables 4.4 and 4.5. Table 4.4 shows the meaning of special characters outside square brackets, and Table 4.5 shows their meaning when used inside square brackets.

Table 4.4  Summary of Special Characters Used in POSIX Regular Expressions Outside Square Brackets

Image

Table 4.5  Summary of Special Characters Used in POSIX Regular Expressions Inside Square Brackets

Image

Putting It All Together for the Smart Form

There are at least two possible uses of regular expressions in the Smart Form application. The first use is to detect particular terms in the customer feedback. You can be slightly smarter about this by using regular expressions. Using a string function, you would have to perform three different searches if you wanted to match on "shop", "customer service", or "retail". With a regular expression, you can match all three:

shop|customer service|retail

The second use is to validate customer email addresses in the application by encoding the standardized format of an email address in a regular expression. The format includes some alphanumeric or punctuation characters, followed by an @ symbol, followed by a string of alphanumeric and hyphen characters, followed by a dot, followed by more alphanumeric and hyphen characters and possibly more dots, up until the end of the string, which encodes as follows:

^[a-zA-Z0-9_-.]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+$

The subexpression ^[a-zA-Z0-9_-.]+ means “start the string with at least one letter, number, underscore, hyphen, or dot, or some combination of those.” Note that when a dot is used at the beginning or end of a character class, it loses its special wildcard meaning and becomes just a literal dot.

The @ symbol matches a literal @.

The subexpression [a-zA-Z0-9-]+ matches the first part of the hostname including alphanumeric characters and hyphens. Note that you slash out the hyphen because it’s a special character inside square brackets.

The . combination matches a literal dot (.). We are using a dot outside character classes, so we need to escape it to match only a literal dot.

The subexpression [a-zA-Z0-9-.]+$ matches the rest of a domain name, including letters, numbers, hyphens, and more dots if required, up until the end of the string.

A bit of analysis shows that you can produce invalid email addresses that will still match this regular expression. It is almost impossible to catch them all, but this will improve the situation a little. You can refine this expression in many ways. You can, for example, list valid top-level domains (TLDs). Be careful when making things more restrictive, though, because a validation function that rejects 1% of valid data is far more annoying than one that allows through 10% of invalid data.

Now that you have read about regular expressions, you’re ready to look at the PHP functions that use them.

Finding Substrings with Regular Expressions

Finding substrings is the main application of the regular expressions you just developed. The two functions available in PHP for matching POSIX-style regular expressions are ereg() and eregi(). The ereg() function has the following prototype:

int ereg(string pattern, string search, array [matches]);

This function searches the search string, looking for matches to the regular expression in pattern. If matches are found for subexpressions of pattern, they will be stored in the array matches, one subexpression per array element.

The eregi() function is identical except that it is not case sensitive.

You can adapt the Smart Form example to use regular expressions as follows:

if (!eregi('^[a-zA-Z0-9_-.]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+$', $email)) {
    echo "<p>That is not a valid email address.</p>".
          <p>Please return to the previous page and try again.</p>";
    exit;
}
$toaddress = "[email protected]";  // the default value
if (eregi("shop|customer service|retail", $feedback))
    $toaddress = "[email protected]";
} else if (eregi("deliver|fulfill", $feedback)) {
    $toaddress = "[email protected]";
} else if (eregi("bill|account", $feedback)) {
    $toaddress = "[email protected]";
}
if (eregi("bigcustomer.com", $email)) {
    $toaddress = "[email protected]";
}

Replacing Substrings with Regular Expressions

You can also use regular expressions to find and replace substrings in the same way as you used str_replace(). The two functions available for this task are ereg_replace() and eregi_replace(). The function ereg_replace() has the following prototype:

string ereg_replace(string pattern, string replacement, string search);

This function searches for the regular expression pattern in the search string and replaces it with the string replacement.

The function eregi_replace() is identical but, again, is not case sensitive.

Splitting Strings with Regular Expressions

Another useful regular expression function is split(), which has the following prototype:

array split(string pattern, string search[, int max]);

This function splits the string search into substrings on the regular expression pattern and returns the substrings in an array. The max integer limits the number of items that can go into the array.

This function can be useful for splitting up email addresses, domain names, or dates. For example,

$address = "[email protected]";
$arr = split (".|@", $address);
while (list($key, $value) = each ($arr)) {
  echo "<br />".$value;
}

This example splits the hostname into its five components and prints each on a separate line.

username
@
example
.
com

Further Reading

PHP has many string functions. We covered the more useful ones in this chapter, but if you have a particular need (such as translating characters into Cyrillic), check the PHP manual online to see whether PHP has the function for you.

The amount of material available on regular expressions is enormous. You can start with the man page for regexp if you are using Unix, and you can also find some terrific articles at devshed.com and phpbuilder.com.

At Zend’s website, you can look at a more complex and powerful email validation function than the one we developed here. It is called MailVal() and is available at http://www.zend.com/code/codex.php?ozid=88&single=1.

Regular expressions take a while to sink in; the more examples you look at and run, the more confident you will be using them.

Next

In the next chapter, we discuss several ways you can use PHP to save programming time and effort and prevent redundancy by reusing pre-existing code.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset