An Example: A Guessing Program

Back in Lesson 6, there was an example program that implemented a number guessing game. In this example, we're going to turn things around and let the program try to guess what you entered. This program illustrates how you can use regular expressions to break down strings and figure out what they are.

The structure of the program is simple. It consists of an infinite while loop that continues to iterate until the last function is called. If the user enters 'q' on a line by itself, then the program will exit. Otherwise, it applies a number of regular expressions to the user's input to try to determine what sort of data was entered. If the input does not match any of the expressions, then a message indicating that no match was found is printed, and the program solicits more input.

The most interesting part of this program is the regular expressions themselves—the logic you've seen in several other examples in this book. Let's break each of them down so you can get an idea of how these regular expressions are used.

The first expression is simple, it checks to see whether the input is just male or female:

if ($value =~/^(male|female)$/i) {
    print "$value looks like a gender.
";
    next;
}

First, I use one convention in this expression that holds true for all of the expressions in this example. The expression being tested begins with ^ and ends with $, indicating that the expression must match the entire string. This is true of all of the expressions in the program. We're not trying to extract data from the middle of strings, but rather to test entire strings to see whether they conform to a standard. This expression is simple, it checks to see whether the string contains “male” or “female,” and uses the i switch to make the search case insensitive.

The next search is a bit more complex—it's used to search for a street address. It searches for a number, followed by a space, followed by a capital letter, followed any number of word characters or white space characters. Here's the expression:

if ($value =~ /^d+s[A-Z][ws]*$/) {
    print "$value looks like a street address.
";
    next;
}

It will match things like 105 Locust, as well as more complex strings like 10 South Street Plaza. However, it will not match 10B West Main, or even 2105 westheimer (because the first letter is not capitalized). Breaking this down, we see several individual expressions. First, d+ is used to match at least one digit. Then, s is used to match exactly one white space character. Next is [A-Z], a character range that matches one capitalized letter. Finally any combination of whitespace or word characters is matched using [ws]*), which accounts for any sort of weird endings in the address (except incorrect use of punctuation).

The next snippet of code checks to see whether the value being evaluated is an e-mail address. Here it is:

if ($value =~ /^[w.]+@w[w.-]+w.[A-Za-z]{2,4} $/) {
        print "$value looks like an email address.
";
        next;
    }

Let's look at how this regular expression breaks down. First, I check for one or more word characters, or a period, followed by an @ sign. It then searches for a word character, followed by any number of word characters and periods, followed by a single word character, followed by a period, followed by two to four letters. This expression isn't perfect, but more often than not, it can tell the difference between a valid email address and something that isn't a valid email address. Let's look at why I used each expression after the at sign. First, the expression starts by searching for a single word character, which represents the beginning of the domain name. It ends with two to four letters, which represent the top level domain name. Before that is the period which separates the top level domain name from the rest of the domain name, and before that is the word character which is required at the end of the domain name (prior to the top level domain name). In the middle, any sequence of word characters, hyphens, and periods is allowed.

The next expression matches a typical city, state, and zip code:

if ($value =~ /^[A-Z][a-z]+,s[A-Z]{2} s{1,2} d{5} $/) {
    print "$value looks like a city, state, and zip code.
";
    next;
}

This expression is probably a bit more comprehensible than the e-mail address expression. It searches for a capital letter followed by some sequence of lowercase letters, which represent the city name. It then looks for a comma, followed by a white space character, which separate the city from the state. It then checks for two uppercase letters, which represent a state abbreviation. For greater accuracy, I could have included all 50 state abbreviations in an expression like:

(AK|AR|LA|TX|...)

One or two spaces can follow the state abbreviation, and then the program checks for a five digit number representing a Zip code. The next expression is the most complex, in fact, it spans two lines:

if ($value =~ /^W{0,1} d{3} W{0,1} s{0,1}
    W{0,1} s{0,1} d{3} s{0,1} W{0,1} s{0,1} d{4} $/x) {
    print "$value looks like a telephone number.
";
    next;
}

This expression is used to match phone numbers. It tries to account for a number of ways of formatting phone numbers—everything from 8005551212 to (800) 555-1212. It will even match something like 800/555/1212. The key here is flexibility. The main thing is that it searches for 10 digits formatted in any way that could be taken to represent a telephone number. The complexity in this expression revolves around all of the optional characters.

First, I search for 0 or 1 non-word characters, so I can catch something like an opening parenthesis. Then I search for 3 digits, representing the area code. I follow that by searching for another nonword character, possibly representing a closing parenthesis. Then there's an optional white space character, which could separate the area code from the exchange. That's followed by an optional nonword character, followed by another optional white space character. The purpose here is to allow things like 800 - 555 - 1212. Then, I match 3 digits representing the exchange. After that, I again match an optional space, an optional nonword character, and another optional space, finishing with a four digit number representing the extension. This is a /x expression, which means that white space is ignored so that it can span multiple lines.

The rest of the script is just a wrapper that loops and allows the user to enter values and have them tested. The full source listing is in Listing 9.1.

Listing 9.1. The Source Code for the values.pl Script
1:  #!/usr/local/bin/perl
2:
3:  print "Enter a value and this program will try to guess what type of
";
4:  print "value it is. Enter 'q' by itself on a line to quit.
";
5:
6:  while () {
7:      print "Enter a value: ";
8:      chomp($value = <STDIN>);
9:
10:     if ($value =~ /^$/) {
11:         print "Please enter a value.
";
12:         next;
13:     }
14:
15:     last if ($value =~ /^q$/i);
16:
17:     if ($value =~/^(male|female)$/i) {
18:         print "$value looks like a gender.
";
19:         next;
20:     }
21:
22:     if ($value =~ /^d+s[A-Z][ws]*$/) {
23:         print "$value looks like a street address.
";
24:         next;
25:     }
26:
27:     if ($value =~ /^[w.]+@w[w.-]+w.[A-Za-z]{2,4} $/) {
28:         print "$value looks like an email address.
";
29:         next;
30:     }
31:
32:     if ($value =~ /^[A-Z][a-z]+,s[A-Z]{2} s{1,2} d{5} $/) {
33:         print "$value looks like a city, state, and zip code.
";
34:         next;
35:     }
36:
37:     if ($value =~ /^W{0,1} d{3} W{0,1} s{0,1}
38:                     W{0,1} s{0,1} d{3} s{0,1} W{0,1} s{0,1} d{4} $/x) {
39:         print "$value looks like a telephone number.
";
40:         next;
41:     }
42:
43:     if ($value =~ /^perl$/i) {
44:         print "$value looks like the name of a programming language.
";
45:         next;
46:     }
47:
48:     print "I couldn't figure out what that value is.
";
49: }
					

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset