An Example: Extracting Attributes from HTML Tags

Here's a quick example that illustrates how to use expressions to extract data from a document and how greediness works. This program simply extracts all the HTML tags from a document, prints out their names, and then prints out all their attributes and the values assigned to those attributes. It assumes that the attributes all have values, and that those values are enclosed in quotation marks (single or double).

First, let's go straight to the source code for the script. It's in Listing 10.1.

Listing 10.1. The extractattrs.pl Script
#!/usr/bin/perl

while (<>)
{
    while (/<(w+)(.*?)>/g)
    {
        print "Tag: $1
";

        my $attrs = $2;

        while ($attrs =~ /(w+)=('|")(.+?)2/g)
        {
            print "Attribute: $1
";
            print "Value: $3
";
        }
    }
}
					

This script is really simple. It iterates over the files that are passed in on the command line, searching for HTML tags. Let's look at the first regular expression in the file:

/<(w+)(.*?)>/g

This expression is applied to each line in the file. On the outside of the expression I have < and > to match tags. Inside the angle brackets, I start with an expression that matches any number of word characters and extracts the tag name. The second expression matches everything else between the angle brackets. This expression is marked as nongreedy so if there are multiple tags on the line they will match the closing angle bracket in the tag I'm currently processing, not the closing angle bracket for the last tag on the line. Everything between the tag name and the closing angle bracket is grouped so that I can process it separately. This expression won't match closing tags because the tag must start with word characters, not /.

After the tag information has been extracted the tag name is printed and stored in $1. Then, I have another inner while loop that iterates over the information stored in the remainder of the tag. Let's look at that expression:

/(w+)=('|")(.+?)2/g

This expression extracts attributes from the rest of the data in the tag. More specifically, it extracts attributes with values that are enclosed in single or double quotation marks. (If the HTML document is XHTML-compliant, this will catch all the attributes.)

Let's look at the expression in detail. First, it matches some number of word characters, and stores them in the first backreference. Then, it matches an equal sign, then either a single or double quotation mark. Then, a nongreedy expression matches any string of characters (and stores them in backreference three), until it hits the value of backreference two, which contains the quotation mark used to open the value of the expression. This expression matches attributes such as color="blue" or size='3', but not height=15, or nowrap, or colspan='2". The script then prints out the name and value of the attributes in the tag.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset