9.6. Decode XML Entities

Problem

You want to convert all character entities defined by the XML standard to their corresponding literal characters. The conversion should handle named character references (such as &, <, and ") as well as numeric character references (be they in decimal notation as Σ or Σ, or in hexadecimal notation as Σ, Σ, or Σ).

Solution

Regular expression

&(?:#([0-9]+)|#x([0-9a-fA-F]+)|([0-9a-zA-Z]+));
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

This regular expression includes three capturing groups. Only one of the groups participate in any particular match and capture a value. Using three groups like this allows you to easily check which type of entity was matched.

Replace matches with their corresponding literal characters

Use the regular expression just shown, together with the code in Recipe 3.16. The code examples listed there show how to perform a search-and-replace with replacement text generated in code.

When writing your replacement callback function, use backreferences to determine the appropriate replacement character. If group 1 captured a value, backreference 1 holds a numeric character reference in decimal notation, possibly with leading zeros. If group 2 captured a value, backreference 2 holds a numeric character reference in hexadecimal notation, possibly with leading zeros. If group 3 captured a value, backreference 3 holds an entity name. Use a lookup object, dictionary, hash, or whatever data structure is most convenient to map entity names to their corresponding characters by value or character code. You can then quickly identify which character to use as your replacement text.

The next section uses JavaScript to demonstrate how this all ties together.

Example JavaScript solution

// Accepts the match ($0) and backreferences; returns replacement text
function callback($0, $1, $2, $3) {
    var charCode;

    // Name lookup object that maps to decimal character codes
    // Equivalent hexadecimal numbers are listed in comments
    var names = {
        quot: 34, // 0x22
        amp: 38, // 0x26
        apos: 39, // 0x27
        lt: 60, // 0x3C
        gt: 62 // 0x3E
    };

    // Decimal character reference
    if ($1) {
        charCode = parseInt($1, 10);
    // Hexadecimal character reference
    } else if ($2) {
        charCode = parseInt($2, 16);
    // Named entity with a lookup mapping
    } else if ($3 && ($3 in names)) {
        charCode = names[$3];
    // Invalid or unknown entity name
    } else {
        return $0; // Return the match unaltered
    }

    // Return a literal character
    return String.fromCharCode(charCode);
}

// Replace all entities with literal text
subject = subject.replace(
        /&(?:#([0-9]+)|#x([0-9a-fA-F]+)|([0-9a-zA-Z]+));/g,
        callback);

Discussion

The regular expression and example code we’ve shown in this recipe are intended for decoding snippets of XML-style text, rather than entire XML documents. The regex here can be useful when converting XML or (X)HTML content to plain text, but keep in mind that no restrictions are placed on where named or numbered entities can occur within the subject text. For instance, there is no special handling for skipping entities in XML CDATA blocks or HTML script blocks.

The JavaScript example code converts both decimal and hexadecimal numeric references to their corresponding literal characters, and additionally converts the five named entities that are defined in the XML standard: &quot; (“), &amp; (&), &apos; ('), &lt; (<), and &gt; (>). HTML includes many more named entities that aren’t covered here.[22] If you follow the approach used in the example code, however, it should be straightforward to add as many more entity names as you need.

The JavaScript example code converts the following subject string:

"&lt; &bogus; dec &#65;&#0065; &amp;lt; hex &#x41;&#x041; &gt;"

To this:

"< &bogus; dec AA &lt; hex AA >"

JavaScript doesn’t support Unicode code points beyond U+FFFF, so the provided code (or more specifically, the String.fromCharCode() method used within it) works correctly only with numeric character references up to &#xFFFF; hexadecimal and &#65535; decimal. This shouldn’t be a problem in most cases, since characters beyond this range are rare. Numeric character references with numbers above this range are invalid in the first edition of the XML 1.0 standard.

Tip

Some programming languages and XML APIs have built-in functions to perform XML or HTML entity decoding. For instance, in PHP 4.3 and later you can use the function html_entity_decode(). It might still be helpful to implement your own method since such functions vary in which entity names they recognize. In some cases, such as with Ruby’s CGI::unescapeHTML(), even fewer than the standard five XML named entities are recognized.

See Also

Recipe 9.5 explains how to convert plain text to HTML by adding <p> and <br> tags. The first step in the process is HTML-encoding &, <, and > characters using named entities.

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition.



[22] HTML 4.01 defines 252 named entities. HTML5 has more than 2,000.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset