You want to convert all character entities defined by the
XML standard to their corresponding literal characters. The conversion
should handle named character references (such as &
, <
, and "
) as well as numeric character
references (be they in decimal notation as Σ
or Σ
, or in hexadecimal notation as
Σ
, Σ
, or Σ
).
&(?:#([0-9]+)|#x([0-9a-fA-F]+)|([0-9a-zA-Z]+));
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
This regular expression includes three capturing groups. Only one of the groups participate in any particular match and capture a value. Using three groups like this allows you to easily check which type of entity was matched.
Use the regular expression just shown, together with the code in Recipe 3.16. The code examples listed there show how to perform a search-and-replace with replacement text generated in code.
When writing your replacement callback function, use backreferences to determine the appropriate replacement character. If group 1 captured a value, backreference 1 holds a numeric character reference in decimal notation, possibly with leading zeros. If group 2 captured a value, backreference 2 holds a numeric character reference in hexadecimal notation, possibly with leading zeros. If group 3 captured a value, backreference 3 holds an entity name. Use a lookup object, dictionary, hash, or whatever data structure is most convenient to map entity names to their corresponding characters by value or character code. You can then quickly identify which character to use as your replacement text.
The next section uses JavaScript to demonstrate how this all ties together.
// Accepts the match ($0) and backreferences; returns replacement text function callback($0, $1, $2, $3) { var charCode; // Name lookup object that maps to decimal character codes // Equivalent hexadecimal numbers are listed in comments var names = { quot: 34, // 0x22 amp: 38, // 0x26 apos: 39, // 0x27 lt: 60, // 0x3C gt: 62 // 0x3E }; // Decimal character reference if ($1) { charCode = parseInt($1, 10); // Hexadecimal character reference } else if ($2) { charCode = parseInt($2, 16); // Named entity with a lookup mapping } else if ($3 && ($3 in names)) { charCode = names[$3]; // Invalid or unknown entity name } else { return $0; // Return the match unaltered } // Return a literal character return String.fromCharCode(charCode); } // Replace all entities with literal text subject = subject.replace( /&(?:#([0-9]+)|#x([0-9a-fA-F]+)|([0-9a-zA-Z]+));/g, callback);
The regular expression and example code we’ve shown in this recipe are intended for decoding snippets of XML-style text, rather than entire XML documents. The regex here can be useful when converting XML or (X)HTML content to plain text, but keep in mind that no restrictions are placed on where named or numbered entities can occur within the subject text. For instance, there is no special handling for skipping entities in XML CDATA blocks or HTML script blocks.
The JavaScript example code converts both decimal and hexadecimal
numeric references to their corresponding literal characters, and
additionally converts the five named entities that are defined in the
XML standard: "
(“),
&
(&), '
('), <
(<), and >
(>). HTML includes many more named
entities that aren’t covered here.[22] If you follow the approach used in the example code,
however, it should be straightforward to add as many more entity names
as you need.
The JavaScript example code converts the following subject string:
"< &bogus; dec AA &lt; hex AA >"
To this:
"< &bogus; dec AA < hex AA >"
JavaScript doesn’t support Unicode code points beyond U+FFFF, so
the provided code (or more specifically, the String.fromCharCode()
method used within it) works
correctly only with numeric character references up to 
hexadecimal and 
decimal. This shouldn’t be a
problem in most cases, since characters beyond this range are rare.
Numeric character references with numbers above this range are invalid
in the first edition of the XML 1.0 standard.
Some programming languages and XML APIs have built-in functions
to perform XML or HTML entity decoding. For instance, in PHP 4.3 and
later you can use the function html_entity_decode()
. It might still be helpful
to implement your own method since such functions vary in which entity
names they recognize. In some cases, such as with Ruby’s CGI::unescapeHTML()
, even fewer
than the standard five XML named entities are recognized.
Recipe 9.5 explains how to
convert plain text to HTML by adding <p>
and <br>
tags. The first step in the process
is HTML-encoding &
, <
, and >
characters using named entities.
Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition.