Given a plain text string, such as a multiline value
submitted via a form, you want to convert it to an HTML fragment to
display within a web page. Paragraphs, separated by two line breaks in a
row, should be surrounded with <p>⋯</p>
. Additional
line breaks should be replaced with <br>
tags.
This problem can be solved in four simple steps. In most programming languages, only the middle two steps benefit from regular expressions.
As we’re converting plain text to HTML, the first step
is to convert the three special HTML characters &
, <
, and >
to named character references (see
Table 9-3).
Otherwise, the resulting markup could lead to unintended results when
displayed in a web browser.
Table 9-3. HTML special character substitutions
Search for | Replace with |
---|---|
‹ | « |
‹ | « |
‹ | « |
Ampersands (&
) must be
replaced first, since you’ll be adding more ampersands to the subject
string as part of the named character references.
?|
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
R
Regex options: None |
Regex flavors: PCRE 7, Perl 5.10 |
<br>
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP, Python, Ruby |
<br>s*<br>
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
</p><p>
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP, Python, Ruby |
Tying all four steps together, we’ll create a JavaScript
function called htmlFromPlainText()
. This function accepts a
string, processes it using the steps we’ve just described, then
returns the new HTML string:
function htmlFromPlainText(subject) { // Step 1 (plain text searches) subject = subject.replace(/&/g, "&"). replace(/</g, "<"). replace(/>/g, ">"); // Step 2 subject = subject.replace(/ ?| /g, "<br>"); // Step 3 subject = subject.replace(/<br>s*<br>/g, "</p><p>"); // Step 4 subject = "<p>" + subject + "</p>"; return subject; } // Run some tests... htmlFromPlainText("Test."); // -> "<p>Test.</p>" htmlFromPlainText("Test. "); // -> "<p>Test.<br></p>" htmlFromPlainText("Test. "); // -> "<p>Test.</p><p></p>" htmlFromPlainText("Test1. Test2."); // -> "<p>Test1.<br>Test2.</p>" htmlFromPlainText("Test1. Test2."); // -> "<p>Test1.</p><p>Test2.</p>" htmlFromPlainText("< AT&T >"); // -> "<p>< AT&T ></p>"
Several examples are included at the end of the code snippet
that show the output when this function is applied to various subject
strings. If JavaScript is foreign to you, note that the /g
modifier appended
to each of the regex literals causes the replace()
method to replace all occurrences
of the pattern, rather than just the first. The
metasequence in
the example subject strings inserts a line feed character (ASCII
position 0x0A) in a JavaScript string literal.
The easiest way to complete this step is to use three discrete search-and-replace operations (see Table 9-3, shown earlier, for the list of replacements). JavaScript always uses regular expressions for global search-and-replace operations, but in other programming languages you will typically get better performance from simple plain-text substitutions.
In this step, we use the regular expression ‹
?|
› to find line breaks
that follow the Windows/MS-DOS (CRLF), Unix/Linux/BSD/OS X (LF), and
legacy Mac OS (CR) conventions. Perl 5.10 and PCRE 7 users can use the
dedicated ‹R
› token
(note the uppercase R) instead for matching those and other line break
sequences.
Replacing all line breaks with <br>
before adding paragraph tags in
the next step keeps things simpler overall. It also makes it easy to
add whitespace between your </p><p>
tags in later
substitutions, if you want to keep your HTML code readable.
If you prefer to use XHTML-style singleton tags, use «<br●/>
» instead of «<br>
» as your
replacement string. You’ll also need to alter the regular expression
in Step 3 to match this change.
Two line breaks in a row indicate the end of one
paragraph and the start of another, so our replacement text for this
step is a closing </p>
tag
followed by an opening <p>
.
If the subject text contains only one paragraph (i.e., two line breaks
never appear in a row), no substitutions will be made. Step 2 already
replaced any of several line break types (leaving behind only <br>
tags), so this step could be
handled with a plain text substitution. However, using a regex here
makes it easy to take things one step further and ignore whitespace
that appears between line breaks. Any extra space characters won’t be
rendered in an HTML document anyway.
If you’re generating XHTML and therefore replaced line breaks
with «<br●/>
» instead of
«<br>
»,
you’ll need to adjust the regex for this step to ‹<br●/>s*<br●/>
›.
Recipe 4.10 includes more information
about Perl and PCRE’s ‹R
›
token, and shows how to manually match the additional, esoteric line
separators that are supported by ‹R
›.
Recipe 9.6 demonstrates how to decode XML-style named and numbered character references.
Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.2 explains how to match nonprinting characters. Recipe 2.3 explains character classes. Recipe 2.8 explains alternation. Recipe 2.12 explains repetition.