Chapter 14. The role of XML

XML is fast becoming the new Internet standard for information exchange. For complex information reuse, XML is the technology of choice. In this chapter, we briefly describe XML and its origins and explain why XML is so valuable in a unified content strategy.

A brief history of XML

XML, eXtensible Markup Language, is advertised as many things, including the successor to HTML as the language of the Web. But unlike HTML, XML is not a specific markup language; it is a standard for defining your own markup language.

XML is not new; it is a recasting of an existing standard: Standard Generalized Markup Language (SGML). In its current iteration, the XML specification is at version 1.0. People sometimes interpret that to mean that the specification is immature and to expect big changes. They are waiting for a “stable” version of the standard. But this is misleading; XML is based on SGML, so it originates from an already stable background.

First SGML

SGML is an International Standards Organization (ISO) standard for markup of text. It began life as a solution to a very common problem in large organizations: How do you take information created by different groups, using different software, and on different operating systems, and share it for use and reuse among those groups? In the 1970s, the answer to that question was, “You don’t.” Files created on one operating system (OS) could not be read on another OS. Applications could not read the formats of other applications. For that matter, files created on one version of an application could not always be read by later versions of the same application. Information was created and re-created as needed, but never reused.

This was the problem identified by Charles Goldfarb in 1969 when he was leading a project to integrate law office systems. He and his colleagues developed Generalized Markup Language (GML) as a mechanism for creating and sharing text. GML focused on the structure of a document, not the formatting, and was the first text processing language to view and formalize the hierarchical nature of information. It was also the first language to explicitly define the hierarchy of documents in a separate form. GML was implemented in mainframe-based text processing systems.

In 1978, the American National Standards Institute (ANSI) invited Goldfarb to participate on their Computer Languages for the Processing of Text committee. As a member of this committee, Goldfarb lead an effort to develop a text processing language based on GML. Over time, this was developed into ISO standard 8879—Standard Generalized Markup Language.

SGML, like GML, describes the logical structure of a document, its components, and their relationship to each other, not how the document should be formatted. A paragraph, for example might be marked with a “begin paragraph tag,” <para>, and an “end paragraph tag,” </para>, rather than with formatting instructions to leave a blank line and indent.

SGML gained wide use in industries such as aerospace and defense. It was also adopted in several branches of the government. SGML markup languages and applications provide authors with a tremendous amount of power and control over their documents. So why not stick with SGML?

Well, there are several reasons:

  • SGML is very complex and is difficult to learn and apply.

  • SGML tools have traditionally been very expensive.

  • SGML has been primarily a technology for print publishing.

Then HTML

The best known application of SGML is the language of the Web: HTML. It was created in the early 1990s by Tim Berners-Lee at CERN (the European Laboratory for Particle Physics) as a way for scientists to share information. With the popularity of the Web, HTML has become ubiquitous. So why not stick with HTML? HTML is limited, in a number of ways:

  • HTML is a fixed tag set. You cannot add your own tags to HTML. It is a fixed language.

  • HTML is designed for display. It is perfectly suited for rendering documents in your browser, but that’s all. It is not effective for print or other formats.

  • HTML is static. Its display is fixed, so providing information in different ways based on user requests is difficult. Dynamic HTML does allow some manipulation of display, but it requires some pretty complex scripting to do so. Scripting requires developers to create and maintain the scripts, which is expensive.

  • HTML is not structural. It is primarily a linear presentation markup. With the exception of lists and tables, it doesn’t contain any structural markup. Its focus on presentation makes it difficult, if not impossible, to process or manipulate chunks of HTML.

  • HTML is not really a standard. The rush to capture market share has meant that tools for HTML, including browsers, frequently appeared on the market before HTML was complete and stable. As a result, the tools incorporate their designer’s best guesses of what HTML would become. Browser vendors have created their own proprietary codes (flavors of HTML), which impedes standardization.

So although HTML has been a useful tool for displaying content on the Web, it has severe limitations. These limitations, at least in part, have spurred the development of XML.

What is XML?

First, let’s discuss what XML is not. XML is not a set of tags that you can apply to documents. XML is a specification that sets rules for the creation of tag sets that you can apply to documents. That’s the eXtensible part. That’s also the most confusing part to many people starting to learn XML. People coming from an HTML background automatically expect to see a list of tags they can apply to documents. XML does not define the tag names—you do.

Design goals of XML

In 1998, the first version of the XML specification was released as a recommendation. For this first version, a design committee of the World Wide Web Consortium (W3C)—the people who develop and maintain XML—had very specific goals in mind, including the following:

  • XML will be designed for use on the Internet, but shall support a wide variety of applications.

    First and foremost, XML was designed to be used on the Web. But the designers foresaw many uses for XML, including Web-based publishing and e-commerce.

  • XML will be based on and compatible with SGML.

    SGML had over a dozen years to “iron out the bugs.” It’s a very solid, stable standard, just too complex for Web-based use. XML keeps the best of SGML and reduces the complexity.

  • It will be easy to write programs that process XML documents.

    “Process” in this context can mean render, sort, parse, transform, assemble, and so on. Having easy-to-write programs means that there are far more tools available for processing XML than there ever were for SGML.

  • XML documents will be easy to create, readable without specialized tools, and reasonably clear.

    As with SGML, XML documents are saved in ASCII format. You can open and edit them with any text editor or XML editing tool.

  • The design of XML will be formal and concise.

    Unlike HTML, XML is a precise standard that sets out clear rules for the creation and application of tag sets. This has made it easier for tool vendors to create new tools for XML. Plus, the standard makes it easy to predict how XML documents will look and be structured.

Compared to HTML and SGML, XML is the best of both worlds. The best functionality of SGML has been extracted and the ease-of-use of HTML has been preserved.

A look at XML

You can best understand XML if you work through a quick example. First look at a small procedure marked up in HTML (see Listing 14.1).

Example 14.1. Sample HTML document

<html> 
   <body> 
      <h2>Logging On to AccSoft </h2> 
      <p >The first time you click on a component in AccSoft you are 
required to log on to the system before you can complete any tasks. 
      <paragraph>To log on to AccSoft: 
      <ol> 
         <li>Double-click the AccSoft application. 
         <li>Select Accounts Payable from the Explorer. < 
         <li>Type your USERID into the Name field. 
         <li>Type your password into the Password field. 
         <li>Select the customer to update. 
         <li>Click the OK button to log on to AccSoft. 
      </ol> 
      <paragraph>If you do not know your USERID or 
         password, consult your System Administrator. 
   </body> 
</html> 

This is simple HTML code and though plain, it works. Figure 14.1 shows what it looks like when displayed in a browser.

. Simple procedure displayed in a browser.

Figure 14.1. . Simple procedure displayed in a browser.

For pure display, HTML is very effective. The code is simple and the output is very workable. This HTML could be displayed in any browser.

HTML is a good example of a simple markup language:

  • There are start tags and end tags. For example, the procedure title is surrounded by an <h2> start tag and end tag:

    <h2>Logging On to AccSoft </h2> 
  • The end tag is differentiated from the start tag by the forward slash (<h2> versus </h2>) so there’s clear delineation of where the title begins and ends.

  • The procedure file also shows nesting of the elements of the document. The <html> element contains a <body> element. The <body> element contains an <h2> element, followed by two <p> elements, followed by an <ol> element, followed by a <p> element. The <ol> element contains six <li> elements.

As the example illustrates, HTML can be pretty effective. The code is simple, the output is workable, and the tags delineate content elements and their nesting very clearly. The issue is that HTML is concerned solely with presentation. It does nothing to help you understand what the information is. To most observers, the content in the example is obviously a procedure. But we reach that conclusion through interpretation of the content, not through the HTML tags. What does an <h2> represent? It’s interpreted as a title. What is an <ol>? You only know that it’s an ordered list if you know HTML. Furthermore, there are three different occurrences of the <p> tags, representing different paragraphs. Are they different semantically? Not knowing this can be a problem when it comes time to process the document. There must be a way to distinguish among ambiguous tags.

Now let’s look at an XML version of the same procedure. Listing 14.2 shows the same procedure marked up in XML.

Example 14.2. Sample XML file

<?xml version="1.0"?> 
<procedure> 
 <title>Logging On to AccSoft </title> 
 <paragraph>The first time you click on a component in AccSoft you are 
 required to log on to the system before you can complete any tasks. 
 </paragraph> 
 <intro>To log on to AccSoft:</intro> 
 <procedure_steps> 
    <step>Double-click the AccSoft application.</step> 
    <step>Select Accounts Payable from the Explorer.</step> 
    <step>Type your USERID into the Name field.</step> 
    <step>Type your password into the Password field.</step> 
    <step product="extended">Select the customer to update.</step> 
    <step>Click the OK button to log on to AccSoft.</step> 
 </procedure_steps> 
 <note>If you do not know your USERID or Password, consult your System 
Administrator. 
 </note> 
</procedure> 

The content of the two procedures is the same, but there are significant differences in the tagging. Tagging for XML is similar to HTML, with start and end tags enclosing content, but it has some very specific rules you need to follow:

  • You must have closing tags for all elements.

  • Tags must be nested, never overlapping.

  • Tag names must match case.

In the XML file the tag names identify what they contain. There can be absolutely no doubt that this is a procedure because the first tag (<procedure>) says so. The title is obviously a title. The steps of the procedure are clearly identified. Also, there is only one generic paragraph (<paragraph>) in the XML file. In the XML file, paragraphs that were identified generically in the HTML sample are tagged as <intro> and <note>.

So where did the tags in the XML document come from? We made them up for this example. And that is one of the great advantages of XML; you define the tag names to suit the information that you are marking up. Instead of generic tags, you get semantic tags. Ideally, the names should come from the semantic model for your information, as described in Chapter 8, “Information modeling.” Semantic names for model elements describe what goes in each element. The semantic names can therefore be used as the XML tag names and eliminate the need for users to interpret or guess at each tag’s purpose.

There are benefits to defining your own tag names:

  • Tag names have meaning for you and your authors.

    You don’t have to guess about the meaning or purpose of a tag name. Tag names can be made specific (as specific or as precise as needed) so authors (and other content handlers) don’t have to interpret how a tag should be or has been used.

  • Names can reflect the content.

    The tag names can clearly identify what they contain. This is a procedure. That is a note.

  • Tag names have nothing to do with formatting.

    Formatting can be defined later, when you know the exact purpose or purposes of the document. Nothing in the markup will limit the formats to which you can output.

  • You can have as many or as few tags as you need.

    HTML has a fixed number of tags. You can’t add any more and it’s very difficult to prevent authors from using those you don’t want them to use.

Importance of XML to a unified content strategy

You can implement a unified content strategy without XML, using traditional authoring tools, but XML provides you with the ability to do a whole lot more. There are disadvantages to XML, notably that it is a new technology that brings issues in dealing with the learning curve, the complexity, and the implementation. However, the disadvantages are outweighed by the advantages. The characteristics of XML that best support reuse are:

  • Structured content

  • Separation of content and format

  • Built-in metadata

  • Database orientation

  • XSL style sheets

  • Personalization

These are described in more detail in the following sections.

XML and structured content

Authors typically have a high-level understanding of the concept of structured content. For example, they understand that books have front matter, body chapters, and back matter. Authors may also recognize repeatable structures at a lower level. Chapters have titles, overviews, sections containing the “meat” of the chapter, and a summary. Some authors can even describe the structure in individual sections, for example, a procedure. (For more on defining structures, see Chapter 8, “Information modeling.”) However, when you examine similar information products, you find that structures are not consistent from product to product. Structures will vary from author to author, from department to department, from division to division. Even information written by a single author will vary over time. This is a big problem for reuse.

In XML, structure can be defined in a Document Type Definition (DTD) or Schema [1]. A DTD is quite specific; it defines all the elements (XML tags) that can be used in a document. It also defines the relationship of those elements to other elements. You can specify the hierarchy of elements (“a chapter contains…”), the order of elements, or even the number of elements.

A DTD can be incredibly valuable for the writing process. Many authors take as much time figuring out the structure they need to write to as they do actually crafting the information. Does my presentation need an overview? Does my procedure have an introduction? Do I need to include a title for a graphic? With a DTD, you can mandate the structure that is required. This consistency is also very valuable for the information’s users. Consistency leads to predictability. Users learn where information is to be found and can automatically navigate to it, finding what they need quickly and efficiently. In addition, a DTD provides a powerful map for systematic reuse and personalization. When content is systematically reused, the content management system must identify what content can be reused where. The DTD specifies this information. Personalization also requires a map and set of rules to define what information should be provided and in what order. The DTD provides this information.

For structural consistency, having a defined structure in a DTD is half of the solution. The other half is provided by specialized editing tools (called validating editors) that can read a DTD and enforce the structural rules defined in it. By providing authors with a validating editor and a DTD, you can ensure that all your information products are structurally consistent.

Separation of content and format

If there’s a single characteristic that impedes the effectiveness of traditional authoring tools—such as word processors—it is their focus on formatting. Traditional tools have been designed to make it easy for authors to make documents look good. In doing so, they have turned authors into desktop publishers. But from the perspective of reuse, this is not a good thing.

First, word processors began life as an alternative to typewriters. They allowed authors to make documents attractive and potentially, more usable. But they were still very typewriter-like because their focus was the current document. Authors entered the characters that formed the content, then selected the characters to apply the formatting. This wasn’t very effective for repeating formatting in a document; it relied on authors remembering that a section title was 18 pt. Helvetica Bold centered on a 36 pica line. The result was inconsistent formatting for all but the most dedicated authors. Later software versions allowed authors to create “styles”: formatting that was defined and given a name to apply as required. But even now, none of the word processors provide any functionality to ensure that the formatting remains constant. Authors can define new styles, redefine existing styles, or ignore them altogether. Consistency—or more accurately, predictability—in the application of style names is vital for reuse.

Second, all that formatting power comes with a price. Simply put, you end up with big, bloated data files that contain not only the content, but also all the details of the formatting. Further, that formatting is specific to the output that the tool is designed to support. Most word processing applications, not surprisingly, have a bias toward paper. What makes this a complication for reuse is that you need a way to remove this formatting to make the content independent of output. To reuse the content, authors must apply formatting that is appropriate for each output. Stripping and reapplying formatting is tricky and usually not 100% effective. Format conversions always require correction by hand or complicated scripting.

For reuse, XML has a significant advantage over traditional word processors. XML stems from the originating goal of making documents transportable across systems and applications. The proponents of markup languages knew that the embedded formatting commands and binary file formats were the main impediment to cross-platform transportability. The solution was to separate the format from the content. XML focuses on the structure of a document, not the presentation. The presentation information (styles) is maintained in separate files that are associated with the document when it is published or used.

The separation of content and format offers immense flexibility. For example, the example XML procedure includes a <note> element. Traditionally, this is formatted for output something like what is shown in Figure 14.2.

Simple formatted note.

Figure 14.2. Simple formatted note.

The signal word “NOTE” is not part of the actual content; it has been provided through the style sheet. The keyword could easily be replaced by an signal icon, again through the style sheet (see Figure 14.3).

Note with signal icon.

Figure 14.3. Note with signal icon.

Built-in metadata

As you’ve seen, HTML is a defined set of markup tags, whereas XML is a set of rules for creating markup tags. So what does that mean? Compare the two variations of markup presented earlier. The files contain the same information. However, to authors the tag names themselves offer additional detail about the information. The tag names become metadata.

For occasions where additional information is required to describe content, attributes can be used to further define metadata. An attribute is a name and a value that can be associated with a tag. For example, a common use for metadata in a reuse environment is to indicate who the intended audience is for a specific piece of information. Consider the procedural example. Using attributes, you can identify the audience for each specific step, option, or even word. Listing 14.3 shows the same procedure but with a step added. The step is tagged like all the other steps, with the addition of an attribute (product="extended"). This attribute functions as metadata in that it indicates that this step is applicable only in the extended version of the product.

Example 14.3. XML with attribute metadata

<procedure> 
<title>Logging On to AccSoft </title> 
<paragraph>The first time you click on a component in AccSoft you are 
required to log on to the system before you can complete any tasks. 
</paragraph> 
<intro>To log on to AccSoft:</intro> 
<procedure_steps> 
<step>Double-click the AccSoft application.</step> 
<step>Select Accounts Payable from the Explorer.</step> 
<step>Type your USERID into the Name field.</step> 
<step>Type your password into the Password field.</step> 
<step product="extended">Select the customer to update.</step> 
<step>Click the OK button to log on to AccSoft.</step> 
</procedure_steps> 
<exercise>Log onto the training database using the USERID and password 
provided by your course facilitator. 
</exercise> 
<note>If you do not know your USERID or Password, consult your System 
Administrator. 
</note> 
<warning>This database contains personal information about our clients. 
Do not let anyone else use your password at any time. 
</warning> 
</procedure> 

Using traditional document and content management systems (CMS), authors add metadata through a selection window when they check in a file. The best CMS products include search tools that enable users to search by metadata. But the metadata is only associated with the file; it is not part of the file. You can’t email the file to someone and have the metadata go along because the metadata is part of the CMS data. In XML, the metadata travels with the XML file. It is entered as the file is authored or updated. It remains part of the content and can be easily searched.

Database orientation

If there’s a common theme running through the XML references available on the market, it’s probably that XML makes you look at information in a different way: as data. The process for determining the structure of your information and the resulting DTD is very similar to the analysis that a developer goes through to design a database. Database designers are not concerned with the actual data values in design; they are interested in the type of information, the hierarchy of that information, and the relationship of the pieces.

A similar approach is taken when designing the structure of XML. The result is a structural format that can be stored very easily in databases. It can be stored as a series of elements rather than as a whole document, and those elements can be extracted and assembled in any order, based on your needs.

Use of XSL

Separating format from content is all well and good, but sooner or later you need to format information for presentation. XML by itself is not acceptable for display to the average user. The technology for formatting XML presentation is XSL (eXtensible Stylesheet Language). Unlike traditional style sheets, which provide only formatting commands, XSL is a powerful mechanism for both transforming and formatting XML documents.

XSL is an XML markup language itself and as such, can

  • Format content for online display or for paper-based delivery

  • Add constant text or graphics (such as the icons in the “warning” example)

  • Filter content

  • Sort or reorder text

There are actually three parts to XSL:

  • XPath

  • XSL Transformations

  • XSL-FO (formatting objects)

Traditional style definitions are very restricted in the way they identify elements for formatting. XPath is a mechanism for identifying and formatting specific elements in an XML document. XPath enables you to apply logic to your formatting. In the XSL style sheet, you can identify and apply specific formatting or transformation to elements, such as a title following a chapter, the first paragraph in a section, and every other bullet in a list.

Other style sheets enable you to describe all your formatting needs, including fonts, colors, sizes, margins, bullets, list numbers, and so on, in a WYSIWYG editor.

But rather than simply formatting the information in a document, XSL gives you the capability to transform it into something else. That is, you can manipulate the information to reorder, repeat, filter out information, or even add information based on details in the file. This is where XSL transformations, also known as XSLT, fit in. XSLT enables you to transform an XML document into another markup language. The most common use of XSLT is to transform information to HTML for display on the Web. But XSLT can also be used to convert information from XML into markup for wireless display, for transmission to PDAs and Web-enabled cell phones.

The flexibility of XSL and its pieces is extremely valuable for information publication and presentation. Unlike traditional tools, which associate one style sheet with one document, you can create any number of style sheets for a single XML document or information type. If you want to post the document on the Web, create an XSL style sheet to HTML. For wireless, create an XSL style sheet to WML.

Despite the unstoppable growth of the Internet and display technologies, paper will continue to be a required output for information. XSL-FO has been designed for that purpose. If you want paper, create an XSL-FO style sheet. XSL-FO (XSL Formatting Objects) provides style sheet capabilities for converting XML to paper-based formats such as PDF. It provides for all the required formatting, including page layouts, headers, footers, recto/verso (odd/even) pages, portrait and landscape pages, and so on.

When the information is ready to publish, you can process the file against all style sheets simultaneously and get all required outputs at the same time.

Personalization

Personalization is very popular for web delivery of content. Personalization, simply defined, is information that can be manipulated to serve the needs of a specific user. It can be user defined, or it can be managed by software, based on a user’s login information. Personalization that is “managed by software” may be controlled by observing user behavior, and/or combined with preferences to create a personalized experience.

With XML, documents can be broken down, stored as separate physical pieces in a database, and then assembled in any order to meet user demands.

Summary

XML is not the only technology solution for reuse, but it is the most powerful by far. XML combines the best functionality of SGML with the ease-of-use of HTML, which is the best of both worlds.

XML provides powerful support for a unified content strategy through

  • Structured content

    A Document Type Definition (DTD) or Schema defines all the elements and their associated tags for a document. The DTD/Schema provides a roadmap to help authors create consistently structured content.

  • Separation of content and format

    XML tags focus on a document’s structure, not its format. An XSL style sheet can interpret the XML tags to produce any desired format.

  • Built-in metadata

    Semantic XML tags (tags that have meaning) can automatically be used to provide metadata (information) about the content. Additional metadata (in attributes in the XML) can also be included in the XML file, which means that when the file is transferred to someone else or a different system, the metadata goes with it.

  • Database orientation

    XML describes the content’s structure. The structure can easily be stored in a database as a series of elements rather than as a whole document, and then extracted and assembled in any order to meet users’ needs.

  • XSL style sheets

    XSL is a powerful mechanism for both transforming and formatting XML documents. XSL style sheets let you manipulate the information to reorder it, repeat content, filter out information, or even add information. You can use XSL style sheets to convert XML to any desired format (such as, HTML, PDF, or wireless).

  • Personalization

    With XML, documents can be broken down, stored as separate pieces in a database or CMS, then assembled in any order to meet users’ needs.



[1] DTDs and Schemas are, from the conceptual perspective, the same thing: an expression of the acceptable and required structure of a document. The difference is that they are written in different languages, and Schemas include additional capability aimed primarily at the e-business uses of XML.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset