Extensible Markup Language (XML)

From reading the last chapter, you already know that XML is the best thing since sliced bread, but to set the scene more accurately, let's go back to the origins of XML and find out what it's really all about.

The design goals for XML (lifted from the text of the XML specification current at this writing—XML 1.0 Second Edition[1]are as follows:

[1] XML 1.0 Second Edition is available at http://www.w3.org/TR/2000/REC-xml-20001006.

  1. XML shall be straightforwardly usable over the Internet.

  2. XML shall support a wide variety of applications.

  3. XML shall be compatible with SGML.

  4. It shall be easy to write programs that process XML documents.

  5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero.

  6. XML documents should be human legible and reasonably clear.

  7. The XML design should be prepared quickly.

  8. The design of XML shall be formal and concise.

  9. XML documents shall be easy to create.

  10. Terseness in XML markup is of minimal importance.

The XML specification was written by the the World Wide Web Consortium (W3C), the body that develops and recommends Web specifications and standards. Tim Berners-Lee founded the W3C in 1994 because he thought there might be something to the little information retrieval system he built while working as a research physicist at Switzerland's CERN laboratory. The W3C's membership has since climbed to over 500 member organizations. In 1994 the Hypertext Markup Language (HTML) was in its infancy, having been built hastily on top of the Standard Generalized Markup Language (SGML). SGML, in turn, had become an international standard in 1986 when it was made so by the International Standards Organization (ISO). Actually it was based on the Generalized Markup Language (GML), developed in 1969 at IBM.

In 1996, the members of the W3C undertook what would become their most influential project: the creation of a new language for the Web, called eXtensible Markup Language (or XML for short). XML was related to SGML, but instead of defining a specific tag set as HTML does, XML enables the designer of a system to create tag sets to support specific domains of knowledge—aca-demic disciplines such as physics, mathematics, and chemistry, and business domains such as finance, commerce, and journalism. XML is a subset of SGML. Like SGML, it is a set of rules for building markup languages. Each of XML's rules is also a rule of SGML.

XML and languages like it use tags to indicate structure within a piece of text. Here's a simple bit of XML-compliant HTML as an example:

<p>Homer's <u>Odyssey</u> is a really nice book.</p>

The portion of the text between the matching <u> begin tag and the </u> end tag is marked up, or slated, for whatever treatment we deem appropriate for the <u> tag (in the case of HTML, an underline). The result is

Homer's Odyssey is a really nice book.

This tag structure is hierarchical; that is, you can place tags inside of tags as shown in Figure 2-1.

Figure 2-1. XML's hierarchical tag structure


Figure 2-1 renders like this:

Homer's Odyssey is a really nice book.[2]

[2] In this example, I'm assuming you have some knowledge of HTML tags. Just in case you're not familiar with them, or you need a quick refresher, the <p> tag (paragraph tag) encloses the entire sentence. Inside the two <p> tags are two other tags, <u> (underline) and <em> (emphasis) tags.

The horizontal rules in Figure 2-1 indicate the basic tag structure. The <p> tag encloses the whole sentence. Inside the two <p> tags are two other tags, <u> and <em>, and inside of <em> is a <strong> tag, three levels down. Tags can go inside of tags, but tags must match up at the same “level.” Hence, the following is not well-formed XML:

<p>Homer's <u>Odyssey</u> is <em>a <strong>really nice</em></strong> book.</p>

The previous example has an <em> tag and within it a <strong> tag, but the <em> tag is ended before the <strong> tag it contains. It isn't well formed because an XML document is not a document at all, but a tree. Let's take the well-formed example in Figure 2-1 and visualize it as a tree (see Figure 2-2).

Figure 2-2. The hierarchical layout of our XML example


As you can see in Figure 2-2, each part of the example sentence is represented in a leaf or node of this tree. The tree structure is a basic data type that a computer can easily deal with. A computer program can “traverse” a tree by starting at the top and making its way down the left branch first. Then, when it gets to a dead end, it goes up a level and looks for a right branch, and so on, until it gets to the very last, rightmost leaf. This kind of traversal is the most elementary kind of computer science, which is why XML is such a wonderful way to represent data in the machine world.

Evaluating XML's Design Goals

How did XML's authors do on their list of initial design goals? Let's take a look.

  1. XML shall be straightforwardly usable over the Internet. The meaning of this goal is a bit fuzzy, but the W3C was essentially going for coherence between XML and other already-existing forms of information retrieval that occur on the Internet, specifically, HTML. The goal may also mean that XML wouldn't be proprietary or proprietary software would not be required to use it. In other words, XML documents should be usable by a broad audience; they should be completely open, not proprietary and closed. No real worries there. Another important factor in being straightforwardly usable over the Internet is that documents should be self-contained. In particular, XML documents can be processed without the presence of a DTD (see Chapter 5), in contrast to SGML where a DTD is always necessary to make sense of documents. Self-contained documents are important in an environment based on request/response protocols (such as HTTP, the information protocol underlying the World Wide Web) where communications failures are common.

  2. XML shall support a wide variety of applications. As already discussed, XML can support any number of applications, ranging from different human disciplines (chemistry, news, math, finance, law, and so on) to machine-to-machine transactions, such as online payment and content syndication. Put a big check mark in the box on this one.

  3. XML shall be compatible with SGML. XML is based on the SGML specification (as described in the XML 1.0 W3C Recommendation document as a “dialect of SGML”), so the W3C has also met this design goal.

  4. It shall be easy to write programs that process XML documents. Because XML is a simplified form of SGML, it's even easier to write programs that process XML documents than it is to write programs that process SGML.

  5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero. By “optional” features, the W3C refers to some variations of SGML that include so-called optional features used only in certain SGML applications. These variations complicate SGML parsers and processors and ultimately mean that some SGML parsers aren't compatible with some SGML documents. In other words, all SGML is compatible, but some SGML applications are more compatible than others. The original XML working group members recognized that XML couldn't suffer from this kind of fragmentation, or it would go the way of SGML and become an obscure and abstruse language used only by information professionals.

    XML actually does have some optional features, which means, in theory, that you can get different results depending on what parser you use to read a document. However, in my experience you won't have to worry about XML's optional features, and they're certainly not within the scope of this book, so we won't go into them here.

  6. XML documents should be human legible and reasonably clear. The best the W3C has been able to do is to make it easy for XML documents to be human legible. Because XML by its very nature enables anyone to design and implement an XML-based vocabulary, the W3C can't guarantee that all XML documents will be human readable. At least you have a fighting chance, however, because XML is a text-based format rather than a binary format like GIF of PDF.

    Later efforts by the W3C have diverged from this goal. Flip forward to Chapter 6, where I discuss XML Schema, and you'll see what I mean—it doesn't mean that XML Schema is a bad thing; it's just not immediately human readable. As with a programming language, you have to understand what you're looking at. So it's 50/50 on readability, but maybe this goal wasn't realistic in the first place.

  7. The XML design should be prepared quickly. Compared with other international standards efforts, such as ISO standards that often take years and sometimes decades to complete and publish, the W3C certainly did a bang-up job on this goal. It took a year to produce the first draft. Put a check mark next to this goal.

  8. The design of XML shall be formal and concise. The XML specification is definitely formal; it is derived from SGML in a formal, declarative sense. The Cambridge International Dictionary of English defines concise as “expressing what needs to be said without unnecessary words.” According to this definition, I'd say the specification is concise in that it includes everything that needs to be there without any extraneous material. Of course, conciseness is in the eye of the beholder. If you read through the specification, “concise” may not be the first word that comes to mind.

  9. XML documents shall be easy to create. You can author an XML document in any text editor, so put a check mark next to this goal.

  10. Terseness in XML markup is of minimal importance. This tenth requirement speaks volumes and represents the fundamental shift that the information science and computer industry have gone through during the 1980s and 1990s. At the dawn of the computer age, terseness was of primary importance. Those familiar with the Y2K uproar will understand the consequences of this propensity for terseness. In the name of terseness, software engineers used abbreviated dates (01/01/50 rather than 01/01/1950) in many of the systems they wrote. This presented later problems when the year 2000 rolled around because computer systems couldn't be certain if a date was 1950 or 2050. Amazingly, this practice lasted until the late 1990s, when some embedded systems that had the so-called “two-digit date” problem were still being produced.

    We can laugh now, but quite a lot of otherwise smart people found themselves holed up in bunkers clutching AK-47s and cans of baked beans and feeling a little silly about it all at five minutes past midnight on January 1, 2000.

    To be fair, the reason systems were designed with the two-digit date wasn't because the software engineers were dumb; it was because memory and storage were expensive in the 1970s and 1980s. It's easy now to say that they should have known better, now that storage and bandwidth are comparatively cheap and easy to obtain and people routinely download and store hours of digitized music on their home computers.

    This “tenth commandment” of XML is essentially saying “out with the old” thinking where protocols and data formats had to be designed based on available storage and bandwidth resources. Now that such storage and bandwidth are available and are becoming ubiquitous in our lives, the W3C wanted to avoid having storage and bandwidth be factors in the design of XML.

    The implications of storage and bandwidth are easy to overlook, but they're quite important in the way information systems are designed and implemented, and they will have repercussions for years to come.

Bandwidth Strikes Back: Mobile Devices

One way in which bandwidth is rearing its ugly head once again is through the proliferation of mobile-connected devices (such as WAP phones and e-mail devices like Research In Motion's Blackberry two-way pager). The wireless connections these devices use are generally pretty low bandwidth; current Global System for Mobile Communications (GSM, the mobile/cellular phone standard in use in most of the world), mobile phones, and infrastructure are mostly limited to 9600 baud. E-mail devices like Blackberry use paging networks that aren't “always on” and allow only small packets of data to be received discontinuously.

Yet industry pundits like Steve Ballmer of Microsoft are predicting XML to be the lingua franca of all connected mobile devices. Indeed, WML, the language WAP phones speak, is based on XML, and the languages that are lined up to replace it are also based on XML.

This bandwidth issue will go away for mobile devices, eventually. We're already seeing more high-bandwidth networks and devices being deployed, especially in high-tech strongholds. People who are currently trying to solve this bandwidth issue treat it as if it's the major limiting factor for mobile device proliferation. These efforts are misguided. The real killer apps for mobile devices will be multimedia broadband applications, and these applications will drive an explosion in bandwidth, just as they have for wired networks.


..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset