Building a DTD

The next step in XML design is to build a DTD based on the rough instance we worked out for the previous example. A DTD, in its simplest form, contains a declaration of every element you want to use in your XML instances. Examining the XML instance we defined in the previous section, the elements are

  • E-MAIL

  • FROM

  • TO

  • SUBJECT

  • BODY

Each of these elements must be defined within the DTD using an “ELEMENT” keyword, such as:

<!ELEMENT E-MAIL (FROM, TO, SUBJECT, BODY)>
<!ELEMENT FROM (#CDATA)*>
<!ELEMENT TO (#CDATA)*>
<!ELEMENT SUBJECT (#CDATA)*>
<!ELEMENT BODY (#CDATA)*>

Each of the preceding lines declares an element that legally can be part of our XML instances. Each declaration has two parts: the element name, such as FROM, and its content model, which defines what the element can contain. Notice what's going on in the declaration for the E-MAIL element: We've defined its content model very rigidly. It must contain a FROM element, followed by a TO element, followed by a SUBJECT element, followed by a BODY element. Any other order is illegal under this content model. The content models of the FROM, TO, SUBJECT, and BODY elements, however, are defined with the cryptic #CDATA, which stands for “character data” and really just means “character text” such as the text of this sentence. This means that the content model of these items can be any character text, but other XML elements or entities are not allowed. If you want to include elements or entities in your character data, you use the designation content model #PCDATA (which stands for parseable character data).

XML Character Entities

What happens if the text within your XML instance needs to include a character like “<” (the “less than” symbol)? Because “<” has a special meaning in XML, if you just stick it in the middle of a sentence, you're going to get a big fat parsing error when you try to use this XML instance. The answer is to use XML entity references in place of these characters. Entity references start with an ampersand (&), followed by a code and then a semicolon. For instance, the XML standard entity reference for < is &lt;. Of course, this makes & into a reserved symbol as well, requiring its own entity reference (&amp;). XML defines a set of standard entity references for these and other special characters, but you can also define your own entity references.

For example, if your XML instances were music reviews about Prince albums released after 1993, you might want to define a new entity for that little squiggly ankh-thingy (&theartist;).

Your application can then substitute a suitable graphic image when representing the article to a reader. I've used this approach with Greek letters (for example, &alpha; for α) and other mathematical symbols with great success.

You define which character entities you want to be legal in your XML at the top of your DTD with an external reference like:

<!ENTITY % HTMLlat1 PUBLIC
   "-//W3C//ENTITIES Latin1//EN//HTML"
   "HTMLlat1.ent">
%HTMLlat1;

This defines that all of the character entities in the “Latin 1” set are now part of your DTD. These include &pound; (£), &frac12; (½) and &Ouml; (Ö) for documents about heavy metal bands).

Full definitions of these predefined entity sets are available from the W3C site as part of the definition of HTML (www.w3.org/MarkUp/).

Entities can also be used to build a shorthand for a complex content model for use in a DTD. For instance, if you wanted your e-mail subject and body to be able to contain characters, entities, and the elements <i> and <b> (for italic and bold), you might define an entity in your DTD like this:

<!ENTITY % text "(#PCDATA | i | b)*">

Then you would define the content models for your BODY and SUBJECT elements like this:

<!ELEMENT SUBJECT &text;>
<!ELEMENT BODY &text;>

In the full DTD for CyberCinema in the Appendix there are more examples of this use of entities.


Now that we've defined our elements and how they fit together (their content models), we need to define each element's attributes.

Reexamining our sample instance, we find that our only attributes are ID numbers, inside the E-MAIL, TO, and FROM elements. Each element must have its own attribute list declaration (using the ATTLIST keyword), like so:

<!ATTLIST E-MAIL     ID    NMTOKEN    #REQUIRED>
<!ATTLIST FROM       ID    NMTOKEN    #REQUIRED>
<!ATTLIST TO         ID    NMTOKEN    #REQUIRED>

Each attribute declaration consists first of an identifier associating it with a specific element (in this case, E-MAIL, FROM, and TO), the attribute's name (ID), and its type (NMTOKEN, which is a reserved type set aside for tokens consisting of letters or numbers). Adding the #REQUIRED reserved word to each declaration means that every element of this type must have a specified ID number. An instance of an e-mail where the FROM field, for example, doesn't have an ID number is not valid according to our DTD and would fail a validation test against the DTD.

The ordering of these declarations isn't important, but an attribute list declaration for an element should occur after the declaration of the element itself in the DTD. This keeps the DTD readable and keeps you sane.

Commenting Code

Comments (that is, nonfunctional, nonparseable bits of text included to provide information to the reader) within a DTD (and within XML documents in general) are denoted using the following syntax:

<!-- This is a comment -->

Commenting your DTD is just as crucial as commenting a piece of application code, especially if you expect someone else to be able to decipher it. Remember that future archeologist?


The following is the entire DTD for our simple e-mail system:

<!--  This is the DTD for our simple e-mail system  -->
<!--  An e-mail message must contain from, to, subject, and body fields,
     in that order  -->
<!ELEMENT E-MAIL (FROM, TO, SUBJECT, BODY)>
<!--  The from, to, subject, and body fields contain only character
     data  -->
<!ELEMENT FROM (#CDATA)*>
<!ELEMENT TO (#CDATA)*>
<!ELEMENT SUBJECT (#CDATA)*>
<!ELEMENT BODY (#CDATA)*>
<!--  The e-mail message itself is identified by a numerical ID  -->
<!ATTLIST E-MAIL   ID    NMTOKEN    #REQUIRED>
<!--  The from and to fields are identified by numerical IDs which
     reference database id numbers for these users. -->
<!ATTLIST FROM     ID    NMTOKEN    #REQUIRED>
<!ATTLIST TO       ID    NMTOKEN    #REQUIRED>

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset