Chapter 5
IN THIS CHAPTER
Understanding XML
Defining structure with DTD
Looking at DOM and SAX
Reading a document into memory
Navigating a document
Getting attribute and element values
In this chapter, you find out how to work with Extensible Markup Language (XML) — the best thing to happen to computing since the invention of the vacuum tube (at least, according to some overenthusiastic prognosticators).
This chapter focuses on the basics of reading an XML document into memory and extracting data from it. With the background in this chapter, you shouldn’t have much trouble studying the API documentation on your own to find out more about XML programming.
Most computer-industry pundits agree that XML will completely change the way you work with computers. Here are just some of the ways XML will revolutionize the world of computers:
Yawn.
So what is XML, really? Simply put, XML is a way to store and exchange information in a standardized way that’s easy to create, retrieve, and transfer between different types of computer systems or programs.
Like Hypertext Markup Language (HTML), XML uses tags to mark the data. Here’s a bit of XML that describes a book:
<Book>
<Title>Java All-In-One For Dummies</Title>
<Author>Lowe</Author>
</Book>
This chunk of XML defines an element called Book
, which contains information for a single book. The Book
element in turn contains two subordinate elements: Title
and Author
.
Notice that each element begins with a tag that lists the element’s name. This tag is called the start tag. The element ends with an element that repeats the element name, preceded by a slash — an end tag.
Everything that appears between the start tag and the end tag is the element’s content, which can consist of text data or of one or more additional elements. In the latter case, the additional elements nested within an element are called child elements, and the element that contains them is called the parent element.
The highest-level element in an XML document is called the root element. A properly formed XML document consists of a single root element, which can contain elements nested within it. Suppose that you want to create an XML document with information about two movies. The XML document might look something like this:
<Movies>
<Movie>
<Title>It's a Wonderful Life</Title>
<Year>1946</Year>
<Price>14.95</Price>
</Movie>
<Movie>
<Title>Young Frankenstein</Title>
<Year>1974</Year>
<Price>16.95</Price>
</Movie>
</Movies>
Here the root element named Movies
contains two Movie
elements, each of which contains Title
, Year
, and Price
elements.
<B>
and <I>
that indicate whether data is bold or italic, for example. By contrast, an XML document that holds information about books may have tags such as <Title>
and <Author>
that provide the title and author of the book.<Make>
, <Model>
, and <Year>
. But if you’re creating an XML document about classes taught at a university, you may use tags such as <Course>
, <Title>
, <Instructor>
, <Room>
, and <Schedule>
.Instead of using child elements, you can use attributes to provide data for an element. An attribute is a name-and-value pair that’s written inside the start tag for an element. Here’s a Movie
element that uses an attribute instead of a child element to record the year:
<Movie year="1946">
<Title>It's a Wonderful Life</Title>
<Price>14.95</Price>
</Movie>
Whether you use attributes or child elements is largely a matter of personal preference. Many XML purists say that you should avoid using attributes or use them only to identify data such as identification numbers or codes; others suggest using attributes freely. In my experience, a few attributes here and there don’t hurt, but for the most part I avoid using them.
Every XML file should begin with an XML declaration that identifies the version of XML being used. For most XML documents, the declaration should look like this:
<?xml version='1.0'?>
Note that the XML declaration does not require an end tag. The XML declaration should be the first line in the file.
For your reference, Listing 5-1 shows the movies.xml
file, which the programs that appear later in this chapter use.
LISTING 5-1 The movies.xml File
<?xml version='1.0'?>
<Movies>
<Movie year="1946">
<Title>It's a Wonderful Life</Title>
<Price>14.95</Price>
</Movie>
<Movie year="1974">
<Title>Young Frankenstein</Title>
<Price>16.95</Price>
</Movie>
<Movie year="1977">
<Title>Star Wars</Title>
<Price>17.95</Price>
</Movie>
<Movie year="1987">
<Title>The Princess Bride</Title>
<Price>16.95</Price>
</Movie>
<Movie year="1989">
<Title>Glory</Title>
<Price>14.95</Price>
</Movie>
<Movie year="1997">
<Title>The Game</Title>
<Price>14.95</Price>
</Movie>
<Movie year="1998">
<Title>Shakespeare in Love</Title>
<Price>19.95</Price>
</Movie>
<Movie year="2009">
<Title>Zombieland</Title>
<Price>18.95</Price>
</Movie>
<Movie year="2010">
<Title>The King's Speech</Title>
<Price>17.95</Price>
</Movie>
<Movie year="2013">
<Title>Star Trek Into Darkness</Title>
<Price>19.95</Price>
</Movie>
</Movies>
An XML document can have a DTD, which spells out exactly what elements can appear in an XML document and in what order the elements can appear. DTD stands for Document Type Definition, but that won’t be on the test.
A DTD for an XML document about movies, for example, may specify that each Movie
element must have Title
and Price
subelements and an attribute named year
. It can also specify that the root element must be named Movies
and consist of any number of Movie
elements.
You can store the DTD for an XML document in the same file as the XML data, but more often, you store the DTD in a separate file. That way, you can use a DTD to govern the format of several XML documents of the same type. To indicate the name of the file that contains the DTD, you add a <!DOCTYPE>
declaration to the XML document. Here’s an example:
<!DOCTYPE Movies SYSTEM "movies.dtd">
Here the XML file is identified as a Movies
document, whose DTD you can find in the file movies.dtd
. Add this tag near the beginning of the movies.xml
file, right after the <?xml>
tag.
Listing 5-2 shows a DTD file for the movies.xml
file that was shown in Listing 5-1.
LISTING 5-2 A DTD File for the movies.xml File
<?xml version="1.0" encoding="UTF-8"?>
<!ELEMENT Movies (Movie*)>
<!ELEMENT Movie (Title, Price)>
<!ATTLIST Movie year CDATA #REQUIRED>
<!ELEMENT Title (#PCDATA)>
<!ELEMENT Price (#PCDATA)>
Each of the ELEMENT
tags in a DTD defines a type of element that can appear in the document and indicates what can appear as the content for that element type. The general form of the ELEMENT
tag is this:
<!ELEMENT element (content)>
Use the rules listed in Table 5-1 to express the content.
TABLE 5-1 Specifying Element Content
Content |
Description |
|
The specified element can occur 0 or more times. |
|
The specified element can occur 1 or more times. |
|
The specified element can occur 0 or 1 time. |
|
Either |
|
|
|
Text data is allowed. |
|
Any child elements are allowed. |
|
No child elements of any type are allowed. |
The first ELEMENT
tag in the DTD I show in Listing 5-2, for example, says that a Movies
element consists of zero or more Movie
elements. The second ELEMENT
tag says that a Movie
element consists of a Title
element followed by a Price
element. The third and fourth ELEMENT
tags say that the Title
and Price
elements consist of text data.
The ATTLIST
tag provides the name of each attribute. Its general form is this:
<!ATTLIST element attribute type default-value>
Here’s a breakdown of this tag:
element
names the element whose tag the attribute can appear in.attribute
provides the name of the attribute.type
specifies what can appear as the attribute’s value. The type
can be any of the items listed in Table 5-2.default
provides a default value and indicates whether the attribute is required or optional. default
can be any of the items listed in Table 5-3.TABLE 5-2 Attribute Types
Element |
The Attribute Value … |
|
Can be any character string. |
|
Can be one of the listed strings. |
|
Must be a name token, which is a string made up of letters and numbers. |
|
Must be one or more name tokens separated by white space. |
|
Is a name token that must be unique. In other words, no other element in the document can have the same value for this attribute. |
|
Must be the same as an |
|
Is a list of |
TABLE 5-3 Attribute Defaults
Default |
Optional or Required? |
|
Required. |
|
Optional. |
|
Optional. This value is used if the attribute is omitted. |
|
Optional. If included, however, it must be this value, and if omitted, this value is used by default. |
Here’s the ATTLIST
tag declaration from movies.dtd
:
<!ATTLIST Movie year CDATA #REQUIRED>
This declaration indicates that the attribute goes with the Movie
element, is named year
, can be any kind of data, and is required.
Here’s an ATTLIST
tag that specifies a list of possible values along with a default:
<!ATTLIST Movie genre (SciFi|Action|Comedy|Drama) Comedy>
This form of the ATTLIST
tag lets you create an attribute that’s similar to an enumeration, with a list of acceptable values.
In general, you can use either of two approaches to process XML documents in a Java program:
In this section, I cover the basics of using DOM to retrieve information from an XML document. DOM represents an XML document in memory as a tree of Node
objects. Figure 5-1 shows a simplified DOM tree for an XML document that has two Movie
elements. Notice that the root element (Movies
) is a node, each Movie
element is a node, and each Title
and Price
element is a node. In addition, text values are stored as child nodes of the elements they belong to. Thus, the Title
and Price
elements each have a child node that contains the text for these elements.
Before you can process a DOM document, you have to read the document into memory from an XML file. You’d think that this would be a fairly straightforward proposition, but unfortunately, it involves some pretty strange incantations. Rather than go through all the classes and methods you have to use, I just look at the finished code for a complete method that accepts a String
containing a filename as a parameter and returns a document object as its return value. Along the way, you find out what each class and method does.
Here’s a method that reads an XML file into a DOM document:
private static Document getDocument(String name)
{
try
{
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
factory.setIgnoringComments(true);
factory.setIgnoringElementContentWhitespace(true);
factory.setValidating(true);
DocumentBuilder builder =
factory.newDocumentBuilder();
return builder.parse(new InputSource(name));
}
catch (Exception e)
{
System.out.println(e.getMessage());
}
return null;
}
The first statement of the preceding example calls the newInstance
method of the DocumentBuilderFactory
class to create a new DocumentBuilderFactory
object. The job of the DocumentBuilderFactory
is to create DocumentBuilder
objects that can read XML input and create DOM documents in memory.
Why not just call the DocumentBuilderFactory
constructor? It turns out that DocumentBuilderFactory
is an abstract class, so it doesn’t have a constructor. newInstance
is a static method that determines which class to create an instance of based on the way your system is configured.
After you get a DocumentBuilderFactory
, you can configure it to read the document the way you want. The next three statements configure three options that are applied to document builders created by this factory object:
factory.setIgnoringComments(true);
factory.setIgnoringElementContentWhitespace(true);
factory.setValidating(true);
Here’s a closer look at these statements:
setIgnoringComments
method tells the document builder not to create nodes for comments in the XML file. Most XML files don’t contain comments, but if they do, they’re not part of the data represented by the document, so they can be safely ignored. Setting this option causes them to be ignored automatically. (If you don’t set this option, a node is created for each comment in the document, and because you can’t predict when or where comments appear, your program has to check every node it processes to make sure that the node isn’t a comment.)setIgnoringElementContentWhitespace
method causes the document builder to ignore any white space that isn’t part of a text value. If you don’t include this option, the DOM document includes nodes that represent white space. The only thing that these white space nodes are good for is making the DOM document harder to process, so you should always set this option.setValidating
method tells the document builder to validate the input XML if it specifies a DTD. Validating the input can also dramatically simplify your program, because you know that the DOM document conforms to the requirements of the DTD. If you’re processing the movies.xml
file shown in Listing 5-1 earlier in this chapter, you know for certain that the first child of a Movie
element is a Title
element and that the second child is a Price
element. Without the validation, all you know is that the first child of a Movie
element should be a Title
element, but you have to check it to make sure.After you set the options, you can call the newDocumentBuilder
method to create a document builder, as follows:
DocumentBuilder builder =
factory.newDocumentBuilder();
Here the document builder is referenced by a variable named builder
.
Finally, you can create the DOM document by calling the parse
method of the document builder. This method accepts an InputSource
object as a parameter. Fortunately, the InputSource
class has a constructor that takes a filename parameter and returns an input source linked to the file. So you can create the input source, parse the XML file, create a DOM document, and return the DOM document to the caller, all in one statement:
return builder.parse(new InputSource(name));
Note that several of these methods throw exceptions. In particular, newDocumentBuilder
throws ParserConfigurationException
, and parse
throws IOException
and SAXException
. To keep this example simple, I caught all exceptions in one catch
clause and printed the exception message to the console.
By adding the getDocument
method, you can create a DOM document from a file with a single statement, like this:
Document doc = getDocument("movies.xml");
Here the movies.xml
file is read, and a DOM document is created and assigned to the doc
variable.
Also note that you must provide three import
statements to use the getDocument
method, as follows:
import javax.xml.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
DocumentBuilder
and DocumentBuilderFactory
are in the javax.xml.parsers
package, Document
is in org.w3c.dom
, and InputSource
is in org.xml.sax
.
After you have a DOM document in memory, you can easily retrieve data from the document’s nodes. The DOM API is based on interfaces rather than classes, so each node of the DOM document is represented by an object that implements one or more DOM interfaces. The following paragraphs give you an overview of the interfaces you need to work with:
Document:
The entire document is represented by an object that implements the Document
interface. The method you use most from this interface is getDocumentElement
, which returns an Element
object that represents the document’s root node. After you have the root node, you can navigate to other nodes in the document to get the information you’re looking for.Node:
The Node
interface represents a node in the DOM document. This interface provides methods that are common to all nodes. Table 5-4 lists the most useful of these methods. This table also lists some of the field values that the getNodeType
method can return.Element:
The Element
interface represents nodes that correspond to elements in the XML document. Element
extends Node
, so any object that implements Element
is also a Node
. Table 5-5 lists some of the most useful methods of this interface.Text:
The text content of an element isn’t contained in the element itself, but in a Text
node that’s stored as a child of the element. The Text
interface has a few interesting methods you may want to look up, but for most applications, you just use the getNodeValue
method inherited from the Node
interface to retrieve the text stored by a text node.NodeList:
A NodeList
is a collection of nodes that’s returned by methods such as the getChildNodes
method of the Node
interface or the getElementsByTagName
of the Element
interface. NodeList
has just two methods: item(int i)
, which returns the node at the specified index, and getLength()
, which returns the number of items in the list. (As with almost every other index in Java, the first node is index 0
, not 1
.)TABLE 5-4 The Node Interface
Method |
Description |
|
Gets a |
|
Gets the first child of this node. |
|
Gets the last child of this node. |
|
Gets an |
|
Gets the value of this node, if the node has a value. |
|
Gets the next sibling node. |
|
Gets the preceding sibling node. |
|
Determines whether the node has any child nodes. |
Field |
Description |
|
The node is an attribute node. |
|
The node has content data. |
|
The node is a comment. |
|
The node is a document node. |
|
The node is an element node. |
|
The node is a text node. |
TABLE 5-5 The Element Interface
Method |
Description |
|
Gets the value of the specified attribute. |
|
Gets a |
|
Determines whether the element has the specified attribute. |
Assuming that you use a DTD to validate the XML file when you build the document, you can usually navigate the document to pick up the information you need without resorting to NodeList
objects. Here’s a routine that simply counts all the Movie
elements in the movies.xml
file (shown in Listing 5-1 earlier in this chapter) after it’s been parsed into a Document
object named doc
:
int count = 0;
Element root = doc.getDocumentElement();
Node movie = root.getFirstChild();
while (movie != null)
{
count++;
movie = movie.getNextSibling();
}
System.out.println("There are " + count + " movies.");
This method first calls the getFirstChild
method to get the first child of the root element. Then it uses each child element’s getNextSibling
method to get the next element that’s also a child of the root element.
If you run a program that contains these lines, the following line appears on the console:
There are 10 movies.
This program doesn’t do anything with the Movie
elements other than count them, but in the next section (“Getting attribute values”), you see how to extract data from the Movie
elements.
An alternative way to process all the elements in the movies.xml
file is to use the getChildNodes
method to return a NodeList
object that contains all the elements. Then you can use a for
loop to access each element individually. Here’s a snippet of code that lists the name of each element:
Element root = doc.getDocumentElement();
NodeList movies = root.getChildNodes();
for (int i = 0; i < movies.getLength(); i++)
{
Node movie = movies.item(i);
System.out.println(movie.getNodeName());
}
Here the item
method is used in the for
loop to retrieve each Movie
element. If you run a program that contains these lines, ten lines with the word Movie
are displayed on the console.
To get the value of an element’s attribute, call the getAttribute
method and pass the name of the attribute as the parameter. This code returns the string value of the attribute. Then you can convert this value to another type if necessary. Note that the value may include some white space, so you should run the value through the trim
method to get rid of any superfluous white space.
Here’s an example that gets the year
attribute from each movie in the movies.xml
file and determines the year of the oldest movie in the collection:
Element root = doc.getDocumentElement();
Element movie = (Element)root.getFirstChild();
int oldest = 9999;
while (movie != null)
{
String s = movie.getAttribute("year");
int year = Integer.parseInt(s);
if (year < oldest)
oldest = year;
movie = (Element)movie.getNextSibling();
}
System.out.println("The oldest movie in the file "
+ "is from " + oldest + ".");
The year
attribute is extracted with these two lines of code:
String s = movie.getAttribute("year");
int year = Integer.parseInt(s);
The first line gets the string value of the year
attribute, and the second line converts it to an int
.
Notice the extra casting that’s done in this method. It’s necessary because the movie
variable has to be an Element
type so that you can call the getAttribute
method. The getNextSibling
method returns a Node
, however, not an Element
. As a result, the compiler doesn’t let you assign the node to the movie
variable unless you first cast it to an Element.
You may be surprised to find that the text content of an element isn’t stored with the element. Instead, it’s stored in a child node of type Text
. Consider the following XML:
<Title>The Princess Bride</Title>
This element results in two nodes in the XML document: an Element
node named Title
and a Text
node that contains the text The Princess Bride
.
Thus, if you have a Title
element in hand, you must get the Text
element before you can get the text content, as in this example:
Node textElement = titleElement.getFirstChild();
String title = textElement.getNodeValue();
If you prefer to write your code a little more tersely, you can use a single statement like this:
String title =
titleElement.getFirstChild().getNodeValue();
If you find this incantation to be a little tedious, and you’re doing a lot of it in your program, write yourself a little helper method, like this one:
private static String getTextValue(Node n)
{
return n.getFirstChild().getNodeValue();
}
Then you can get the text content for an element by calling the getTextValue
method, like this:
String title = getTextValue(titleElement);
After you get the text content, you can parse it to a numeric type if you need to.
Now that you’ve seen the various interfaces and classes you use to get data from an XML file, Listing 5-3 shows a complete program that reads the movies.xml
file (shown in Listing 5-1 earlier in this chapter) and lists the title, year, and price of each movie on the console. When you run this program, the following appears on the console:
1946: It's a Wonderful Life ($14.95)
1974: Young Frankenstein ($16.95)
1977: Star Wars ($17.95)
1987: The Princess Bride ($16.95)
1989: Glory ($14.95)
1997: The Game ($14.95)
1998: Shakespeare in Love ($19.95)
2009: Zombieland ($18.95)
2010: The Kings Speech ($17.95)
2013: Star Trek Into Darkness ($19.95)
LISTING 5-3 Reading an XML Document
import javax.xml.parsers.*; →1
import org.xml.sax.*;
import org.w3c.dom.*;
import java.text.*;
public class ListMoviesXML
{
private static NumberFormat cf =
NumberFormat.getCurrencyInstance();
public static void main(String[] args) →11
{
Document doc = getDocument("movies.xml");
Element root = doc.getDocumentElement();
Element movieElement = (Element)root.getFirstChild();
Movie m;
while (movieElement != null)
{
m = getMovie(movieElement);
String msg = Integer.toString(m.year);
msg += ": " + m.title;
msg += " (" + cf.format(m.price) + ")";
System.out.println(msg);
movieElement =
(Element)movieElement.getNextSibling();
}
}
private static Document getDocument(String name) →29
{
try
{
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
factory.setIgnoringComments(true);
factory.setIgnoringElementContentWhitespace(true);
factory.setValidating(true);
DocumentBuilder builder =
factory.newDocumentBuilder();
return builder.parse(new InputSource(name));
}
catch (Exception e)
{
System.out.println(e.getMessage());
}
return null;
}
private static Movie getMovie(Element e) →49
{
// get the year attribute
String yearString = e.getAttribute("year");
int year = Integer.parseInt(yearString);
// get the Title element
Element tElement = (Element)e.getFirstChild();
String title = getTextValue(tElement).trim();
// get the Price element
Element pElement =
(Element)tElement.getNextSibling();
String pString = getTextValue(pElement).trim();
double price = Double.parseDouble(pString);
return new Movie(title, year, price);
}
private static String getTextValue(Node n) →65
{
return n.getFirstChild().getNodeValue();
}
private static class Movie →70
{
public String title;
public int year;
public double price;
public Movie(String title, int year, double price)
{
this.title = title;
this.year = year;
this.price = price;
}
}
}
Because all the code in this program appears elsewhere in this chapter, the following paragraphs just provide a simple description of what each method in this program does:
main
method starts by calling the getDocument
method to get a Document
object from the file movies.xml
. Then it gets the root
element and uses a while
loop to spin through all the child elements, which you know to be Movie
elements because the document was validated when it was parsed. As each Movie
element is processed, it’s passed to the getMovie
method, which extracts the year
attribute and the title
and price
elements, and returns a Movie
object. Then the movie is printed on the console.getDocument
method accepts a filename as a parameter and returns a Document
object. Before it creates the DocumentBuilder
object, it sets the configuration options so that comments and white space are ignored and the XML file is validated. Because the XML file is validated, you must create a DTD file (like the file in Listing 5-2 earlier in this chapter). Also, you must begin the XML file with a DOCTYPE
declaration (such as <!DOCTYPE Movies SYSTEM "movies.dtd">
).getMovie
method is passed an Element
object that represents a Movie
element. It extracts the year
attribute, gets the text value of the title
element, and parses the text value of the price
element to a double
. Then it uses these values to create a new Movie
object, which is returned to the caller.getTextValue
method is simply a little helper method that gets the text content from a node. This method assumes that the node has a child node containing the text value, so you shouldn’t call this method unless you know that to be the case. (Because the XML document was validated, you do.)Movie
class is a private inner class that represents a single movie. It uses public fields to hold the title
, year
, and price
data, and it provides a simple constructor that initializes these fields.