Chapter 18. XML
XML (Extensible Markup Language) is a cousin of HTML that has become important as a data management tool. XML documents are text files marked with tags that indicate the structure of the content of the files. XML is software and platform independent, which means that it provides an excellent environment for transferring data between database systems.
XML has been accepted by the Word Wide Web Consortium (W3C) and has become a de facto standard for cross-platform data transfers. Because it is an open source specification, a number of application programming languages have been built on top of it, such as XHTML, RSS, MathML, GraphML, Scalable Vector Graphics, and MusicXML.
Major DBMSs handle XML in two ways. Some store XML documents as binary objects. The documents are not searchable and therefore may be accompanied by keywords assigned to the documents for search purposes. The documents are stored and retrieved as complete units. Other DBMSs store XML data in standard character columns and then generate complete documents for output. Although incorporating XML still isn't a “typical” relational DBMS activity, its use is expanding, and you may well encounter an organization that uses it.

XML Syntax

Like HTML, XML documents contain tags that identify elements in the document. However, each markup language has a different purpose. HTML tags communicate with a Web browser, telling the browser how to display the page. XML tags, however, identify data elements and how they are related to one another. HTML tags are defined as part of the language specification. In contrast, almost all XML tags are defined by the person or software writing the XML.
To see how it works, consider the XML in Figure 18-1, which contains data for an instance of SmartMart's Product entity. Notice first that tags are surrounded by < and >, just as they are in HTML. Most tags are paired, with the closing tag beginning with /. The data are text. (Because there is no provision for storing binary images, the product image is supplied as the URL of a server where all the images are stored. An application program will be needed to resolve the URL, retrieve the image file, and store the image in the database.)
B9780123747303000206/gr1.jpg is missing
Figure 18-1
An XML instance of the SmartMart product entity.
The tags in Figure 18-1 are also nested. The <entities> tag is known as the root tag because it represents the top of the tag hierarchy. The <product> tag is nested underneath it, with all of the product's attributes nested inside the product. Nested paired tags work just like parentheses in programming languages: The last opened tag must also be the first tag closed. To transfer all of SmartMart's product data from one database to another, an XML document would contain multiple <product> tags, each with different data. Elements that have no data can be handled as solo tags that contain only an opening tag that ends with /.
Elements can also have attributes. In this case, the term is used somewhat differently from how it is used in a database environment. Here it is used for data values that apply to a specific element as “children” of that element. In Figure 18-2, for example, you can see a modified version of the product entity instance. The product entity and shipping weight elements now have attributes.
B9780123747303000206/gr2.jpg is missing
Figure 18-2
The XML instance from Figure 18-1 with element attributes.
XML also supports something known as an entity that has nothing to do with the entities represented in a database. Instead, XML entities are placeholders for special characters that have meaning to an XML parser, making it possible to include those characters in data. The five predefined entities can be found in Table 18-1. Additional entities can be created as needed.
Table 18-1 XML Predefined Entities
CharacterEntity
&&amp;
<&lt;
>&gt;
'&apos;
&quot;

XML Document Correctness

XML documents may be examined by XML conforming parsers to determine if they meet basic standards for correctness. At the lowest level, an XML document needs to be well formed, in which case it adheres to the basic syntax rules, such as having a </> tag for each < > tag. XML documents that are valid also conform to rules governing their content. For example, a valid document contains only those elements that appear in a related XML schema. (See the next section for more information.)
Well-formed documents must meet a number of rules, including the following:
▪ Elements that have content must have a start tag and an end tag (<tag_name>data</tag_name>).
▪ Elements that may be empty can use a single tag as a shorthand for the paired tags required by elements that have content (<tag_name/>).
▪ Attribute values must be surrounded by quotes (either single or double).
▪ Characters used to delimit items in an XML document must be represented within data by entities.
▪ Like parentheses and braces in computer programs, tags can be nested but must not overlap. In other words, the last opened tag must be the next tag closed.
Note: Conforming parsers for checking the correctness of XML documents are available as stand-alone tools. However, XML-enabled Web browsers such as Firefox can also do the job. Simply attempt to open the XML document in the browser. If the browser can't interpret the document, it will let you know.

XML Schemas

The syntax in Figure 18-1 and Figure 18-2 is designed to transfer data, but it isn't suited for defining the structure of an XML document. XML has therefore been extended to support several schema languages. XML schemas are intended to specify the structure of XML documents that contain data. Documents can then be validated against the schema to ensure that they adhere to the format that the DBMS expects. XML schemas can also be used to specify the format for an XML document generated as output by a DBMS.
Note: XML schemas are the more modern equivalent of XML Data Type Definitions (DTDs).
The first schema language to be recommended by the W3C was XML Schema, an example of which appears in Figure 18-3. As you can see, the product relation is defined as a complex data type.
B9780123747303000206/gr3.jpg is missing
Figure 18-3
An XML schema definition for the product entity.
Keep in mind that an XML schema is not precisely the same as a relational schema, although you could include the same structural elements (and the same relationships) in an XML schema as you do in a relational schema. An XML schema specifies the layout of data in a text document. In contrast, a relational schema specifies the structure of relations and constraints on those relations. XML schemas, although able to include elements that represent primary and foreign keys, have no provision for constraints on data. If constraints are to be applied, they must be enforced once the data have been loaded into a relational database as normal relational data rather than XML documents.

XML Support in Relational DBMSs

Although XML itself is defined by a standard, the way in which a specific DBMS supports XML is not. To give you an overview of the nature and extent of such support, this section looks at DB2 and Oracle, two of the most widely used DBMSs.

DB2

DB2 supports the storage of XML with an XML column data type. A data type of XML can be assigned to a column in any relation and can then store one XML document in that column for each row of the table. The syntax for including an XML column in a table is the same as creating a table using any other SQL data type (for example, Figure 18-4).
B9780123747303000206/gr4.jpg is missing
Figure 18-4
The definition of a table for DB2, including an XML column.
To store a document, an application program sends the text (for example, Figure 18-5) to the DB2 server. The server then checks the document to see that it is correct (well formed and valid) and then converts it into a hierarchical representation such as that in Figure 18-6. This hierarchy is stored separately from the rest of the table, although an internal data structure links the relational and XML storage structures.
B9780123747303000206/gr5.jpg is missing
Figure 18-5
Sample XML for loading into a DB2 XML column.
B9780123747303000206/gr6.jpg is missing
Figure 18-6
The hierarchical representation of and instance of the XML column specified in Figure 18-5.
DB2's XML column data type does not need to be linked to an XML schema. However, schemas can be applied to documents to check their correctness.
As you know, indexing can significantly speed up database query performance. XML documents, however, bury their data within the document (and thus within a single DB2 column). The hierarchical storage structure, such as that in Figure 18-6, however, makes indexing possible. To create an index on a specific element, the user must supply the path through the hierarchy to reach the desired data. For example, to create an index on the product name from Figure 18-6, someone might type:
CREATE INDEX idx_product_name ON products (product_details)GENERATE KEY USING XML PATTERN ‘/product_details/product_name’AS SQL VARCHAR (50);
Notice that the key for the index is indicated by the path to the data element and that it must be given an SQL data type for use in the index. If a value in an XML document can't be transformed into the index data type, the value won't be included in the index.
DB2 supports manipulation of XML data through two extensions: SQL/XML and XQuery. Both are query languages, although support for data modification using XQuery is still under development. 1 SQL/XML is an extension of the SQL standard, although, as with most DBMSs, the implementation is proprietary. XQuery is also an open standard.
1At the time this book was written, the most recent update of the XQuery standard seemed to be from August 2008. You can find it at www.w3.org/TR/xqupdate/.
SQL/XML contains extensions to standard SQL to retrieve either entire XML documents or specific elements from a document. When referencing specific elements in an XML document, you include the path through the document storage hierarchy to reach the element, just as it was done with creating an index on a single data element.

Oracle

Oracle provides three storage options for XML documents:
▪ Store an entire XML document as a character large object (CLOB). Because the document is stored as a whole, it cannot be validated against an XML schema and provides slow performance on relational queries. CLOB storage also uses the most disk space.
▪ Parse the document with an XML schema (including validation) and store the data in binary form. This is the easiest storage method to use, and although its performance on relational queries isn't optimal, it does outperform CLOB storage.
▪ Store parsed data in an object-relational format, a combination of relational and object-oriented data modeling. The result is slow performance when an XML document is added to the database but the best query performance and minimal use of disk space.
Like DB2, Oracle provides a data type for a column in a relation to store XML data (XMLType). An XMLType column in a relation can accept data from a previously stored CLOB, a character column (VARCHAR2), a correctly formatted SQL INSERT statement, or the query of another XMLType column.
An SQL INSERT statement to add a row to the stocked products table in Figure 18-4 might look like Figure 18-7. Notice that the XML specification is preceded by the data type to alert the DBMS that what follows is in XML format.
B9780123747303000206/gr7.jpg is missing
Figure 18-7
Using SQL to insert data into an Oracle XMLType column.
Oracle's SQL implementation allows querying of XMLType columns using the standard SQL SELECT command. The product also supports SQL/XML for querying relational data to produce XML documents as output and provides an XQuery implementation.
For Further Reading
Chaudhri, Akaml B; Rashid, Awais; Zicari, Roberto, XML Data Management: Native XML and XML-Extended Database Systems. (2003) Addison-Wesley Professional.
IBM Redbooks, Integrating XML with DB2 XML Extender and DB2 Text Extender. (2000) IBM.
Melton, Jim; Buxton, Stephen, Querying XML: XQuery, XPath, and SQL/XML in Context. (2006) Morgan Kaufmann.
Powell, Galvin, Beginning XML Databases. (2006) Wrox.
Quin, Liam, Open Source XML Database Toolkit: Resources and Techniques for Improved Development. (2000) Wiley.
Scardina, Mark; Chang, Ben; Wang, Jinyu, Oracle Database 10g XML & SQL. (2004) McGraw-Hill Osborne Media.
Williams, Kevin; Brudage, Michael; Dengler, Patrick; Gabriel, Jeff; Hoskinson, Andy; Kay, Michael; Maxwell, Thomas; Ochoa, Marcelo; Papa, Johnny; Vanmane, Mohan, Professional XML Databases. (2000) Wrox.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset