Chapter 11. Other XML Technologies

The System.Xml namespace comprises the following namespaces and core classes:

System.Xml.*
XmlReader and XmlWriter
High-performance, forward-only cursors for reading or writing an XML stream
XmlDocument
Represents an XML document in a W3C-style DOM (obsolete)
System.Xml.XLinq
Modern LINQ-centric DOM for working with XML (see Chapter 10)
System.Xml.XmlSchema
Infrastructure and API for (W3C) XSD schemas
System.Xml.Xsl
Infrastructure and API (XslCompiledTransform) for performing (W3C) XSLT transformations of XML
System.Xml.Serialization
Supports the serialization of classes to and from XML (see Chapter 17)

W3C is an abbreviation for World Wide Web Consortium, where the XML standards are defined.

XmlConvert, the static class for parsing and formatting XML strings, is covered in Chapter 6.

XmlReader

XmlReader is a high-performance class for reading an XML stream in a low-level, forward-only manner.

Consider the following XML file:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<customer id="123" status="archived">
  <firstname>Jim</firstname>
  <lastname>Bo</lastname>
</customer>

To instantiate an XmlReader, you call the static XmlReader.Create method, passing in a Stream, a TextReader, or a URI string. For example:

using (XmlReader reader = XmlReader.Create ("customer.xml"))
  ...
Note

Because XmlReader lets you read from potentially slow sources (Streams and URIs), it offers asynchronous versions of most of its methods so that you can easily write nonblocking code. We’ll cover asynchrony in detail in Chapter 14.

To construct an XmlReader that reads from a string:

XmlReader reader = XmlReader.Create (
  new System.IO.StringReader (myString));

You can also pass in an XmlReaderSettings object to control parsing and validation options. The following three properties on XmlReaderSettings are particularly useful for skipping over superfluous content:

bool IgnoreComments                  // Skip over comment nodes?
bool IgnoreProcessingInstructions    // Skip over processing instructions?
bool IgnoreWhitespace                // Skip over whitespace?

In the following example, we instruct the reader not to emit whitespace nodes, which are a distraction in typical scenarios:

XmlReaderSettings settings = new XmlReaderSettings();
settings.IgnoreWhitespace = true;

using (XmlReader reader = XmlReader.Create ("customer.xml", settings))
  ...

Another useful property on XmlReaderSettings is ConformanceLevel. Its default value of Document instructs the reader to assume a valid XML document with a single root node. This is a problem if you want to read just an inner portion of XML containing multiple nodes:

<firstname>Jim</firstname>
<lastname>Bo</lastname>

To read this without throwing an exception, you must set ConformanceLevel to Fragment.

XmlReaderSettings also has a property called CloseInput, which indicates whether to close the underlying stream when the reader is closed (there’s an analogous property on XmlWriterSettings called CloseOutput). The default value for CloseInput and CloseOutput is false.

Reading Nodes

The units of an XML stream are XML nodes. The reader traverses the stream in textual (depth-first) order. The Depth property of the reader returns the current depth of the cursor.

The most primitive way to read from an XmlReader is to call Read. It advances to the next node in the XML stream, rather like MoveNext in IEnumerator. The first call to Read positions the cursor at the first node. When Read returns false, it means the cursor has advanced past the last node, at which point the XmlReader should be closed and abandoned.

In this example, we read every node in the XML stream, outputting each node type as we go:

XmlReaderSettings settings = new XmlReaderSettings();
settings.IgnoreWhitespace = true;

using (XmlReader reader = XmlReader.Create ("customer.xml", settings))
  while (reader.Read())
  {
    Console.Write (new string (' ',reader.Depth*2));  // Write indentation
    Console.WriteLine (reader.NodeType);
  }

The output is as follows:

XmlDeclaration
Element
  Element
    Text
  EndElement
  Element
    Text
  EndElement
EndElement
Note

Attributes are not included in Read-based traversal (see the section “Reading Attributes” later in this chapter).

NodeType is of type XmlNodeType, which is an enum with these members:

None
XmlDeclaration
Element
EndElement
Text
Attribute
Comment
Entity
EndEntity
EntityReference
ProcessingInstruction
CDATA
Document
DocumentType
DocumentFragment
Notation
Whitespace
SignificantWhitespace

Two string properties on XmlReader provide access to a node’s content: Name and Value. Depending on the node type, either Name or Value (or both) is populated:

XmlReaderSettings settings = new XmlReaderSettings();
settings.IgnoreWhitespace = true;
settings.DtdProcessing = DtdProcessing.Parse;    // Required to read DTDs

using (XmlReader r = XmlReader.Create ("customer.xml", settings))
  while (r.Read())
  {
    Console.Write (r.NodeType.ToString().PadRight (17, '-'));
    Console.Write ("> ".PadRight (r.Depth * 3));

    switch (r.NodeType)
    {
      case XmlNodeType.Element:
      case XmlNodeType.EndElement:
        Console.WriteLine (r.Name); break;

      case XmlNodeType.Text:
      case XmlNodeType.CDATA:
      case XmlNodeType.Comment:
      case XmlNodeType.XmlDeclaration:
        Console.WriteLine (r.Value); break;

      case XmlNodeType.DocumentType:
        Console.WriteLine (r.Name + " - " + r.Value); break;

      default: break;
    }
  }

To demonstrate this, we’ll expand our XML file to include a document type, entity, CDATA, and comment:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE customer [ <!ENTITY tc "Top Customer"> ]>
<customer id="123" status="archived">
  <firstname>Jim</firstname>
  <lastname>Bo</lastname>
  <quote><![CDATA[C#'s operators include: < > &]]></quote>
  <notes>Jim Bo is a &tc;</notes>
  <!--  That wasn't so bad! -->
</customer>

An entity is like a macro; a CDATA is like a verbatim string (@"...") in C#. Here’s the result:

XmlDeclaration---> version="1.0" encoding="utf-8"
DocumentType-----> customer -  <!ENTITY tc "Top Customer">
Element----------> customer
Element---------->  firstname
Text------------->     Jim
EndElement------->  firstname
Element---------->  lastname
Text------------->     Bo
EndElement------->  lastname
Element---------->  quote
CDATA------------>     C#'s operators include: < > &
EndElement------->  quote
Element---------->  notes
Text------------->     Jim Bo is a Top Customer
EndElement------->  notes
Comment---------->    That wasn't so bad!
EndElement------->  customer

XmlReader automatically resolves entities, so in our example, the entity reference &tc; expands into Top Customer.

Reading Elements

Often, you already know the structure of the XML document that you’re reading. To help with this, XmlReader provides a range of methods that read while presuming a particular structure. This simplifies your code, as well as performing some validation at the same time.

Note

XmlReader throws an XmlException if any validation fails. XmlException has LineNumber and LinePosition properties indicating where the error occurred—logging this information is essential if the XML file is large!

ReadStartElement verifies that the current NodeType is Element, and then calls Read. If you specify a name, it verifies that it matches that of the current element.

ReadEndElement verifies that the current NodeType is EndElement, and then calls Read.

For instance, we could read this:

<firstname>Jim</firstname>

as follows:

reader.ReadStartElement ("firstname");
Console.WriteLine (reader.Value);
reader.Read();
reader.ReadEndElement();

The ReadElementContentAsString method does all of this in one hit. It reads a start element, a text node, and an end element, returning the content as a string:

string firstName = reader.ReadElementContentAsString ("firstname", "");

The second argument refers to the namespace, which is blank in this example. There are also typed versions of this method, such as ReadElementContentAsInt, which parse the result. Returning to our original XML document:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<customer id="123" status="archived">
  <firstname>Jim</firstname>
  <lastname>Bo</lastname>
  <creditlimit>500.00</creditlimit>    <!-- OK, we sneaked this in! -->
</customer>

We could read it in as follows:

XmlReaderSettings settings = new XmlReaderSettings();
settings.IgnoreWhitespace = true;

using (XmlReader r = XmlReader.Create ("customer.xml", settings))
{
  r.MoveToContent();                // Skip over the XML declaration
  r.ReadStartElement ("customer");
  string firstName    = r.ReadElementContentAsString ("firstname", "");
  string lastName     = r.ReadElementContentAsString ("lastname", "");
  decimal creditLimit = r.ReadElementContentAsDecimal ("creditlimit", "");

  r.MoveToContent();      // Skip over that pesky comment
  r.ReadEndElement();     // Read the closing customer tag
}
Note

The MoveToContent method is really useful. It skips over all the fluff: XML declarations, whitespace, comments, and processing instructions. You can also instruct the reader to do most of this automatically through the properties on XmlReaderSettings.

Optional elements

In the previous example, suppose that <lastname> was optional. The solution to this is straightforward:

r.ReadStartElement ("customer");
string firstName    = r. ReadElementContentAsString ("firstname", "");
string lastName     = r.Name == "lastname"
                      ? r.ReadElementContentAsString() : null;
decimal creditLimit = r.ReadElementContentAsDecimal ("creditlimit", "");

Random element order

The examples in this section rely on elements appearing in the XML file in a set order. If you need to cope with elements appearing in any order, the easiest solution is to read that section of the XML into an X-DOM. We describe how to do this later in the section “Patterns for Using XmlReader/XmlWriter”.

Empty elements

The way that XmlReader handles empty elements presents a horrible trap. Consider the following element:

<customerList></customerList>

In XML, this is equivalent to:

<customerList/>

And yet, XmlReader treats the two differently. In the first case, the following code works as expected:

reader.ReadStartElement ("customerList");
reader.ReadEndElement();

In the second case, ReadEndElement throws an exception, because there is no separate “end element” as far as XmlReader is concerned. The workaround is to check for an empty element as follows:

bool isEmpty = reader.IsEmptyElement;
reader.ReadStartElement ("customerList");
if (!isEmpty) reader.ReadEndElement();

In reality, this is a nuisance only when the element in question may contain child elements (such as a customer list). With elements that wrap simple text (such as firstname), you can avoid the whole issue by calling a method such as ReadElementContentAsString. The ReadElementXXX methods handle both kinds of empty elements correctly.

Other ReadXXX methods

Table 11-1 summarizes all ReadXXX methods in XmlReader. Most of these are designed to work with elements. The sample XML fragment shown in bold is the section read by the method described.

Table 11-1. Read methods
Members Works on NodeType Sample XML fragment Input parameters Data returned
ReadContentAsXXX Text <a>x</a>   x
ReadString Text <a>x</a>   x
ReadElementString Element <a>x</a>   x
ReadElementContentAsXXX Element <a>x</a>   x
ReadInnerXml Element <a>x</a>   x
ReadOuterXml Element <a>x</a>   <a>x</a>
ReadStartElement Element <a>x</a>    
ReadEndElement Element <a>x</a>    
ReadSubtree Element <a>x</a>   <a>x</a>
ReadToDescendant Element <a>x<b></b></a> "b"  
ReadToFollowing Element <a>x<b></b></a> "b"  
ReadToNextSibling Element <a>x</a><b></b> "b"  
ReadAttributeValue Attribute See “Reading Attributes”    

The ReadContentAsXXX methods parse a text node into type XXX. Internally, the XmlConvert class performs the string-to-type conversion. The text node can be within an element or an attribute.

The ReadElementContentAsXXX methods are wrappers around corresponding ReadContentAsXXX methods. They apply to the element node, rather than the text node enclosed by the element.

Note

The typed ReadXXX methods also include versions that read base 64 and BinHex formatted data into a byte array.

ReadInnerXml is typically applied to an element, and it reads and returns an element and all its descendants. When applied to an attribute, it returns the value of the attribute.

ReadOuterXml is the same as ReadInnerXml, except it includes rather than excludes the element at the cursor position.

ReadSubtree returns a proxy reader that provides a view over just the current element (and its descendants). The proxy reader must be closed before the original reader can be safely read again. At the point the proxy reader is closed, the cursor position of the original reader moves to the end of the subtree.

ReadToDescendant moves the cursor to the start of the first descendant node with the specified name/namespace.

ReadToFollowing moves the cursor to the start of the first node—regardless of depth—with the specified name/namespace.

ReadToNextSibling moves the cursor to the start of the first sibling node with the specified name/namespace.

ReadString and ReadElementString behave like ReadContentAsString and ReadElementContentAsString, except that they throw an exception if there’s more than a single text node within the element. In general, these methods should be avoided because they throw an exception if an element contains a comment.

Reading Attributes

XmlReader provides an indexer giving you direct (random) access to an element’s attributes—by name or position. Using the indexer is equivalent to calling GetAttribute.

Given the following XML fragment:

<customer id="123" status="archived"/>

we could read its attributes as follows:

Console.WriteLine (reader ["id"]);              // 123
Console.WriteLine (reader ["status"]);          // archived
Console.WriteLine (reader ["bogus"] == null);   // True
Warning

The XmlReader must be positioned on a start element in order to read attributes. After calling ReadStartElement, the attributes are gone forever!

Although attribute order is semantically irrelevant, you can access attributes by their ordinal position. We could rewrite the preceding example as follows:

Console.WriteLine (reader [0]);            // 123
Console.WriteLine (reader [1]);            // archived

The indexer also lets you specify the attribute’s namespace—if it has one.

AttributeCount returns the number of attributes for the current node.

Attribute nodes

To explicitly traverse attribute nodes, you must make a special diversion from the normal path of just calling Read. A good reason to do so is if you want to parse attribute values into other types, via the ReadContentAsXXX methods.

The diversion must begin from a start element. To make the job easier, the forward-only rule is relaxed during attribute traversal: you can jump to any attribute (forward or backward) by calling MoveToAttribute.

Note

MoveToElement returns you to the start element from anyplace within the attribute node diversion.

Returning to our previous example:

<customer id="123" status="archived"/>

we can do this:

reader.MoveToAttribute ("status");
string status = reader.ReadContentAsString();

reader.MoveToAttribute ("id");
int id = reader.ReadContentAsInt();

MoveToAttribute returns false if the specified attribute doesn’t exist.

You can also traverse each attribute in sequence by calling the MoveToFirstAttribute and then the MoveToNextAttribute methods:

if (reader.MoveToFirstAttribute())
  do
  {
    Console.WriteLine (reader.Name + "=" + reader.Value);
  }
  while (reader.MoveToNextAttribute());

// OUTPUT:
id=123
status=archived

Namespaces and Prefixes

XmlReader provides two parallel systems for referring to element and attribute names:

  • Name

  • NamespaceURI and LocalName

Whenever you read an element’s Name property or call a method that accepts a single name argument, you’re using the first system. This works well if no namespaces or prefixes are present; otherwise, it acts in a crude and literal manner. Namespaces are ignored, and prefixes are included exactly as they were written. For example:

Sample fragment Name
<customer ...> customer
<customer xmlns='blah' ...> customer
<x:customer ...> x:customer

The following code works with the first two cases:

reader.ReadStartElement ("customer");

The following is required to handle the third case:

reader.ReadStartElement ("x:customer");

The second system works through two namespace-aware properties: NamespaceURI and LocalName. These properties take into account prefixes and default namespaces defined by parent elements. Prefixes are automatically expanded. This means that NamespaceURI always reflects the semantically correct namespace for the current element, and LocalName is always free of prefixes.

When you pass two name arguments into a method such as ReadStartElement, you’re using this same system. For example, consider the following XML:

<customer xmlns="DefaultNamespace" xmlns:other="OtherNamespace">
  <address>
    <other:city>
    ...

We could read this as follows:

reader.ReadStartElement ("customer", "DefaultNamespace");
reader.ReadStartElement ("address",  "DefaultNamespace");
reader.ReadStartElement ("city",     "OtherNamespace");

Abstracting away prefixes is usually exactly what you want. If necessary, you can see what prefix was used through the Prefix property and convert it into a namespace by calling LookupNamespace.

XmlWriter

XmlWriter is a forward-only writer of an XML stream. The design of XmlWriter is symmetrical to XmlReader.

As with XmlTextReader, you construct an XmlWriter by calling Create with an optional settings object. In the following example, we enable indenting to make the output more human-readable, and then write a simple XML file:

XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;

using (XmlWriter writer = XmlWriter.Create ("..\..\foo.xml", settings))
{
  writer.WriteStartElement ("customer");
  writer.WriteElementString ("firstname", "Jim");
  writer.WriteElementString ("lastname"," Bo");
  writer.WriteEndElement();
}

This produces the following document (the same as the file we read in the first example of XmlReader):

<?xml version="1.0" encoding="utf-8" ?>
<customer>
  <firstname>Jim</firstname>
  <lastname>Bo</lastname>
</customer>

XmlWriter automatically writes the declaration at the top unless you indicate otherwise in XmlWriterSettings, by setting OmitXmlDeclaration to true or ConformanceLevel to Fragment. The latter also permits writing multiple root nodes—something that otherwise throws an exception.

The WriteValue method writes a single text node. It accepts both string and nonstring types such as bool and DateTime, internally calling XmlConvert to perform XML-compliant string conversions:

writer.WriteStartElement ("birthdate");
writer.WriteValue (DateTime.Now);
writer.WriteEndElement();

In contrast, if we call:

WriteElementString ("birthdate", DateTime.Now.ToString());

the result would be both non-XML-compliant and vulnerable to incorrect parsing.

WriteString is equivalent to calling WriteValue with a string. XmlWriter automatically escapes characters that would otherwise be illegal within an attribute or element, such as & < >, and extended Unicode characters.

Writing Attributes

You can write attributes immediately after writing a start element:

writer.WriteStartElement ("customer");
writer.WriteAttributeString ("id", "1");
writer.WriteAttributeString ("status", "archived");

To write nonstring values, call WriteStartAttribute, WriteValue, and then WriteEndAttribute.

Writing Other Node Types

XmlWriter also defines the following methods for writing other kinds of nodes:

WriteBase64       // for binary data
WriteBinHex       // for binary data
WriteCData
WriteComment
WriteDocType
WriteEntityRef
WriteProcessingInstruction
WriteRaw
WriteWhitespace

WriteRaw directly injects a string into the output stream. There is also a WriteNode method that accepts an XmlReader, echoing everything from the given XmlReader.

Namespaces and Prefixes

The overloads for the Write* methods allow you to associate an element or attribute with a namespace. Let’s rewrite the contents of the XML file in our previous example. This time we will associate all the elements with the http://oreilly.com namespace, declaring the prefix o at the customer element:

writer.WriteStartElement ("o", "customer", "http://oreilly.com");
writer.WriteElementString ("o", "firstname", "http://oreilly.com", "Jim");
writer.WriteElementString ("o", "lastname", "http://oreilly.com", "Bo");
writer.WriteEndElement();

The output is now as follows:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<o:customer xmlns:o='http://oreilly.com'>
  <o:firstname>Jim</o:firstname>
  <o:lastname>Bo</o:lastname>
</o:customer>

Notice how for brevity XmlWriter omits the child element’s namespace declarations when they are already declared by the parent element.

Patterns for Using XmlReader/XmlWriter

Working with Hierarchical Data

Consider the following classes:

public class Contacts
{
  public IList<Customer> Customers = new List<Customer>();
  public IList<Supplier> Suppliers = new List<Supplier>();
}

public class Customer { public string FirstName, LastName; }
public class Supplier { public string Name;                }

Suppose you want to use XmlReader and XmlWriter to serialize a Contacts object to XML as in the following:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<contacts>
   <customer id="1">
      <firstname>Jay</firstname>
      <lastname>Dee</lastname>
   </customer>
   <customer>                     <!-- we'll assume id is optional -->
      <firstname>Kay</firstname>
      <lastname>Gee</lastname>
   </customer>
   <supplier>
      <name>X Technologies Ltd</name>
   </supplier>
</contacts>

The best approach is not to write one big method, but to encapsulate XML functionality in the Customer and Supplier types themselves by writing ReadXml and WriteXml methods on these types. The pattern in doing so is straightforward:

  • ReadXml and WriteXml leave the reader/writer at the same depth when they exit.

  • ReadXml reads the outer element, whereas WriteXml writes only its inner content.

Here’s how we would write the Customer type:

public class Customer
{
  public const string XmlName = "customer";
  public int? ID;
  public string FirstName, LastName;

  public Customer () { }
  public Customer (XmlReader r) { ReadXml (r); }

  public void ReadXml (XmlReader r)
  {
    if (r.MoveToAttribute ("id")) ID = r.ReadContentAsInt();
    r.ReadStartElement();
    FirstName = r.ReadElementContentAsString ("firstname", "");
    LastName = r.ReadElementContentAsString ("lastname", "");
    r.ReadEndElement();
  }

  public void WriteXml (XmlWriter w)
  {
    if (ID.HasValue) w.WriteAttributeString ("id", "", ID.ToString());
    w.WriteElementString ("firstname", FirstName);
    w.WriteElementString ("lastname", LastName);
  }
}

Notice that ReadXml reads the outer start and end element nodes. If its caller did this job instead, Customer couldn’t read its own attributes. The reason for not making WriteXml symmetrical in this regard is twofold:

  • The caller might need to choose how the outer element is named.

  • The caller might need to write extra XML attributes, such as the element’s subtype (which could then be used to decide which class to instantiate when reading back the element).

Another benefit of following this pattern is that it makes your implementation compatible with IXmlSerializable (see Chapter 17).

The Supplier class is analogous:

public class Supplier
{
  public const string XmlName = "supplier";
  public string Name;

  public Supplier () { }
  public Supplier (XmlReader r) { ReadXml (r); }

  public void ReadXml (XmlReader r)
  {
    r.ReadStartElement();
    Name = r.ReadElementContentAsString ("name", "");
    r.ReadEndElement();
  }

  public void WriteXml (XmlWriter w)
  {
    w.WriteElementString ("name", Name);
  }
}

With the Contacts class, we must enumerate the customers element in ReadXml, checking whether each subelement is a customer or a supplier. We also have to code around the empty element trap:

public void ReadXml (XmlReader r)
{
  bool isEmpty = r.IsEmptyElement;           // This ensures we don't get
  r.ReadStartElement();                      // snookered by an empty
  if (isEmpty) return;                       // <contacts/> element!
  while (r.NodeType == XmlNodeType.Element)
  {
    if (r.Name == Customer.XmlName)      Customers.Add (new Customer (r));
    else if (r.Name == Supplier.XmlName) Suppliers.Add (new Supplier (r));
    else
      throw new XmlException ("Unexpected node: " + r.Name);
  }
  r.ReadEndElement();
}

public void WriteXml (XmlWriter w)
{
  foreach (Customer c in Customers)
  {
    w.WriteStartElement (Customer.XmlName);
    c.WriteXml (w);
    w.WriteEndElement();
  }
  foreach (Supplier s in Suppliers)
  {
    w.WriteStartElement (Supplier.XmlName);
    s.WriteXml (w);
    w.WriteEndElement();
  }
}

Mixing XmlReader/XmlWriter with an X-DOM

You can fly in an X-DOM at any point in the XML tree where XmlReader or XmlWriter becomes too cumbersome. Using the X-DOM to handle inner elements is an excellent way to combine X-DOM’s ease of use with the low-memory footprint of XmlReader and XmlWriter.

Using XmlReader with XElement

To read the current element into an X-DOM, you call XNode.ReadFrom, passing in the XmlReader. Unlike XElement.Load, this method is not “greedy” in that it doesn’t expect to see a whole document. Instead, it reads just the end of the current subtree.

For instance, suppose we have an XML logfile structured as follows:

<log>
  <logentry id="1">
    <date>...</date>
    <source>...</source>
    ...
  </logentry>
  ...
</log>

If there were a million logentry elements, reading the whole thing into an X-DOM would waste memory. A better solution is to traverse each logentry with an XmlReader, and then use XElement to process the elements individually:

XmlReaderSettings settings = new XmlReaderSettings();
settings.IgnoreWhitespace = true;

using (XmlReader r = XmlReader.Create ("logfile.xml", settings))
{
  r.ReadStartElement ("log");
  while (r.Name == "logentry")
  {
    XElement logEntry = (XElement) XNode.ReadFrom (r);
    int id = (int) logEntry.Attribute ("id");
    DateTime date = (DateTime) logEntry.Element ("date");
    string source = (string) logEntry.Element ("source");
    ...
  }
  r.ReadEndElement();
}

If you follow the pattern described in the previous section, you can slot an XElement into a custom type’s ReadXml or WriteXml method without the caller ever knowing you’ve cheated! For instance, we could rewrite Customer’s ReadXml method as follows:

public void ReadXml (XmlReader r)
{
  XElement x = (XElement) XNode.ReadFrom (r);
  FirstName = (string) x.Element ("firstname");
  LastName = (string) x.Element ("lastname");
}

XElement collaborates with XmlReader to ensure that namespaces are kept intact and prefixes are properly expanded—even if defined at an outer level. So, if our XML file read like this:

<log xmlns="http://loggingspace">
  <logentry id="1">
  ...

the XElements we constructed at the logentry level would correctly inherit the outer namespace.

Using XmlWriter with XElement

You can use an XElement just to write inner elements to an XmlWriter. The following code writes a million logentry elements to an XML file using XElement—without storing the whole thing in memory:

using (XmlWriter w = XmlWriter.Create ("log.xml"))
{
  w.WriteStartElement ("log");
  for (int i = 0; i < 1000000; i++)
  {
    XElement e = new XElement ("logentry",
                   new XAttribute ("id", i),
                   new XElement ("date", DateTime.Today.AddDays (-1)),
                   new XElement ("source", "test"));
    e.WriteTo (w);
  }
  w.WriteEndElement ();
}

Using an XElement incurs minimal execution overhead. If we amend this example to use XmlWriter throughout, there’s no measurable difference in execution time.

XSD and Schema Validation

The content of a particular XML document is nearly always domain-specific, such as a Microsoft Word document, an application configuration document, or a web service. For each domain, the XML file conforms to a particular pattern. There are several standards for describing the schema of such a pattern, to standardize and automate the interpretation and validation of XML documents. The most widely accepted standard is XSD, short for XML Schema Definition. Its precursors, DTD and XDR, are also supported by System.Xml.

Consider the following XML document:

<?xml version="1.0"?>
<customers>
  <customer id="1" status="active">
    <firstname>Jim</firstname>
    <lastname>Bo</lastname>
  </customer>
  <customer id="1" status="archived">
    <firstname>Thomas</firstname>
    <lastname>Jefferson</lastname>
  </customer>
</customers>

We can write an XSD for this document as follows:

<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified"
           elementFormDefault="qualified"
           xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="customers">
    <xs:complexType>
      <xs:sequence>
        <xs:element maxOccurs="unbounded" name="customer">
          <xs:complexType>
            <xs:sequence>
              <xs:element name="firstname" type="xs:string" />
              <xs:element name="lastname" type="xs:string" />
            </xs:sequence>
            <xs:attribute name="id" type="xs:int" use="required" />
            <xs:attribute name="status" type="xs:string" use="required" />
          </xs:complexType>
        </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

As you can see, XSD documents are themselves written in XML. Furthermore, an XSD document is describable with XSD—you can find that definition at http://www.w3.org/2001/xmlschema.xsd.

Performing Schema Validation

You can validate an XML file or document against one or more schemas before reading or processing it. There are a number of reasons to do so:

  • You can get away with less error checking and exception handling.

  • Schema validation picks up errors you might otherwise overlook.

  • Error messages are detailed and informative.

To perform validation, plug a schema into an XmlReader, an XmlDocument, or an X-DOM object, and then read or load the XML as you would normally. Schema validation happens automatically as content is read, so the input stream is not read twice.

Validating with an XmlReader

Here’s how to plug a schema from the file customers.xsd into an XmlReader:

XmlReaderSettings settings = new XmlReaderSettings();
settings.ValidationType = ValidationType.Schema;
settings.Schemas.Add (null, "customers.xsd");

using (XmlReader r = XmlReader.Create ("customers.xml", settings))
  ...

If the schema is inline, set the following flag instead of adding to Schemas:

settings.ValidationFlags |= XmlSchemaValidationFlags.ProcessInlineSchema;

You then Read as you would normally. If schema validation fails at any point, an XmlSchemaValidationException is thrown.

Note

Calling Read on its own validates both elements and attributes: you don’t need to navigate to each individual attribute for it to be validated.

If you want only to validate the document, you can do this:

using (XmlReader r = XmlReader.Create ("customers.xml", settings))
  try { while (r.Read()) ; }
  catch (XmlSchemaValidationException ex)
  {
    ...
  }

XmlSchemaValidationException has properties for the error Message, LineNumber, and LinePosition. In this case, it only tells you about the first error in the document. If you want to report on all errors in the document, you instead must handle the ValidationEventHandler event:

XmlReaderSettings settings = new XmlReaderSettings();
settings.ValidationType = ValidationType.Schema;
settings.Schemas.Add (null, "customers.xsd");
settings.ValidationEventHandler += ValidationHandler;
using (XmlReader r = XmlReader.Create ("customers.xml", settings))
  while (r.Read()) ;

When you handle this event, schema errors no longer throw exceptions. Instead, they fire your event handler:

static void ValidationHandler (object sender, ValidationEventArgs e)
{
  Console.WriteLine ("Error: " + e.Exception.Message);
}

The Exception property of ValidationEventArgs contains the XmlSchemaValidationException that would have otherwise been thrown.

Note

The System.Xml namespace also contains a class called XmlValidatingReader. This was used to perform schema validation prior to Framework 2.0, and it is now deprecated.

Validating an X-DOM

To validate an XML file or stream while reading into an X-DOM, you create an XmlReader, plug in the schemas, and then use the reader to load the DOM:

XmlReaderSettings settings = new XmlReaderSettings();
settings.ValidationType = ValidationType.Schema;
settings.Schemas.Add (null, "customers.xsd");

XDocument doc;
using (XmlReader r = XmlReader.Create ("customers.xml", settings))
  try { doc = XDocument.Load (r); }
  catch (XmlSchemaValidationException ex) { ... }

You can also validate an XDocument or XElement that’s already in memory by calling extension methods in System.Xml.Schema. These methods accept an XmlSchemaSet (a collection of schemas) and a validation event handler:

XDocument doc = XDocument.Load (@"customers.xml");
XmlSchemaSet set = new XmlSchemaSet ();
set.Add (null, @"customers.xsd");
StringBuilder errors = new StringBuilder ();
doc.Validate (set, (sender, args) => { errors.AppendLine
                                       (args.Exception.Message); }
             );
Console.WriteLine (errors.ToString());

XSLT

XSLT stands for Extensible Stylesheet Language Transformations. It is an XML language that describes how to transform one XML language into another. The quintessential example of such a transformation is transforming an XML document (that typically describes data) into an XHTML document (that describes a formatted document).

Consider the following XML file:

<customer>
  <firstname>Jim</firstname>
  <lastname>Bo</lastname>
</customer>

The following XSLT file describes such a transformation:

<?xml version="1.0" encoding="UTF-8"?>
  <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
  <xsl:template match="/">
    <html>
      <p><xsl:value-of select="//firstname"/></p>
      <p><xsl:value-of select="//lastname"/></p>
    </html>
  </xsl:template>
</xsl:stylesheet>

The output is as follows:

<html>
  <p>Jim</p>
  <p>Bo</p>
</html>

The System.Xml.Xsl.XslCompiledTransform transform class efficiently performs XSLT transforms. It renders XmlTransform obsolete. XslCompiledTransform works very simply:

XslCompiledTransform transform = new XslCompiledTransform();
transform.Load ("test.xslt");
transform.Transform ("input.xml", "output.xml");

Generally, it’s more useful to use the overload of Transform that accepts an XmlWriter rather than an output file, so you can control the formatting.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset