The System.Xml
namespace comprises the following namespaces and core classes:
System.Xml.*
System.Xml.XLinq
System.Xml.XmlSchema
System.Xml.Xsl
XslCompiledTransform
) for performing (W3C) XSLT transformations of XMLSystem.Xml.Serialization
W3C is an abbreviation for World Wide Web Consortium, where the XML standards are defined.
XmlConvert
, the static class for parsing and formatting XML strings, is covered in Chapter 6.
XmlReader
is a high-performance class for reading an XML stream in a low-level, forward-only manner.
Consider the following XML file:
<?xml version="1.0" encoding="utf-8" standalone="yes"?> <customer id="123" status="archived"> <firstname>Jim</firstname> <lastname>Bo</lastname> </customer>
To instantiate an XmlReader
, you call the static XmlReader.Create
method, passing in a Stream
, a TextReader
, or a URI string. For example:
using (XmlReader reader = XmlReader.Create ("customer.xml")) ...
Because XmlReader
lets you read from potentially slow sources (Stream
s and URIs), it offers asynchronous versions of most of its methods so that you can easily write nonblocking code. We’ll cover asynchrony in detail in Chapter 14.
To construct an XmlReader
that reads from a string:
XmlReader reader = XmlReader.Create ( new System.IO.StringReader (myString));
You can also pass in an XmlReaderSettings
object to control parsing and validation options. The following three properties on XmlReaderSettings
are particularly useful for skipping over superfluous content:
bool IgnoreComments // Skip over comment nodes? bool IgnoreProcessingInstructions // Skip over processing instructions? bool IgnoreWhitespace // Skip over whitespace?
In the following example, we instruct the reader not to emit whitespace nodes, which are a distraction in typical scenarios:
XmlReaderSettings settings = new XmlReaderSettings(); settings.IgnoreWhitespace = true; using (XmlReader reader = XmlReader.Create ("customer.xml", settings)) ...
Another useful property on XmlReaderSettings
is ConformanceLevel
. Its default value of Document
instructs the reader to assume a valid XML document with a single root node. This is a problem if you want to read just an inner portion of XML containing multiple nodes:
<firstname>Jim</firstname> <lastname>Bo</lastname>
To read this without throwing an exception, you must set ConformanceLevel
to Fragment
.
XmlReaderSettings
also has a property called CloseInput
, which indicates whether to close the underlying stream when the reader is closed (there’s an analogous property on XmlWriterSettings
called CloseOutput
). The default value for CloseInput
and CloseOutput
is false
.
The units of an XML stream are XML nodes. The reader traverses the stream in textual (depth-first) order. The Depth
property of the reader returns the current depth of the cursor.
The most primitive way to read from an XmlReader
is to call Read
. It advances to the next node in the XML stream, rather like MoveNext
in IEnumerator
. The first call to Read
positions the cursor at the first node. When Read
returns false
, it means the cursor has advanced past the last node, at which point the XmlReader
should be closed and abandoned.
In this example, we read every node in the XML stream, outputting each node type as we go:
XmlReaderSettings settings = new XmlReaderSettings(); settings.IgnoreWhitespace = true; using (XmlReader reader = XmlReader.Create ("customer.xml", settings)) while (reader.Read()) { Console.Write (new string (' ',reader.Depth*2)); // Write indentation Console.WriteLine (reader.NodeType); }
The output is as follows:
XmlDeclaration Element Element Text EndElement Element Text EndElement EndElement
Attributes are not included in Read
-based traversal (see the section “Reading Attributes” later in this chapter).
NodeType
is of type XmlNodeType
, which is an enum with these members:
None XmlDeclaration Element EndElement Text Attribute |
Comment Entity EndEntity EntityReference ProcessingInstruction CDATA |
Document DocumentType DocumentFragment Notation Whitespace SignificantWhitespace |
Two string
properties on XmlReader
provide access to a node’s content: Name
and Value
. Depending on the node type, either Name
or Value
(or both) is populated:
XmlReaderSettings settings = new XmlReaderSettings(); settings.IgnoreWhitespace = true; settings.DtdProcessing = DtdProcessing.Parse; // Required to read DTDs using (XmlReader r = XmlReader.Create ("customer.xml", settings)) while (r.Read()) { Console.Write (r.NodeType.ToString().PadRight (17, '-')); Console.Write ("> ".PadRight (r.Depth * 3)); switch (r.NodeType) { case XmlNodeType.Element: case XmlNodeType.EndElement: Console.WriteLine (r.Name); break; case XmlNodeType.Text: case XmlNodeType.CDATA: case XmlNodeType.Comment: case XmlNodeType.XmlDeclaration: Console.WriteLine (r.Value); break; case XmlNodeType.DocumentType: Console.WriteLine (r.Name + " - " + r.Value); break; default: break; } }
To demonstrate this, we’ll expand our XML file to include a document type, entity, CDATA, and comment:
<?xml version="1.0" encoding="utf-8" ?> <!DOCTYPE customer [ <!ENTITY tc "Top Customer"> ]> <customer id="123" status="archived"> <firstname>Jim</firstname> <lastname>Bo</lastname> <quote><![CDATA[C#'s operators include: < > &]]></quote> <notes>Jim Bo is a &tc;</notes> <!-- That wasn't so bad! --> </customer>
An entity is like a macro; a CDATA is like a verbatim string (@"..."
) in C#. Here’s the result:
XmlDeclaration---> version="1.0" encoding="utf-8" DocumentType-----> customer - <!ENTITY tc "Top Customer"> Element----------> customer Element----------> firstname Text-------------> Jim EndElement-------> firstname Element----------> lastname Text-------------> Bo EndElement-------> lastname Element----------> quote CDATA------------> C#'s operators include: < > & EndElement-------> quote Element----------> notes Text-------------> Jim Bo is a Top Customer EndElement-------> notes Comment----------> That wasn't so bad! EndElement-------> customer
XmlReader
automatically resolves entities, so in our example, the entity reference &tc;
expands into Top Customer
.
Often, you already know the structure of the XML document that you’re reading. To help with this, XmlReader
provides a range of methods that read while presuming a particular structure. This simplifies your code, as well as performing some validation at the same time.
XmlReader
throws an XmlException
if any validation fails. XmlException
has LineNumber
and LinePosition
properties indicating where the error occurred—logging this information is essential if the XML file is large!
ReadStartElement
verifies that the current NodeType
is Element
, and then calls Read
. If you specify a name, it verifies that it matches that of the current element.
ReadEndElement
verifies that the current NodeType
is EndElement
, and then calls Read
.
For instance, we could read this:
<firstname>Jim</firstname>
as follows:
reader.ReadStartElement ("firstname"); Console.WriteLine (reader.Value); reader.Read(); reader.ReadEndElement();
The ReadElementContentAsString
method does all of this in one hit. It reads a start element, a text node, and an end element, returning the content as a string:
string firstName = reader.ReadElementContentAsString ("firstname", "");
The second argument refers to the namespace, which is blank in this example. There are also typed versions of this method, such as ReadElementContentAsInt
, which parse the result. Returning to our original XML document:
<?xml version="1.0" encoding="utf-8" standalone="yes"?> <customer id="123" status="archived"> <firstname>Jim</firstname> <lastname>Bo</lastname> <creditlimit>500.00</creditlimit> <!-- OK, we sneaked this in! --> </customer>
We could read it in as follows:
XmlReaderSettings settings = new XmlReaderSettings(); settings.IgnoreWhitespace = true; using (XmlReader r = XmlReader.Create ("customer.xml", settings)) { r.MoveToContent(); // Skip over the XML declaration r.ReadStartElement ("customer"); string firstName = r.ReadElementContentAsString ("firstname", ""); string lastName = r.ReadElementContentAsString ("lastname", ""); decimal creditLimit = r.ReadElementContentAsDecimal ("creditlimit", ""); r.MoveToContent(); // Skip over that pesky comment r.ReadEndElement(); // Read the closing customer tag }
The MoveToContent
method is really useful. It skips over all the fluff: XML declarations, whitespace, comments, and processing instructions. You can also instruct the reader to do most of this automatically through the properties on XmlReaderSettings
.
In the previous example, suppose that <lastname>
was optional. The solution to this is straightforward:
r.ReadStartElement ("customer"); string firstName = r. ReadElementContentAsString ("firstname", ""); string lastName = r.Name == "lastname" ? r.ReadElementContentAsString() : null; decimal creditLimit = r.ReadElementContentAsDecimal ("creditlimit", "");
The examples in this section rely on elements appearing in the XML file in a set order. If you need to cope with elements appearing in any order, the easiest solution is to read that section of the XML into an X-DOM. We describe how to do this later in the section “Patterns for Using XmlReader/XmlWriter”.
The way that XmlReader
handles empty elements presents a horrible trap. Consider the following element:
<customerList></customerList>
In XML, this is equivalent to:
<customerList/>
And yet, XmlReader
treats the two differently. In the first case, the following code works as expected:
reader.ReadStartElement ("customerList"); reader.ReadEndElement();
In the second case, ReadEndElement
throws an exception, because there is no separate “end element” as far as XmlReader
is concerned. The workaround is to check for an empty element as follows:
bool isEmpty = reader.IsEmptyElement; reader.ReadStartElement ("customerList"); if (!isEmpty) reader.ReadEndElement();
In reality, this is a nuisance only when the element in question may contain child elements (such as a customer list). With elements that wrap simple text (such as firstname
), you can avoid the whole issue by calling a method such as ReadElementContentAsString
. The ReadElementXXX
methods handle both kinds of empty elements correctly.
Table 11-1 summarizes all ReadXXX
methods in XmlReader
. Most of these are designed to work with elements. The sample XML fragment shown in bold is the section read by the method described.
Members | Works on NodeType | Sample XML fragment | Input parameters | Data returned |
---|---|---|---|---|
ReadContentAsXXX |
Text |
<a>x</a> |
x |
|
ReadString |
Text |
<a>x</a> |
x |
|
ReadElementString |
Element |
<a>x</a> |
x |
|
ReadElementContentAsXXX |
Element |
<a>x</a> |
x |
|
ReadInnerXml |
Element |
<a>x</a> |
x |
|
ReadOuterXml |
Element |
<a>x</a> |
<a>x</a> |
|
ReadStartElement |
Element |
<a>x</a> |
||
ReadEndElement |
Element |
<a>x</a> |
||
ReadSubtree |
Element |
<a>x</a> |
<a>x</a> |
|
ReadToDescendant |
Element |
<a>x<b></b></a> |
"b" |
|
ReadToFollowing |
Element |
<a>x<b></b></a> |
"b" |
|
ReadToNextSibling |
Element |
<a>x</a><b></b> |
"b" |
|
ReadAttributeValue |
Attribute |
See “Reading Attributes” |
The ReadContentAsXXX
methods parse a text node into type XXX
. Internally, the XmlConvert
class performs the string-to-type conversion. The text node can be within an element or an attribute.
The ReadElementContentAsXXX
methods are wrappers around corresponding ReadContentAsXXX
methods. They apply to the element node, rather than the text node enclosed by the element.
The typed ReadXXX
methods also include versions that read base 64 and BinHex formatted data into a byte array.
ReadInnerXml
is typically applied to an element, and it reads and returns an element and all its descendants. When applied to an attribute, it returns the value of the attribute.
ReadOuterXml
is the same as ReadInnerXml
, except it includes rather than excludes the element at the cursor position.
ReadSubtree
returns a proxy reader that provides a view over just the current element (and its descendants). The proxy reader must be closed before the original reader can be safely read again. At the point the proxy reader is closed, the cursor position of the original reader moves to the end of the subtree.
ReadToDescendant
moves the cursor to the start of the first descendant node with the specified name/namespace.
ReadToFollowing
moves the cursor to the start of the first node—regardless of depth—with the specified name/namespace.
ReadToNextSibling
moves the cursor to the start of the first sibling node with the specified name/namespace.
ReadString
and ReadElementString
behave like ReadContentAsString
and ReadElementContentAsString
, except that they throw an exception if there’s more than a single text node within the element. In general, these methods should be avoided because they throw an exception if an element contains a comment.
XmlReader
provides an indexer giving you direct (random) access to an element’s attributes—by name or position. Using the indexer is equivalent to calling GetAttribute
.
Given the following XML fragment:
<customer id="123" status="archived"/>
we could read its attributes as follows:
Console.WriteLine (reader ["id"]); // 123 Console.WriteLine (reader ["status"]); // archived Console.WriteLine (reader ["bogus"] == null); // True
The XmlReader
must be positioned on a start element in order to read attributes. After calling ReadStartElement
, the attributes are gone forever!
Although attribute order is semantically irrelevant, you can access attributes by their ordinal position. We could rewrite the preceding example as follows:
Console.WriteLine (reader [0]); // 123 Console.WriteLine (reader [1]); // archived
The indexer also lets you specify the attribute’s namespace—if it has one.
AttributeCount
returns the number of attributes for the current node.
To explicitly traverse attribute nodes, you must make a special diversion from the normal path of just calling Read
. A good reason to do so is if you want to parse attribute values into other types, via the ReadContentAsXXX
methods.
The diversion must begin from a start element. To make the job easier, the forward-only rule is relaxed during attribute traversal: you can jump to any attribute (forward or backward) by calling MoveToAttribute
.
MoveToElement
returns you to the start
element from anyplace within the attribute node diversion.
Returning to our previous example:
<customer id="123" status="archived"/>
we can do this:
reader.MoveToAttribute ("status"); string status = reader.ReadContentAsString(); reader.MoveToAttribute ("id"); int id = reader.ReadContentAsInt();
MoveToAttribute
returns false
if the specified attribute doesn’t exist.
You can also traverse each attribute in sequence by calling the MoveToFirstAttribute
and then the MoveToNextAttribute
methods:
if (reader.MoveToFirstAttribute()) do { Console.WriteLine (reader.Name + "=" + reader.Value); } while (reader.MoveToNextAttribute()); // OUTPUT: id=123 status=archived
XmlReader
provides two parallel systems for referring to element and attribute names:
Name
NamespaceURI
and LocalName
Whenever you read an element’s Name
property or call a method that accepts a single name
argument, you’re using the first system. This works well if no namespaces or prefixes are present; otherwise, it acts in a crude and literal manner. Namespaces are ignored, and prefixes are included exactly as they were written. For example:
Sample fragment | Name |
---|---|
<customer ...> |
customer |
<customer xmlns='blah' ...> |
customer |
<x:customer ...> |
x:customer |
The following code works with the first two cases:
reader.ReadStartElement ("customer");
The following is required to handle the third case:
reader.ReadStartElement ("x:customer");
The second system works through two namespace-aware properties: NamespaceURI
and LocalName
. These properties take into account prefixes and default namespaces defined by parent elements. Prefixes are automatically expanded. This means that NamespaceURI
always reflects the semantically correct namespace for the current element, and LocalName
is always free of prefixes.
When you pass two name arguments into a method such as ReadStartElement
, you’re using this same system. For example, consider the following XML:
<customer xmlns="DefaultNamespace" xmlns:other="OtherNamespace"> <address> <other:city> ...
We could read this as follows:
reader.ReadStartElement ("customer", "DefaultNamespace"); reader.ReadStartElement ("address", "DefaultNamespace"); reader.ReadStartElement ("city", "OtherNamespace");
Abstracting away prefixes is usually exactly what you want. If necessary, you can see what prefix was used through the Prefix
property and convert it into a namespace by calling LookupNamespace
.
XmlWriter
is a forward-only writer of an XML stream. The design of XmlWriter
is symmetrical to XmlReader
.
As with XmlTextReader
, you construct an XmlWriter
by calling Create
with an optional settings
object. In the following example, we enable indenting to make the output more human-readable, and then write a simple XML file:
XmlWriterSettings settings = new XmlWriterSettings(); settings.Indent = true; using (XmlWriter writer = XmlWriter.Create ("..\..\foo.xml", settings)) { writer.WriteStartElement ("customer"); writer.WriteElementString ("firstname", "Jim"); writer.WriteElementString ("lastname"," Bo"); writer.WriteEndElement(); }
This produces the following document (the same as the file we read in the first example of XmlReader
):
<?xml version="1.0" encoding="utf-8" ?> <customer> <firstname>Jim</firstname> <lastname>Bo</lastname> </customer>
XmlWriter
automatically writes the declaration at the top unless you indicate otherwise in XmlWriterSettings
, by setting OmitXmlDeclaration
to true
or ConformanceLevel
to Fragment
. The latter also permits writing multiple root nodes—something that otherwise throws an exception.
The WriteValue
method writes a single text node. It accepts both string and nonstring types such as bool
and DateTime
, internally calling XmlConvert
to perform XML-compliant string conversions:
writer.WriteStartElement ("birthdate"); writer.WriteValue (DateTime.Now); writer.WriteEndElement();
In contrast, if we call:
WriteElementString ("birthdate", DateTime.Now.ToString());
the result would be both non-XML-compliant and vulnerable to incorrect parsing.
WriteString
is equivalent to calling WriteValue
with a string. XmlWriter
automatically escapes characters that would otherwise be illegal within an attribute or element, such as & < >
, and extended Unicode characters.
You can write attributes immediately after writing a start
element:
writer.WriteStartElement ("customer"); writer.WriteAttributeString ("id", "1"); writer.WriteAttributeString ("status", "archived");
To write nonstring values, call WriteStartAttribute
, WriteValue
, and then WriteEndAttribute
.
XmlWriter
also defines the following methods for writing other kinds of nodes:
WriteBase64 // for binary data WriteBinHex // for binary data WriteCData WriteComment WriteDocType WriteEntityRef WriteProcessingInstruction WriteRaw WriteWhitespace
WriteRaw
directly injects a string into the output stream. There is also a WriteNode
method that accepts an XmlReader
, echoing everything from the given XmlReader
.
The overloads for the Write*
methods allow you to associate an element or attribute with a namespace. Let’s rewrite the contents of the XML file in our previous example. This time we will associate all the elements with the http://oreilly.com namespace, declaring the prefix o
at the customer
element:
writer.WriteStartElement ("o", "customer", "http://oreilly.com"); writer.WriteElementString ("o", "firstname", "http://oreilly.com", "Jim"); writer.WriteElementString ("o", "lastname", "http://oreilly.com", "Bo"); writer.WriteEndElement();
The output is now as follows:
<?xml version="1.0" encoding="utf-8" standalone="yes"?> <o:customer xmlns:o='http://oreilly.com'> <o:firstname>Jim</o:firstname> <o:lastname>Bo</o:lastname> </o:customer>
Notice how for brevity XmlWriter
omits the child element’s namespace declarations when they are already declared by the parent element.
Consider the following classes:
public class Contacts { public IList<Customer> Customers = new List<Customer>(); public IList<Supplier> Suppliers = new List<Supplier>(); } public class Customer { public string FirstName, LastName; } public class Supplier { public string Name; }
Suppose you want to use XmlReader
and XmlWriter
to serialize a Contacts
object to XML as in the following:
<?xml version="1.0" encoding="utf-8" standalone="yes"?> <contacts> <customer id="1"> <firstname>Jay</firstname> <lastname>Dee</lastname> </customer> <customer> <!-- we'll assume id is optional --> <firstname>Kay</firstname> <lastname>Gee</lastname> </customer> <supplier> <name>X Technologies Ltd</name> </supplier> </contacts>
The best approach is not to write one big method, but to encapsulate XML functionality in the Customer
and Supplier
types themselves by writing ReadXml
and WriteXml
methods on these types. The pattern in doing so is straightforward:
ReadXml
and WriteXml
leave the reader/writer at the same depth when they exit.
ReadXml
reads the outer element, whereas WriteXml
writes only its inner content.
Here’s how we would write the Customer
type:
public class Customer { public const string XmlName = "customer"; public int? ID; public string FirstName, LastName; public Customer () { } public Customer (XmlReader r) { ReadXml (r); } public void ReadXml (XmlReader r) { if (r.MoveToAttribute ("id")) ID = r.ReadContentAsInt(); r.ReadStartElement(); FirstName = r.ReadElementContentAsString ("firstname", ""); LastName = r.ReadElementContentAsString ("lastname", ""); r.ReadEndElement(); } public void WriteXml (XmlWriter w) { if (ID.HasValue) w.WriteAttributeString ("id", "", ID.ToString()); w.WriteElementString ("firstname", FirstName); w.WriteElementString ("lastname", LastName); } }
Notice that ReadXml
reads the outer start and end element nodes. If its caller did this job instead, Customer
couldn’t read its own attributes. The reason for not making WriteXml
symmetrical in this regard is twofold:
The caller might need to choose how the outer element is named.
The caller might need to write extra XML attributes, such as the element’s subtype (which could then be used to decide which class to instantiate when reading back the element).
Another benefit of following this pattern is that it makes your implementation compatible with IXmlSerializable
(see Chapter 17).
The Supplier
class is analogous:
public class Supplier { public const string XmlName = "supplier"; public string Name; public Supplier () { } public Supplier (XmlReader r) { ReadXml (r); } public void ReadXml (XmlReader r) { r.ReadStartElement(); Name = r.ReadElementContentAsString ("name", ""); r.ReadEndElement(); } public void WriteXml (XmlWriter w) { w.WriteElementString ("name", Name); } }
With the Contacts
class, we must enumerate the customers
element in ReadXml
, checking whether each subelement is a customer or a supplier. We also have to code around the empty element trap:
public void ReadXml (XmlReader r) { bool isEmpty = r.IsEmptyElement; // This ensures we don't get r.ReadStartElement(); // snookered by an empty if (isEmpty) return; // <contacts/> element! while (r.NodeType == XmlNodeType.Element) { if (r.Name == Customer.XmlName) Customers.Add (new Customer (r)); else if (r.Name == Supplier.XmlName) Suppliers.Add (new Supplier (r)); else throw new XmlException ("Unexpected node: " + r.Name); } r.ReadEndElement(); } public void WriteXml (XmlWriter w) { foreach (Customer c in Customers) { w.WriteStartElement (Customer.XmlName); c.WriteXml (w); w.WriteEndElement(); } foreach (Supplier s in Suppliers) { w.WriteStartElement (Supplier.XmlName); s.WriteXml (w); w.WriteEndElement(); } }
You can fly in an X-DOM at any point in the XML tree where XmlReader
or XmlWriter
becomes too cumbersome. Using the X-DOM to handle inner elements is an excellent way to combine X-DOM’s ease of use with the low-memory footprint of XmlReader
and XmlWriter
.
To read the current element into an X-DOM, you call XNode.ReadFrom
, passing in the XmlReader
. Unlike XElement.Load
, this method is not “greedy” in that it doesn’t expect to see a whole document. Instead, it reads just the end of the current subtree.
For instance, suppose we have an XML logfile structured as follows:
<log> <logentry id="1"> <date>...</date> <source>...</source> ... </logentry> ... </log>
If there were a million logentry
elements, reading the whole thing into an X-DOM would waste memory. A better solution is to traverse each logentry
with an XmlReader
, and then use XElement
to process the elements individually:
XmlReaderSettings settings = new XmlReaderSettings(); settings.IgnoreWhitespace = true; using (XmlReader r = XmlReader.Create ("logfile.xml", settings)) { r.ReadStartElement ("log"); while (r.Name == "logentry") { XElement logEntry = (XElement) XNode.ReadFrom (r); int id = (int) logEntry.Attribute ("id"); DateTime date = (DateTime) logEntry.Element ("date"); string source = (string) logEntry.Element ("source"); ... } r.ReadEndElement(); }
If you follow the pattern described in the previous section, you can slot an XElement
into a custom type’s ReadXml
or WriteXml
method without the caller ever knowing you’ve cheated! For instance, we could rewrite Customer
’s ReadXml
method as follows:
public void ReadXml (XmlReader r) { XElement x = (XElement) XNode.ReadFrom (r); FirstName = (string) x.Element ("firstname"); LastName = (string) x.Element ("lastname"); }
XElement
collaborates with XmlReader
to ensure that namespaces are kept intact and prefixes are properly expanded—even if defined at an outer level. So, if our XML file read like this:
<log xmlns="http://loggingspace"> <logentry id="1"> ...
the XElements
we constructed at the logentry
level would correctly inherit the outer namespace.
You can use an XElement
just to write inner elements to an XmlWriter
. The following code writes a million logentry
elements to an XML file using XElement
—without storing the whole thing in memory:
using (XmlWriter w = XmlWriter.Create ("log.xml")) { w.WriteStartElement ("log"); for (int i = 0; i < 1000000; i++) { XElement e = new XElement ("logentry", new XAttribute ("id", i), new XElement ("date", DateTime.Today.AddDays (-1)), new XElement ("source", "test")); e.WriteTo (w); } w.WriteEndElement (); }
Using an XElement
incurs minimal execution overhead. If we amend this example to use XmlWriter
throughout, there’s no measurable difference in execution time.
The content of a particular XML document is nearly always domain-specific, such as a Microsoft Word document, an application configuration document, or a web service. For each domain, the XML file conforms to a particular pattern. There are several standards for describing the schema of such a pattern, to standardize and automate the interpretation and validation of XML documents. The most widely accepted standard is XSD, short for XML Schema Definition. Its precursors, DTD and XDR, are also supported by System.Xml
.
Consider the following XML document:
<?xml version="1.0"?> <customers> <customer id="1" status="active"> <firstname>Jim</firstname> <lastname>Bo</lastname> </customer> <customer id="1" status="archived"> <firstname>Thomas</firstname> <lastname>Jefferson</lastname> </customer> </customers>
We can write an XSD for this document as follows:
<?xml version="1.0" encoding="utf-8"?> <xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="customers"> <xs:complexType> <xs:sequence> <xs:element maxOccurs="unbounded" name="customer"> <xs:complexType> <xs:sequence> <xs:element name="firstname" type="xs:string" /> <xs:element name="lastname" type="xs:string" /> </xs:sequence> <xs:attribute name="id" type="xs:int" use="required" /> <xs:attribute name="status" type="xs:string" use="required" /> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:schema>
As you can see, XSD documents are themselves written in XML. Furthermore, an XSD document is describable with XSD—you can find that definition at http://www.w3.org/2001/xmlschema.xsd.
You can validate an XML file or document against one or more schemas before reading or processing it. There are a number of reasons to do so:
You can get away with less error checking and exception handling.
Schema validation picks up errors you might otherwise overlook.
Error messages are detailed and informative.
To perform validation, plug a schema into an XmlReader
, an XmlDocument
, or an X-DOM object, and then read or load the XML as you would normally. Schema validation happens automatically as content is read, so the input stream is not read twice.
Here’s how to plug a schema from the file customers.xsd into an XmlReader
:
XmlReaderSettings settings = new XmlReaderSettings(); settings.ValidationType = ValidationType.Schema; settings.Schemas.Add (null, "customers.xsd"); using (XmlReader r = XmlReader.Create ("customers.xml", settings)) ...
If the schema is inline, set the following flag instead of adding to Schemas
:
settings.ValidationFlags |= XmlSchemaValidationFlags.ProcessInlineSchema;
You then Read
as you would normally. If schema validation fails at any point, an XmlSchemaValidationException
is thrown.
Calling Read
on its own validates both elements and attributes: you don’t need to navigate to each individual attribute for it to be validated.
If you want only to validate the document, you can do this:
using (XmlReader r = XmlReader.Create ("customers.xml", settings)) try { while (r.Read()) ; } catch (XmlSchemaValidationException ex) { ... }
XmlSchemaValidationException
has properties for the error Message
, LineNumber
, and LinePosition
. In this case, it only tells you about the first error in the document. If you want to report on all errors in the document, you instead must handle the ValidationEventHandler
event:
XmlReaderSettings settings = new XmlReaderSettings(); settings.ValidationType = ValidationType.Schema; settings.Schemas.Add (null, "customers.xsd"); settings.ValidationEventHandler += ValidationHandler; using (XmlReader r = XmlReader.Create ("customers.xml", settings)) while (r.Read()) ;
When you handle this event, schema errors no longer throw exceptions. Instead, they fire your event handler:
static void ValidationHandler (object sender, ValidationEventArgs e) { Console.WriteLine ("Error: " + e.Exception.Message); }
The Exception
property of ValidationEventArgs
contains the XmlSchemaValidationException
that would have otherwise been thrown.
To validate an XML file or stream while reading into an X-DOM, you create an XmlReader
, plug in the schemas, and then use the reader to load the DOM:
XmlReaderSettings settings = new XmlReaderSettings(); settings.ValidationType = ValidationType.Schema; settings.Schemas.Add (null, "customers.xsd"); XDocument doc; using (XmlReader r = XmlReader.Create ("customers.xml", settings)) try { doc = XDocument.Load (r); } catch (XmlSchemaValidationException ex) { ... }
You can also validate an XDocument
or XElement
that’s already in memory by calling extension methods in System.Xml.Schema
. These methods accept an XmlSchemaSet
(a collection of schemas) and a validation event handler:
XDocument doc = XDocument.Load (@"customers.xml"); XmlSchemaSet set = new XmlSchemaSet (); set.Add (null, @"customers.xsd"); StringBuilder errors = new StringBuilder (); doc.Validate (set, (sender, args) => { errors.AppendLine (args.Exception.Message); } ); Console.WriteLine (errors.ToString());
XSLT stands for Extensible Stylesheet Language Transformations. It is an XML language that describes how to transform one XML language into another. The quintessential example of such a transformation is transforming an XML document (that typically describes data) into an XHTML document (that describes a formatted document).
Consider the following XML file:
<customer> <firstname>Jim</firstname> <lastname>Bo</lastname> </customer>
The following XSLT file describes such a transformation:
<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:template match="/"> <html> <p><xsl:value-of select="//firstname"/></p> <p><xsl:value-of select="//lastname"/></p> </html> </xsl:template> </xsl:stylesheet>
The output is as follows:
<html> <p>Jim</p> <p>Bo</p> </html>
The System.Xml.Xsl.XslCompiledTransform
transform class efficiently performs XSLT transforms. It renders XmlTransform
obsolete. XslCompiledTransform
works very simply:
XslCompiledTransform transform = new XslCompiledTransform(); transform.Load ("test.xslt"); transform.Transform ("input.xml", "output.xml");
Generally, it’s more useful to use the overload of Transform
that accepts an XmlWriter
rather than an output file, so you can control the formatting.