Using the XML module to parse XML data

In this next section, I will walk through some of the steps for using Python to parse and process XML data in a basic project to convert a dataset from XML to JSON.

In Python, XML is represented using a tree-like structure and parsed using the xml.etree.ElementTree module. Navigating this tree is a bit more sophisticated than navigating the structure of JSON data because the structure of XML does not fit as neatly into python data structures. 

The first step to processing XML data is to read the XML data into Python's tree-like XML representation with the xml.etree.ElementTree module, using the following steps:

  1. Import the xml.etree.ElementTree module.
  2. Open the file containing the XML data.
  3. Use the ElementTree.parse() function to create an ElementTree object.
  4. Use the .getroot function of the ElementTree object to return an element object representing the root of the element tree.

The result is a representation of the XML data in python that you can navigate starting with the root element, or the element at the base of the XML tree. In the following demonstration, I've created a Python script called xml_to_json.py, which opens the XML dataset for this chapter, parses the data, and navigates to the root element:

import json
from xml.etree import ElementTree

fin = open("../data/input_data/wikipedia.xml","r")
tree = ElementTree.parse(fin)
root = tree.getroot()

fin.close()

The root variable at the end of the previous example is an element object. Specifically, it is the element object representing the element at the base of the tree. 

An element object is Python's representation of an XML element, similar to the way a dictionaries and Python lists are Python's representation of JSON structures. With element objects, it is possible to navigate the XML tree by accessing a particular element's attributes, internal text, or child elements.

Once you have obtained an element object that represents the root element, the next step is to navigate from the document root to the part of the XML that contains the data you are looking for. As with JSON data, XML data sources often have good documentation to describe the structure and contents of a dataset. However, if documentation is not available for a particular dataset, it may be helpful to conduct a quick exploration of the dataset in order to find how the data is structured.

The element.getchildren() function of the element object can be used to retrieve an array of element objects that are the child elements of a particular element. In the the next few steps, I will continuously use element.getchildren() to search through the XML and display the contents of the XML data at various levels. This exploration will be similar to the exploration used in the previous chapter to navigate JSON data. The goal is to find the core content of the dataset within all of the metadata.

In the following continuation of xml_to_json.py, the element.getchildren() function is used to retrieve the child elements of the root element. The child elements are then printed to the output:

....
root = tree.getroot()

## search through the element tree
## for a long list of elements
data = root.getchildren()
print(data)

fin.close()

Running xml_to_json.csv at this stage should produce the following output:

In the printout, it is possible to see the tag name of each element in the list of child elements. It is possible to expand to one of the child elements by selecting it's index from the Python list of child elements. The result is an element object representing the selected child element. The following demonstrates how you might expand to the child elements of the second child element of the root element:

data = root.getchildren()[1].getchildren()

This can be a rather primitive but effective way of navigating the contents of an XML file if the file is too big to open in a text editor.

After a few iterations of searching through the children, the following continuation of xml_to_json.py produces a relatively long list of elements that likely contain the relevant data:

....

## search through the element tree
## for a long list of elements

data = root.getchildren()[1].getchildren()[1].getchildren()
print(data)

fin.close()

At this stage, running xml_to_json.csv should print out a list of element objects as follows:

There are two more features of element objects which can be useful in navigating and parsing XML data. The first is the element.tag value of an element object, which is a string containing the tag of the element. The second is the element.attrib value of an element object, which is a dictionary in which the keys correspond to the attribute names and the values correspond to the attribute values.

At this stage of xml_to_json.py, the data variable contains a list of element objects corresponding to a list of data entries in the XML data. In the following continuation of xml_to_json.py, the tag name and attributes of the first element are printed to the output.

....
data = root.getchildren()[1].getchildren()[1].getchildren()
# print(data)
print("item_tag:")
print(data[0].tag)
print("item_attributes:")
print(data[0].attrib)

fin.close()

Running xml_to_json.py at this stage shows that the .attrib value of each element is a dictionary in which the keys correspond to the data variables and the dictionary values correspond to the data values. This means converting to JSON will just involve retrieving the .attrib value of each element and placing the result into an array. The resulting array can be written to an output JSON file using the json module.

In the following continuation of xml_to_json.py, an array called json_data is created. A for loop iterates over each of the element objects and retrieves the .attrib value of each placing each of the resulting dictionaries in the json_data array. The json_data array is then written to a JSON file called wikipedia.json.

....
# print(data[0].attrib)

## iterate over the xml data converting
## each entry to json format
json_data=[]
for entry in data:
json_data.append(entry.attrib)

## output the new data to a json file
fout = open("../data/output_data/wikipedia.json","w")
json.dump(json_data,fout,indent=4)

fin.close()
fout.close()

Running xml_to_json.csv should now produce a JSON version of the data in the output_data folder!

This has been a basic introduction to working with XML data. To read more about the xml.etree.ElementTree module and working with XML in Python, I have made a link available in the Links and Further Reading document in the external resources.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset