Getting started with Libxml2 and Python

Getting to grips with Libxml2 and Python can be a frustrating experience, particularly as in-depth, accurate Python documentation is hard to find on the Web.

Many Python developers seem to dislike the Libxml2 bindings, as they are 'un-Pythonic' and much too C-like. This however misses the point of Libxml2. The point being that this library is portable, mature, extremely full-featured and *very* fast.

In the process of writing this tutorial, I hung out in the #xml channel on irc.gnome.org, and subscribed to the xml@gnome.org mailing list - I was given a lot of help when things weren't obvious! Although there's not a massive amount of activity on IRC, or in the mailing list on a daily basis, I would definitely recommend spending some time browsing the archive - or using Google to search it when you have questions. Additionally, I have found the people in the Libxml2 community very helpful.

Manipulating XML using Libxml2 is fairly straightforward when you have a couple of working examples, however that tends to be the problem in Python. Finding working examples tends to be a bit of a hit-and-miss affair.

The first place to look is in the examples folder in the documentation installed with your release (/usr/share/doc/libxml2-python-2.6.27/examples on my machine).

TODO: where are the examples on a number of distributions/platforms?

Also, take a moment to scan through libxml2.py itself - this is the Python wrapper and is a good place to look if you are hunting for a particular function. There is plenty of information in the wrapper as all the docstrings have been populated, you can always get information like

	print libxml2.parseFile.__doc__

for any particular function.

Also remember that you can list the available methods for any Python object by using the dir function. The most immediately useful objects are xmlCore, xmlNode xmlDoc, so

	dir(libxml2.xmlCore)

is your friend when working out what functions are available to you.

I'm going to assume that you know a bit about XML, at least enough to recognise an XML document when you see one, and hopefully enough about Python to know where to find the documentation!

Contents

installing Libxml2

TODO: installation examples for a number of distros/platforms.

Loading a document

The first thing you want to do in XML will be to load a document of some sort. As a new Libxml2 user, this is where our confusion starts! It is worth remembering that in general, the Python bindings are automatically generated - therefore there is an equivalent Python function for every C function, and sometimes this can lead to unnecessary, or apparently duplicated Python functions.

The library contains a number of different functions we can use to load an XML document:

parseDoc, parseFile, parseMemory, readDoc, readFd, readFile, readMemory, recoverDoc and recoverFile

All of these functions return an xmlDoc object. Examples for using each of these follow:


parseDoc(cur) - load an XML document from memory (a string)

	doc = libxml2.parseDoc("""<?xml version="1.0"?>
	<root>Hello world!</root>""")	

parseMemory(buffer, size) - load an XML document from memory

	doc = libxml2.parseMemory(xml, len(xml))

This function performs exactly the same job as parseDoc from a Python perspective.


parseFile(filename) - load an XML document from a file

	
	doc = libxml2.parseFile('test.xml')

readDoc(cur, URL, encoding, options) - load an XML document from memory (a string)

This version of the function allows you to specify options on a per-document basis. The parseDoc version uses the parser defaults (in practice, the parser global settings, which can also be modified using global functions).

In most cases,

		doc = libxml2.readDoc('<foo/>',None,None,0)

will be equivalent to

		doc = libxml2.parseDoc('<foo/>')

When using XSL, I have found it better to force entities to be resolved before running the transform, in which case it is useful to use the following:

	
	doc = libxml2.readDoc( xml, None, libxml2.XML_PARSE_NOENT)

readFd(fd, URL, encoding, options) - load an XML document from a file descriptor

readFile(filename, encoding, options) - load an XML document from a file allowing the specification of per-document options.


readMemory(buffer, size, URL, encoding, options) - for Python, equivalent to using readDoc


recoverDoc(cur) - this is equivalent to readDoc, except that even broken XML will result in a valid XML tree being created.

	doc = libxml2.recoverDoc('<foo><broken></foo>')

will raise a parser error, but after the error has been handled, doc will contain:

	<?xml version="1.0"?>
	<foo><broken/></foo>

recoverFile(filename) - same as recoverDoc, but for files.


In the simplest case, to load a file from disk you can do:

	doc = libxml2.parseFile( 'test.xml' )

managing your memory

Ugh, nasty memory management. Isn't that why we're using Python, to avoid all that stuff?

Libxml2 does not explicitly handle the cleaning up of the memory it uses, so when you finish working with your xmlDoc object, you need to remember to call freeDoc. The same is true of xpath evaluation contexts created with xpathNewContext, you call xpathFreeContext on them.

OK, so what we have now is something like the following:

	doc = libxml2.parseFile( 'test.xml' )
	# Do some stuff with the document here!
	doc.freeDoc()

It doesn't matter which method you use to create your xmlDoc object - each of the functions return the same thing, so just remember to call freeDoc on it when you are done and all will be well.

There, that wasn't so hard was it? :-)

Working with the document

Now we have a working document, and know how to dispose of it when we're done it is time to look at a number of common XML operations and see how we can do those using Libxml2 and Python.

Elements

The xmlDoc object has a large number of methods. As well as its own collection, it inherits from xmlNode, which inherits from xmlCore; this gives you over 200 available methods to read up on! This is fairly daunting, when you can't find an example that shows you how to perform simple tasks but don't worry, In practice we can get by in most situations with a small fraction of these.

All valid XML documents contain a single root node, which contains all the other nodes.

You can get a reference to the root element using getRootElement on the document object. The root element is an xmlNode object, just like all other nodes in the document. Working with nodes is fairly straightforward:

	>>> import libxml2
	>>> doc = libxml2.parseDoc( '<foo>Hello world.</foo>' )
	>>> root = doc.getRootElement()
	>>> print root.name
	foo
	>>> print root.content
	Hello world.
	>>> root.setProp('bar', 'an attribute')
	<xmlAttr (bar) object at 0x13c00d0>
	>>> root.prop('bar')
	an attribute
	>>> print root.serialize()
	<foo bar="an attribute">Hello world.</foo>
	>>> doc.freeDoc()

The serialize method can be called on a single node, or on the document and provides a string representation of the document.

Navigating through the document is not much more difficult - we can use the node properties (from the xmlCore ancestor object) to find the child nodes:

	child = root.children
	# the children property returns the FIRST child of a node
	while child is not None:
		if child.type == "element":
			# do something with the child node
			print child.name
		child = child.next

Accessing the attributes of a node is possible in a similar way

	import libxml2
	doc = libxml2.parseDoc('<foo att1="value 1" att2="value 2"/>')
	root = doc.getRootElement()
	for property in root.properties:
		if property.type=='attribute':
			# do something with the attributes
			print property.name
			print property.content
	doc.freeDoc()

Notice that in both looping through the children, and looping through the properties there is a test for the type of the node. This is because in most documents, there is additional whitespace that shows up as well as the specific node types we are interested in.

XPath

Navigating a document in this manner is straightforward, but tedious and requires accessing every node in the document until you get to the specific one you need. More often, you want to retrive a set of nodes or a single node matching some specific criteria. This is where XPath comes in, and Libxml2 has full support for XPath.

XPath queries can be run against the document or a specific element in the document, but in either case the procedure is the same.

The xmlsoft.org Python page suggests the following:

	doc = libxml2.parseFile("test.xml")
	ctxt = doc.xpathNewContext()
	result = ctxt.xpathEval("//*")
	# do something with the result
	
	doc.freeDoc()
	ctxt.xpathFreeContext()

which involves creating an XPath context, running a query against it and then freeing the context when finished. If you have a lot of queries to run, then this is the best way to work, as the context can be re-used for each query.

In practice, the xmlCore object provides a helper function which wraps this up for you. For single queries running xpathEval directly on the node will suffice, just be aware that each query creates and destroys its own context, which is going to be slower than the above implementation.

An XPath query will return a typed result, corresponding to the four basic types mentioned in the introduction section of the XPath Specification, where the result is a node-set this will be a tuple. This makes it easy to perform an operation on many nodes at once.

	import libxml2
	doc = libxml2.parseFile('test.xml')
	# select every element in the document
	result = doc.xpathEval('//*')
	for node in result:
		print node.name
	doc.freeDoc()

Apart from the call to freeDoc, I can't see how much more Pythonic it could be?

Namespaces

Dealing with XML Namespaces is possible as well.

Here we create an XML document and declare a namespace on the root element.

	import libxml2

	doc = libxml2.newDoc('1.0')
	root = libxml2.newNode('foo')
	doc.setRootElement(root)

	#Register the toto namespace
	ns = root.newNs('http://toto.org', 'toto')

        root.setNs(ns)  #put this node in the namespace

	#Add to the root node a property in this namespace
	root.setNsProp(ns, 'Id', str(12345))

	print doc.serialize()

This produces:

	<?xml version="1.0"?>
	<toto:foo xmlns:toto="http://toto.org" toto:Id="12345"/>

Namespace can also be dealt with in XPath, provided you register the namespace with the XPath context object.

	import libxml2
	
	doc = libxml2.parseDoc("""
	<foo xmlns:MYNS="http://somewhere.fr">
	   <MYNS:a id="a1"/>
	   <a      id="a2"/>
	</foo>
	""")
	
	ctxt = doc.xpathNewContext()
	#you can choose any name, the URI is the namespace identifier
	ctxt.xpathRegisterNs("OtherName", "http://somewhere.fr") 
	
	# select the 'a' node in the somewhere.fr namespace
	result = ctxt.xpathEval('//OtherName:a')
	for node in result:	
		print node.name, "id=%s"%node.prop("id")  #will display "a id=a1"

	ctxt.xpathFreeContext()
	doc.freeDoc()

If a namespace by default is specified, you will have to register it in XPath with a name of your choice to use it in a XPath expression.


Writing to to a file

To write the contents of your XML document to a file, just use the saveTo method:

	f = open('output.xml','w')
	doc.saveTo(f)
	f.close

The saveTo method is also part of xmlCore, so you can use it to save the contents of just a single node and it's children as well as the whole document.

It is also worth noting that both saveTo, and serialize can accept an encoding parameter, which allows the conversion of a document from one encoding to another. Libxml2 itself uses UTF-8 internally, and will convert the document when loading and serialising.

	>>>>doc = libxml2.parseDoc("""<root><foo>hello</foo></root>""")
	>>>>str = doc.serialize()
	>>>>print str
	<?xml version="1.0"?>
	<root><foo>hello</foo></root>

	>>>>str = doc.serialize("iso-8859-1")
	>>>>print str
	<?xml version="1.0" encoding="iso-8859-1"?>
	<root><foo>hello</foo></root>

Modifying documents

To add a new node to a document, first we must create the node and then add it as a child of the element it belongs to.

	import libxml2
	doc = libxml2.parseDoc('<foo/>')
	root = doc.getRootElement()
	newNode = libxml2.newNode('bar')
	root.addChild(newNode)

At this stage, our document contains

	<?xml version="1.0"?>
	<foo><bar/></foo>

Using the content property of newNode, we can do:

	newNode.setContent('Hello')

We can append some content to our element by calling addContent,

	newNode.addContent(' world')

which gives us

	<?xml version="1.0"?>
	<foo><bar>Hello world</bar></foo>

Creating or setting an attribute is easy to, we use the setProp method.

	newNode.setProp('attribute', 'the value')

If the attribute doesn't exist, it will be created otherwise it will just have its content changed.

Adding nodes at a particular location in the hierarchy is possible using addNextSibling, or addPrevSibling. These operate in the same way as addChild, except they operate on the node you wish to add next to, rather than to the parent.

	sibling = libxml2.newNode('bar2') 
	newNode.addPrevSibling(sibling)

gives

	<?xml version="1.0"?>
	<foo><bar2/><bar new attribute="the value">Hello world</bar></foo>

whereas

	sibling = libxml2.newNode('bar2') 
	newNode.addNextSibling(sibling)

gives

	<?xml version="1.0"?>
	<foo><bar new attribute="the value">Hello world</bar><bar2/></foo>

To insert text into the document, you create a text node with some content and add it in the same way

	text = libxml2.newText('some text\n')
	bar.addNextSibling(text)

which leaves us with

	<?xml version="1.0"?>
	<foo><bar2/><bar new attribute="the value">Hello world</bar>some text
	</foo>

To create content and nodes, the useful Libxml2 helper functions are newComment, newText and newNode. You can also create a new node by copying one that already exists. The xmlNode object has copyNode and copyProp methods which can be useful here.

To add these new nodes into a document, you need to use one of the following methods (directly on nodes rather than on the document), addChild, addContent, addNextSibling, addPrevSibling.

XSLT

Libxml2 has a companion library called libxslt which provides support for XSL Transformations. I find the following example provides most of the useful information for a Python coder:

	def runTransform(xmlFile,xslFile):
		out = ''
		sourcedoc = libxml2.parseFile( xmlFile )
		styledoc = libxml2.parseFile( xslFile )
		style = libxslt.parseStylesheetDoc(styledoc)
		result = style.applyStylesheet(sourcedoc, None)
		out = style.saveResultToString( result )
		style.freeStylesheet()
		result.freeDoc()
		sourcedoc.freeDoc()
		return out

Notice that there are three documents involved, each of which need to be explicitly freed, the source, the stylesheet and the result. The starting point for documentation can be found here, http://xmlsoft.org/XSLT/python.html.

Libxml2 and HTML

If you have spent any time poking around libxml2.py, you will probably have noticed a number of functions that start with html. This is because Libxml2 has an HTML parser built in that does a pretty good job of loading real world (in other words horribly broken) HTML documents. You can then use the features we have previously discussed to read or modify the HTML.

The following example will load pretty much any HTML file into an xmlDoc object

	parse_options = libxml2.HTML_PARSE_RECOVER + \
		libxml2.HTML_PARSE_NOERROR + \
		libxml2.HTML_PARSE_NOWARNING
	doc = libxml2.htmlReadDoc(html, '', None, parse_options)

Here is a more complete example, which extracts all the links from the Guardian newspaper Website home page and prints the href attribute.

	import urllib2
	import libxml2

	# Load the page into a string
	f = urllib2.urlopen('http://www.guardian.co.uk')
	html = f.read()
	f.close()

	parse_options = libxml2.HTML_PARSE_RECOVER + \
		libxml2.HTML_PARSE_NOERROR + \
		libxml2.HTML_PARSE_NOWARNING
	doc = libxml2.htmlReadDoc(html,'',None,parse_options)
	links = doc.xpathEval('//a')
	for link in links:
		href = link.xpathEval('attribute::href')
		if len(href) > 0:
			href = href[0].content	
			print href
	doc.freeDoc()

For a more comprehensive example, see example of scraping content from a website.

Schema

One may validate an XML instance against a W3C schema, as shown below:

	# inspired from the test suite file "xstc/xstc.py"
	# thanks to Kasimier Buchcik
	#
	import libxml2

	ctxt = libxml2.schemaNewParserCtxt("my-schema.wxs")
	schema = ctxt.schemaParse()
	del ctxt

	validationCtxt = schema.schemaNewValidCtxt()

	doc = libxml2.parseFile("test.xml")

	#instance_Err = validationCtxt.schemaValidateFile(filePath, 0)
	instance_Err = validationCtxt.schemaValidateDoc(doc)

	del validationCtxt
	del schema
	doc.freeDoc()

	if instance_Err != 0:
            print "VALIDATION FAILED"
	else:
	    print "VALIDATED"

Known Problems

Node equality Problem

The usual equality test (==) does not work, :-(, look at this:

	>>> import libxml2
	>>> doc = libxml2.parseDoc('<foo/>')
	>>> root1 = doc.getRootElement()
	>>> root2 = doc.getRootElement()
	>>> root1 == root2
	False 

(note: This issue affects earlier builds of Libxml2 for Python. It is referred to in http://bugzilla.gnome.org/show_bug.cgi?id=345779 and appears to be resolved in current builds)

Using libxml2-2.6.27, this produces the expected result.

	>>> import libxml2 
	>>> doc = libxml2.parseDoc('<foo/>')
	>>> root1 = doc.getRootElement()
	>>> root2 = doc.getRootElement()
	>>> root1 == root2
	True