Getting started with Libxml2 and Python
Getting to grips with Libxml2 and Python can be a frustrating experience, particularly as in-depth, accurate Python documentation is hard to find on the Web.
Many Python developers seem to dislike the Libxml2 bindings, as they are 'un-Pythonic' and much too C-like. This however misses the point of Libxml2. The point being that this library is portable, mature, extremely full-featured and *very* fast.
In the process of writing this tutorial, I hung out in the #xml channel on irc.gnome.org, and subscribed to the xml@gnome.org mailing list - I was given a lot of help when things weren't obvious! Although there's not a massive amount of activity on IRC, or in the mailing list on a daily basis, I would definitely recommend spending some time browsing the archive - or using Google to search it when you have questions. Additionally, I have found the people in the Libxml2 community very helpful.
Manipulating XML using Libxml2 is fairly straightforward when you have a couple of working examples, however that tends to be the problem in Python. Finding working examples tends to be a bit of a hit-and-miss affair.
The first place to look is in the examples folder in the documentation installed with your release (/usr/share/doc/libxml2-python-2.6.27/examples on my machine).
TODO: where are the examples on a number of distributions/platforms?
Also, take a moment to scan through libxml2.py itself - this is the Python wrapper and is a good place to look if you are hunting for a particular function. There is plenty of information in the wrapper as all the docstrings have been populated, you can always get information like
print libxml2.parseFile.__doc__
for any particular function.
Also remember that you can list the available methods for any Python object by using the dir function. The most immediately useful objects are xmlCore, xmlNode xmlDoc, so
dir(libxml2.xmlCore)
is your friend when working out what functions are available to you.
I'm going to assume that you know a bit about XML, at least enough to recognise an XML document when you see one, and hopefully enough about Python to know where to find the documentation!
Contents
installing Libxml2
TODO: installation examples for a number of distros/platforms.
Loading a document
The first thing you want to do in XML will be to load a document of some sort. As a new Libxml2 user, this is where our confusion starts! It is worth remembering that in general, the Python bindings are automatically generated - therefore there is an equivalent Python function for every C function, and sometimes this can lead to unnecessary, or apparently duplicated Python functions.
The library contains a number of different functions we can use to load an XML document:
parseDoc, parseFile, parseMemory, readDoc, readFd, readFile, readMemory, recoverDoc and recoverFile
All of these functions return an xmlDoc object. Examples for using each of these follow:
parseDoc(cur) - load an XML document from memory (a string)
doc = libxml2.parseDoc("""<?xml version="1.0"?>
<root>Hello world!</root>""")
parseMemory(buffer, size) - load an XML document from memory
doc = libxml2.parseMemory(xml, len(xml))
This function performs exactly the same job as parseDoc from a Python perspective.
parseFile(filename) - load an XML document from a file
doc = libxml2.parseFile('test.xml')
readDoc(cur, URL, encoding, options) - load an XML document from memory (a string)
This version of the function allows you to specify options on a per-document basis. The parseDoc version uses the parser defaults (in practice, the parser global settings, which can also be modified using global functions).
In most cases,
doc = libxml2.readDoc('<foo/>',None,None,0)
will be equivalent to
doc = libxml2.parseDoc('<foo/>')
When using XSL, I have found it better to force entities to be resolved before running the transform, in which case it is useful to use the following:
doc = libxml2.readDoc( xml, None, libxml2.XML_PARSE_NOENT)
readFd(fd, URL, encoding, options) - load an XML document from a file descriptor
readFile(filename, encoding, options) - load an XML document from a file allowing the specification of per-document options.
readMemory(buffer, size, URL, encoding, options) - for Python, equivalent to
using readDoc
recoverDoc(cur) - this is equivalent to readDoc, except that even broken XML
will result in a valid XML tree being created.
doc = libxml2.recoverDoc('<foo><broken></foo>')
will raise a parser error, but after the error has been handled, doc will contain:
<?xml version="1.0"?> <foo><broken/></foo>
recoverFile(filename) - same as recoverDoc, but for files.
In the simplest case, to load a file from disk you can do:
doc = libxml2.parseFile( 'test.xml' )
managing your memory
Ugh, nasty memory management. Isn't that why we're using Python, to avoid all that stuff?
Libxml2 does not explicitly handle the cleaning up of the memory it uses, so when you finish working with your xmlDoc object, you need to remember to call freeDoc. The same is true of xpath evaluation contexts created with xpathNewContext, you call xpathFreeContext on them.
OK, so what we have now is something like the following:
doc = libxml2.parseFile( 'test.xml' ) # Do some stuff with the document here! doc.freeDoc()
It doesn't matter which method you use to create your xmlDoc object - each of the functions return the same thing, so just remember to call freeDoc on it when you are done and all will be well.
There, that wasn't so hard was it? :-)
Working with the document
Now we have a working document, and know how to dispose of it when we're done it is time to look at a number of common XML operations and see how we can do those using Libxml2 and Python.
Elements
The xmlDoc object has a large number of methods. As well as its own collection, it inherits from xmlNode, which inherits from xmlCore; this gives you over 200 available methods to read up on! This is fairly daunting, when you can't find an example that shows you how to perform simple tasks but don't worry, In practice we can get by in most situations with a small fraction of these.
All valid XML documents contain a single root node, which contains all the other nodes.
You can get a reference to the root element using getRootElement on the document object. The root element is an xmlNode object, just like all other nodes in the document. Working with nodes is fairly straightforward:
>>> import libxml2
>>> doc = libxml2.parseDoc( '<foo>Hello world.</foo>' )
>>> root = doc.getRootElement()
>>> print root.name
foo
>>> print root.content
Hello world.
>>> root.setProp('bar', 'an attribute')
<xmlAttr (bar) object at 0x13c00d0>
>>> root.prop('bar')
an attribute
>>> print root.serialize()
<foo bar="an attribute">Hello world.</foo>
>>> doc.freeDoc()
The serialize method can be called on a single node, or on the document and provides a string representation of the document.
Navigating through the document is not much more difficult - we can use the node properties (from the xmlCore ancestor object) to find the child nodes:
child = root.children # the children property returns the FIRST child of a node while child is not None: if child.type == "element": # do something with the child node print child.name child = child.next
Accessing the attributes of a node is possible in a similar way
import libxml2
doc = libxml2.parseDoc('<foo att1="value 1" att2="value 2"/>')
root = doc.getRootElement()
for property in root.properties:
if property.type=='attribute':
# do something with the attributes
print property.name
print property.content
doc.freeDoc()
Notice that in both looping through the children, and looping through the properties there is a test for the type of the node. This is because in most documents, there is additional whitespace that shows up as well as the specific node types we are interested in.
XPath
Navigating a document in this manner is straightforward, but tedious and requires accessing every node in the document until you get to the specific one you need. More often, you want to retrive a set of nodes or a single node matching some specific criteria. This is where XPath comes in, and Libxml2 has full support for XPath.
XPath queries can be run against the document or a specific element in the document, but in either case the procedure is the same.
The xmlsoft.org Python page suggests the following:
doc = libxml2.parseFile("test.xml")
ctxt = doc.xpathNewContext()
result = ctxt.xpathEval("//*")
# do something with the result
doc.freeDoc()
ctxt.xpathFreeContext()
which involves creating an XPath context, running a query against it and then freeing the context when finished. If you have a lot of queries to run, then this is the best way to work, as the context can be re-used for each query.
In practice, the xmlCore object provides a helper function which wraps this up for you. For single queries running xpathEval directly on the node will suffice, just be aware that each query creates and destroys its own context, which is going to be slower than the above implementation.
An XPath query will return a typed result, corresponding to the four basic types mentioned in the introduction section of the XPath Specification, where the result is a node-set this will be a tuple. This makes it easy to perform an operation on many nodes at once.
import libxml2
doc = libxml2.parseFile('test.xml')
# select every element in the document
result = doc.xpathEval('//*')
for node in result:
print node.name
doc.freeDoc()
Apart from the call to freeDoc, I can't see how much more Pythonic it could be?
Namespaces
Dealing with XML Namespaces is possible as well.
Here we create an XML document and declare a namespace on the root element.
import libxml2
doc = libxml2.newDoc('1.0')
root = libxml2.newNode('foo')
doc.setRootElement(root)
#Register the toto namespace
ns = root.newNs('http://toto.org', 'toto')
root.setNs(ns) #put this node in the namespace
#Add to the root node a property in this namespace
root.setNsProp(ns, 'Id', str(12345))
print doc.serialize()
This produces:
<?xml version="1.0"?> <toto:foo xmlns:toto="http://toto.org" toto:Id="12345"/>
Namespace can also be dealt with in XPath, provided you register the namespace with the XPath context object.
import libxml2
doc = libxml2.parseDoc("""
<foo xmlns:MYNS="http://somewhere.fr">
<MYNS:a id="a1"/>
<a id="a2"/>
</foo>
""")
ctxt = doc.xpathNewContext()
#you can choose any name, the URI is the namespace identifier
ctxt.xpathRegisterNs("OtherName", "http://somewhere.fr")
# select the 'a' node in the somewhere.fr namespace
result = ctxt.xpathEval('//OtherName:a')
for node in result:
print node.name, "id=%s"%node.prop("id") #will display "a id=a1"
ctxt.xpathFreeContext()
doc.freeDoc()
If a namespace by default is specified, you will have to register it in XPath with a name of your choice to use it in a XPath expression.
Writing to to a file
To write the contents of your XML document to a file, just use the saveTo method:
f = open('output.xml','w')
doc.saveTo(f)
f.close
The saveTo method is also part of xmlCore, so you can use it to save the contents of just a single node and it's children as well as the whole document.
It is also worth noting that both saveTo, and serialize can accept an encoding parameter, which allows the conversion of a document from one encoding to another. Libxml2 itself uses UTF-8 internally, and will convert the document when loading and serialising.
>>>>doc = libxml2.parseDoc("""<root><foo>hello</foo></root>""")
>>>>str = doc.serialize()
>>>>print str
<?xml version="1.0"?>
<root><foo>hello</foo></root>
>>>>str = doc.serialize("iso-8859-1")
>>>>print str
<?xml version="1.0" encoding="iso-8859-1"?>
<root><foo>hello</foo></root>
Modifying documents
To add a new node to a document, first we must create the node and then add it as a child of the element it belongs to.
import libxml2
doc = libxml2.parseDoc('<foo/>')
root = doc.getRootElement()
newNode = libxml2.newNode('bar')
root.addChild(newNode)
At this stage, our document contains
<?xml version="1.0"?> <foo><bar/></foo>
Using the content property of newNode, we can do:
newNode.setContent('Hello')
We can append some content to our
newNode.addContent(' world')
which gives us
<?xml version="1.0"?> <foo><bar>Hello world</bar></foo>
Creating or setting an attribute is easy to, we use the setProp method.
newNode.setProp('attribute', 'the value')
If the attribute doesn't exist, it will be created otherwise it will just have its content changed.
Adding nodes at a particular location in the hierarchy is possible using addNextSibling, or addPrevSibling. These operate in the same way as addChild, except they operate on the node you wish to add next to, rather than to the parent.
sibling = libxml2.newNode('bar2')
newNode.addPrevSibling(sibling)
gives
<?xml version="1.0"?> <foo><bar2/><bar new attribute="the value">Hello world</bar></foo>
whereas
sibling = libxml2.newNode('bar2')
newNode.addNextSibling(sibling)
gives
<?xml version="1.0"?> <foo><bar new attribute="the value">Hello world</bar><bar2/></foo>
To insert text into the document, you create a text node with some content and add it in the same way
text = libxml2.newText('some text\n')
bar.addNextSibling(text)
which leaves us with
<?xml version="1.0"?> <foo><bar2/><bar new attribute="the value">Hello world</bar>some text </foo>
To create content and nodes, the useful Libxml2 helper functions are newComment, newText and newNode. You can also create a new node by copying one that already exists. The xmlNode object has copyNode and copyProp methods which can be useful here.
To add these new nodes into a document, you need to use one of the following methods (directly on nodes rather than on the document), addChild, addContent, addNextSibling, addPrevSibling.
XSLT
Libxml2 has a companion library called libxslt which provides support for XSL Transformations. I find the following example provides most of the useful information for a Python coder:
def runTransform(xmlFile,xslFile): out = '' sourcedoc = libxml2.parseFile( xmlFile ) styledoc = libxml2.parseFile( xslFile ) style = libxslt.parseStylesheetDoc(styledoc) result = style.applyStylesheet(sourcedoc, None) out = style.saveResultToString( result ) style.freeStylesheet() result.freeDoc() sourcedoc.freeDoc() return out
Notice that there are three documents involved, each of which need to be explicitly freed, the source, the stylesheet and the result. The starting point for documentation can be found here, http://xmlsoft.org/XSLT/python.html.
Libxml2 and HTML
If you have spent any time poking around libxml2.py, you will probably have noticed a number of functions that start with html. This is because Libxml2 has an HTML parser built in that does a pretty good job of loading real world (in other words horribly broken) HTML documents. You can then use the features we have previously discussed to read or modify the HTML.
The following example will load pretty much any HTML file into an xmlDoc object
parse_options = libxml2.HTML_PARSE_RECOVER + \ libxml2.HTML_PARSE_NOERROR + \ libxml2.HTML_PARSE_NOWARNING doc = libxml2.htmlReadDoc(html, '', None, parse_options)
Here is a more complete example, which extracts all the links from the Guardian newspaper Website home page and prints the href attribute.
import urllib2
import libxml2
# Load the page into a string
f = urllib2.urlopen('http://www.guardian.co.uk')
html = f.read()
f.close()
parse_options = libxml2.HTML_PARSE_RECOVER + \
libxml2.HTML_PARSE_NOERROR + \
libxml2.HTML_PARSE_NOWARNING
doc = libxml2.htmlReadDoc(html,'',None,parse_options)
links = doc.xpathEval('//a')
for link in links:
href = link.xpathEval('attribute::href')
if len(href) > 0:
href = href[0].content
print href
doc.freeDoc()
For a more comprehensive example, see example of scraping content from a website.
Schema
One may validate an XML instance against a W3C schema, as shown below:
# inspired from the test suite file "xstc/xstc.py"
# thanks to Kasimier Buchcik
#
import libxml2
ctxt = libxml2.schemaNewParserCtxt("my-schema.wxs")
schema = ctxt.schemaParse()
del ctxt
validationCtxt = schema.schemaNewValidCtxt()
doc = libxml2.parseFile("test.xml")
#instance_Err = validationCtxt.schemaValidateFile(filePath, 0)
instance_Err = validationCtxt.schemaValidateDoc(doc)
del validationCtxt
del schema
doc.freeDoc()
if instance_Err != 0:
print "VALIDATION FAILED"
else:
print "VALIDATED"
Known Problems
Node equality Problem
The usual equality test (==) does not work, :-(, look at this:
>>> import libxml2
>>> doc = libxml2.parseDoc('<foo/>')
>>> root1 = doc.getRootElement()
>>> root2 = doc.getRootElement()
>>> root1 == root2
False
(note: This issue affects earlier builds of Libxml2 for Python. It is referred to in http://bugzilla.gnome.org/show_bug.cgi?id=345779 and appears to be resolved in current builds)
Using libxml2-2.6.27, this produces the expected result.
>>> import libxml2
>>> doc = libxml2.parseDoc('<foo/>')
>>> root1 = doc.getRootElement()
>>> root2 = doc.getRootElement()
>>> root1 == root2
True
