XML parsing – python
Apr 12th, 2007 by harijay
[youtube=http://www.youtube.com/watch?v=L-_6tiTR8v0]
XML is a great way to organize information. I first learnt of the power of XML to systematize information when I used it to output a whole bunch of search results from NCBI in the Tinyseq XML format. Once I had this XML document , I could read it into Excel and then very easily analyze the information since it was nicely laid out as an Excel sheet.
Backpackit a service I use to take notes detailing my experimental research results
outputs all of the account data in XML format. Before I can move this data elsewhere , it helps for me to understand the data structure. So the first task I set out to do was to parse the XML output.
I decided to use Python for this , because I felt using Java here would be like using an elephant to crush a fly ( or whatever the expression is ). Also a lot of the data is text , and I always used perl previously to handle text. So a general basis for my codeitch will be What I did in Perl before I wold like to do in Python now. Java will be used once for more heavyweight tasks.
What I needed my program to do was :
- Read the XML output
- Create objects for each element or node in the output
I can then imagine that once I have these objects I can ask questions like how many objects have embedded images , how many objects have outgoing links etc etc..
The “Dive into python “ book gave me a quick introduction into the xml.dom package. I then ran into some encoding or codec issues and learnt all about “utf8″ and “iso8859″ character encoding. Once I learnt how to handle the UnicodeEncodeError , I had a full fledged three line program that parsed my input file , created the document object and as proof of successful parsing and printed my XML file back out.
The screencast above documents my travails.
[...] Hari and I have been friends for a long time, and I knew that he would get blogging the second I mentioned it to him. We have spent quite a few phone calls talking about science, the web, etc etc. Earlier this week, he started a new blog where he will “codify , algorithmize and systematize everything” he does. One aspect that will be a little different from most other scientific blogs, screencasts, an example of which can be found in this post on Python [...]