XML is a great way to organize information. I first learnt of the power of XML to systematize information when I used it to output a whole bunch of search results from NCBI in the Tinyseq XML format. Once I had this XML document , I could read it into Excel and then very easily analyze the information since it was nicely laid out as an Excel sheet.
Backpackit a service I use to take notes detailing my experimental research results
outputs all of the account data in XML format. Before I can move this data elsewhere , it helps for me to understand the data structure. So the first task I set out to do was to parse the XML output.
I decided to use Python for this , because I felt using Java here would be like using an elephant to crush a fly ( or whatever the expression is ). Also a lot of the data is text , and I always used perl previously to handle text. So a general basis for my codeitch will be What I did in Perl before I wold like to do in Python now. Java will be used once for more heavyweight tasks.
What I needed my program to do was :
- Read the XML output
- Create objects for each element or node in the output
I can then imagine that once I have these objects I can ask questions like how many objects have embedded images , how many objects have outgoing links etc etc..
The “Dive into python “ book gave me a quick introduction into the xml.dom package. I then ran into some encoding or codec issues and learnt all about “utf8″ and “iso8859″ character encoding. Once I learnt how to handle the UnicodeEncodeError , I had a full fledged three line program that parsed my input file , created the document object and as proof of successful parsing and printed my XML file back out.
The screencast above documents my travails.