Monthly Archives: April 2007

XML parsing – python

[youtube=http://www.youtube.com/watch?v=L-_6tiTR8v0]

XML is a great way to organize information. I first learnt of the power of XML to systematize information when I used it to output a whole bunch of search results from NCBI in the Tinyseq XML format. Once I had this XML document , I could read it into Excel and then very easily analyze the information since it was nicely laid out as an Excel sheet.

Backpackit a service I use to take notes detailing my experimental research results
outputs all of the account data in XML format. Before I can move this data elsewhere , it helps for me to understand the data structure. So the first task I set out to do was to parse the XML output.

I decided to use Python for this , because I felt using Java here would be like using an elephant to crush a fly ( or whatever the expression is ). Also a lot of the data is text , and I always used perl previously to handle text. So a general basis for my codeitch will be What I did in Perl before I wold like to do in Python now. Java will be used once for more heavyweight tasks.

What I needed my program to do was :

  1. Read the XML output
  2. Create objects for each element or node in the output

I can then imagine that once I have these objects I can ask questions like how many objects have embedded images , how many objects have outgoing links etc etc..

The “Dive into python “ book gave me a quick introduction into the xml.dom package. I then ran into some encoding or codec issues and learnt all about “utf8” and “iso8859” character encoding. Once I learnt how to handle the UnicodeEncodeError , I had a full fledged three line program that parsed my input file , created the document object and as proof of successful parsing and printed my XML file back out.

The screencast above documents my travails.

Moving Data from Backpackit to Jotspot

First a tangent: In order to reduce desktop clutter and focus on the task at hand, I have decided to experiment with “A distraction free Desktop” as screencasted by Jon Udell. His accompanying blog post pointed me to a great Mac utility to cleanup my desktop and also to iterm which I believe is a very good alternative to the Terminal App offered by Apple.

One of the first tasks I want to focus on is how to organize my data from backpackit to enable an impending move to Jotspot. There are several issues to be handled.

  • First a lot of my image and file links are hosted on my university account , I need to keep track of those files and links as I plant to consolidate all of them and store them in one place
  • A lot of my text is free form. I want to give it more of a structure so that I can make better sense of it . for eg an entire experiment is detailed in a long paragraph instead of a more organized – Goal – Method- Conclusion type structure
  • My Image links are quite “stupid” and cannot be queried in any smart way – for eg. All my protein gel images have random file names. I have to com up with a smart link or file naming system to bring more order and query-ability to the gel images
  • A significant number of my graphs and plotted data are embedded as png or jpeg files. I need to plot data dynamically using Javascript or flash . Also jpeg plots are not query-able!

Codeitch- or the desire to codeify and systematize everything I do

This blog is roughly about my attempts to codify , algorithmize and systematize everything I do. In it I will hopefully detail my march to coding and getting proficient in a bunch of computer languages. After a long process of looking around , I have narrowed my focus to the following 3 languages in no particular order.
Java , Python and Javascript.

The reasons for these will hopefully emerge as I begin posting. But I will try and spell them out here

Java :I like java for two reasons , its one of the most widely used languages in the enterprise space and the second and very important reason are the Java IDEs. Both java IDEs I use namely Eclipse and Netbeans are Free and amazingly featured. Code prompting available in both IDEs make mastering an API a lot easier than learning the same functionality using other languages or platforms. Also, I love Javadoc !. It really makes picking up new APIs a little easier

Python : My first crack at automating anything came with Perl scripting. I will not lie if I say that If I have to do anything today I will first use Perl. But after several year of Perl use I found I was re-using very little code. I have to get more object oriented in the way I code, and since I never quite got a hang of Perl objects , and its namespace conventions!. Python which is at its heart a purely object oriented scripting language with libraries that easily rival Java was a natural scripting alternative.Learning Python I hope will teach me how to script smart objects that will beg to be reused.

Javascript : This is a surprising bedfellow to my codeitch. I want to learn javascript simply because it is becoming very fashionable. Google Maps and gmail have AJAX at their core and Javascript is the J in AJAX. Plus I have always fancied having a web frontend to everything I do and I am sure Javascript will beg to be used when that happens.

The above are the three languages I want to master.

Apart from these there are two platforms that I want to get comfortable with and they are

Excel : Everyone in the business world uses Excel. Spreadsheets were the PCs killer App and Excel is in my mind microsofts great product. I have seen the amazing things ou can do in Excel without writing a single line of code, and I want to learn to use its power.

Matlab: This platform from Mathworks is the bread and butter of engineering computation. I am anything but an engineer but have seen matlabs power when it comes to simulations. A lot of the very academic questions that I have in my research can really benefit from learning the Matlab platform and no I am not fully convinced on why exactly I need to use Matlab, maybe I will find a more concrete reason.

refs: My tumblr feed