XML is a great way to organize information. I first learnt of the power of XML to systematize information when I used it to output a whole bunch of search results from NCBI in the Tinyseq XML format. Once I had this XML document , I could read it into Excel and then very easily analyze the information since it was nicely laid out as an Excel sheet.
Backpackit a service I use to take notes detailing my experimental research results
outputs all of the account data in XML format. Before I can move this data elsewhere , it helps for me to understand the data structure. So the first task I set out to do was to parse the XML output.
I decided to use Python for this , because I felt using Java here would be like using an elephant to crush a fly ( or whatever the expression is ). Also a lot of the data is text , and I always used perl previously to handle text. So a general basis for my codeitch will be What I did in Perl before I wold like to do in Python now. Java will be used once for more heavyweight tasks.
What I needed my program to do was :
- Read the XML output
- Create objects for each element or node in the output
I can then imagine that once I have these objects I can ask questions like how many objects have embedded images , how many objects have outgoing links etc etc..
The “Dive into python “ book gave me a quick introduction into the xml.dom package. I then ran into some encoding or codec issues and learnt all about “utf8” and “iso8859” character encoding. Once I learnt how to handle the UnicodeEncodeError , I had a full fledged three line program that parsed my input file , created the document object and as proof of successful parsing and printed my XML file back out.
The screencast above documents my travails.
First a tangent: In order to reduce desktop clutter and focus on the task at hand, I have decided to experiment with “A distraction free Desktop” as screencasted by Jon Udell. His accompanying blog post pointed me to a great Mac utility to cleanup my desktop and also to iterm which I believe is a very good alternative to the Terminal App offered by Apple.
One of the first tasks I want to focus on is how to organize my data from backpackit to enable an impending move to Jotspot. There are several issues to be handled.
- First a lot of my image and file links are hosted on my university account , I need to keep track of those files and links as I plant to consolidate all of them and store them in one place
- A lot of my text is free form. I want to give it more of a structure so that I can make better sense of it . for eg an entire experiment is detailed in a long paragraph instead of a more organized – Goal – Method- Conclusion type structure
- My Image links are quite “stupid” and cannot be queried in any smart way – for eg. All my protein gel images have random file names. I have to com up with a smart link or file naming system to bring more order and query-ability to the gel images
This blog is roughly about my attempts to codify , algorithmize and systematize everything I do. In it I will hopefully detail my march to coding and getting proficient in a bunch of computer languages. After a long process of looking around , I have narrowed my focus to the following 3 languages in no particular order.
The reasons for these will hopefully emerge as I begin posting. But I will try and spell them out here
Java :I like java for two reasons , its one of the most widely used languages in the enterprise space and the second and very important reason are the Java IDEs. Both java IDEs I use namely Eclipse and Netbeans are Free and amazingly featured. Code prompting available in both IDEs make mastering an API a lot easier than learning the same functionality using other languages or platforms. Also, I love Javadoc !. It really makes picking up new APIs a little easier
Python : My first crack at automating anything came with Perl scripting. I will not lie if I say that If I have to do anything today I will first use Perl. But after several year of Perl use I found I was re-using very little code. I have to get more object oriented in the way I code, and since I never quite got a hang of Perl objects , and its namespace conventions!. Python which is at its heart a purely object oriented scripting language with libraries that easily rival Java was a natural scripting alternative.Learning Python I hope will teach me how to script smart objects that will beg to be reused.
The above are the three languages I want to master.
Apart from these there are two platforms that I want to get comfortable with and they are
Excel : Everyone in the business world uses Excel. Spreadsheets were the PCs killer App and Excel is in my mind microsofts great product. I have seen the amazing things ou can do in Excel without writing a single line of code, and I want to learn to use its power.
Matlab: This platform from Mathworks is the bread and butter of engineering computation. I am anything but an engineer but have seen matlabs power when it comes to simulations. A lot of the very academic questions that I have in my research can really benefit from learning the Matlab platform and no I am not fully convinced on why exactly I need to use Matlab, maybe I will find a more concrete reason.
refs: My tumblr feed