Monthly Archives: November 2008

Wanna learn python , join comp.lang.python : Regexp matching of fixed digit numbers

One of the fun parts of learning a programming language yourself is when you get stuck at solving a trivial problem . When this happens I generally try to do a few things generally involving searching . I have tabulated these activities and also the apparent value to me , both for my short-term solving of the immediate problem and for my long-term understanding of the subject in addition to improving my tacit knowledge.

Table describing merits of newsgroup memberships

In most cases the above acitivities take almost an hour. When despite all this I dont get an answer I start to get frustrated. Most of my frustrations stem from being unable to define the problem because I dont have a good enough grasp of the vocabulary required to improve my searching and querying.

This is when I resort to what I call my ultimate resource , I get outside human help and  contact a newsgroup In almost 100% of these problems I get a very satisfying answer from these forums . Forums also have the social benefit of expanding my network and importantly creating a lasting record of the solution.

Let me qualify all of this with an actual example . I had to develop a regular expression to parse text from a file and only extract sentences that contained four digit numbers

for eg

I have 1234 dollars left in the bank ( result desired:Match)

I dont have 12345 issues to deal with ( result desired : SKIP )

My initial regular expression was

[python]

p = re.compile(r’\d{4}’)

[/python]

Based on what I knew this would search for “exactly” four digit numbers . But I soon realized that this regexp matched both 1245 and 12345 i.e when you say r’\d{4}’ what you are saying is match exactly a four digit number with no constraints on whats on either side . So all of  “1234” and “12345” and “a1234abcd” or “1234asdf” were matched.

Clearly when you say “exact” in the definition of the regular expression you mean something different from “exact” to a non-savant regexp writer .

I then assumed that this had something to do with greedy vs non-greedy matching ( hence approach  in row 2 of table above)  . This too quickly proved to be a non-starter. I was clearly stuck and had no idea how to  proceed with solving the problem or even go about finding more information .

Thats when I wrote in to the comp.lang.python newsgroup . Within 15 minutes I had a reply . Although it wasnt the right answer , the promptness of the reply was well worth it. A few minutes later I had a few suggestions all of which listed multiple approaches to my problem . I quickly learnt that I had not defined the regexp fully enough.  I had to also codeify the fact that the exact match had to be a separate “word” unto itself .

Three other posts helped me arrive at the perfect solution the regexps

r’\b\d{4}\b’ ” match a four digit  number with a word boundary on either side”

OR more intricate examples that allowed me to learn more about the regexp syntax for eg. expressions containing negative look-behinds , expression notations and non-grouping parenthesis

(r'(?<!\d)\d{4}(?!\d)’) ” four digit number preceded by a non digit and  followed by a non digit)

Or

(?:\D|\b)\d{4}(?:\D|\b) ” four digit number surounded by non digit or word boundaries ”

Clearly almost 20 minutes after my post to the newsgroup , I had not only solved my particular problem but more importantly also considerably increased my knowledge of regular expression behavior and syntax. Most of the replies had not just applied their solutions to my test case , but also had tried to clear my misunderstandings of how regexps worked.  Some posts even provided whole code and novel test cases with results .

Joining an active newsgroup and posing your questions with some effort  always gives you an answer . In all of my experience on these newsgroups , the more effort you spend in definining your problem the more value you gain from the interaction.

I am quite convinced that joining a newsgroup is well worth it to start learning python or any other language or  platform

Why Bioinformaticians have to grin and bear it!

If anyone feels this post is provocatively titled. I can only offer as defence sagging traffic to this blog.

But serioulsy! I am writing to defend what I perceive as a tendency among bioinformaticians to complain about Biologists and their tendency to not respect things like structured data , file formats , data portability and various other concepts at the centre-stage of data management.

The bottom line is that experimentalists do what they are trained to do best – experiment. In most cases ( and I agree sadly) experiment design does not extend to the data storage or management level. People tend to store data in ways that makes sense to them and mostly only them. We are all fortunately wired differently and most things make sense to an experimenter only when ordered or “structured” in a way that he/she likes it.

To offer a simple example , I have often found others commenting that they cannot make sense of the table layout I have chosen or the order in which I load my proteins samples on my electrophoresis gel. In most cases I disagree with the offered suggestion and persist with my “bad” ways!.

In defence I offer the statement “It is never difficult to reformat data , It is always difficult to repeat an experiment”  

To elaborate : To force formats or structure on an experimenter is often way more difficult than reformatting the data using a computational approach.An experimental workflow can incorporate structure extending all the way to the data storage level , but often this increases the level of effort for the experimenter and may even complicate the experiment in ways that make it harder for the experimenter!. Experimentation is difficult , bioinformatics is easier!

Therefore I say that Bioinformatics will have to live with the burden of data-munging!. The only way out of it is to catch the young experimenter and teach him some aspects of data mining or electronic record management at the very same time that we teach them how to conduct an experiment. 

Dont get me wrong, I am not condoning poor experimental record keeping or unstructured data!. In most cases a simple re-think can ensure that are spreadsheets are more comprehensible or structured , but since it will always be easier to write a thousand line reformatting script than to force an experiment to output the data in a format that will make a resident bioinformatican happy . 

Till the generation of structured data aware scientists take to the bench in a big way ..I am sorry to say bioinformaticians will just have to grin and bear it!

refs :”The Saunders Principle“, “Comment on the  saunders principle from Chris a Miller” ,