Wanna learn python , join comp.lang.python : Regexp matching of fixed digit numbers
Nov 24th, 2008 by harijay
One of the fun parts of learning a programming language yourself is when you get stuck at solving a trivial problem . When this happens I generally try to do a few things generally involving searching . I have tabulated these activities and also the apparent value to me , both for my short-term solving of the immediate problem and for my long-term understanding of the subject in addition to improving my tacit knowledge.

In most cases the above acitivities take almost an hour. When despite all this I dont get an answer I start to get frustrated. Most of my frustrations stem from being unable to define the problem because I dont have a good enough grasp of the vocabulary required to improve my searching and querying.
This is when I resort to what I call my ultimate resource , I get outside human help and contact a newsgroup In almost 100% of these problems I get a very satisfying answer from these forums . Forums also have the social benefit of expanding my network and importantly creating a lasting record of the solution.
Let me qualify all of this with an actual example . I had to develop a regular expression to parse text from a file and only extract sentences that contained four digit numbers
for eg
I have 1234 dollars left in the bank ( result desired:Match)
I dont have 12345 issues to deal with ( result desired : SKIP )
My initial regular expression was
[python]
p = re.compile(r’\d{4}’)
[/python]
Based on what I knew this would search for “exactly” four digit numbers . But I soon realized that this regexp matched both 1245 and 12345 i.e when you say r’\d{4}’ what you are saying is match exactly a four digit number with no constraints on whats on either side . So all of “1234″ and “12345″ and “a1234abcd” or “1234asdf” were matched.
Clearly when you say “exact” in the definition of the regular expression you mean something different from “exact” to a non-savant regexp writer .
I then assumed that this had something to do with greedy vs non-greedy matching ( hence approach in row 2 of table above) . This too quickly proved to be a non-starter. I was clearly stuck and had no idea how to proceed with solving the problem or even go about finding more information .
Thats when I wrote in to the comp.lang.python newsgroup . Within 15 minutes I had a reply . Although it wasnt the right answer , the promptness of the reply was well worth it. A few minutes later I had a few suggestions all of which listed multiple approaches to my problem . I quickly learnt that I had not defined the regexp fully enough. I had to also codeify the fact that the exact match had to be a separate “word” unto itself .
Three other posts helped me arrive at the perfect solution the regexps
r’\b\d{4}\b’ ” match a four digit number with a word boundary on either side”
OR more intricate examples that allowed me to learn more about the regexp syntax for eg. expressions containing negative look-behinds , expression notations and non-grouping parenthesis
(r’(?<!\d)\d{4}(?!\d)’) ” four digit number preceded by a non digit and followed by a non digit)
Or
(?:\D|\b)\d{4}(?:\D|\b) ” four digit number surounded by non digit or word boundaries ”
Clearly almost 20 minutes after my post to the newsgroup , I had not only solved my particular problem but more importantly also considerably increased my knowledge of regular expression behavior and syntax. Most of the replies had not just applied their solutions to my test case , but also had tried to clear my misunderstandings of how regexps worked. Some posts even provided whole code and novel test cases with results .
Joining an active newsgroup and posing your questions with some effort always gives you an answer . In all of my experience on these newsgroups , the more effort you spend in definining your problem the more value you gain from the interaction.
I am quite convinced that joining a newsgroup is well worth it to start learning python or any other language or platform