When RegEx meets WordML

Development May 23rd, 2008

One of excuses to keep me updating this blog is some exhausting logistics work I need to tackle with in last couple weeks, long story short, the requirement is to load a excel file, filter with lookup table, then retrieve extra information from a line-based text file and render the docx file with some words highlighted. Let’s decompose this problem to tasks one by one:

Retrieve extra information from a line-based text file
A typical regular expression match example.

Render the docx file with some words highlighted
This task seems easy, as you know, ultimately docx file is a zipped Office Open XML, aka text. We can even replace all the words in one shot as this recipe suggests. Assume the example sentence is:
Kun loves programming and beer, would you buy me one beer?
The to-be highlighted words are programming and beer.

The behavior of Microsoft Office Word 2007 breaks the sentence into 7 pieces: Kun loves_, programming, _and_, beer, , would you buy me one, beer and ?; _ stands for leading or tailing space. Each piece is rendered with either normal style or highlighted style. That is quite messy.

WordML may support embedded style in the bible somewhere, but I am going to live with that since it is crunch time and we can cheat: have you noticed that our highlighted words are always followed by the normal text? So we can put the whole sentence in the normal style enclosure, whenever the RegEx hits the match, we break the enclosure, insert the highlighted words with highlighted style, then start a new normal enclosure. Brilliant!

Hold on, the text is rendered in Word 2007 as:
Kun lovesprogrammingandbeer, would you buy me onebeer?
According to WordML spec or the scream of Jeni:

It is also notable that since leading and trailing whitespace is not normally significant in XML; some runs require a designating specifying that their whitespace is significant via the xml:space element.

So the formal solution for this quiz is to add xml:space=preserve attribute whenever the normal text has leading and/or tailing space(s). In our case, Kun loves_, _and_ and , would you buy me one_ need that attribute but ? does not. The versatile re.sub also supports a callback function instead of string for more complicated substitution like this. As long as the highlighted word is succeeded by space, the succeeding normal text needs to preserve the space, so we can build the pattern like this:

pattern = re.compile(“(?<=\W(%s))(\s)” % “|”.join(the_list_of_to_be_highlighted_words))

in the callback function, we set the attribute if group(1) is matched. Some corner cases needs more post-process: we need to set the attribute if the highlighted word is not in the head of the line, otherwise we need to eliminate unnecessary normal enclosure.

Or we can set xml:space=perserve to all normal text with extra bytes overhead. It is not perfect but good enough.

I will talk about the CSV later.

PyAWS 0.3.0 released

Development, Web May 6th, 2008

After 6 months, PyAWS 0.3.0 is eventually released. You can check out the tar ball here.

I almost abandoned this project as I found the XSLT approach is more appealing: ideal for AJAX application and easy to integrate via simplejson in the server side. Furthermore, I joined Microsoft, moved to Canada, and had less spare time to work on less interested hobby work. The last straw is the unexpected complicity of the the BIG FAT refactory.

Until recently, I got the email from one PyAWS user, he reported a bug on unexpected result of ListLookup operation. It is so good to hear from some users that this library still benefits somebody in the world. So I picked it up, completed the refactory and released it today. The library still in active development, the code style stinks, the document sucks and most of all, testing is lacking — I would explain it for a little bit here.

I am a big fan of TDD personally, and we have respected testing troops to help building our products in MSFT as well. However, the complexity of PyAWS is far beyond my capacity: there are tens of operations and twenties of response groups, and response groups may combine, that make it extremely difficult to cover all the paths. To make it worse, the AWS is dynamic, there is no guarantee that the consecutive queries would return the same result. I may consider automation to facilitate the unit tests. If you have better ideas, please leave a comment here.

Django’s D-day

Development, Web April 7th, 2008

Google just released the Google App Engine in python development environment. The environment is loaded with WSGI, and Django 0.96 “for convenience”.

Just checked the Datastore API, it is a copycat of Django reference. Google’s engineers hacked the Django’s Model to support Google’s datastore, aka BigTable. Bang! Google Account is also supported via User API, no idea whether it is integrated to Django’s authentication framework though.

I am so glad that Google has made such a move, I can bet the Django users may grow exponentially in the following couple months. Today is Django’s D-day.

Suds makes the soapy world less slippery

Development April 5th, 2008

In the last post, I was whining about the bumps in the road when trying to consume a SOAP web service using python. Thanks to Olosta’s suggestion, Suds.

Suds logo The cute yellow rubber duck makes the soapy world less slippery. There is no need to generate execution code using an external tool like wsdl.exe for C#, just load the WSDL in the runtime, the ServiceProxy object would dynamically generate the function calls for you. It still in actively developed, salute to joetel.

Something needs to tailor to adapt to the Microsoft Office SharePoint Server: the connection persistence. As you may know, the default authentication used in SharePoint web service is NTLM, undocumented, but well known to the public. NTLM authenticates the connection, so in current suds implementation, each method invocation incurs redundant NTLM negotiation-challenge-and-response. I would dig more for this issue; stay tuned.

Who would be old school python developer?

Development April 3rd, 2008

Two posts (here and here) in programming.reddit.com discussed the state-of-the-art python IDEs. Two of them really arouse my interest: Komodo Edit and IronPython Studio which is honorably mentioned in the comment.

Komodo Edit is the shrunk-and-free version ActiveState’s flagship Komodo IDE. It is rooted in the same technology as Firefox, using XUL framework to render the UI, same Add-on mechanism to support 3rd-party package, and the UI is quite clean, eye-candy lacking in another term:

Komodo Edit in action

Furthermore, thanks to ActiveState’s generosity, there is an open source initiative openkomodo’s Snapdragon project to build a full-fledged IDE based upon the Komodo Edit’s code base. Though I suffer the huge memory footprint of Firefox from time to time, I still believe this is a much lightweight IDE compared to the versatile Eclipse.

Another option is IronPython Studio based upon award-winning Microsoft Visual Studio technology. Whether you like it or not, we have to admit that lots of programmers would feel at home when using familiar interface. However, strictly speaking it is not a python IDE, you are locked to IronPython, and most likely you could not resist the temptation to use .Net and WPF. And at the end of day, the ultimate question may emerge: “Why not use C#? The syntax is quite similar, and we are no longer treated as second-class developers.” I doubt that Silverlight may make a difference if you are not a Web developer.

The last but not the least question when I read through all the comments. I was quite amazed to find so few comments from the die-hard old school guys. Here is one comment about using Emacs and python mode, how about Vim users? Did they just disregard this kind of flame-prone discussion or already lost the faith to convince the other world?

So if you happened to be a heavy-weight Vim user and program with python, I would appreciate if you could drop a message here to share your experience.