When RegEx meets WordML

wordmlregex

Last week, I am busying with some exhausting logistics work: the requirement is to load a excel file, filter with lookup table, then retrieve extra information from a line-based text file and render the docx file with some words highlighted. Let’s decompose this problem to tasks one by one:

Retrieve extra information from a line-based text file A typical regular expression match example.

Render the docx file with some words highlighted This task is quite trivial as you may know, ultimately docx file is a zipped Office Open XML, aka text. We can even replace all the words in one shot as this recipe suggests. Assume the example sentence is:

Kun loves programming and beer, would you buy me some beer?

The to-be highlighted words are programming and beer.

The behavior of Microsoft Office Word 2007 breaks the sentence into 8 pieces: Kun loves, programming, and, beer, ,, would you buy me some, beer and ?. Each piece is rendered with either normal style or highlighted style. That is quite messy.

WordML may support embedded style in the bible somewhere, but I am going to live with that since it is crunch time and we can cheat: have you noticed that our highlighted words are always followed by the normal text? So we can put the whole sentence in the normal style enclosure, whenever the RegEx hits the match, we break the enclosure, insert the highlighted words with highlighted style, then start a new normal enclosure.

Hold on, the text is rendered in Word 2007 as: Kun lovesprogrammingandbeer, would you buy me somebeer? According to WordML spec or the scream of Jeni:

It is also notable that since leading and trailing whitespace is not normally significant in XML; some runs require a designating specifying that their whitespace is significant via the xml:space element.

So the formal solution for this quiz is to add xml:space=preserve attribute whenever the normal text has leading and/or tailing space(s). The versatile re.sub also supports a callback function instead of string for more complicated substitution like this. As long as the highlighted word is succeeded by space, the succeeding normal text needs to preserve the space, so we can build the pattern like this:

pattern = re.compile("(?<=\W(%s))(\s)" % "|".join(the_list_of_to_be_highlighted_words))

in the callback function, we set the attribute if group(1) is matched. Some corner cases needs more post-process: we need to set the attribute if the highlighted word is not in the head of the line, otherwise we need to eliminate unnecessary normal enclosure.

Or we can set xml:space=perserve to all normal text with extra bytes overhead. It is not perfect but good enough.

I will talk about the CSV later.