WYS is not always WYG in python.re

pythonregex

After almost two month hard work, I finally check-in the feature, and tonight I decided to relax on some leisure python programming:

This side project is quite trivial, fetch the HTML content, search the keywords in the thread, and build links table of contents for navigation. The only intrigue highlight that make this post worthy your 5 minute is that the language of the page is Chinese, and it is encoded in GB2312.

Long story short, I am trying to search the total number of posts in the thread using this regular expression:

pattern = re.compile('(?<=<b class="page">总数 )(?P<total>\d+)')

The first catch is I have to declare the code page used for the source code, as python interpreter complains:

SyntaxError: Non-ASCII character '\xe6' in file ./elevator.py on line 17, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

OK, I will stick to UTF-8, so add this declaration in the second line:

# -*- coding: utf-8 -*-

It does not work. And the dumped content of the page is totally messy. Oops, we forget to decode the content to Unicode, use codec to wrap the handle opened by urlopen:

gb = codecs.lookup('gb2312')
# load the page
content = gb.streamreader(urllib.urlopen(url)).read()

And don’t forget to add either Unicode prefix or re.Unicode flag to the pattern.

pattern = re.compile('(?<=<b class="page">总数 )(?P<total>\d+)', re.UNICODE)

Still no luck, but it works in the python console with the same pattern, faked data, and also works if we change a little bit:

pattern = re.compile('(?<=<b class="page">.{2} )(?P<total>\d+)', re.UNICODE)

Looks like the trouble maker is the non-Latin characters: 总数. Let’s play a little bit in the pdb console:

(Pdb) '总数'
'\xe6\x80\xbb\xe6\x95\xb0'
(Pdb) '总数'.decode('utf8')
u'\u603b\u6570'

And it works finally with the hard-coded Unicode character:

pattern = re.compile('(?<=<b class="page">\u603b\u6570 )(?P<total>\d+)', re.UNICODE)

We can use the decode method to avoid the ugly Unicode string for better readability:

pattern = re.compile('(?<=<b class="page">总数 )(?P<total>\d+)'.decode('utf-8'), re.UNICODE)

And a note is recorded that the decoded codec MUST be consistent to the code page declaration.

Some speculations based upon the observation:

  • re.UNICODE does not enforce the Unicode mode, it just redefine the escaped characters like: \b, \w etc.
  • The pattern and string in Unicode implicitly invokes the Unicode mode. That explains why some pattern works in Python console only. Both of them are encoded in UTF-8, so re really runs in 8bit!
  • Python interpreter will not translate the literal string even though the code page is specified.

Please leave your insight in the comments. Thanks

UPDATE: Thanks for all the comments first. Seems that I have a typo when testing the pattern with Unicode prefix. Here are the test cases:

patterns = [
    re.compile('(?总数 )(?P\d+)'.decode('utf8'), re.UNICODE),
    re.compile(ur'(?总数 )(?P\d+)', re.UNICODE),
    re.compile(u'(?总数 )(?P\d+)', re.UNICODE),
    re.compile('(?总数 )(?P\d+)', re.UNICODE),
]

print [ pattern.search(s) for pattern in patterns ]

The output is:

[<_sre.SRE_Match object at 0xb7c18260>, <_sre.SRE_Match object at 0xb7c18360>, <_sre.SRE_Match object at 0xb7c183a0>, None]

Download test.py