Nitty-gritty Regular Expression Details


I played the super nerdy and fun Regex Crossword, and I honed my rusty regexfu and learned some nitty-gritty regex details( using ipython repl for the demonstration):

+ matches 1 or more repetition of the preceding RE, not the matched content.

This is particularly less obvious for the group:

In [17]: re.match(r'G(H|O)+', 'GHO')
Out[17]: <_sre.SRE_Match at 0x1049cbeb8>

To match the content, use \number:

In [18]: re.match(r'G(H|O)\1', 'GHH')
Out[18]: <_sre.SRE_Match at 0x1049cbf30>

- in [] is interpreted to ”-” literally only - is escaped (e.g. [a\-z]) or placed as the first or last character.

The range is inclusive with the ASCII or Unicode values. I was bitten hard by the fallacy when sanitizing the names of uploaded files: a hotfix pushed out immediately and humiliation by my colleague.