Python grouping regexIf your a regex guru, and you know why you came here, you can go straight to the brief explanation. If not just keep reading.

I found a workaround for python bug 1519638. It most definitely will not solve all of the puzzles out there but it stops breaking the sub method for replacing with the use of backrefs.

The problem

If you would like to replace this:

<label for="author"><small>Name

With this:

<label for="author"><small>Naam

And you’re not sure if the <small> tags is there, you would group the chars “<small>” and use a question mark for making them optional. BTW, running a replace on just “Name” is not allowed because they would mess up other parts of the file in question.

Example updated. Thanx dbr!

The solution

Using a compiled pattern and thus a regex to replace this, a solution might look like this:

reg = re.compile(r'(<label for="author">)(<small>)?(Name)',
    re.VERBOSE | re.MULTILINE | re.DOTALL)
replace = r'g<1>g<2>g<3>'
search = reg.sub(replace, data)

In this case the replacement string uses backreferences to the groups being the sub expressions within the parenthesis in the search pattern.

The oops

However, if the “<small>” tag is not there the search command raises an exception.

$ python regex.py
Traceback (most recent call last):
  File "regex.py", line 14, in <module>
    search = reg.sub(replace, data)
  File "/usr/lib/python2.5/re.py", line 274, in filter
    return sre_parse.expand_template(template, match)
  File "/usr/lib/python2.5/sre_parse.py", line 793, in expand_template
    raise error, "unmatched group"
sre_constants.error: unmatched group

This happens because the second group represented with “g<2>” in the replacement string returns a “None” instead of an empty string. That is (seems) the bug.

Solving the oops

This can be resolved by replacing the optional notation “(<small>)?” with an alternation “(|<small>)” because with the “<small>” tag being absent it matches on the empty subexpression. And then it actually returns an empty string so the search command won’t raise the exception.

In other words …

Brief explanation

When doing a search and replace with sub, replace the group represented as optional for a group represented as an alternation with one empty subexpression. So instead of this “(.+?)?” use this “(|.+?)” (without the double quotes).

If there’s nothing matched by this group the empty subexpression matches. Then an empty string is returned instead of a None and the sub method is executed normally instead of raising the “unmatched group” error.

That’s all folks …