Je t’embrasse Salutations from Silicon Valley, California

2Aug/090

RegEx – Lists & Hashes with Escape Characters

I recently ran across a need to match escape characters within a regular expression. The key here is that you need to make use of a fun little RegEx feature called "Zero-Width, Negative, Look-Back Assertion". Its a loaded title, but crams a LOT of functionality into a little space... None. This all stems from a post I made on Python-Forum...

Just FYI, so that nobody hassles me later, you need to include the Python Regular Expression module in order to use any of these examples.

import re

Problem #1: How do you represent an "escaped character" in a Regular Expression?
Solution #1: Use positive look-back assertion on the existence of a single backslash prior to any character.

text=r'\f\o\o'
regex = r'(.(?:(?<=\\).))'
i = re.finditer(regex, text)
for m in i: print m.groups()

Problem #2: It is not possible to put the previous regex inside a character set. So how then do you create a set of "escaped characters + non-terminators"?
Solution #2: Use the "?" on each "character" in the set.

text='{foo:bar}{a:\/foo\.bar}'
regex = r'(\{(?:(?:(?<=\\).)?[^\{\}\[\]\/\.]?)+\})'
i = re.finditer(regex, text)
for m in i: print m.groups()

Problem #3: How do you keep multiple occurrences from matching?
Solution #3: Explicit match using anchors. In other words, use "^" and "$" to bound your matching appropriately.

text='{a:\/foo\.bar}'
regex = r'^(\{(?:(?:(?<=\\).)?[^\{\}\[\]\/\.]?)+\})$'
i = re.finditer(regex, text)
for m in i: print m.groups()

Problem #4: How do you match practically any character that could occur within 2 bounding characters (like /regex/ in a sed-esque fashion)?
Solution #4: Exactly like before, except now, the only character we cant have in the middle is one of the plain (non-escaped) bounding characters.

text='/g[@#$%\/{foo:bar}\/^&*()]/{a:\/foo\.bar}'
regex = r'(\/(?:(?:(?<=\\).)?[^\/]?)+\/)'
i = re.finditer(regex, text)
for m in i: print m.groups()

Problem #5: How do we avoid matching when two valid matches are split? Like "/foo/{a:b}/bar/"
Solution #5: Explicitly define the possible combinations of each regex subsection

text1='{a:\/foo\.bar}/foo\/bar/'
text2='/foo\/bar/{a:\/foo\.bar}'
regex=r'^(\{(?:(?:(?<=\\).)?[^\{\}\[\]\/\.]?)+\})(\/(?:(?:(?<=\\).)?[^\/]?)+\/)$'
i = re.finditer(regex, text1)
for m in i: print m.groups()
regex=r'^(\/(?:(?:(?<=\\).)?[^\/]?)+\/)(\{(?:(?:(?<=\\).)?[^\{\}\[\]\/\.]?)+\})$'
i = re.finditer(regex, text2)
for m in i: print m.groups()
Filed under: Python No Comments