Lookahead (or -behind) assertions in regular expressions

Lookahead and lookbehind assertions are a feature of regular expressions that’s not needed as often as others. The .NET regular expression implementation has full support for them and the documentation is here. But what exactly are they good for? What do you do with them?

One answer is, they can be quite important when using regular expressions for replacing parts of strings. The reason for this is that a regex replacement always replaces the complete matched text. It’s possible to use substitutions to build a replacement string, but there are situations where it’s extremely difficult if not impossible to construct the correct replament string. Consider a string like this:

c:program files;e:tmp%var1%;d:apps%var3%foo;e:tmp%var2%

Now assume for a moment that you want to replace the %varX% expressions to say $varX$ instead, but only if they mark the end of a “part” of the original string. IOW, you want to catch var1 and var2, but not var3.

So, to start with, the following regex finds all the relevant parts: (?<var>%(?<varname>[^%]+)%)(;|$). In English, this expression finds expressions of the form %...%, which are followed by either ; or the end of the line. It has a group called var that holds the complete %...% part, and another one called varname that holds only the part between the %’s. Nice.

Now for the replacement. As I said, the replacement always replaces the whole matched string, so that will include the delimiter that was found, either the ; or the end of the line. So, if we use a replacement like $$${varname}$$, the delimiter that used to be there will be gone and the string would look like this: c:program files;e:tmp$var1$d:apps%var3%foo;e:tmp$var2$.

This is where the lookahead assertion comes in handy. By using it, we can tell the regular expression engine that while we are interested in the string that’s behind the %...% part, we don’t want to make it part of the match! It’s simple: we modify the regex to look like this: (?<var>%(?<varname>[^%]+)%)(?=;|$) (I have made the modification yellow, to make sure you see it :-))

What this means is simple: look ahead to see whether the following characters actually match the expression in the parens. But don’t include the matched characters in the expression’s match! Now this is exactly what we want. Because the delimiter is no longer part of the match, it’s not going to be replaced either. By replacing (?<var>%(?<varname>[^%]+)%)(?=;|$) with $$${varname}$$, we get c:program files;e:tmp$var1$;d:apps%var3%foo;e:tmp$var2$.

The alternative

In this case, there would have been an alternative way to do the same thing. We’d have had to modify the expression like this: (?<var>%(?<varname>[^%]+)%)(?<delimiter>;|$). Then the delimiter would have been stored away in the group delimiter and we could have used it in a modified replacement string like this: $$${varname}$$${delimiter}. But this is a solution that requires far more modifications to the actual matching expression, and it can get extremely confusing to implement when the expressions are more complicated or there are simply more alternatives to watch out for. Using lookahead or lookbehind assertions just takes parts of the matched text out of the replaced match, which is a much more elegant way.

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s