.NET Regex balanced matches failure evaluation

While doing some regex demos, I noticed some strange behavior that doesn’t seem to comply with the docs. I’m working with balancing group definitions as described here. In conjunction with those, it should be possible to use the syntax (?(start)(?!)) to fail the expression in case the balanced elements don’t even out (or rather, if there aren’t sufficient end elements). This conditional expression is documented here in these words: Matches [] if [] a named [] capturing group has a match

In my tests, this does not appear to work correctly. Here are my tests and results.

I’m starting with this code (you can copy and run it as is):

    public static void BalancedParens() {
      var rs = @"
[^()]*                  # there might be stuff at the start
(?<formula>             # formula group captures the complete () part
(                       # this group starts a () part

  (                     # this is the start of an opening paren group
    [^()]*              # stuff before the opening paren
    (?<start>\()        # the opening paren itself
    [^()]*              # stuff after the opening paren
  )+                    # we can have lots of parens opening

  (                     # this is the start of the closing paren group
    (?<end-start>\))    # the closing paren itself - 
                        # the special name 'closes' the corresponding start group
  )+                    # we should have lots of parens closing, too

)*                      # the end of the nested paren groups
)                       # the end of the formula group
#(?!)
#(?(start)(?!))
";
      var r = new Regex(rs, RegexOptions.IgnorePatternWhitespace);

      var text = "formula: (10 * (3 + (7 - 5) + (2 + 6))) <- this is important";
      var match = r.Match(text);

      if (match.Success) {
        Console.WriteLine("Match: " + match.Value);
        Console.WriteLine("Formula: " + match.Groups["formula"]);
        Console.WriteLine("start captures:");
        var starts = match.Groups["start"];
        foreach (var start in starts.Captures) {
          Console.WriteLine(start);
        }

        Console.WriteLine("end captures:");
        var ends = match.Groups["end"];
        foreach (var end in ends.Captures) {
          Console.WriteLine(end);
        }
      }
      else
        Console.WriteLine("Match not successful");
    }

For reference purposes, the output rendered is this:

Match: formula: (10 * (3 + (7 - 5) + (2 + 6)))
Formula: (10 * (3 + (7 - 5) + (2 + 6)))
start captures:
end captures:
7 - 5
2 + 6
3 + (7 - 5) + (2 + 6)
10 * (3 + (7 - 5) + (2 + 6))

As you can see, I have included two elements at the end of my expression, both commented in the code above. The final line shows the documented conditional expression that should fail the expression if the start group still has content at the end, in other words, if there are more starting elements (that’s opening parens in my case) than ending elements (closing parens).

To test this, I first “break” my input text by removing one of the closing parens, so this remains: "formula: (10 * (3 + (7 - 5) + (2 + 6)) &lt;- this is important"

Running the application again, the output is now this:

Match: formula: (10 * (3 + (7 - 5) + (2 + 6))
Formula: (10 * (3 + (7 - 5) + (2 + 6))
start captures:
(
end captures:
7 - 5
2 + 6
3 + (7 - 5) + (2 + 6)

This is correct. The start group still contains one opening paren because there was no closing counterpart found in the input.

Now I remove the comment sign from the last line of my regular expression and run yet again. I expect match.Success to return False now. Here's what I get instead:

Match: formula: 
Formula: 
start captures:
end captures:

Instead of failing the expression, the match is now empty. Two things I learn from that:

  1. The engine has obviously reacted to the inclusion of the (?(start)(?!)) expression, since the behavior is no longer the same as it was before.

  2. The intended result of failing the expression was not reached. The behavior I see instead seems rather inexplicable.

On that basis, I had the idea to try whether the (?!) on its own really has the ability of failing the match in the sense of setting Success to False. I comment the last line of the expression again and include the one right before it. Now the expression should always fail. And it does, regardless of the input string.

Match not successful

The next idea is that perhaps the conditional construct doesn't work as intended. However, a quick independent check shows that under simple circumstances, the construct works as intended.

Console.WriteLine(Regex.Match("abc", "(?<foo>a)(?(foo)(?!))").Success);

This code renders False as expected.

Up to this point, the matter remains a mystery to me. I'm either missing something, or this is a special case that triggers some kind of unintended behavior in the engine. Here are a few further thoughts:

  • I tried to run the code in .NET 4.5 and mono – results are the same

  • I considered whether the match might be regarded successful by the engine for some reason, in spite of the "fail" triggered by the (?!) element. My point of view is, though, that the expression should definitely fail. This is supported by the fact that (?!) triggers the "fail" I want when used outside the conditional construct.

Please let me know if you have any ideas. It would be great to understand what's going on here, assuming there is a plausible reason for it!

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s