Regular expressions: better to test for multiple captures or matches?

This was a question in a newsgroup today: given the choice, is it better to test for multiple matches of a regular expression or to let the expression match multiple captures at the same time. My reply was that I thought multiple captures might involve less overhead in the regular expression engine because the capture matching is an intrinsic part of the engine algorithm, while finding multiple matches means running the expression against the input multiple times. But then I thought that the final outcome would depend much on the implementation details of the engine at hand, because there are a lot of other factors in favor of both approaches.

For example, expressions matching multiple captures tend to be a lot more complicated than those looking only for a single match. This could mean that multiple captures wouldn’t be as fast as one could think. Plus, the engine will be looking for multiple matches whether or not it will find any, this could slow down the multiple captures approach further.

Finally I thought that a carefully implemented engine wouldn’t probably show a great difference between the two approaches, but that I wouldn’t be surprised if any given implementation would actually show a sizeable difference. To find out how the .NET implementation performs, I did the following test.

I had a simple input string: one two three four five six seven eight nine ten . This string was matched first against the simple match expression [^s]+ , which would return one match per number string, and then against the multiple capture expression (?[^s]+ )*. The code then checked that in each case exactly ten matches/captures had been retrieved and that the content of the matches/captures was correct. This complete test was performed 1000000 times for each approach.

The result is interesting. The multiple match approach took 11.80 seconds to run a million times, while the multiple capture approach took only 8.46 seconds. So although the multiple capture algorithm has one more significant line of code (the one where the group is retrieved from the match by its name), it’s significantly faster than the multiple match approach.

I’m not completely sure if this result will be the same with expressions of varying complexity, but the difference is that large that I think it supports my initial idea about the nature of the algoriths involved in both approaches.

The test program I used wasn’t extremely long, so I thought in case someone wants to do his own testing I’d publish it here. Here you go:

class Program {
  static readonly string input = "one two three four five six seven eight nine ten ";
  static readonly string simplePattern = @"[^s]+ ";
  static readonly string multipleCapturePattern = @"(?<entry>" + simplePattern + @")*";
  static readonly int runCount = 1000000;
 
  static void Main(string[] args) {
    Measure(new MethodInvoker(MultipleMatches), "Multiple matches");
    Measure(new MethodInvoker(MultipleCaptures), "Multiple captures");

    Console.ReadLine( );
  }
        
  delegate void MethodInvoker( );

  static void Measure(MethodInvoker methodInvoker, string name) {
    DateTime startTime = DateTime.Now;
    methodInvoker.Invoke( );
    TimeSpan duration = DateTime.Now - startTime;

    Console.WriteLine("Process '{0}' took: {1}", name, duration);
  }

  static void MultipleMatches( ) {
    Regex regex = new Regex(simplePattern, RegexOptions.None);
    for (int i = 0; i < runCount; i++) {
      MatchCollection matchCollection = regex.Matches(input);
      Debug.Assert(matchCollection.Count == 10);
      Debug.Assert(matchCollection[0].Value == "one ");
      Debug.Assert(matchCollection[1].Value == "two ");
      Debug.Assert(matchCollection[2].Value == "three ");
      Debug.Assert(matchCollection[3].Value == "four ");
      Debug.Assert(matchCollection[4].Value == "five ");
      Debug.Assert(matchCollection[5].Value == "six ");
      Debug.Assert(matchCollection[6].Value == "seven ");
      Debug.Assert(matchCollection[7].Value == "eight ");
      Debug.Assert(matchCollection[8].Value == "nine ");
      Debug.Assert(matchCollection[9].Value == "ten ");
    }
  }

  static void MultipleCaptures( ) {
    Regex regex = new Regex(multipleCapturePattern, RegexOptions.None);
    for (int i = 0; i < runCount; i++) {
      Match match = regex.Match(input);
      CaptureCollection captureCollection = match.Groups["entry"].Captures;
      Debug.Assert(captureCollection.Count == 10);
      Debug.Assert(captureCollection[0].Value == "one ");
      Debug.Assert(captureCollection[1].Value == "two ");
      Debug.Assert(captureCollection[2].Value == "three ");
      Debug.Assert(captureCollection[3].Value == "four ");
      Debug.Assert(captureCollection[4].Value == "five ");
      Debug.Assert(captureCollection[5].Value == "six ");
      Debug.Assert(captureCollection[6].Value == "seven ");
      Debug.Assert(captureCollection[7].Value == "eight ");
      Debug.Assert(captureCollection[8].Value == "nine ");
      Debug.Assert(captureCollection[9].Value == "ten ");
    }
  }
}

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s