Lately, I’ve been reading several blog articles about regular expressions. I’d like to throw in my vote for the most important sentence in Jeff Atwood’s article Regex use vs. Regex abuse:

All developers should learn to use basic regular expressions, because they’ll produce better, more flexible, more maintainable code with them.

Differing from what Jeff further states in his My Buddy, Regex article (which also features a very useful tool, RegexBuddy) , I actually do subscribe to the UNIX religion, in many ways. But subscribed or not, I think it’s important to look over the fence because every technology has its good sides, whether you like the overall philosophy or not. I recently commented on yet another article in Frans Bouma’s blog, titled Regex fun, which is really a good example how regular expressions can be used in all kinds of applications.

In this article, I want to show some of the basics and I’ll provide some links to further information at the end. Now, what do regular expressions really do? The most important thing that regular expressions do is “matching”. What’s meant by this is simply that the expression is “matched” against another string and the first question that can be answered this way is whether it’s a match or not, in other words, does the “content” string adher to the pattern purported by the regular expression. Some very simple examples:

Regular expression Content Match?
Hello Hello there Yes (and yes, “Hello” is really a regular expression)
Wow Hello there No
lo.t Hello there Yes
foo*.bat fooooo.bat Yes
foo*.bat foobar.bat No

From the examples, you can see that there are special characters (the . (dot) and the * (asterisk)) in use. The dot stands for a single character, which is easy to understand. But what’s with the asterisk? Why isn’t the last example a match? The reason for this lies in the most important difference to “normal” wildcard systems, like that used in DOS: in regular expressions, the asterisk is a so-called quantifier and not a placeholder (like the dot). A quantifier defines how many times the expression right in front of it may be repeated. The asterisk quantifier stands for zero to an indefinite number of occurences. That’s why foo*.batmatches fooooo.bat, but not foobar.bat.

There are three other quantifiers, the ? (question mark), for zero or one occurrence, the + (plus), which stands for one to an indefinite number of occurrences and the construct{m,n}, which allows occurrences counts from m to n, inclusively.

Back to the dot for a moment: While the dot stands for a single arbitrary character, there’s another placeholder construct that’s often used: character ranges. Written in brackets, it’s possible to write down all the characters that are allowed in a specific position, like this:[abc]. For example,[Hh]ellowould match Hello as well as hello. Ranges may be used, like in[A-Za-z0-9], which matches all 26 characters of the alphabet, capital and small, plus the digits from 0 to 9.

The final most important construct are the parentheses and the | (pipe sign). They are used to group other stuff together and to create either/or scenarios. The pipe is simple: the expressionHello|Hi would match either Hello or Hi, nothing else. The parentheses are most important in use with quantifiers; if there’s a quantifier behind an expression in parentheses, the complete expression is affected by the quantifier. For example, say you have to match a string that goes like this:

Apples,Bananas,Pears,Tomatoes

Obviously, the following expression could do:

[A-Za-z]+,[A-Za-z]+,[A-Za-z]+,[A-Za-z]+

But what if another line contained more than those four words? Or less than four? Now, using parentheses, you could rewrite the expression like this:

([A-Za-z]+,)+

To the close observer: of course, this wouldn’t be perfect because it would allow a final comma to appear at the end of the line. I’m aware of this, but the example is supposed to be simple :-)

Okay, that’s all I want to say about the topic at this point, because the purpose of the article is only to arouse some interest in regular expressions. I left out a lot of stuff, but mainly in the area of useful functionality, not intimidating syntax. I really recommend you follow some of the links below and find out more about this useful technology. Also, feel free to ask if you have specific questions about my examples or anything. If there’s demand, maybe I’ll write another article :-)

Some interesting content about regular expressions: