Modern regex engines have some powerful features which are not used quite often. Maybe because regexes are considered cryptic and hard to begin with? In this blogpost I want to document a few of my favourite tricks.
x modifier is a good way to make your cryptic regexes more readable. If this modifier is set, whitespace characters are ignored in the pattern. Everything after
# is ignored as well. This means that you can write your regex on several lines while adding comments on each line.
Say for example you want to write a validation regex in PHP for a custom ID, it might look something like this:
Keep in mind that if you need to match a literal space or a number sign
# you will need to escape them or put them in a character class:
Check it out on regex101.
\K escape sequence
Let’s say we want to match
helloworld that is preceded with
foo somewhere in the string. Most regex engines do not support arbitrary lookbehinds. See output of the following php code:
\K escape sequence can be a nice workaround to this problem. When this sequence is used in a regex, you’re basically telling the regex engine to “forget” what has been matched so far. The above example can be rewritten as such:
The only downside to this technique is that it “consumes” what we have matched so far, meaning that we won’t be able to have overlapping matches. Say we have the following input string:
The following steps happen roughly inside the regex engine:
The first match consumed a part of the string. The engine continues from where it left to find another match. However no more
foo is to be found, therefore the second
helloworld does not get matched.
Perl and PCRE-enabled engines support “Backtracking Control Verbs” which are extensively covered on rexegg. I want to cover a simple yet powerful trick which combines two of these control verbs. Say you want to match all instances of
hello\d+ but it should not be enclosed in brackets
<>. This is easily achieved using SKIP & FAIL:
Without going too much into the internal workings of
(*FAIL), you can use this combination
/pattern1(*SKIP)(*FAIL)|pattern2/ to instruct the regex engine to match
pattern2 while excluding
If you’re interested how this trick exactly works, make sure to check out the rexegg article.