Regex subroutines and recursion
Subroutines and recursion are powerful expressions that are sometimes ignored or forgotten. Maybe because the number of languages that support it are relatively scarce? If you’re using PCRE, PHP which uses PCRE under the hood, Perl or Python with the
regex module then keep reading!
Let’s warmup with backreferences. Say you want to match the following expression
word -> word where the left and right side are the same words. You’ll usually end up using backreferences:
Aside from the fact that this regex could be simplified to
(\w+) -> \1, the regex is pretty straightforward, we use a capturing group to match a word, then reference what we have matched with
But what if we don’t want to match the exact word but reuse the capturing group?
Continuing with our previous example, say we want to match instances such as
qux -> baz, usually we would write the following expression:
Good so far but we can do better. With subroutines we can reuse defined groups:
The expression above matches a word, puts it in group #1, then reuses the expression in group #1 with
(?1). The results:
There are different variations of subroutines:
(?1)will call group #1, effectively if you have group #2 you can also use
(?2)and so on.
(?+1)will call the next group. Using our example above, we could rewrite it as:
(?+1) -> ([a-zA-Z0-9_]+). Of course, using
(?+2)is also possible if the second next group exists and so on.
(?-1)same as above but will call the previous group. Using our example above, we could rewrite is as:
([a-zA-Z0-9_]+) -> (?-1).
(?&name)also known as named subroutines. Using our example above, we could rewrite is as:
(?P<word>[a-zA-Z0-9_]+) -> (?&word).
Realword subroutine example
Say we want to match UUID’s. UUID’s are 32 hexadecimal characters separated by hyphens adhering the following pattern:
8-4-4-4-12. The vanilla regex pattern would look like this:
Let’s use subroutines instead. As we can see the basic unit is 4 hex characters, let’s use that as a basis and complete the pattern:
Much shorter! We can go a step further by repeating this part
4-4-4 to make it even more shorter:
Now that’s a bad idea since regexes are hard to read to begin with for most people. It’s better to make it as readable as possible. We can achieve this by using the
x modifier to allow for whitespaces in our regex. This also has the advantage that we can add comments. We’ll also use named groups instead:
Now recursion is basically a subroutine that calls itself. The most common example is to match balanced brackets:
Worth noting that
(?R) is the same as
(?0) since group #0 is basically the whole pattern. Some test cases:
There’s a trick worth noting which is the DEFINE trick. I think it’s easiest if we reference directly from the PCRE manual:
Basically we can put all kinds of patterns inside the DEFINE area and use it later on in our regex. Back to the UUID example, we could write:
Combining everything in the real world
Say we have developed our own syntax which resembles JSON in some sense:
If want to validate such syntax we’ll need recursion too since an object can in itself contain another object:
The above regex can be found online on regex101, an awesome online regex fiddler. Make sure to select the right language on the left and set the right regex modifier.
This StackOverflow thread took it to the next level by validating JSON entirely with regex.
The techniques described here are probably useful, but in general you either want to use a full-fledged parser or use one of the standard libraries of your programming environment. For example, to validate JSON strings, you could load it with your favorite function/library. It will probably throw an exception if it’s invalid :_)