Hamza's random blogposts

Blogposts for future self reference

28 July 2018

Regex subroutines and recursion

Tags: regex

Introduction

Subroutines and recursion are powerful expressions that are sometimes ignored or forgotten. Maybe because the number of languages that support it are relatively scarce? If you’re using PCRE, PHP which uses PCRE under the hood, Perl or Python with the regex module then keep reading!

Warmup

Let’s warmup with backreferences. Say you want to match the following expression word -> word where the left and right side are the same words. You’ll usually end up using backreferences:

([a-zA-Z0-9_]+) -> \1

Aside from the fact that this regex could be simplified to (\w+) -> \1, the regex is pretty straightforward, we use a capturing group to match a word, then reference what we have matched with \1.

foo -> foo  [ match ]
bar -> bar  [ match ]
qux -> baz  [ won't match ]
baz qux     [ won't match ]

But what if we don’t want to match the exact word but reuse the capturing group?

Subroutines

Continuing with our previous example, say we want to match instances such as qux -> baz, usually we would write the following expression:

[a-zA-Z0-9_]+ -> [a-zA-Z0-9_]+

Good so far but we can do better. With subroutines we can reuse defined groups:

([a-zA-Z0-9_]+) -> (?1)

The expression above matches a word, puts it in group #1, then reuses the expression in group #1 with (?1). The results:

foo -> foo  [ match ]
bar -> bar  [ match ]
qux -> baz  [ match ]
baz qux     [ won't match ]

There are different variations of subroutines:

Realword subroutine example

Say we want to match UUID’s. UUID’s are 32 hexadecimal characters separated by hyphens adhering the following pattern: 8-4-4-4-12. The vanilla regex pattern would look like this:

[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}

Let’s use subroutines instead. As we can see the basic unit is 4 hex characters, let’s use that as a basis and complete the pattern:

([a-fA-F0-9]{4})(?1)-(?1)-(?1)-(?1)-(?1){3}

Much shorter! We can go a step further by repeating this part 4-4-4 to make it even more shorter:

([a-fA-F0-9]{4})(?1)(-(?1)){3}-(?1){3}

Now that’s a bad idea since regexes are hard to read to begin with for most people. It’s better to make it as readable as possible. We can achieve this by using the x modifier to allow for whitespaces in our regex. This also has the advantage that we can add comments. We’ll also use named groups instead:

(?P<hex>[a-fA-F0-9]{4})(?&hex) # 8 hex combo
-(?&hex)-(?&hex)-(?&hex)       # 4-4-4 hex combo
-(?&hex){3}                    # 12 hex combo

Recursion

Now recursion is basically a subroutine that calls itself. The most common example is to match balanced brackets:

\{                 # Match opening tag
    (?:            # Non capturing-group
        [^{}]      # Match anything that's not an opening or closing tag
        |          # Else (which means we encountered an opening or closing tag)
        (?R)       # Recurse the whole pattern
    )*             # Repeat zero or more times, we could also use + instead of * if we don't want to match empty tags
\}                 # Match close tag

Worth noting that (?R) is the same as (?0) since group #0 is basically the whole pattern. Some test cases:

foo          [ won't match ]
{}           [ match ]
{foo}        [ match ]
{{ bar }}    [ match ]
{{{ baz }}}  [ match ]

(?(DEFINE)) trick

There’s a trick worth noting which is the DEFINE trick. I think it’s easiest if we reference directly from the PCRE manual:

 Defining subpatterns for use by reference only

       If  the  condition  is  the string (DEFINE), and there is no subpattern
       with the name DEFINE, the condition is  always  false.  In  this  case,
       there  may  be  only  one  alternative  in the subpattern. It is always
       skipped if control reaches this point  in  the  pattern;  the  idea  of
       DEFINE  is that it can be used to define subroutines that can be refer-
       enced from elsewhere. (The use of subroutines is described below.)  For
       example,  a  pattern  to match an IPv4 address such as "192.168.23.245"
       could be written like this (ignore white space and line breaks):

         (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
         \b (?&byte) (\.(?&byte)){3} \b

       The first part of the pattern is a DEFINE group inside which a  another
       group  named "byte" is defined. This matches an individual component of
       an IPv4 address (a number less than 256). When  matching  takes  place,
       this  part  of  the pattern is skipped because DEFINE acts like a false
       condition. The rest of the pattern uses references to the  named  group
       to  match the four dot-separated components of an IPv4 address, insist-
       ing on a word boundary at each end.

Basically we can put all kinds of patterns inside the DEFINE area and use it later on in our regex. Back to the UUID example, we could write:

(?(DEFINE)                     # regex definitions
    (?P<hex>[a-fA-F0-9]{4})    # match 4 hex chars
)
# Matching starts now...
(?&hex){2}                     # 8 hex combo
-(?&hex)-(?&hex)-(?&hex)       # -4-4-4 hex combo
-(?&hex){3}                    # -12 hex combo

Combining everything in the real world

Say we have developed our own syntax which resembles JSON in some sense:

{
  foo : bar,
  baz : [aa, bb, ccc],
  123 : 123,
  obj : { foo : bar}
}

If want to validate such syntax we’ll need recursion too since an object can in itself contain another object:

(?(DEFINE)                     # regex definitions
    (?P<string>[a-zA-Z0-9]+)
    (?P<value>\s*(?:(?&string)|(?&array)|(?&object))\s*)
    (?P<array>\s*\[\s*(?&value)\s*(?:\s*,\s*(?&value)\s*)*\])
    (?P<pair>(?&string)\s*:\s*(?&value))
    (?P<object>\{\s*(?&pair)\s*(?:\s*,\s*(?&pair)\s*)*\})
)
(?&object)

The above regex can be found online on regex101, an awesome online regex fiddler. Make sure to select the right language on the left and set the right regex modifier.

This StackOverflow thread took it to the next level by validating JSON entirely with regex.

The techniques described here are probably useful, but in general you either want to use a full-fledged parser or use one of the standard libraries of your programming environment. For example, to validate JSON strings, you could load it with your favorite function/library. It will probably throw an exception if it’s invalid :_)