Time to re-evaluate

Last time we looked at how to remove superfluous information from our input strings and we came up against a problem with the string:

Olive oil (2 tbsp)  

In this instance, the (2 tbsp) would be removed and we would lose our amount information. It’s not nice to have to extract information in two passes - we’d like to classify correctly and then extract the information we need.

To avoid classifying the amount as superfluous, we need to improve our rule so it doesn’t match if the bracketed content is an amount. To do that we need negative lookahead, whereby if we look ahead and see a particular item, we don’t match.

At this point we could go ahead and improve our matching algorithm to implement the negative lookahead feature. However, it’s worth taking a step back. It looks like we are steadily re-implementing regexes, which (interesting though it may be) is not exactly the most efficient use of our time and energy. Is there a way that we can harness the existing regex engine to do this for us and consequently, give us all the other nice regex features for free?

This is also a good opportunity to assess what value we are adding over just using a regex - do we even need this library?!

To answer these questions let’s consider how we might implement the same concepts with regexes alone.

Simple rule with regex

Let’s consider one of our simpler rules:

/pinch/ is unit

We need to consider how to do the matching implied by this rule and how to store any resulting classifications. If we use named capture groups then we can represent the classification quite neatly with the following regex:

(?<unit>pinch)

We could represent the classification by inserting mark-up into the string such that pinch is replaced with <unit>pinch</unit>. The rule can then be represented as a regex substitution using $ notation to refer to the capture group:

(?<unit>pinch) => <unit>$unit<\/unit>

Now we can classify the pinch unit in a string such as salt (a pinch) yielding:

salt (a <unit>pinch</unit>)  

It’s worth noting that not all regex flavours support named capture groups but Python does and it is always possible to translate them to numbered capture groups where they are not supported.

Matching types

We’ve written a simple pattern match rule quite easily but what if we now want to match using our classified types? Let’s assume we have classified our example string so we have a number type and a unit type:

salt (<number>a</number> <unit>pinch</unit>)  

Now we want to classify the combination of number followed by unit as an amount. We need a regex that matches the type mark-up:

\<number\>(?:(?!\</number\>).)*\</number\>\s*\<unit\>(?:(?!\</unit\>).)*\</unit\>

This expression matches the amount with all its mark-up. We can use the same substitution notation as above to define the complete rule:

(?<amount>\<number\>(?:(?!\</number\>).)*\</number\>\s*\<unit\>(?:(?!\</unit\>).)*\</unit\>) => <amount>$amount</amount>

Now our output string looks like:

salt (<amount><number>a</number> <unit>pinch</unit></amount>)  

Finally, we can match our ingredient name using this rule:

(?<ingredient>(?:(?!\s*\().)*)\(?:\<amount\>(?:(?!\</amount\>).)*\</amount\>)\) => <ingredient>$ingredient</ingredient> (<amount>$amount</amount>)

And we end up with this:

<ingredient>salt</ingredient> (<amount><number>a</number> <unit>pinch</unit></amount>)  

This is going quite well but there are some issues lurking. What if we had been more enthusiastic with our typing and classified number as value? It would appear as <value><number>a</number></value> and our amount rule would break because number would no longer be adjacent to unit. We need a way to account for superclass mark-up.

We could try looking for (?:\<[^\>]+\>)* before and (?:\</[^\>]\>)* after our existing expression. That will grab all the outer tags. The only trouble is that we might end up with one or more mismatched tags, for example:

<a><b><c>x</c>y</b>z</a>  

This would end up matching more outer tags on the left for all cases except a. To ensure that the start and end tag are matched, we need to employ a backreference. This lets us capture the first tag name and use it in the pattern for the last tag. Here’s what the complete regex looks like for amount:

(?<amount>(?:\<(?<tag>[^\>]+)\>)(?:\<[^\>]+\>)*\<number\>(?:(?!\</number\>).)*\</number\>(?:\</[^\>]+\>)*(?:\</\k<tag>\>)\s*(?:\<(?<tag>[^\>]+)\>)(?:\<[^\>]+\>)*\<unit\>(?:(?!\</unit\>).)*\</unit\>(?:\</[^\>]+\>)*(?:\</\k<tag>\>)) => <amount>$amount</amount>

It’s getting pretty unwieldy and it’s still not right because now we require at least one outer type. Handling the case of no outer types would virtually double the length of the regex. Surely there must be a better way?

We could just capture more than we need and when forming the output string we can discard the extra tags. The downside is that we’re moving away from plain regexes and building extra functionality around them but that was inevitable at some point. As a minimum we’ll want a framework for building and running all the regexes. That gets us back to the (relatively) simple form:

(?<amount>(?:\<[^\>]+\>)*\<number\>(?:(?!\</number\>).)*\</number\>(?:\</[^\>]+\>)*\s*(?:\<[^\>]+\>)*\<unit\>(?:(?!\</unit\>).)*\</unit\>(?:\</[^\>]+\>)*) => <amount>$amount</amount>

Tokenisation

In our previous implementation we converted the input string to a list of tokens and the algorithm operated on those tokens. Using regexes to operate over the whole input string, as we are now, this is no longer possible because regex operates on a single string.

Instead we have made the assumption that the boundary of each token is marked by either:

  • a space character
  • a mark-up start or end tag

Therefore we should still run our tokeniser function on the input string but we’ll take the token list output and join it to form a space delimited token string:

tokens = self.tokeniser.tokenise(in_str)  
token_str = ‘ ’.join(tokens)  

One side effect of this approach is that spaces in the input string will always mark the boundary of a token unless they are replaced with an alternative character prior to being consumed.

Syntax

It would be nice to simplify the syntax for our regexes. We could extend the regex syntax to more concisely handle some of our requirements. We could then transpile the extended syntax into bog standard regexes.

The main thing we need to improve is type matching. We need some new syntax for identifying a type and matching with a type. Let’s borrow <<type>> from UML. For defining a type we can extend the named capture group syntax:

(?<<unit>>pinch)

And for matching we can add a $ symbol:

(?<<$unit>>)

We can now write our amount rule much more concisely:

(?<<amount>>(?<<$number>>)(?<<$unit>>) => <amount>$amount</amount>

Our transpiler will convert this into the complete regex syntax ready to be run through a regex engine.

Next time

We’ll take the rough ideas we’ve formed in this article and try to form them into a Python implementation.