Classifying text with Python regex

Last time we noodled on the possibility of using regex substitutions to classify a string by applying type mark-up. Today, let’s implement a simple classifier in Python using this approach.

Python regex substitutions

The re module provides a method for substituting non-overlapping matches in an input string with an expanded template string:

re.sub(pattern, template, input)  

Fortunately for us, the template string can reference named groups within the pattern. A named group is defined using the following notation:


And it is expanded in the template string using the following notation:


Support for named groups is very useful because it means we don’t have to translate our type names to group numbers, which would need an additional pre-processing step.

Type syntax

We’re going to augment the python regex syntax with some new syntax for capturing, matching and substituting types.


To capture a type, we define a special kind of capturing group:

{(?<my-type>the pattern)}

We’re using curly braces to distinguish this from a regular group. The combination of { and ( with the curly brace on the outside is unlikely to occur in a regular regex.


To match a type, we use the double angle bracket notation in the pattern:


This will match if the specified type is found in the input string.


To substitute a type, we also use the double angle bracket notation but this time in the template string. This replaces the matched string with a typed version of the string.


Expands to:

<my-type>captured content</my-type>  

A simple example

Let’s consider one of our simpler rules. If we want to classify pinch as a unit we could define the following substitution using our new syntax:

{(?<unit>pinch)} => <<unit>>

This would transform the string:

salt (a pinch)  


salt (a <unit>pinch</unit)  

Making it work

Of course, we can’t simply pass our pattern and template to re.sub and expect it to work because it doesn’t support our new syntax. If we want re.sub to understand us we’ll have to translate our pattern and template into plain regex format.

We need to match instances of our new syntax and substitute them with a supported equivalent. We can do this using - you guessed it - re.sub!

Type capture translation

First let’s look at translating our type capturing syntax. Here’s the pattern from our example again:


It’s fairly simple to concoct a regex that matches this syntax and captures the type name and the content that follows it:


We can replace this with a regular regex capture group:


We need a bit more though. The string pinch could be matched in the middle of another word so we should make sure this is a whole word match. We might think of using \b to check for a word boundary but that’s a bit over-zealous here because we’re really looking for a token boundary, which is denoted by a space or a mark-up tag. How about [\> ] and [\< ] then? Well, they won’t work at the start/end of the input string because there’s no preceding character to match. We can flip this around though, by using a double negation:

(?<![^\> ])

This uses negative lookbehind to assert that the preceding character is not any character besides a space or an angle bracket. The advantage of negative lookbehind here is that if we’re at the start of the input string the assertion still holds despite there being no preceding character. A similar approach can be employed using negative lookahead for the end.

Incorporating these assertions along with an optional space character at the start and end, we arrive at the following substitution string:

 ?(?<![^\> ])(?P<unit>pinch)(?![^\< ]) ?

Generalising this to a substitution template string (where \g<name> expands to the captured content of the named group) we get:

 ?(?<![^\> ])(?P<\g<type>>\g<content>)(?![^\< ]) ?

We can now write our complete substitution statement for replacing type capture groups:

out = re.sub(r'\{\(\?\<(?P<type>[a-z\-]+)\>(?P<content>.*)\)\}', r' ?(?<![^\> ])(?P<\g<type>>\g<content>)(?![^\< ]) ?', input)  

Type substitution translation

Now we’ll move on to the substitution. Let’s examine our example again. Here’s the substitution part:


We can match this trivially with the following regex:


We want to surround the captured string with tags indicating its type, i.e.


We also need to add a space character before and after to keep the token separation. Therefore the substitution is simply:

' <\g<type>>\\g<\g<type>></\g<type>> '  

(Quotes added to highlight leading/trailing whitespace)

Simple example program

If we steal our tokenise function from PyFathom, we now have enough bits and pieces to write a program to solve our simple example:

input = 'salt (a pinch)'

# tokenise the input
def tokenise(s):  
    return [t for t in re.split('([a-zA-Z][a-zA-Z\\-]*|[½⅓⅔¼¾⅕⅖⅗⅘⅙⅚⅛⅜⅝⅞\\d]+|[^\\w ])', s) if t.strip() != '']
tok_str = ' '.join(tokenise(input))

# translate pattern
unit_pattern = re.sub(r'\{\(\?\<(?P<type>[a-z\-]+)\>(?P<content>.*)\)\}', r' ?(?<![^\> ])(?P<\g<type>>\g<content>)(?![^\< ]) ?', unit_pattern)

# translate substitution
unit_sub = re.sub(r'\<\<(?P<type>[a-z\-]+)\>\>', r' <\g<type>>\\g<\g<type>></\g<type>> ', unit_sub)

# classify input
classified = re.sub(unit_pattern, unit_sub, tok_str)  

This produces the desired output:

salt (a <unit>pinch</unit>)  

This looks like a bit of a faff to do something simple at the moment but the key achievement here is that we can reuse those translation regexes on many different rules. We’re not quite done though because our simple example didn’t do any type matching.

Type matching translation

Let’s expand our example a bit with a rule that matches a type and classifies it as another type:

word_pattern = '{(?<word><<unit>>)}'  
word_sub = '<<word>>'  

This matches a unit and classifies it as a word. We can match our syntax with the following simple regex:


What regex do we replace this with though? In the simplest case we just want to match:


Where ... is any text other than the unit closing tag. We can match this straightforwardly using the following regex:


However, as we saw in the last article, this is not going to suffice because we would like to capture any assigned classes along with the specific matched class. For example, consider:

<mammal><bear>polar bear</bear></mammal>eats person  

If we classify <<bear>>eats person as scary then we want to see:

<scary><mammal><bear>polar bear</bear></mammal>eats person</scary>  

Rather than the equally correct but slightly less pleasant form:

<mammal><scary><bear>polar bear</bear></mammal>eats person</scary>  

Where the mammal and scary tags cross over.

It’s worth noting that the order of the tags doesn’t imply any kind of superclass relationship. It is often the case that a superclass encloses a subclass but this is a side-effect of the way classification is naturally performed bottom-up and is not strictly enforced by the mark-up syntax.

In fact, there are cases where it is not possible to have a neat superclass-subclass relationship such as where one class overlaps another but neither is a subset of the other, e.g.

<buttery>butter, <beany>butter beans</buttery>, black beans<\beany>  

The buttery parts of this phrase overlap the beany parts and neither encloses the other.

We can grab the additional enclosing classes by searching for <tag-name> before and </tag-name> afterwards. However, we need to make sure we have the same number of opening and closing tags. We can do this by checking the first and last tag name match as we saw in the previous article. We still have to handle the case where there are no enclosing tags and therefore we end up replacing it with the following regex:

 ?(?:\<(?P<start_\g<type>>[a-z\-]+)\>(?:\<[a-z\-]+\>)*\<\g<type>\>(?:(?!\<\/\g<type>\>).)*\<\/\g<type>\>(?:\<\/[a-z\-]\>)*\<\/(?P=start_\g<type>)\>|\<\g<type>\>(?:(?!\<\/\g<type>\>).)*\<\/\g<type>\>) ?

This is rather unwieldy but we only have to understand it once and then we can forget about it and use the enhanced syntax that it translates from. In essence this regex says:

  • try to match any start tag
  • followed by zero or more start tags
  • followed by the start tag we’re interested in
  • followed by anything that isn’t the end tag we’re interested in
  • followed by the end tag we’re interested in
  • followed by zero or more end tags
  • followed by an end tag that matches the first start tag

and if that doesn’t match:

  • try to match the start tag we’re interested in
  • followed by anything that isn’t the end tag we’re interested in
  • followed by the end tag we’re interested in

We also added ' ?' at the start and end to soak up extraneous whitespace between tokens.

One last complication is: what happens if we want to capture the same type twice? As it stands the regex will blow up because of duplicate group names. We need a way to differentiate them so we’ll add an optional index after the type, e.g.

word_pattern = '{(?<word><<unit1>><<unit2>>)}'  

To support this we need to change our replacement rule to match:


This matches the type for use in the tags and the type plus optional index for use in the group name. We then replace our template with:

 ?(?:\<(?P<start_\g<type_and_index>>[a-z\-]+)\>(?:\<[a-z\-]+\>)*\<\g<type>\>(?:(?!\<\/\g<type>\>).)*\<\/\g<type>\>(?:\<\/[a-z\-]\>)*\<\/(?P=start_\g<type_and_index>)\>|\<\g<type>\>(?:(?!\<\/\g<type>\>).)*\<\/\g<type>\>) ?

Which is the same as before except we are using type_and_index in the group name instead of type.

Enhanced regex substitution function

Let’s wrap up our translation regexes in a function as we’ll want to re-use them:

# enhanced regex substitution
def re_sub(pattern, template, input):

    # translate pattern to python regex pattern
    pattern = re.sub(r'\{\(\?\<(?P<type>[a-z\-]+)\>(?P<content>.*)\)\}', r' ?(?<![^\> ])(?P<\g<type>>\g<content>)(?![^\< ]) ?', pattern)
    pattern = re.sub(r'\<\<(?P<type_and_index>(?P<type>[a-z\-]+)(?:[0-9]+)?)\>\>', r' ?(?:\<(?P<start_\g<type_and_index>>[a-z\-]+)\>(?:\<[a-z\-]+\>)*\<\g<type>\>(?:(?!\<\/\g<type>\>).)*\<\/\g<type>\>(?:\<\/[a-z\-]\>)*\<\/(?P=start_\g<type_and_index>)\>|\<\g<type>\>(?:(?!\<\/\g<type>\>).)*\<\/\g<type>\>) ?', pattern)

    # translate template to python regex template
    template = re.sub(r'\<\<(?P<type>[a-z\-]+)\>\>', r' <\g<type>>\\g<\g<type>></\g<type>> ', template)

    # substitute using the translated pattern, template
    return re.sub(pattern, template, input)

Now we can call re_sub instead of re.sub and use our enhanced type capture/match/substitute syntax.

Let’s use this to solve our classification problem:

input = ' '.join(tokenise('salt (a pinch)'))

output = re_sub('{(?<unit>pinch)}', '<<unit>>', input)  
output = re_sub('{(?<word><<unit>>)}', '<<word>>', output)


This gives the desired result:

salt ( a <word><unit>pinch</unit></word>)  

And there it is! We’ve built a regex substitution function that neatly handles type annotation using the standard regex sub method as our building block.

Next time

We’ll go back to our ingredient quantities example and see if we can apply our new classifier to solve it. In particular, whether we now have enough tools at our disposal to attempt to extract a canonical ingredient name.