A simple method for classifying text

The problem

Suppose we have a string 3 tbsp tomato ketchup and we want to extract:

  • ingredient name
  • unit
  • quantity

It shouldn’t be too difficult to split this into the individual parts, right?

Regex to the rescue?

If we know that the string will always have the format {quantity} {unit} {ingredient name} then the values can be extracted trivially with a suitable regex ((\d+) (\w+) (.+)). However things are not generally so regular - especially when scraping data from the web.

Suppose that our next string is 6 - 7 cups of three different types of vegetables, chopped into bite size pieces - our regex solution is in deep water.

Of course, we could extend our regex to cater for the new features in this second string: (\d+(?:\s-\s\d+)? (\w+) (?:of )? ([^,]+) but it’s looking bad already and we’ve only considered two possible formats.

How about a separate regex for each case and we choose the one that matches? But what if more than one matches? And in what order should we try them all? It’s not obvious.

Iterative classification

Extracting everything we need in one regex match per string format looks unwieldy. Perhaps we can lower our ambitions and match simpler parts of the input string. For example, we know that tbsp is almost certainly a unit and we can be fairly confident in designating \d+ to be a number.

If we could build on these trivial definitions to assemble a set of rules that describe the essential grammar of the text, while maintaining a semblance of readability then we’d be doing better than the pure regex approach.

To describe our example so far we might define the following classifications:

/\d+/ is number
number,/-/,number is range  
/tbsp/ is unit
/cups?/ is unit
number|range,unit,/of/? is amount  
amount,/\w+/+ is amount,ingredient  

These vary in complexity from designating a single pattern with a specified type to designating a pattern within the context of adjacent types as having a particular type. However, they are all fairly readable.

These rules would be applied iteratively until the required classifications have been made or no new classifications emerge - meaning more knowledge would be required to successfully classify the example in hand.

Let’s look at how this works with 6 - 7 cups of three different types of vegetables, chopped into bite size pieces. The first rule is /\d+/ is number, meaning anything made of one or more digits is a number. Writing the classifications using markup, this yields:

<number>6</number>-<number>7</number>cups of three different types of vegetables, chopped into bite size pieces

The second rule is number,/-/,number is range, which classifies number, hyphen, number as a range. Our string now looks like:

<range><number>6</number>-<number>7</number></range>cups of three different types of vegetables, chopped into bite size pieces

The third rule is like the first but classifies tbsp as a unit. It doesn’t match anything in this string. The fourth rule is similar again: /cups?/ is unit, looking to classify cup or cups as a unit. Here, we do get a match:

<range><number>6</number>-<number>7</number></range><unit>cups</unit>of three different types of vegetables, chopped into bite size pieces

Next time

We’ll look at how to implement this idea in python and what happens when we consider some more example input strings.