Last time we built a simple text classifier from regexes. Today we’ll go back to our ingredient quantities example and see how far this new classifier can carry us towards extracting a canonical ingredient name and quantity from a relatively free-form input string.

The new rules

If you’ve been following this project for a while then you’ll recall that we had a nice set of rules with the previous classifier that were of the general form:

pattern_or_type_1, pattern_or_type_2, ... => type_3, type_4, ...

To use our new regex classifier, we need to convert these to equivalent rules of the form:

rule(r'{(?<type_3>pattern_or_type_1)}{(?<type_4>pattern_or_type_2)}...', r'<<type_3>><<type_4>>...')

If we apply this mapping to our original rules then we end up with the following:

rules = [
	# units
	rule('{(?<unit>pinch)}', '<<unit>>'),
	rule('{(?<unit>mls?|mL|cc|millilitres?|milliliters?)}', '<<unit>>'),
	rule('{(?<unit>tsps?|t|teaspoons?)}', '<<unit>>'),
	rule('{(?<unit>tbsps?|Tbsps?|T|tbl|tbs|tablespoons?)}', '<<unit>>'),
	rule('{(?<unit>fl ?oz|fluid ounces?)}', '<<unit>>'),
	rule('{(?<unit>cups?)}', '<<unit>>'),
	rule('{(?<unit>p|pts?|pints?)}', '<<unit>>'),
	rule('{(?<unit>ls?|L|litres?|liters?)}', '<<unit>>'),
	rule('{(?<unit>gals?|gallons?/)}', '<<unit>>'),
	rule('{(?<unit>dls?|dL|decilitre|deciliter)}', '<<unit>>'),
	rule('{(?<unit>gs?|grams?|grammes?)}', '<<unit>>'),
	rule('{(?<unit>oz|ounces?)}', '<<unit>>'),
	rule('{(?<unit>lbs?|#|pounds?)}', '<<unit>>'),
	rule('{(?<unit>kgs?|kilos?|kilograms?)}', '<<unit>>'),
	
	# numbers
	rule('{(?<number>\d+/?,/\d+\/\d+)}', '<<number>>'),
	rule('{(?<number>\d+(\.\d+)?)}', '<<number>>'),
	rule('{(?<number>\d*[½⅓⅔¼¾⅕⅖⅗⅘⅙⅚⅛⅜⅝⅞])}', '<<number>>'),
	rule('{(?<number_word>a)}', '<<number_word>>'),
	
	# ingredient quantities
	rule('{(?<range><<number1>>(?:-|–)<<number2>>)}', '<<range>>'),
	rule('{(?<amount>(?:<<range>>|<<number>>|<<number_word>>)(?:-|–)?<<unit>>?\.?(?:of)?)}', '<<amount>>')

These are all straightforward mappings from the original except the last one, which is missing. Why? Because it’s not a very tight definition and tends to classify some things incorrectly so now we have a richer syntax, let’s see if we can do any better.

For now, we’ll just handle one specific case where the parenthesised amount follows the ingredient name. To handle that case, we add the following rule:

	rule('{(?<ingredient>.*)} \({(?<quantity><<amount>>)} \)', '<<quantity>><<ingredient>>'),
]

We’ll improve this later to handle other formats.

Just add data

Let’s throw the 300 ingredient quantities we scraped previously into our classifier and start fixing issues.

The first item in the list is:

'2 tablespoons raisins (optional)'

Since we made that last rule more specific, this one doesn’t match anymore. We need a new rule:

rule('{(?<quantity><<amount>>)}{(?<ingredient>.*)}', '<<quantity>><<ingredient>>')

We might observe that quantity and amount are essentially the same and the extra classification is redundant. All we wanted to do was match amount and output that as-is in the substitution using a more concise rule:

rule('<<amount>>{(?<ingredient>.*)}', '<<amount>><<ingredient>>')

This looks good but it doesn’t work because of a little bug in the type matching template. We are currently not capturing the content of the matched type in a named group - we’re using a non-capturing group. To fix this we need to change the start of the group enclosing the type from (?: to (?P<\g<type_and_index>> yielding the following template string:

(?: ?(?P<\g<type_and_index>>\<(?P<start_\g<type_and_index>>[a-z_]+)\>(?:\<[a-z_]+\>)*\<\g<type>\>(?:(?!\<\/\g<type>\>).)*\<\/\g<type>\>(?:\<\/[a-z_]\>)*\<\/(?P=start_\g<type_and_index>)\>|\<\g<type>\>(?:(?!\<\/\g<type>\>).)*\<\/\g<type>\>) ?)

Now our rule produces a result:

 <amount><amount><number>2</number> <unit>tablespoons</unit></amount></amount>  <ingredient>raisins ( optional )</ingredient> 

It’s almost right but an additional classification for amount has been added, which we’d rather do without.

We can fix this by capturing all the different bits separately. This is a bit messy because we end up with four separate capture groups:

  • the type opening/closing tags and content
  • the surrounding type opening tags
  • the content
  • the surrounding type closing tags

The first is populated if there are no surrounding tags otherwise the remainder are populated. We’ll apply a prefix to each group name to distinguish them and while we’re at it let’s capitalise those prefixes to ensure they can’t clash with another group name. All of this leads to the following translation:

pattern = re.sub(r'\<\<(?P<type_and_index>(?P<type>[a-z_]+)(?:[0-9]+)?)\>\>', r'(?: ?(?:(?P<TPRE_\g<type_and_index>>\<(?P<TSTART_\g<type_and_index>>[a-z_]+)\>(?:\<[a-z_]+\>)*)\<\g<type>\>(?P<TMAIN_\g<type_and_index>>(?:(?!\<\/\g<type>\>).)*)\<\/\g<type>\>(?P<TPOST_\g<type_and_index>>(?:\<\/[a-z_]\>)*\<\/(?P=TSTART_\g<type_and_index>)\>)|\<\g<type>\>(?P<T_\g<type_and_index>>(?:(?!\<\/\g<type>\>).)*)\<\/\g<type>\>) ?)', pattern)

We then change the substitution translation to insert the content of all these groups, in the appropriate order between the opening and closing tags for the type:

	template = re.sub(r'\<\<(?P<type>[a-z_]+)\>\>', r' <\g<type>>\\g<T_\g<type>>\\g<TPRE_\g<type>>\\g<TMAIN_\g<type>>\\g<TPOST_\g<type>></\g<type>> ', template)

We’re very close but this will fail because the new groups we’ve added don’t exist in the translated type capture. We need to add the T_ prefix to the type group and add empty capture groups for TPRE_, TMAIN_ and TPOST_:

	pattern = re.sub(r'\{\(\?\<(?P<type>[a-z_]+)\>(?P<content>.*?)\)\}', r' ?(?<![^\> ])(?P<T_\g<type>>\g<content>)(?![^\< ]) ?(?P<TPRE_\g<type>>)(?P<TMAIN_\g<type>>)(?P<TPOST_\g<type>>)', pattern)

This now works as expected. It’s a shame we need those empty groups but that’s a limitation of python regex. While we’re fixing extraneous classifications, let’s take a look at the other ingredient rule:

{(?<ingredient>.*)} \({(?<quantity><<amount>>)} \)

We can remove the unnecessary reclassification of amount as quantity here too:

{(?<ingredient>.*)} \(<<amount>>\)

More examples

Now our classifier handles capturing matched types properly we can go back to checking it classifies the rest of our data correctly.

There are some examples that have not been classified because they are just an ingredient without an amount - we’ll probably assume that they are an ingredient name only with an implicit quantity of 1 but we’ll have a think about those later when we’ve dealt with the obvious cases.

The first obvious error is handling fractional numbers of the form 1 / 4 as the following classified string demonstrates:

 <amount><number>1</number></amount>  <ingredient>/ <amount><number>4</number> <unit>teaspoon</unit></amount> cayenne</ingredient> 

The numerator and denominator have been recognised as separate numbers rather than considering the fraction as a single entity. This is due to an error in translating the rule into the new format. The pattern for matching a fractional number should look like:

{(?<number>(?:\d* )?\d+ \/ \d+)}

This yields the following result:

 <amount><number> <number>1</number></amount>  <ingredient>/ <amount><number>4</number> </amount> </number> <unit>teaspoon</unit> cayenne</ingredient> 

Still not quite right because the 1 and 4 are still being classified separately. If we want to avoid that, we need to combine the number rules with the | operator so only one can match:

{(?<number>(?:\d* )?\d+ \/ \d+|\d+(\.\d+)?|\d*[½⅓⅔¼¾⅕⅖⅗⅘⅙⅚⅛⅜⅝⅞])}

This yields:

 <amount><number>1 / 4</number> <unit>teaspoon</unit></amount>  <ingredient>cayenne</ingredient> 

Exactly what we want! What other errors can we spot? The ingredient has not been classified correctly in this example:

umeboshi paste <amount><number>4</number> <unit>tbsp</unit></amount>  <ingredient>, see notes below</ingredient> 

Let’s look again at our rule:

<<amount>>{(?<ingredient>.*)}

We want to be able to match an ingredient preceding an amount. We’d ideally like to replace this with:

<<amount>>{(?<ingredient>.*)}|{(?<ingredient>.*)}<<amount>>

But that’s not going to work because we have duplicate group names and python regex doesn’t like that even when they are mutually exclusive like in this case. To fix it we have to number the groups:

<<amount1>>{(?<ingredient1>.*)}|{(?<ingredient2>.*)}<<amount2>>

This works but only after fixing a bug in the translation regexes... we need to add [0-9]* to the group name in the type capture translation regex pattern and also make sure type_and_index is used for all the group names and references rather than type, which should only be used for outputting the type tags. While we’re changing this we can also replace (?:[0-9]+)? with the rather more concise [0-9]*. The complete updated re_sub function looks like this:

def re_sub(pattern, template, input):
	
	# translate pattern to python regex pattern
	pattern = re.sub(r'\{\(\?\<(?P<type_and_index>[a-z_]+[0-9]*)\>(?P<content>.*?)\)\}', r' ?(?<![^\> ])(?P<T_\g<type_and_index>>\g<content>)(?![^\< ]) ?(?P<TPRE_\g<type_and_index>>)(?P<TMAIN_\g<type_and_index>>)(?P<TPOST_\g<type_and_index>>)', pattern)
	pattern = re.sub(r'\<\<(?P<type_and_index>(?P<type>[a-z_]+)[0-9]*)\>\>', r'(?: ?(?:(?P<TPRE_\g<type_and_index>>\<(?P<TSTART_\g<type_and_index>>[a-z_]+)\>(?:\<[a-z_]+\>)*)\<\g<type>\>(?P<TMAIN_\g<type_and_index>>(?:(?!\<\/\g<type>\>).)*)\<\/\g<type>\>(?P<TPOST_\g<type_and_index>>(?:\<\/[a-z_]\>)*\<\/(?P=TSTART_\g<type_and_index>)\>)|\<\g<type>\>(?P<T_\g<type_and_index>>(?:(?!\<\/\g<type>\>).)*)\<\/\g<type>\>) ?)', pattern)
	
	# translate template to python regex template
	template = re.sub(r'\<\<(?P<type_and_index>(?P<type>[a-z_]+)[0-9]*)\>\>', r' <\g<type>>\\g<T_\g<type_and_index>>\\g<TPRE_\g<type_and_index>>\\g<TMAIN_\g<type_and_index>>\\g<TPOST_\g<type_and_index>></\g<type>> ', template)
	
	# substitute using the translated pattern, template
	return re.sub(pattern, template, input)

We’ll have to update the substitution part of our rule to include all the numbered groups. The complete rule is now:

rule('<<amount1>>{(?<ingredient1>.*)}|{(?<ingredient2>.*)}<<amount2>>', '<<amount1>><<ingredient1>><<amount2>><<ingredient2>>')

This produces the following result:

 <amount></amount>  <ingredient></ingredient>  <amount><number>4</number> <unit>tbsp</unit></amount>  <ingredient>umeboshi paste </ingredient> , see notes below

It’s basically correct but we see a few empty classifications in there as a result of not all the type groups matching. The easiest way to fix this is to simply remove empty groups afterwards using a regex replace in the classify function:

def classify(rules, s):
	for r in rules:
		s = re_sub(r.p, r.s, s)
	return re.sub('\<(?P<type>[^\>]+)\>\<\/(?P=type)\>', '', s)

It would be nice to not need this removal step but trying to handle this in the original substitution looks difficult and it is legitimate to assert that an empty classification is equivalent to no classification.

We now get exactly what we want:

<amount><number>4</number> <unit>tbsp</unit></amount>  <ingredient>umeboshi paste </ingredient> , see notes below

This is great but there are a couple of niggling problems with having to splice together logically separate rules like we’ve done here and previously with the number rules:

  • it reduces the clarity of the rules
  • it doesn’t scale very well

Next time

We’ll look at how we might separate our rules again to improve clarity and scaleability.