Last time we built a classifier that uses a monte-carlo tree search algorithm to find a sequence of rules that can be applied to an input string to produce an output string with the desired classifications.

This time we’ll make a couple of improvements to improve the classification accuracy and then extract the ingredients and quantities from our list of 300 ingredients.

Some clean-up

Before we make the real changes let’s clean up the code a bit. We can move our regex translation into the rule constructor and add match and sub methods that use the translated pattern and substitution strings. Let’s also use simplified translation regexes because we don’t need to match enclosing tags for our purposes. The new rule class looks like this:

class rule:
	
	def __init__(self, pattern, substitution):
		self.p = rule._translate_type_captures(rule._translate_type_matches(pattern))
		self.s = rule._translate_type_substitutions(substitution)
		
	def match(self, s):
		return re.match(self.p, s)
		
	def sub(self, s):
		return re.sub(self.p, self.s, s)
		
	def _translate_type_captures(s):
		
		pat = r'\{\(\?\<(?P<type_and_index>[a-z_]+[0-9]*)\>(?P<content>.*?)\)\}'
		
		rep = r' ?(?<![^\> ])(?P<T_\g<type_and_index>>\g<content>)(?![^\< ]) ?'
		
		return re.sub(pat, rep, s)
		
	def _translate_type_matches(s):
		
		pat = r'\<\<!(?P<type_and_index>(?P<type>[a-z_]+)[0-9]*)\>\>'
		
		rep = r'(?! ?\<\g<type>\>)'
		
		s2 = re.sub(pat, rep, s)
		
		pat = r'\<\<(?P<type_and_index>(?P<type>[a-z_]+)[0-9]*)\>\>'
		
		rep = r' ?\<\g<type>\>(?P<T_\g<type_and_index>>(?:(?!\<\/\g<type>\>).)*)\<\/\g<type>\> ?'
		
		return re.sub(pat, rep, s2)
		
	def _translate_type_substitutions(s):
		
		pat = r'\<\<(?P<type_and_index>(?P<type>[a-z_]+)[0-9]*)\>\>'
		
		rep = r' <\g<type>>\\g<T_\g<type_and_index>></\g<type>> '
		
		return re.sub(pat, rep, s)

We’ll also make a minor change to our tokenise function to return a space delimited string instead of an array of tokens - as we always end up converting to a string anyway and we’ll replace . with \.:

def tokenise(s):
	return ' '.join([t for t in re.split(r'([a-zA-Zñ][a-zA-Zñ\-]*|\d+\.\d+|[½⅓⅔¼¾⅕⅖⅗⅘⅙⅚⅛⅜⅝⅞\d]+|[^\w ])', s) if t.strip() != ''])

Keep winning!

Our first improvement is to allow more than one win along a path. This is useful because we may expect to match a pattern multiple times in the input string. Currently, we stop when the first match is found or we have to specify multiple matches in the pattern, which then fails to reward fewer matches resulting in missed classifications in that case. A better solution is to weight the reward based on the number of matches:

win_weight = len(rout.findall(s))

Or even better, we could use the total length of the matched groups:

win_weight = len(''.join([''.join(x) if isinstance(x, tuple) else x for x in rout.findall(s)]))

This requires findall to be exposed by the rule class instead of match:

	def findall(self, s):
		return re.findall(self.p, s)

We then need a way to weight a win using this factor. A simple solution is to pass the weight into the explored_move.win method and instead of incrementing wins by 1 we add this number. That means if we find two matches then we chalk up two wins:

def win(self, i):
		self.wins += i
		if self.prev_move is not None:
			self.prev_move.win(i)

Instead of calling win as soon as we find the first match, we want to keep track of the best win (most matches) and record that when we reach the end of the path. We’ll also keep track of the node where the win occurred as we don’t want to attribute wins to subsequent moves along the path.

Before the loop over depth, we initialise the best win so far:

		best_win = 0
		best_win_last_move = None

At the end of each iteration, we accumulate the best win so far:

			win_weight = len(''.join([''.join(x) if isinstance(x, tuple) else x for x in rout.findall(s)]))
			if win_weight > best_win:
				best_win = win_weight
				best_win_last_move = cur_move

After the loop, we record the best win that was found (if applicable) from the move where it occurred:

		if best_win > 0:
			best_win_last_move.win(best_win)

Get with the program

So far we have called the classifier with a pattern representing the desired final state. The search algorithm finds possible paths to reach that state and chooses the path that is most successful. However, there is a problem with this approach: if there are two possible routes to a win then the algorithm will tend to favour the simplest.

In our ingredient quantity example, we define an amount as being a number and a unit, e.g.

<amount><number>2</number><unit>tbsp</unit></amount>

In some cases, though, there is no unit (e.g. 1 banana). So we also allow an amount to be just a number:

<amount><number>1</number></amount>

But that means our classifier tends to incorrectly classify our first example as:

<amount><number>2</number></amount>tbsp

We need to give the classifier some more information - it needs to know something about the sequence of classifications. In other words we need to give it a program of patterns. Instead of just an end state, this program gives the waypoints along the way.

For the above example, our patterns could be:

<<number>>
<<unit>>
<<amount>>
<<ingredient>>

We implement this as multiple calls to play() (one for each pattern in the program, in order) with the output of the previous call piped into the input of the next:

def program(rules, sin, pouts, paths, max_depth=1000):
	s = sin
	for pout in pouts:
		s = play(rules, s, pout, paths, max_depth)
	return s

Running the classifier

Let’s update our rules a bit and run the classifier. The main change is to assume that we always have an amount and not just a number adjacent to an ingredient. That’s because numbers are wrapped in amounts by a prior stage before we attempt to match ingredients. The complete updated rule set is:

rules = [
	# units
	rule('{(?<unit>pinch)}', '<<unit>>'),
	rule('{(?<unit>cloves?)}', '<<unit>>'),
	rule('{(?<unit>mls?|mL|cc|millilitres?|milliliters?)}', '<<unit>>'),
	rule('{(?<unit>tsps?|t|teaspoons?)}', '<<unit>>'),
	rule('{(?<unit>tbsps?|Tbsps?|T|tbl|tbs|tablespoons?)}', '<<unit>>'),
	rule('{(?<unit>fl ?oz|fluid ounces?)}', '<<unit>>'),
	rule('{(?<unit>cups?)}', '<<unit>>'),
	rule('{(?<unit>p|pts?|pints?)}', '<<unit>>'),
	rule('{(?<unit>ls?|L|litres?|liters?)}', '<<unit>>'),
	rule('{(?<unit>gals?|gallons?/)}', '<<unit>>'),
	rule('{(?<unit>dls?|dL|decilitre|deciliter)}', '<<unit>>'),
	rule('{(?<unit>gs?|grams?|grammes?)}', '<<unit>>'),
	rule('{(?<unit>oz|ounces?)}', '<<unit>>'),
	rule('{(?<unit>lbs?|#|pounds?)}', '<<unit>>'),
	rule('{(?<unit>kgs?|kilos?|kilograms?)}', '<<unit>>'),
	
	# numbers
	rule('{(?<number>(?:\d* )?\d+ ?\/ ?\d+|\d*\s?[½⅓⅔¼¾⅕⅖⅗⅘⅙⅚⅛⅜⅝⅞]|\d+(\.\d+)?|\d+(?:-|–)\d+|an?)}', '<<number>>'),
	
	# amounts
	rule('{(?<amount>to taste)}', '<<amount>>'),
	rule('{(?<amount><<number1>><<unit1>>\+<<number2>><<unit2>>)}', '<<amount>>'),
	rule('{(?<amount><<number>>(?:-|–)?<<unit>>\.?(?: of)?)}', '<<amount>>'),
	rule('{(?<amount><<number>><<!unit>>(?: of)?)}', '<<amount>>'),
	
	# approximations
	rule('{(?<approx>about <<amount>>)}', '<<approx>>'),
	
	# ingredient quantities
	rule('{(?<ingredient>\s?\w[\s\w]+)} \(<<amount>>\)', '<<amount>><<ingredient>>'),
	rule('<<amount>>{(?<ingredient>\s?\w[\s\w]+)}', '<<amount>><<ingredient>>'),
	rule('<<amount1>> \( <<amount2>> \) {(?<ingredient>\s?\w[\s\w]+)}', '<<amount1>><<amount2>><<ingredient>>'),
	rule('<<amount>>{(?<ingredient>\s?\w[\s\w]+)}.*\(<<approx>>\)', '<<approx>><<ingredient>>'),
	rule('^{(?<ingredient>\s?\w[\s\w]+)}<<amount1>>(?: \(<<amount2>>\))?', '<<amount1>><<ingredient>>')
]

Running the classifier yields good results. Filtering out the correct classifications gives the following list of errors/failures:

--salt and pepper
--Lemon pepper
--Lime juice (to serve)
--Green onions
--Toasted sesame seeds
--Avocado basil sauce (recipe below)
--Caramelized kimchi (recipe below)
--Salt
--Pepper
plus 
plus
jalape ñ o chile
--Fresh cracked pepper
jalape ñ o chile
--120g gluten-free wild rice
--Coriander
--Lime
--Black pepper
--1 (1-inch) knob of ginger
--4 cloves
--Generous dash of black pepper
--black pepper, to taste
Tablespoons flour
--Salt and pepper
--Chinese five-spice ¼ tsp
--1 15-ounce can chickpeas (rinsed, drained, and dried)
--3 Tbsp gluten-free panko bread crumbs
--1 15-ounce can tomato sauce
--Red pepper flakes
--Vegan parmesan cheese
--Fresh basil
--ground pepper
large brown
--Salt and Pepper
--handful of fresh Cilantro - optional
--salt and pepper , for seasoning
--sesame seeds , for sprinkling
--leftover marinade from the mushrooms
--pasta of your choice
--salt, pepper
--1 tbsp + 1 tsp all-purpose flour use cornstarch to make it gluten free
--1 tbsp (15 ml) extra-virgin olive oil
plus 
--1 (1 lb 454 g) box of spaghetti

43 errors - not too bad from 300 inputs. Let’s group together the similar failures by cause and attempt to fix them:

Tokeniser not handling ñ correctly

jalape ñ o chile
jalape ñ o chile

This one is an easy fix. We just need to add the missing character to the regex in the character classes we’re using to match words:

def tokenise(s):
	return ' '.join([t for t in re.split('([a-zA-Zñ][a-zA-Zñ\\-]*|\\d+.\\d+|[½⅓⅔¼¾⅕⅖⅗⅘⅙⅚⅛⅜⅝⅞\\d]+|[^\\w ])', s) if t.strip() != ''])

Unrecognised unit

Tablespoons flour

This one is also trivial. We just need to add Tablespoons (with a capital T) to the tablespoons rule:

rule('{(?<unit>tbsps?|Tbsps?|T|tbl|tbs|tablespoons?|Tablespoons?)}', '<<unit>>')

Unrecognised amounts

--Lime juice (to serve)
--salt and pepper , for seasoning
--sesame seeds , for sprinkling
--Generous dash of black pepper

These are also easy to fix with the addition of some simple rules:

	rule(r'{(?<amount>to taste)}', '<<amount>>'),
	rule(r'{(?<amount>to serve)}', '<<amount>>'),
	rule(r'{(?<amount>for \w+ing)}', '<<amount>>'),
	rule(r'{(?<amount>[gG]enerous dash of)}', '<<amount>>')

Unrecognised ingredient formats

--black pepper, to taste
play
--120g gluten-free wild rice
--1 (1-inch) knob of ginger
--4 cloves
--Chinese five-spice ¼ tsp
--1 15-ounce can chickpeas (rinsed, drained, and dried)
--3 Tbsp gluten-free panko bread crumbs
--1 15-ounce can tomato sauce
large brown
--1 tbsp + 1 tsp all-purpose flour use cornstarch to make it gluten free
--1 tbsp (15 ml) extra-virgin olive oil
plus 
--1 (1 lb 454 g) box of spaghetti

These are due to missing/incorrect ingredient rules. We can continue to add rules to fix these issues.

rules = [
	# units
	rule(r'{(?<unit>pinch)}', '<<unit>>'),
	rule(r'{(?<unit>cloves?)}', '<<unit>>'),
	rule(r'{(?<unit>mls?|mL|cc|millilitres?|milliliters?)}', '<<unit>>'),
	rule(r'{(?<unit>tsps?|t|teaspoons?)}', '<<unit>>'),
	rule(r'{(?<unit>[tT]bsps?|T|tbl|tbs|[tT]ablespoons?)}', '<<unit>>'),
	rule(r'{(?<unit>fl ?oz|fluid ounces?)}', '<<unit>>'),
	rule(r'{(?<unit>cups?)}', '<<unit>>'),
	rule(r'{(?<unit>p|pts?|pints?)}', '<<unit>>'),
	rule(r'{(?<unit>ls?|L|litres?|liters?)}', '<<unit>>'),
	rule(r'{(?<unit>gals?|gallons?/)}', '<<unit>>'),
	rule(r'{(?<unit>dls?|dL|decilitre|deciliter)}', '<<unit>>'),
	rule(r'{(?<unit>gs?|grams?|grammes?)}', '<<unit>>'),
	rule(r'{(?<unit>oz|ounces?)}', '<<unit>>'),
	rule(r'{(?<unit>lbs?|#|pounds?)}', '<<unit>>'),
	rule(r'{(?<unit>kgs?|kilos?|kilograms?)}', '<<unit>>'),
	
	# numbers
	rule(r'{(?<number>(?:\d* )?\d+ ?\/ ?\d+|\d*\s?[½⅓⅔¼¾⅕⅖⅗⅘⅙⅚⅛⅜⅝⅞]|\d+(\.\d+)?)}', '<<number>>'),
	rule(r'{(?<number>an?)}', '<<number>>'),
	
	# amounts
	rule(r'{(?<amount>to taste)}', '<<amount>>'),
	rule(r'{(?<amount>to serve)}', '<<amount>>'),
	rule(r'{(?<amount>for \w+ing)}', '<<amount>>'),
	rule(r'{(?<amount>[gG]enerous dash)}', '<<amount>>'),
	rule(r'{(?<amount><<number1>><<unit1>>\+<<number2>><<unit2>>)}', '<<amount>>'),
	rule(r'{(?<amount><<number>>[\-–]?<<unit>>\.?)}', '<<amount>>'),
	rule(r'{(?<amount><<number>><<!unit>>(?! [\-–] \<number\>)(?: of)?)}', '<<amount>>'),
	
	# ingredients
	rule(r'^<<amount1>>{(?<ingredient>\w[\s\w-]+)}', '<<amount1>><<ingredient>>'),
	rule(r'^<<amount1>><<amount2>>{(?<ingredient>\w[\s\w-]+)}', '<<amount1>><<amount2>><<ingredient>>'),
	rule(r'^<<amount1>> \( <<amount2>> \) {(?<ingredient>\w[\s\w-]+)}', '<<amount1>><<amount2>><<ingredient>>'),
	rule(r'^<<amount1>> \( <<amount2>><<amount3>> \) {(?<ingredient>\w[\s\w-]+)}', '<<amount1>><<amount2>><<amount3>><<ingredient>>'),
	rule(r'^<<amount1>> \+ <<amount2>>{(?<ingredient>\w[\s\w-]+)}', '<<amount1>>+<<amount2>><<ingredient>>'),
	rule(r'^<<amount1>> plus <<amount2>>{(?<ingredient>\w[\s\w-]+)}', '<<amount1>>+<<amount2>><<ingredient>>'),
	rule(r'^<<amount1>> \( <<amount2>> \) plus <<amount3>>{(?<ingredient>\w[\s\w-]+)}', '<<amount1>><<amount2>><<amount3>><<ingredient>>'),
	rule(r'^<<amount1>> [\-–] <<amount2>>{(?<ingredient>\w[\s\w-]+)}', '<<amount1>>-<<amount2>><<ingredient>>'),
	rule(r'^<<amount1>> \| <<amount2>>{(?<ingredient>\w[\s\w-]+)}', '<<amount1>>-<<amount2>><<ingredient>>'),
	rule(r'^{(?<ingredient>\w[\s\w-]+)}<<amount1>>', '<<amount1>><<ingredient>>'),
	rule(r'^{(?<ingredient>\w[\s\w-]+)},<<amount1>>', '<<amount1>><<ingredient>>'),
]

This is quite a painful process though as new rules can break previous ones unexpectedly.

No amount information

That still leaves these remaining items that we can’t extract because the information does not exist:

--salt and pepper
--Lemon pepper
--Green onions
--Toasted sesame seeds
--Avocado basil sauce (recipe below)
--Caramelized kimchi (recipe below)
--Salt
--Pepper
--Fresh cracked pepper
--Coriander
--Lime
--Black pepper
--Salt and pepper
--Red pepper flakes
--Vegan parmesan cheese
--Fresh basil
--ground pepper
--Salt and Pepper
--handful of fresh Cilantro - optional
--leftover marinade from the mushrooms
--pasta of your choice
--salt, pepper

Conclusion

We have followed the approach of identifying the quantity part of an ingredient quantity and treating the residual as the ingredient. On the whole, this has worked quite well and has the advantage of extracting ingredients with minimal actual ingredient knowledge (we didn’t need to tell the classifier that a ‘carrot’ is an ingredient, for example). However, there are some clear drawbacks:

  • The extracted ingredients contain extra information and representational differences
  • Where we didn’t find a quantity we have to assume the whole string is an ingredient

These issues make a lot of work for us in achieving a canonical form and make it harder to detect a failed classification.

For the complete source of this example take a look at: https://gist.github.com/jeremyorme/52319cdb8f34f4bc6caeb5f618c6cd4a

Next time

We’ll think about how to teach our classifier specific ingredients instead of identifying them as the residual after removing quantities and consider how we might do this efficiently.