Recap

In the previous article we built a classifier that would take an input string along with some knowledge about its structure and produce an output string marked up with classifications. This worked nicely for the single case in hand but the question is how does the required knowledge scale with the number of input strings?

Every time we add a new string to our input dataset we potentially need some new knowledge to be able to correctly classify it. The extent to which new knowledge is required depends on the scope and regularity of the inputs. If they conform to a strict format then providing all the knowledge up-front is trivial. If we’re trying to classify any sentence in the English language, it may take longer.

Input dataset

Let’s continue with the same theme and consider more ingredient quantity examples. These clearly don’t conform to a single format but equally their scope seems to be manageable given that there are only so many units, they are generally fairly short strings and there only seems to be a few sentence formations in use. We can also readily scrape a bunch of them from the web as there are lots of sites with well structure recipe descriptions.

For this test we’ll consider 100 such strings plucked from various recipe websites.

Pre-knowledge

Before we even look at our dataset, we can expand our knowledge definition in a systematic way. We can have a decent stab at the set of known units and we can expand our definition of what a number looks like.

Let’s add all the units we can think of...

units = '''
/ml/ is unit
/tsp/ is unit
/tbsp/ is unit
/floz/ is unit
/pint/ is unit
/l/ is unit
/gal/ is unit
/g/ is unit
/oz/ is unit
/lb/ is unit
/kg/ is unit
'''

We could have put these all into a single rule by using the | operator inside the regex pattern. However, we can anticipate needing additional representations of each unit (e.g. tbsp == tablespoon) so for readability it’s nice to keep each unit separate.

Let’s add in any alternative representations that we can anticipate, considering potential long form names, plurals and alternative english spellings. To help us out we’ll take a look at wikipedia:

units = '''
/ml|mL|cc|millilitres?|milliliters?/ is unit
/tsp|t|teaspoons?/ is unit
/tbsp|T|tbl|tbs|tablespoons?/ is unit
/floz/ is unit
/fl/,/oz/ is unit
/fluid/,/ounces?/ is unit
/p|pt|pints?/ is unit
/l|L|litres?|liters?/ is unit
/gal|gallons?/ is unit
/dl|dL|decilitre|deciliter/ is unit
/g|grams?|grammes?/ is unit
/oz|ounces?/ is unit
/lb|#|pounds?/ is unit
/kg|kilos?|kilograms?/ is unit
'''

We’ve surely missed a few corner cases but this is a good start so let’s move on and take a look at our number definition (/\d+/ is number). This is quite a narrow definition of what a number looks like (a sequence of one or more digits). With minimal head scratching we could anticipate:

  • A decimal point (2.5)
  • Fractions (2 1/2)
  • Fractional characters (2½)

This gives us the following rules:

numbers = '''
/\d+/?,/\d+\/\d+/ is number
/\d+(\.\d+)?/ is number
/\d*[½⅓⅔¼¾⅕⅖⅗⅘⅙⅚⅛⅜⅝⅞]/ is number
'''

Verification

We now want to run our classifier and verify that each input string is correctly classified. We could write a test that compares the classified output with the expected output for each input string. That’s a lot of tedious work though - we’d have to manually classify all our input strings to get those expected output values. A more efficient approach is to extract the classified ingredient and print it next to the original string so we can quickly scan down the list and spot any erroneous cases.

For this, we’ll need a new function to grab the bits of the string with the type we’re interested in (i.e. ingredient). It’s simply going to iterate through the tokens and use the type matcher to decide whether to include each one. If the type matches, the token is appended to the output string. Here’s the code:

def extract_typed(tokens, types, type):
	token_str = ''
	t = 0
	matcher = type_matcher(type, '')
	for token in tokens:
		if matcher.match(tokens, t, types):
			if len(token_str) > 0:
				token_str += ' '
			token_str += token
		t += 1
	return token_str

Now we just need to modify our main program to iterate over the input strings:

def extract_ingredient(in_str):
	tokens = tokenise(in_str)
	rules = classifier.parse(knowledge)
	classifications = classifier.classify(tokens, rules)
	return extract_typed(tokens, classifications, 'ingredient')

out_strs = []
for in_str in in_strs:
	out_strs.append(extract_ingredient(in_str))
	
for i in range(len(in_strs)):
	print('"' + in_strs[i] + '"')
	print('=> "' + out_strs[i] + '"')
	print('')

The data

And finally, we get to the data! This is a set of 100 ingredient quantities taken from several different recipe websites:

in_strs = [
	'6 – 7 cups of Three different types of vegetables*, chopped into bite-sized pieces',
	'1 1/2 tsp. Onion Powder',
	'1 tsp. Garlic Powder',
	'1 tsp. Ground Ginger',
	'1/4 cup Orange Marmalade',
	'2 1/2 T. Soy Sauce, Tamari or Bragg’s Aminos',
	'2 T. Water',
	'15 oz. can Black Beans, drained and rinsed',
	'3 – 4 cups Cooked Rice or Quinoa (heat up the frozen type when in a pinch)',
	'1 butternut squash (around 1 kg)',
	
	'2 tbsp lemon juice',
	'75g grated vegan parmesan (or 15g Nutritional yeast flakes)',
	'½ tsp powered garlic',
	'1 tsp. mustard powder',
	'1 tsp. grated nutmeg',
	'300ml Alpro Oat Unsweetened drink',
	'2 tbsp extra virgin olive oil, plus a little extra for oiling the squash',
	'1 bunch of sage',
	'100g hazelnuts, roughly chopped',
	'400g macaroni',
	
	'Salt and pepper',
	'umeboshi paste 4 tbsp, see notes below',
	'Chinese black vinegar 2 tsp',
	'toasted sesame oil 2 tsp',
	'dark soy sauce 2 tbsp',
	'shaoxing wine 2 tsp',
	'garlic 2 cloves, peeled',
	'Chinese five-spice ¼ tsp',
	'dried chilli flakes ¼ tsp',
	'tempeh 300g block, see notes below',
	
	'pineapple ½, peeled and cored',
	'spring onions 2, finely chopped',
	'coriander ½ a small bunch, leaves picked',
	'dried chilli flakes a pinch',
	'echalion shallots 2, halved and thinly sliced',
	'red chilli 1, deseeded and thinly sliced',
	'lemons 2, juiced',
	'farro 180g, see notes below',
	'agave syrup 1 tbsp',
	'rapeseed oil 4 tbsp',
	
	'pecans 125g',
	'flat-leaf parsley a bunch, roughly chopped',
	'rocket 70g',
	'peaches 3, slightly under-ripe, halved and stoned',
	'thyme 4 sprigs, leaves picked',
	'1 15-ounce can chickpeas (rinsed, drained, and dried)',
	'1 Tbsp olive oil',
	'1 Tbsp dried or fresh oregano',
	'1 pinch sea salt',
	'2 tsp garlic powder',
	
	'3 Tbsp gluten-free panko bread crumbs',
	'1 Tbsp vegan parmesan cheese',
	'1 Tbsp olive oil',
	'3 cloves garlic, minced (3 cloves yield ~1 1/2 Tbsp or 9 g)',
	'1/4 cup carrots (very finely diced)',
	'1 15-ounce can tomato sauce',
	'1 Tbsp dried or fresh oregano',
	'1 Tbsp vegan parmesan cheese (plus more to taste)',
	'1-2 Tbsp sweetener of choice (such as organic cane sugar or coconut sugar // optional)',
	'10 large carrots (ribboned with a vegetable peeler // or sub 8 ounces pasta of choice per 10 carrots)',
	
	'Red pepper flakes',
	'Vegan parmesan cheese',
	'Fresh basil',
	'3/4 cup dried chickpeas (soaked and cooked, see step 1)',
	'1/2 red onion (peeled)',
	'2 garlic cloves (peeled)',
	'1/4 teaspoon salt',
	'1/4 teaspoon paprika powder',
	'1/4 cup fresh parsley',
	'1 tablespoon lemon juice',
	
	'1 teaspoon olive oil',
	'1-2 tablespoons besan/chickpea flour',
	'1 avocado',
	'1/2 teaspoon lime juice',
	'1/4 teaspoon salt',
	'ground pepper',
	'4 pretzel buns',
	'1/2 cup baby spinach',
	'1/2 cup arugula',
	'3 tablespoons Olive Oil',
	
	'1 large brown/yellow/white Onion diced',
	'4 Garlic Cloves minced',
	'1 large Zucchini sliced, then quartered',
	'4 celery stalks sliced',
	'6 medium sized Tomatoes diced',
	'3 Bell Peppers 1 green, 2 yellow,red, or orange',
	'3 tablespoons favorite Chili Powder storebought or homemade',
	'1 tablespoon Cumin',
	'2 tablespoons Paprika',
	'1 teaspoon Smoked Paprika',
	
	'4 1/2 cups 900ml Tomato Puree',
	'4 cups 800ml Water',
	'3 cups Beans of choice - Kidney Black, Pinto etc. (soaked and cooked, or canned)',
	'2 cups Corn',
	'Salt and Pepper',
	'1 Avocado - optional diced',
	'handful of fresh Cilantro - optional',
	'3 cups butternut squash (cubed*)',
	'3 cloves garlic (whole // skin removed)',
	'2 Tbsp olive oil (divided)'
]

When we run this lot through our classifier, we see a lot of strings that have failed to match, resulting in an empty ingredient string, e.g.

"1 1/2 tsp. Onion Powder"
=> ""

If we focus on these first we can condense the output to just show the input string for those that have no match. Let’s change our output code to the following:

for i in range(len(in_strs)):
	if len(out_strs[i]) == 0:
		print(in_strs[i])

This gives us the following list of input strings that failed to classify:

1 1/2 tsp. Onion Powder
1 tsp. Garlic Powder
1 tsp. Ground Ginger
2 1/2 T. Soy Sauce, Tamari or Bragg’s Aminos
2 T. Water
15 oz. can Black Beans, drained and rinsed
1 butternut squash (around 1 kg)
1 tsp. mustard powder
1 tsp. grated nutmeg
1 bunch of sage
Salt and pepper
umeboshi paste 4 tbsp, see notes below
Chinese black vinegar 2 tsp
toasted sesame oil 2 tsp
dark soy sauce 2 tbsp
shaoxing wine 2 tsp
garlic 2 cloves, peeled
Chinese five-spice ¼ tsp
dried chilli flakes ¼ tsp
tempeh 300g block, see notes below
pineapple ½, peeled and cored
spring onions 2, finely chopped
coriander ½ a small bunch, leaves picked
dried chilli flakes a pinch
echalion shallots 2, halved and thinly sliced
red chilli 1, deseeded and thinly sliced
lemons 2, juiced
farro 180g, see notes below
agave syrup 1 tbsp
rapeseed oil 4 tbsp
pecans 125g
flat-leaf parsley a bunch, roughly chopped
rocket 70g
peaches 3, slightly under-ripe, halved and stoned
thyme 4 sprigs, leaves picked
1 15-ounce can chickpeas (rinsed, drained, and dried)
1 Tbsp olive oil
1 Tbsp dried or fresh oregano
1 pinch sea salt
3 Tbsp gluten-free panko bread crumbs
1 Tbsp vegan parmesan cheese
1 Tbsp olive oil
3 cloves garlic, minced (3 cloves yield ~1 1/2 Tbsp or 9 g)
1 15-ounce can tomato sauce
1 Tbsp dried or fresh oregano
1 Tbsp vegan parmesan cheese (plus more to taste)
1-2 Tbsp sweetener of choice (such as organic cane sugar or coconut sugar // optional)
Red pepper flakes
Vegan parmesan cheese
Fresh basil
1/2 red onion (peeled)
2 garlic cloves (peeled)
1-2 tablespoons besan/chickpea flour
1 avocado
ground pepper
4 pretzel buns
1 large brown/yellow/white Onion diced
4 Garlic Cloves minced
1 large Zucchini sliced, then quartered
4 celery stalks sliced
6 medium sized Tomatoes diced
3 Bell Peppers 1 green, 2 yellow,red, or orange
Salt and Pepper
1 Avocado - optional diced
handful of fresh Cilantro - optional
3 cloves garlic (whole // skin removed)
2 Tbsp olive oil (divided)

That’s 67% of our input set that we couldn’t classify. Let’s take a look at the failed strings and see if we can find some common explanations for them not matching:

  • a period (.) may follow a unit, signifying the abbreviation - e.g. 1 tsp. Garlic Powder
  • an amount may have no unit - e.g. 1 butternut squash (around 1 kg)
  • there may be no amount preceding an ingredient - e.g. Salt and pepper
  • the amount may follow the ingredient - e.g. Chinese black vinegar 2 tsp
  • units can start with a capital letter - e.g. 1 Tbsp olive oil
  • pinch is a unit - e.g. 1 pinch sea salt

Improve our knowledge

We need to add some new rules and modify our existing ones to handle the new cases we’ve found.

We can handle a period after a unit by modifying the rule range|number,unit,/of/? is amount to include an optional period: range|number,unit,/\./?,/of/? is amount.

We can modify the same rule again to make the unit optional: range|number,unit?,/\./?,/of/? is amount.

We can allow an ingredient to have no preceding amount by modifying the rule amount,/\w+/+ is ,ingredient to make amount optional: amount?,/\w+/+ is ,ingredient.

We can modify the same rule again to allow the amount to follow the ingredient: amount?,/\w+/+,amount? is ,ingredient,.

We can update the tbsp unit rule with the alternative casing: /tbsp|Tbsp|T|tbl|tbs|tablespoons?/ is unit.

Finally, we can add a new rule for the pinch unit: /pinch/ is unit.

Re-running with our new knowledge yields no empty output strings but there may still be errors so let’s go back to printing the input string and output ingredient next to each other and scan down looking for errors. We can quickly see the following lines are dubious:

"½ tsp powered garlic"
=> "tsp powered garlic"

"umeboshi paste 4 tbsp, see notes below"
=> "umeboshi paste 4 tbsp see notes below"

"Chinese black vinegar 2 tsp"
=> "Chinese black vinegar 2 tsp"

"toasted sesame oil 2 tsp"
=> "toasted sesame oil 2 tsp"

"dark soy sauce 2 tbsp"
=> "dark soy sauce 2 tbsp"

"shaoxing wine 2 tsp"
=> "shaoxing wine 2 tsp"

"garlic 2 cloves, peeled"
=> "garlic 2 cloves peeled"

"Chinese five-spice ¼ tsp"
=> "Chinese tsp"

"dried chilli flakes ¼ tsp"
=> "dried chilli flakes tsp"

"tempeh 300g block, see notes below"
=> "tempeh 300 g block see notes below"

"spring onions 2, finely chopped"
=> "spring onions 2 finely chopped"

"dried chilli flakes a pinch"
=> "dried chilli flakes a pinch"

"echalion shallots 2, halved and thinly sliced"
=> "echalion shallots 2 halved and thinly sliced"

"red chilli 1, deseeded and thinly sliced"
=> "red chilli 1 deseeded and thinly sliced"

"lemons 2, juiced"
=> "lemons 2 juiced"

"farro 180g, see notes below"
=> "farro 180 g see notes below"

"agave syrup 1 tbsp"
=> "agave syrup 1 tbsp"

"rapeseed oil 4 tbsp"
=> "rapeseed oil 4 tbsp"

"pecans 125g"
=> "pecans 125 g"

"rocket 70g"
=> "rocket 70 g"

"peaches 3, slightly under-ripe, halved and stoned"
=> "peaches 3 slightly halved and stoned"

"thyme 4 sprigs, leaves picked"
=> "thyme 4 sprigs leaves picked"

"1 15-ounce can chickpeas (rinsed, drained, and dried)"
=> "ounce can chickpeas rinsed drained and dried"

"1 15-ounce can tomato sauce"
=> "ounce can tomato sauce"

"handful of fresh Cilantro - optional"
=> "handful of fresh Cilantro optional"

Many of these failures have the same structure with an amount in the middle of the string - e.g. umeboshi paste 4 tbsp, see notes below. Let’s look at the classifications for this string and try to understand what’s going wrong:

[{s:2, e:2, t:number}, {s:3, e:3, t:unit}, {s:2, e:3, t:amount}, {s:0, e:3, t:ingredient}, {s:5, e:7, t:ingredient}]

We can see it has classified umeboshi paste 4 tbsp as an ingredient. It looks like our pattern is too greedy and has gobbled up the amount as well as the ingredient. Let’s look at the rule again: amount?,/\w+/+,amount? is ,ingredient,.

Indeed, \w will match numbers as well as letters so it will gobble up the 4 and the tbsp. We can make it more prescriptive by specifying letters only: amount?,/[a-zA-Z\-]+/+,amount? is ,ingredient,.

What we really wanted here was a lazy matcher that would give up as soon as the next matcher can take over. The advantage of that is we could still have numbers in our ingredient. Perhaps we’ll come back to that later.

Now let’s turn our attention to ½ tsp powered garlic. Something’s gone wrong here because tsp is ending up in the ingredient. Here are the tokens:
['½ ', 'tsp', 'powered', 'garlic']. We can see that there is whitespace after the ½ character, which would prevent it matching as a number.

We have two options here: we could add \s to the number pattern to allow whitespace to appear but we don’t really want that whitespace. Alternatively we could modify our tokeniser - after all, we don’t expect to see whitespace with a number and it only happens for fraction symbols. Let’s examine the tokeniser regex:

([a-zA-Z][a-zA-Z\\-]*|\\d+|[^\\w ])

The number representation doesn’t account for our expanded number definition including fraction symbols. We can rectify this by changing it to:

([a-zA-Z][a-zA-Z\\-]*|[½⅓⅔¼¾⅕⅖⅗⅘⅙⅚⅛⅜⅝⅞\\d]+|[^\\w ])

Re-running, we see the extraneous whitespace is gone:

['½', 'tsp', 'powered', 'garlic']

And the ingredient is now correctly extracted as powered garlic.

Conclusion

We made a first stab at providing the necessary knowledge by taking definitions on wikipedia as a basis. We then classified 100 ingredient quantity strings extracted from an assortment of websites, pulled out the ingredient name part of the string and scanned for errors. We had to provide eight new pieces of knowledge to the classifier to produce a satisfactory result.

Next time

We’ll repeat this process a few times with more data and see how the required knowledge grows with respect to the input set.