Last time we looked at using our regex classifier to infer the part of the input string containing the ingredient using a set of rules. This meant we could identify the ingredients with only a little input knowledge but due to insufficient identifying information and a long tail of different formats it failed in about 10% of cases. It also produced non-canonical results, containing extraneous information and varied formatting. This time we’ll take a more direct approach.

Simple substitution

Let’s work the other way around. We’ll assume we have a database of ingredients and each entry contains one or more keywords (or phrases) that we can search each string with to identify the presence of that ingredient.

Now the classification process becomes trivial. We simply str.replace instances of each keyword with that keyword surrounded by type mark-up. The type mark-up can identify the specific canonical ingredient (e.g. carrot) as well as the more general class (ingredient). Given a dictionary of ingredients in json format:

ings = json.loads('''{
	"mixed-vegetables": {
		"keywords": [
			"different types of veg"
		]
	}
}''')

We can iterate over the keywords replacing them with the classifications:

def classify_ingredients(ings, s):
	sout = s
	for canonical_name, ingredient in ings.items():
		for keyword in ingredient['keywords']:
			sout = sout.replace(keyword, '<ingredient><' + canonical_name + '>' + keyword + '</' + canonical_name + '></ingredient>')
	return sout

Classifying a real recipe

Let’s have a go at classifying the ingredients from a real recipe. We’ll take a recipe from Veganuary.com (https://uk.veganuary.com/recipes/zoodles-with-basil-and-avocado-pesto). Here is the ingredient list:

recipe = '''2 courgettes (zucchini)  
1 carrot  
1 avocado  
1 bunch basil  
1 tbsp lemon juice  
2 tbsp nutritional yeast  
10 olives, sliced  
4 garlic cloves, roasted  
2 tomatoes, roasted  
Pinch of chilli powder or smoked paprika'''

We’ll iterate over each line, calling our classifier, extracting the ingredients and outputting them (or --the-input if no ingredient was found):

for iq in recipe.splitlines():
	s = classify_ingredients(ings, iq)
	ingredients = extract_typed(s, ['ingredient'])[0]
	print(', '.join(ingredients) if len(ingredients) > 0 else '--' + iq)

This produces the following output:

--2 courgettes (zucchini)  
--1 carrot  
--1 avocado  
--1 bunch basil  
--1 tbsp lemon juice  
--2 tbsp nutritional yeast  
--10 olives, sliced  
--4 garlic cloves, roasted  
--2 tomatoes, roasted  
--Pinch of chilli powder or smoked paprika

None of the ingredients were matched because we only have mixed veg in our known ingredients. Let’s update our ingredients and save them in a separate data file called ingredients.json as this could potentially get quite big:

{
	"mixed-vegetables": {
		"keywords": [
			"different types of veg"
		]
	},
	"courgette": {
		"keywords": [
			"courgette",
			"zucchini"
		]
	},
	"carrot": {
		"keywords": [
			"carrot"
		]
	},
	"avocado": {
		"keywords": [
			"avocado"
		]
	},
	"basil": {
		"keywords": [
			"basil"
		]
	},
	"lemon": {
		"keywords": [
			"lemon"
		]
	},
	"yeast": {
		"keywords": [
			"yeast"
		]
	},
	"black-olives": {
		"keywords": [
			"olive"
		]
	},
	"garlic": {
		"keywords": [
			"garlic"
		]
	},
	"tomato": {
		"keywords": [
			"tomato"
		]
	},
	"chilli-powder": {
		"keywords": [
			"chilli powder"
		]
	},
	"paprika": {
		"keywords": [
			"paprika"
		]
	}
}

We can modify our json reading code to load this file:

with open('ingredients.json') as f:
	ings = json.load(f)

Then we re-run the classifier, which generates the following output:

<courgette>courgette</courgette>, <courgette>zucchini</courgette>
<carrot>carrot</carrot>
<avocado>avocado</avocado>
<basil>basil</basil>
<lemon>lemon</lemon>
<yeast>yeast</yeast>
<black-olives>olive</black-olives>
<garlic>garlic</garlic>
<tomato>tomato</tomato>
<chilli-powder>chilli powder</chilli-powder>, <paprika>paprika</paprika>

Everything is correctly classified! The drawback, as stated before, is that we now have to provide a lot more information to the classifier.

Do something useful

The advantage of reliable classification to a canonical ingredient is that we can easily reference additional related data. Logically, all we need is an additional table with a column for the canonical name and columns of additional data. For example, we could create a table that links each canonical ingredient to an entry in the Food Data Central database (https://fdc.nal.usda.gov) - we’ll save this as fdc_map.csv:

canonical-name,fdc-id
mixed-vegetables,170471
courgette,169291
carrot,342354
avocado,341528
basil,342608
lemon,341433
yeast,343391
black-olives,343670
garlic,342614
tomato,342502
chilli-powder,171319
paprika,171329

Now we can report the nutrient information for each ingredient in the recipe. We just look up the fdc_id using our map and use it to index into the FDC nutrient table.

For this example we'll just look up the macro-nutrient totals so we’ll take the following subset of the nutrient names table (for simplicity):

nutrient_id,name
1003,Protein
1004,Fat
1005,Carbohydrate

We can load this as follows:

with open('nutrients.csv') as f:
	nutrients = list(csv.DictReader(f))

Here is the subset of nutrient data we need that contains the above nutrients for each of the ingredients we’re using in this example:

id,fdc_id,nutrient_id,amount,data_points,derivation_id,min,max,median,footnote,min_year_acquired
"1526524","170471","1003","3.33","36","","","","","",""
"1526500","170471","1004","0.52","36","","","","","",""
"1526496","170471","1005","13.47","0","49","","","","",""
"1430596","169291","1003","1.21","10","43","0.91","1.5","","",""
"1430544","169291","1004","0.32","7","42","0.1","0.45","","",""
"1430545","169291","1005","3.11","0","49","","","","",""
"2696692","342354","1003","0.93","","","","","","",""
"2696693","342354","1004","0.24","","","","","","",""
"2696694","342354","1005","9.58","","","","","","",""
"2643002","341528","1003","2","","","","","","",""
"2643003","341528","1004","14.66","","","","","","",""
"2643004","341528","1005","8.53","","","","","","",""
"2713202","342608","1003","3.15","","","","","","",""
"2713203","342608","1004","0.64","","","","","","",""
"2713204","342608","1005","2.65","","","","","","",""
"2636827","341433","1003","1.1","","","","","","",""
"2636828","341433","1004","0.3","","","","","","",""
"2636829","341433","1005","9.32","","","","","","",""
"2764097","343391","1003","40.44","","","","","","",""
"2764098","343391","1004","7.61","","","","","","",""
"2764099","343391","1005","41.22","","","","","","",""
"2782232","343670","1003","0.88","","","","","","",""
"2782233","343670","1004","9.54","","","","","","",""
"2782234","343670","1005","6.06","","","","","","",""
"2713592","342614","1003","6.36","","","","","","",""
"2713593","342614","1004","0.5","","","","","","",""
"2713594","342614","1005","33.06","","","","","","",""
"2706312","342502","1003","0.88","","","","","","",""
"2706313","342502","1004","0.2","","","","","","",""
"2706314","342502","1005","3.89","","","","","","",""
"1600381","171319","1003","13.46","1","1","","","","",""
"1600336","171319","1004","14.28","1","1","","","","",""
"1600337","171319","1005","49.7","0","49","","","","",""
"1601325","171329","1003","14.14","3","1","13.96","14.47","","",""
"1601334","171329","1004","12.89","3","1","11.48","14.8","","",""
"1601368","171329","1005","53.99","0","49","","","","",""
'''

We can load these into a map for quick look-up by ingredient (fdc_id) and nutrient (nutrient_id):

nutrient_map = {}
with open('ingredient_nutrients.csv') as f:
	nut_map_reader = csv.DictReader(f)
	for row in nut_map_reader:
		nutrient_map[row['fdc_id'] + ':' + row['nutrient_id']] = row['amount']

Finally, we can modify our program to output the nutrient information for each discovered ingredient.

We write a helper to look up the nutrient info from the canonical name and generate a string representation:

def nutrient_info(canonical_ing):
	fdc_id = fdc_map[canonical_ing]
	return ', '.join([(n['name'] + '=' + nutrient_map[fdc_id + ':' + n['id']]) for n in nutrients])

Then we can use that in our program where we output the ingredient:

for iq in recipe.splitlines():
	s = classify_ingredients(ings, iq)
	ingredients = [first_tag(i) for i in extract_typed(s, ['ingredient'])[0]]
	print(', '.join([(i + ' (' + nutrient_info(i) + ')') for i in ingredients]) if len(ingredients) > 0 else '--' + iq)

This produces the following output:

courgette (Protein=1.21, Fat=0.32, Carbohydrate=3.11), courgette (Protein=1.21, Fat=0.32, Carbohydrate=3.11)
carrot (Protein=0.93, Fat=0.24, Carbohydrate=9.58)
avocado (Protein=2, Fat=14.66, Carbohydrate=8.53)
basil (Protein=3.15, Fat=0.64, Carbohydrate=2.65)
lemon (Protein=1.1, Fat=0.3, Carbohydrate=9.32)
yeast (Protein=40.44, Fat=7.61, Carbohydrate=41.22)
black-olives (Protein=0.88, Fat=9.54, Carbohydrate=6.06)
garlic (Protein=6.36, Fat=0.5, Carbohydrate=33.06)
tomato (Protein=0.88, Fat=0.2, Carbohydrate=3.89)
chilli-powder (Protein=13.46, Fat=14.28, Carbohydrate=49.7), paprika (Protein=14.14, Fat=12.89, Carbohydrate=53.99)

Exactly what we were looking for!

Next time

We’ll take this a bit further and try to calculate the nutrient values for the actual quantities of the ingredients in the recipe (rather than just per 100g). Then we’ll aggregate them to generate the nutrient info for the recipe as a whole.