Last time we generated a list of ingredients with mappings to items in the Food Data Central (FDC) database and very rough estimates for unit size and density. This time we’ll plug that data into the ingredient classifier we built previously and generate the aggregate nutrition values for an example recipe. Then we’ll look at refining some of the data and testing with a wider range of recipes.

Keywords

Our ingredient list from last time requires some manual work to resolve the following issues with the auto-generated keywords

  • too many words (we want the smallest keyword/phrase to maximise the number of matches) - e.g. red tomatoes -> tomatoes
  • plurals (we want to remove plurals where the singular matches both) - e.g. tomatoes -> tomato
  • alternative names (we want to include all known alternative names) - e.g. zucchini -> zucchini|courgette
  • duplicates (a given keyword should only appear in one ingredient)
  • unwanted items (some items aren’t suitable as an ingredient)

We’ll change the matching logic in the ingredient classifier a bit so that instead of replacing every match, we find the longest match and replace that. Otherwise we run into all sorts of trouble with multiple matches. For example, we might have the keyword tomato for ingredient red-tomato and green tomato for ingredient green-tomato. If we see tomato we’d like to match the red-tomato ingredient but not if it is part of the string green tomato, in which case we’d like to match the green-tomato ingredient. We can achieve this by changing our classify_ingredients function as follows:


def classify_ingredients(s):
	candidates = []
	names = {}
	for name, ingredient in ingredients.items():
		for keyword in ingredient['keywords']:
			if keyword in s:
				candidates.append(keyword)
				names[keyword] = name
				
	sorted_candidates = sorted(candidates, key=len)
	
	if len(sorted_candidates) == 0:
		return s
		
	keyword = sorted_candidates[-1]
	name = names[keyword]
	return s.replace(keyword, '<ingredient><' + name + '>' + keyword + '</' + name + '></ingredient>')

Here we store a list of candidates (those keywords that were found in the input string), sort them by length and select the longest one. This results in the most specific keyword match being selected.

Handling duplicates

We’d like to avoid duplicate keywords because only one of them will ever be selected and it isn’t obvious which one.

Let’s create a simple script that records the line each keyword first appears on and if another occurrence is found, reports the keyword name and line numbers of both occurrences:

import csv

with open('manual-ingredients.csv') as f:
	rows = csv.DictReader(f)
	l = 1
	found = {}
	for r in rows:
		l += 1
		kwds = r['keywords'].split('|')
		for k in kwds:
			if k == '':
				continue
			if k in found:
				print(str(l) + ': duplicate "' + k + '" from line ' + str(found[k]))
			else:
				found[k] = l
				
print('done.')

Now we can go through the reported duplicates and either remove or rename one of them.

Generating ingredients JSON

Having beaten our ingredient list into reasonable shape, we’d like to run our ingredient classifier with our new data. To do that we must first convert our CSV data into JSON, which we can do with another simple script that builds a dictionary from the CSV rows and serialises it as JSON:

import csv
import json

ings = {}
with open('manual-ingredients.csv') as f:
	rows = csv.DictReader(f)
	
	for r in rows:
		if r['keywords'].strip() == '':
			continue
		ings[r['keywords'].split('|')[0].strip().replace(' ', '-')] = {
			'keywords': [k for k in r['keywords'].split('|') if k.strip() != ''],
			'density': float(r['density']),
			'component_ids': [k for k in r['component_ids'].split('+') if k.strip() != ''],
			'unit_mass': float(r['unit_mass'])
		}
		
print(json.dumps(ings, indent=2))

Handling no name match

As it stands, the aggregator will blow up if there’s no match because of this line in both total_grams and nutrient_grams:

		return g / len(self.names)

If there are no matched names we’ll get a divide by zero but in that case g will be zero anyway so a simple fix is to bound the lower limit of len(self.names) at 1 using max:

		return g / max(len(self.names), 1)

Handling multiple components

We can have several component FDC IDs for an ingredient. We handle this by calculating the average nutrient value for the set. The complete nutrient_grams method looks like this:

	def nutrient_grams(self, nutrient_id):
		g = 0
		for name in self.names:
			i_g = 0
			for fdc_id in ingredients[name]['component_ids']:
				i_g += nutrient_values[fdc_id + ':' + nutrient_id] * self.total_grams_of(name) / 100
			g += i_g / len(ingredients[name]['component_ids'])
		return g / max(len(self.names), 1)

Here we simply accumulate the nutrient value over all the components and divide by the number of components. That value is then accumulated for each ingredient in the recipe as before.

Classify a recipe

Now we have enough to classify and aggregate our example recipe from before. This produces the following output:

2 courgettes (zucchini)  
number: 2.0, unit: default, names: ['zucchini'], grams: 200.0, Protein: 5.42, Fat: 0.8, Carbohydrate: 6.22

1 carrot  
number: 1.0, unit: default, names: ['carrot'], grams: 100.0, Protein: 0.68, Fat: 0.2918181818181818, Carbohydrate: 7.220909090909092

1 avocado  
number: 1.0, unit: default, names: ['avocado'], grams: 100.0, Protein: 2.063333333333333, Fat: 13.376666666666665, Carbohydrate: 8.33

1 bunch basil  
number: 1.0, unit: default, names: ['basil'], grams: 100.0, Protein: 13.065, Fat: 2.355, Carbohydrate: 25.2

1 tbsp lemon juice  
number: 1.0, unit: tbsp, names: ['lemon-juice'], grams: 17.7582, Protein: 0.06215369999999999, Fat: 0.04261967999999999, Carbohydrate: 1.2253158

2 tbsp nutritional yeast  
number: 2.0, unit: tbsp, names: ['yeast-extract'], grams: 35.5164, Protein: 8.481316319999998, Fat: 0.3196476, Carbohydrate: 7.25244888

10 olives, sliced  
number: 10.0, unit: default, names: ['olive'], grams: 1000.0, Protein: 9.466666666666667, Fat: 110.3, Carbohydrate: 51.63333333333333

4 garlic cloves, roasted  
number: 4.0, unit: default, names: ['garlic'], grams: 400.0, Protein: 25.44, Fat: 2.0, Carbohydrate: 132.24

2 tomatoes, roasted  
number: 2.0, unit: default, names: ['red-tomato'], grams: 200.0, Protein: 1.98, Fat: 0.9675, Carbohydrate: 10.4275

Pinch of chilli powder or smoked paprika
number: 1.0, unit: pinch, names: ['chili-powder'], grams: 0.73992, Protein: 0.099593232, Fat: 0.105660576, Carbohydrate: 0.36774024000000005


Total mass: 2154.0145199999997
Protein: 66.758063252
Fat: 130.55891270448484
Carbohydrate: 250.11724734424243

Note that I modified the program to store the original ingredient quantity string and output it along with the aggregated output to improve readability.

Improving the dataset

Some of the numbers are clearly inaccurate. For example, 10 olives do not weigh 1kg. The issue here is that the blanket estimate of 100g per unit is way off in some cases such as the olive. In this case, the number of ingredients is not so large as to preclude manually sourcing unit size data.

This gives us a rather better result:

2 courgettes (zucchini)  
number: 2.0, unit: default, names: ['zucchini'], grams: 392.0, Protein: 10.623199999999999, Fat: 1.568, Carbohydrate: 12.191199999999998

1 carrot  
number: 1.0, unit: default, names: ['carrot'], grams: 61.0, Protein: 0.4148, Fat: 0.1780090909090909, Carbohydrate: 4.404754545454545

1 avocado  
number: 1.0, unit: default, names: ['avocado'], grams: 136.0, Protein: 2.8061333333333334, Fat: 18.19226666666667, Carbohydrate: 11.3288

1 bunch basil  
number: 1.0, unit: default, names: ['basil'], grams: 70.0, Protein: 9.145500000000002, Fat: 1.6485, Carbohydrate: 17.639999999999997

1 tbsp lemon juice  
number: 1.0, unit: tbsp, names: ['lemon-juice'], grams: 17.7582, Protein: 0.06215369999999999, Fat: 0.04261967999999999, Carbohydrate: 1.2253158

2 tbsp nutritional yeast  
number: 2.0, unit: tbsp, names: ['yeast'], grams: 35.5164, Protein: 8.481316319999998, Fat: 0.3196476, Carbohydrate: 7.25244888

10 olives, sliced  
number: 10.0, unit: default, names: ['olive'], grams: 27.0, Protein: 0.2556, Fat: 2.9781, Carbohydrate: 1.3941

4 garlic cloves, roasted  
number: 4.0, unit: default, names: ['garlic'], grams: 12.0, Protein: 0.7632000000000001, Fat: 0.06, Carbohydrate: 3.9672

2 tomatoes, roasted  
number: 2.0, unit: default, names: ['tomato'], grams: 246.0, Protein: 2.4354, Fat: 1.190025, Carbohydrate: 12.825825000000002

Pinch of chilli powder or smoked paprika
number: 1.0, unit: pinch, names: ['chili-powder'], grams: 0.73992, Protein: 0.099593232, Fat: 0.105660576, Carbohydrate: 0.36774024000000005


Total mass: 998.01452
Protein: 35.08689658533333
Fat: 26.28282861357576
Carbohydrate: 72.59738446545454

This can still be improved by providing more accurate densities but even without that the errors are of a lesser order than before.

Next time

We'll improve the data by calculating densities.

Note: updates will be significantly sparser due to the coronavirus situation. I write these articles to pass the time on the train during my daily commute, which isn't happening while I work from home. I have every confidence that normal service will resume at some point.