Last time we built a classifier that reliably identified known ingredients and linked to their nutritional values in the Food Data Central database (https://fdc.nal.usda.gov/). This time we’re going to try and aggregate those values over all the ingredients in a recipe.

Aggregating nutrient values

The FDC database provides nutrient values per 100g of each ingredient. It is meaningless to add up those values directly unless we believe every recipe contains 100g of each ingredient, which sounds rather unlikely. First, we need to scale those values according to how much of the ingredient is required in the recipe. To be able to do that we need to:

  • Extract the amounts from the ingredient strings (e.g. 1 tbsp)
  • Convert the amounts to a mass in grams

When we have the mass of the ingredient in grams we simply multiply the nutrient values by ing_mass / 100. For example, avocado contains 14.66g Fat / 100g. If our recipe calls for 50g of avocado then we calculate that the avocado contributes 14.66 * 50 / 100 = 7.33 grams of fat to the recipe. Once we have calculated the individual contribution of each ingredient to a given nutrient we can add them all up to find the total amount of that nutrient contained in the meal as a whole.

Extracting amounts

We have extracted amounts from ingredient quantity strings in a previous article but there is a bit more to do here to make sure we pick the right amount(s) in cases where more than one is present.

To extract the raw amounts we’ll use the rule class we devised previously:


class rule:
	
	def __init__(self, pattern, substitution):
		self.p = rule._translate_type_captures(rule._translate_type_matches(pattern))
		self.s = rule._translate_type_substitutions(substitution)
		
	def sub(self, s):
		return re.sub(self.p, self.s, s)
		
	def _translate_type_captures(s):
		pat = r'\{\(\?\<(?P<type_and_index>[a-z_]+[0-9]*)\>(?P<content>.*?)\)\}'
		rep = r' ?(?<![^\> ])(?P<T_\g<type_and_index>>\g<content>)(?![^\< ]) ?'
		return re.sub(pat, rep, s)
		
	def _translate_type_matches(s):
		pat = r'\<\<!(?P<type_and_index>(?P<type>[a-z_]+)[0-9]*)\>\>'
		rep = r'(?! ?\<\g<type>\>)'
		s2 = re.sub(pat, rep, s)
		pat = r'\<\<(?P<type_and_index>(?P<type>[a-z_]+)[0-9]*)\>\>'
		rep = r' ?\<\g<type>\>(?P<T_\g<type_and_index>>(?:(?!\<\/\g<type>\>).)*)\<\/\g<type>\> ?'
		return re.sub(pat, rep, s2)
		
	def _translate_type_substitutions(s):
		pat = r'\<\<(?P<type_and_index>(?P<type>[a-z_]+)[0-9]*)\>\>'
		rep = r' <\g<type>>\\g<T_\g<type_and_index>></\g<type>> '
		return re.sub(pat, rep, s)

For this application we can simply apply our rules sequentially to classify the amounts in a string:

def classify_amounts(s):
	for r in amount_rules:
		s = r.sub(s)
	return s

Then we can use extract_typed to grab the bits we’re interested in. We’ll simplify it to just accept one type and to match anything up to the corresponding closing tag:

def extract_typed(s, t):
	return re.findall(r'\<' + t + r'\>((?:(?!\</' + t + '\>).)*)\</' + t + r'\>', s)

Now let’s think about the rules themselves. Previously we would match something like tbsp or tbs and mark it as type unit. We’ll need to know that these are both tablespoons to be able to translate the quantities into grams. Therefore let’s be more specific and use a rule like this:

	rule(r'{(?<tbsp>[tT]bsps?|T|tbl|tbs|[tT]ablespoons?)}', ' <unit><<tbsp>></unit> '),

This results in a double classification so that, for example, tbl would be replaced with <unit><tbsp>tbl</tbsp></unit>. We still want to classify as a unit so we can match it as part of an amount but the extra tbsp classification enables us to map to an appropriate scaling factor without having to consider all the possible representations of a tablespoon unit.

We can repeat this for our other units:

rules = [
	# imprecise cooking units
	rule(r'{(?<pinch>[pP]inch(?:es)?)}', ' <unit><<pinch>></unit> '),
	rule(r'{(?<dash>[dD]ash)}', ' <unit><<dash>></unit> '),
	
	# general units of volume
	rule(r'{(?<ml>mls?|mL|cc|millilitres?|milliliters?)}', ' <unit><<ml>></unit> '),
	rule(r'{(?<tsp>tsps?|t|teaspoons?)}', ' <unit><<tsp>></unit> '),
	rule(r'{(?<tbsp>[tT]bsps?|T|tbl|tbs|[tT]ablespoons?)}', ' <unit><<tbsp>></unit> '),
	rule(r'{(?<floz>fl ?oz|fluid ounces?)}', ' <unit><<floz>></unit> '),
	rule(r'{(?<cup>cups?)}', ' <unit><<cup>></unit> '),
	rule(r'{(?<pt>p|pts?|pints?)}', ' <unit><<pt>></unit> '),
	rule(r'{(?<l>ls?|L|litres?|liters?)}', ' <unit><<l>></unit> '),
	rule(r'{(?<gal>gals?|gallons?/)}', ' <unit><<gal>></unit> '),
	rule(r'{(?<dl>dls?|dL|decilitre|deciliter)}', ' <unit><<dl>></unit> '),
	
	# general units of mass
	rule(r'{(?<kg>kgs?|kilos?|kilograms?)}', ' <unit><<kg>></unit> '),
	rule(r'{(?<g>gs?|grams?|grammes?)}', ' <unit><<g>></unit> '),
	rule(r'{(?<oz>oz|ounces?)}', ' <unit><<oz>></unit> '),
	rule(r'{(?<lb>lbs?|#|pounds?)}', ' <unit><<lb>></unit> '),

Then we use our tried and tested number matching rules:

	# numbers
	rule(r'{(?<number>(?:\d* )?\d+ ?\/ ?\d+|\d*\s?[½⅓⅔¼¾⅕⅖⅗⅘⅙⅚⅛⅜⅝⅞]|\d+(\.\d+)?)}', '<<number>>'),
	rule(r'{(?<number>an?)}', '<<number>>'),

And a general rule to handle amounts containing a number and/or a unit:

	# general amounts
	rule(r'{(?<amount><<number1>>[\-–]?<<unit1>>|<<number2>>|<<unit2>>)}', '<<amount>>')
]

Recall the recipe that we used in the previous article:


recipe = '''
2 courgettes (zucchini)  
1 carrot  
1 avocado  
1 bunch basil  
1 tbsp lemon juice  
2 tbsp nutritional yeast  
10 olives, sliced  
4 garlic cloves, roasted  
2 tomatoes, roasted  
Pinch of chilli powder or smoked paprika
'''

We want to convert each of the lines in this recipe into a structured ingredient quantity that we can query easily. Let’s create a class called ingredient_quantity to encapsulate the number, unit and ingredient name(s):

class ingredient_quantity:
	def __init__(self, number, unit, names):
		self.number = number
		self.unit = unit
		self.names = names
		
	def __repr__(self):
		return 'number: ' + str(self.number) + ', unit: ' + self.unit + ', names: ' + str(self.names)

	@classmethod
	def from_string(cls, iq_str):
		pass

We’ll create an instance of this class for each non-blank line in the input recipe:

iqs = [ingredient_quantity.from_string(iq_str) for iq_str in recipe.splitlines() if iq_str.strip() != '']

Now let’s implement the from_string class method. The first thing we need to do is classify the string. We’ll tokenise and classify_ingredients as before and then we’ll call our new classify_amounts method to add in the amount typing:

	@classmethod
	def from_string(cls, iq_str):
		s = tokenise(iq_str)
		s = classify_ingredients(s)
		s = classify_amounts(s)

That gets us a nicely classified string, e.g.

'<amount> <number>2</number> <unit> <tbsp>tbsp</tbsp> </unit> </amount>'

Now we just need to extract the number and the unit from the amount. We can go through each amount and store a number or unit as we find it. If both a number and unit are found in the same amount then we’ll take those and stop because their affinity implies a better choice than a number and unit from separate amounts. We also extract the ingredient canonical names contained in the first tag inside the ingredients. Here is the complete from_string method:

	@classmethod
	def from_string(cls, iq_str):
		s = tokenise(iq_str)
		s = classify_ingredients(s)
		s = classify_amounts(s)
		
		amounts = extract_typed(s, 'amount')
		number = '1'
		unit = 'default'
		for amount in amounts:
			numbers = extract_typed(amount, 'number')
			units = extract_typed(amount, 'unit')
			if len(numbers) > 0:
				number = numbers[0]
			if len(units) > 0:
				unit = first_tag(units[0])
			if len(numbers) > 0 and len(units) > 0:
				break
				
		names = [first_tag(i) for i in extract_typed(s, 'ingredient')]
		
		return cls(float(number), unit, names)

If no number is found, we assume 1 and if no unit is found we assume default (meaning the average size of one item of the ingredient). There are quantities that this won’t work for (e.g. numbers with symbolic fractions) but it’ll do for now and we can come back and improve it later. The first_tag helper function is implemented as a simple regex find:

def first_tag(s):
	for t in re.finditer(r'\<([^\>]*)\>', s):
		return t.group(1)

Let’s print out our ingredient_quantity instances:

for iq in iqs:
	print(str(iq))

This gives the following output:

number: 2.0, unit: default, names: ['courgette', 'courgette']
number: 1.0, unit: default, names: ['carrot']
number: 1.0, unit: default, names: ['avocado']
number: 1.0, unit: default, names: ['basil']
number: 1.0, unit: tbsp, names: ['lemon']
number: 2.0, unit: tbsp, names: ['yeast']
number: 10.0, unit: default, names: ['black-olives']
number: 4.0, unit: default, names: ['garlic']
number: 2.0, unit: default, names: ['tomato']
number: 1.0, unit: pinch, names: ['chilli-powder', 'paprika']

This is looking very respectable. We’ve extracted all the information we need from the recipe. All that remains is to convert the values to a common system of units and aggregate their values.

Converting to base units

Our ingredient quantities are specified using numerous different units. Ultimately we want everything to be in grams because our nutrient database (FDC) specifies quantities per 100 grams.

Generally, there are two different fundamental types of unit in a recipe:

  • Mass (g, oz, lb, ...)
  • Volume (ml, tbsp, cup, ...)

There are also quantities such as 2 tomatoes where it appears that no unit is present. In fact, the unit here is implicit in the ingredient (i.e. a typical tomato weighs 120 grams). These could be mass or volume depending on the unit we choose to store the typical quantity. Let’s assume we already have the size in grams for a unit of each ingredient and we can simply multiply by this value to get the total mass for the number of units the recipe demands.

That leaves us with those ingredient quantities that are not specified in grams. We first need to multiply by a conversion factor to convert from a unit such as oz to the base unit (e.g. g). We can create a conversion map that stores the conversion factors for the units we know about:

{
	"g": {
		"factor": 1.0,
		"base_unit": "g"
	},
	"oz": {
		"factor": 28.3495,
		"base_unit": "g"
	},
	"lb": {
		"factor": 453.592,
		"base_unit": "g"
	},
	"kg": {
		"factor": 1000.0,
		"base_unit": "g"
	},
	"ml": {
		"factor": 1.0,
		"base_unit": "ml"
	},
	"cc": {
		"factor": 1.0,
		"base_unit": "ml"
	},
	"pinch": {
		"factor": 0.73992,
		"base_unit": "ml"
	},
	"tsp": {
		"factor": 5.91939,
		"base_unit": "ml"
	},
	"tbsp": {
		"factor": 17.7582,
		"base_unit": "ml"
	},
	"floz": {
		"factor": 28.4131,
		"base_unit": "ml"
	},
	"cup": {
		"factor": 284.131,
		"base_unit": "ml"
	},
	"pt": {
		"factor": 568.261,
		"base_unit": "ml"
	},
	"dl": {
		"factor": 100.0,
		"base_unit": "ml"
	},
	"l": {
		"factor": 1000.0,
		"base_unit": "ml"
	},
	"gal": {
		"factor": 4546.09,
		"base_unit": "ml"
	}
}

Notice for the volume units that the base unit is ml, not g. Mass and volume are related to each other by density, which is specific to the ingredient. Let’s assume we have the density specified in g/ml for each ingredient. We can multiply by the density to convert from ml to g.

Bringing all this together gives us the following method on ingredient_quantity for converting to its grams equivalent:

	def total_grams_of(self, name):
		if self.unit == 'default':
			unit_mass = ingredients[name]['unit_mass']
			return self.number * unit_mass
			
		factor = units[self.unit]['factor']
		base_number = self.number * factor
		
		base_unit = units[self.unit]['base_unit']
		if base_unit == 'g':
			return base_number
			
		density = ingredients[name]['density']
		return base_number * density

This produces a value for a given ingredient name and there may be more than one ingredient name so we provide a method that averages for all the names:

	def total_grams(self):
		g = 0
		for name in self.names:
			g += self.total_grams_of(name)
		return g / len(self.names)

Finally, we need a method to return the amount (in grams) of a given nutrient. Here we simply look up the nutrient value using the fdc_id (derived from the canonical ingredient name) and the supplied nutrient id and multiply by the number of grams returned by total_grams_of divided by 100:

	def nutrient_grams(self, nutrient_id):
		g = 0
		for name in self.names:
			fdc_id = ingredients[name]['fdc_id']
			g += nutrient_values[fdc_id + ':' + nutrient_id] * self.total_grams_of(name) / 100
		return g / len(self.names)

Note that we added additional data into the ingredients.json file:

  • fdc_id
  • density
  • unit_mass
  • unit_name (for commentary only)

Adding it all up

The last step is to aggregate the nutrient values over all of the ingredients to get the totals for the recipe. This just involves iterating over the ingredient quantities and summing the value of iq.nutrient_grams(nutrient_id) for each nutrient:

totals = [0.0 for n in nutrients.items()]
total_mass = 0.0
for iq in iqs:
	for n, nutrient_id in enumerate(nutrients.keys()):
		totals[n] += iq.nutrient_grams(nutrient_id)
	total_mass += iq.total_grams()
		
print()
print('Total mass: ' + str(total_mass))
for n, nutrient_id in enumerate(nutrients.keys()):
	print(nutrients[nutrient_id] + ': ' + str(totals[n]))

We also sum iq.total_grams() here to get the total mass of the recipe.

Source code

That’s it! We have extracted amount and ingredient information from each line of a recipe, converted the values to grams and mapped to nutrient data in our database and then aggregated over the whole recipe. The complete source code for this example is on github (https://gist.github.com/jeremyorme/34504e4966763f1170474fc978f44ddf).

Next time

We’ll think about acquiring enough data to be able to process a wide range of recipes.