Last time we built a classifier that reliably identified known ingredients and linked to their nutritional values in the Food Data Central database (https://fdc.nal.usda.gov/). This time we’re going to try and aggregate those values over all the ingredients in a recipe.
Aggregating nutrient values
The FDC database provides nutrient values per 100g of each ingredient. It is meaningless to add up those values directly unless we believe every recipe contains 100g of each ingredient, which sounds rather unlikely. First, we need to scale those values according to how much of the ingredient is required in the recipe. To be able to do that we need to:
- Extract the amounts from the ingredient strings (e.g.
1 tbsp
) - Convert the amounts to a mass in grams
When we have the mass of the ingredient in grams we simply multiply the nutrient values by ing_mass / 100
. For example, avocado contains 14.66g Fat / 100g. If our recipe calls for 50g of avocado then we calculate that the avocado contributes 14.66 * 50 / 100
= 7.33
grams of fat to the recipe. Once we have calculated the individual contribution of each ingredient to a given nutrient we can add them all up to find the total amount of that nutrient contained in the meal as a whole.
Extracting amounts
We have extracted amounts from ingredient quantity strings in a previous article but there is a bit more to do here to make sure we pick the right amount(s) in cases where more than one is present.
To extract the raw amounts we’ll use the rule class we devised previously:
class rule:
def __init__(self, pattern, substitution):
self.p = rule._translate_type_captures(rule._translate_type_matches(pattern))
self.s = rule._translate_type_substitutions(substitution)
def sub(self, s):
return re.sub(self.p, self.s, s)
def _translate_type_captures(s):
pat = r'\{\(\?\<(?P<type_and_index>[a-z_]+[0-9]*)\>(?P<content>.*?)\)\}'
rep = r' ?(?<![^\> ])(?P<T_\g<type_and_index>>\g<content>)(?![^\< ]) ?'
return re.sub(pat, rep, s)
def _translate_type_matches(s):
pat = r'\<\<!(?P<type_and_index>(?P<type>[a-z_]+)[0-9]*)\>\>'
rep = r'(?! ?\<\g<type>\>)'
s2 = re.sub(pat, rep, s)
pat = r'\<\<(?P<type_and_index>(?P<type>[a-z_]+)[0-9]*)\>\>'
rep = r' ?\<\g<type>\>(?P<T_\g<type_and_index>>(?:(?!\<\/\g<type>\>).)*)\<\/\g<type>\> ?'
return re.sub(pat, rep, s2)
def _translate_type_substitutions(s):
pat = r'\<\<(?P<type_and_index>(?P<type>[a-z_]+)[0-9]*)\>\>'
rep = r' <\g<type>>\\g<T_\g<type_and_index>></\g<type>> '
return re.sub(pat, rep, s)
For this application we can simply apply our rules sequentially to classify the amounts in a string:
def classify_amounts(s):
for r in amount_rules:
s = r.sub(s)
return s
Then we can use extract_typed
to grab the bits we’re interested in. We’ll simplify it to just accept one type and to match anything up to the corresponding closing tag:
def extract_typed(s, t):
return re.findall(r'\<' + t + r'\>((?:(?!\</' + t + '\>).)*)\</' + t + r'\>', s)
Now let’s think about the rules themselves. Previously we would match something like tbsp
or tbs
and mark it as type unit
. We’ll need to know that these are both tablespoons to be able to translate the quantities into grams. Therefore let’s be more specific and use a rule like this:
rule(r'{(?<tbsp>[tT]bsps?|T|tbl|tbs|[tT]ablespoons?)}', ' <unit><<tbsp>></unit> '),
This results in a double classification so that, for example, tbl
would be replaced with <unit><tbsp>tbl</tbsp></unit>
. We still want to classify as a unit
so we can match it as part of an amount but the extra tbsp
classification enables us to map to an appropriate scaling factor without having to consider all the possible representations of a tablespoon unit.
We can repeat this for our other units:
rules = [
# imprecise cooking units
rule(r'{(?<pinch>[pP]inch(?:es)?)}', ' <unit><<pinch>></unit> '),
rule(r'{(?<dash>[dD]ash)}', ' <unit><<dash>></unit> '),
# general units of volume
rule(r'{(?<ml>mls?|mL|cc|millilitres?|milliliters?)}', ' <unit><<ml>></unit> '),
rule(r'{(?<tsp>tsps?|t|teaspoons?)}', ' <unit><<tsp>></unit> '),
rule(r'{(?<tbsp>[tT]bsps?|T|tbl|tbs|[tT]ablespoons?)}', ' <unit><<tbsp>></unit> '),
rule(r'{(?<floz>fl ?oz|fluid ounces?)}', ' <unit><<floz>></unit> '),
rule(r'{(?<cup>cups?)}', ' <unit><<cup>></unit> '),
rule(r'{(?<pt>p|pts?|pints?)}', ' <unit><<pt>></unit> '),
rule(r'{(?<l>ls?|L|litres?|liters?)}', ' <unit><<l>></unit> '),
rule(r'{(?<gal>gals?|gallons?/)}', ' <unit><<gal>></unit> '),
rule(r'{(?<dl>dls?|dL|decilitre|deciliter)}', ' <unit><<dl>></unit> '),
# general units of mass
rule(r'{(?<kg>kgs?|kilos?|kilograms?)}', ' <unit><<kg>></unit> '),
rule(r'{(?<g>gs?|grams?|grammes?)}', ' <unit><<g>></unit> '),
rule(r'{(?<oz>oz|ounces?)}', ' <unit><<oz>></unit> '),
rule(r'{(?<lb>lbs?|#|pounds?)}', ' <unit><<lb>></unit> '),
Then we use our tried and tested number matching rules:
# numbers
rule(r'{(?<number>(?:\d* )?\d+ ?\/ ?\d+|\d*\s?[½⅓⅔¼¾⅕⅖⅗⅘⅙⅚⅛⅜⅝⅞]|\d+(\.\d+)?)}', '<<number>>'),
rule(r'{(?<number>an?)}', '<<number>>'),
And a general rule to handle amounts containing a number and/or a unit:
# general amounts
rule(r'{(?<amount><<number1>>[\-–]?<<unit1>>|<<number2>>|<<unit2>>)}', '<<amount>>')
]
Recall the recipe that we used in the previous article:
recipe = '''
2 courgettes (zucchini)
1 carrot
1 avocado
1 bunch basil
1 tbsp lemon juice
2 tbsp nutritional yeast
10 olives, sliced
4 garlic cloves, roasted
2 tomatoes, roasted
Pinch of chilli powder or smoked paprika
'''
We want to convert each of the lines in this recipe into a structured ingredient quantity that we can query easily. Let’s create a class called ingredient_quantity
to encapsulate the number, unit and ingredient name(s):
class ingredient_quantity:
def __init__(self, number, unit, names):
self.number = number
self.unit = unit
self.names = names
def __repr__(self):
return 'number: ' + str(self.number) + ', unit: ' + self.unit + ', names: ' + str(self.names)
@classmethod
def from_string(cls, iq_str):
pass
We’ll create an instance of this class for each non-blank line in the input recipe:
iqs = [ingredient_quantity.from_string(iq_str) for iq_str in recipe.splitlines() if iq_str.strip() != '']
Now let’s implement the from_string
class method. The first thing we need to do is classify the string. We’ll tokenise
and classify_ingredients
as before and then we’ll call our new classify_amounts
method to add in the amount typing:
@classmethod
def from_string(cls, iq_str):
s = tokenise(iq_str)
s = classify_ingredients(s)
s = classify_amounts(s)
That gets us a nicely classified string, e.g.
'<amount> <number>2</number> <unit> <tbsp>tbsp</tbsp> </unit> </amount>'
Now we just need to extract the number
and the unit
from the amount
. We can go through each amount
and store a number
or unit
as we find it. If both a number
and unit
are found in the same amount
then we’ll take those and stop because their affinity implies a better choice than a number
and unit
from separate amount
s. We also extract the ingredient canonical names contained in the first tag inside the ingredient
s. Here is the complete from_string
method:
@classmethod
def from_string(cls, iq_str):
s = tokenise(iq_str)
s = classify_ingredients(s)
s = classify_amounts(s)
amounts = extract_typed(s, 'amount')
number = '1'
unit = 'default'
for amount in amounts:
numbers = extract_typed(amount, 'number')
units = extract_typed(amount, 'unit')
if len(numbers) > 0:
number = numbers[0]
if len(units) > 0:
unit = first_tag(units[0])
if len(numbers) > 0 and len(units) > 0:
break
names = [first_tag(i) for i in extract_typed(s, 'ingredient')]
return cls(float(number), unit, names)
If no number
is found, we assume 1
and if no unit
is found we assume default
(meaning the average size of one item of the ingredient). There are quantities that this won’t work for (e.g. numbers with symbolic fractions) but it’ll do for now and we can come back and improve it later. The first_tag
helper function is implemented as a simple regex find:
def first_tag(s):
for t in re.finditer(r'\<([^\>]*)\>', s):
return t.group(1)
Let’s print out our ingredient_quantity
instances:
for iq in iqs:
print(str(iq))
This gives the following output:
number: 2.0, unit: default, names: ['courgette', 'courgette']
number: 1.0, unit: default, names: ['carrot']
number: 1.0, unit: default, names: ['avocado']
number: 1.0, unit: default, names: ['basil']
number: 1.0, unit: tbsp, names: ['lemon']
number: 2.0, unit: tbsp, names: ['yeast']
number: 10.0, unit: default, names: ['black-olives']
number: 4.0, unit: default, names: ['garlic']
number: 2.0, unit: default, names: ['tomato']
number: 1.0, unit: pinch, names: ['chilli-powder', 'paprika']
This is looking very respectable. We’ve extracted all the information we need from the recipe. All that remains is to convert the values to a common system of units and aggregate their values.
Converting to base units
Our ingredient quantities are specified using numerous different units. Ultimately we want everything to be in grams because our nutrient database (FDC) specifies quantities per 100 grams.
Generally, there are two different fundamental types of unit in a recipe:
- Mass (g, oz, lb, ...)
- Volume (ml, tbsp, cup, ...)
There are also quantities such as 2 tomatoes
where it appears that no unit is present. In fact, the unit here is implicit in the ingredient (i.e. a typical tomato weighs 120 grams). These could be mass or volume depending on the unit we choose to store the typical quantity. Let’s assume we already have the size in grams for a unit of each ingredient and we can simply multiply by this value to get the total mass for the number of units the recipe demands.
That leaves us with those ingredient quantities that are not specified in grams. We first need to multiply by a conversion factor to convert from a unit such as oz
to the base unit (e.g. g
). We can create a conversion map that stores the conversion factors for the units we know about:
{
"g": {
"factor": 1.0,
"base_unit": "g"
},
"oz": {
"factor": 28.3495,
"base_unit": "g"
},
"lb": {
"factor": 453.592,
"base_unit": "g"
},
"kg": {
"factor": 1000.0,
"base_unit": "g"
},
"ml": {
"factor": 1.0,
"base_unit": "ml"
},
"cc": {
"factor": 1.0,
"base_unit": "ml"
},
"pinch": {
"factor": 0.73992,
"base_unit": "ml"
},
"tsp": {
"factor": 5.91939,
"base_unit": "ml"
},
"tbsp": {
"factor": 17.7582,
"base_unit": "ml"
},
"floz": {
"factor": 28.4131,
"base_unit": "ml"
},
"cup": {
"factor": 284.131,
"base_unit": "ml"
},
"pt": {
"factor": 568.261,
"base_unit": "ml"
},
"dl": {
"factor": 100.0,
"base_unit": "ml"
},
"l": {
"factor": 1000.0,
"base_unit": "ml"
},
"gal": {
"factor": 4546.09,
"base_unit": "ml"
}
}
Notice for the volume units that the base unit is ml
, not g
. Mass and volume are related to each other by density, which is specific to the ingredient. Let’s assume we have the density specified in g/ml
for each ingredient. We can multiply by the density to convert from ml
to g
.
Bringing all this together gives us the following method on ingredient_quantity
for converting to its grams equivalent:
def total_grams_of(self, name):
if self.unit == 'default':
unit_mass = ingredients[name]['unit_mass']
return self.number * unit_mass
factor = units[self.unit]['factor']
base_number = self.number * factor
base_unit = units[self.unit]['base_unit']
if base_unit == 'g':
return base_number
density = ingredients[name]['density']
return base_number * density
This produces a value for a given ingredient name and there may be more than one ingredient name so we provide a method that averages for all the names:
def total_grams(self):
g = 0
for name in self.names:
g += self.total_grams_of(name)
return g / len(self.names)
Finally, we need a method to return the amount (in grams) of a given nutrient. Here we simply look up the nutrient value using the fdc_id
(derived from the canonical ingredient name) and the supplied nutrient id and multiply by the number of grams returned by total_grams_of
divided by 100
:
def nutrient_grams(self, nutrient_id):
g = 0
for name in self.names:
fdc_id = ingredients[name]['fdc_id']
g += nutrient_values[fdc_id + ':' + nutrient_id] * self.total_grams_of(name) / 100
return g / len(self.names)
Note that we added additional data into the ingredients.json
file:
fdc_id
density
unit_mass
unit_name
(for commentary only)
Adding it all up
The last step is to aggregate the nutrient values over all of the ingredients to get the totals for the recipe. This just involves iterating over the ingredient quantities and summing the value of iq.nutrient_grams(nutrient_id)
for each nutrient:
totals = [0.0 for n in nutrients.items()]
total_mass = 0.0
for iq in iqs:
for n, nutrient_id in enumerate(nutrients.keys()):
totals[n] += iq.nutrient_grams(nutrient_id)
total_mass += iq.total_grams()
print()
print('Total mass: ' + str(total_mass))
for n, nutrient_id in enumerate(nutrients.keys()):
print(nutrients[nutrient_id] + ': ' + str(totals[n]))
We also sum iq.total_grams()
here to get the total mass of the recipe.
Source code
That’s it! We have extracted amount and ingredient information from each line of a recipe, converted the values to grams and mapped to nutrient data in our database and then aggregated over the whole recipe. The complete source code for this example is on github (https://gist.github.com/jeremyorme/34504e4966763f1170474fc978f44ddf).
Next time
We’ll think about acquiring enough data to be able to process a wide range of recipes.