# Aggregation is the name of the game

Last time we built a classifier that reliably identified known ingredients and linked to their nutritional values in the Food Data Central database (https://fdc.nal.usda.gov/). This time we’re going to try and aggregate those values over all the ingredients in a recipe.

## Aggregating nutrient values

The FDC database provides nutrient values per 100g of each ingredient. It is meaningless to add up those values directly unless we believe every recipe contains 100g of each ingredient, which sounds rather unlikely. First, we need to scale those values according to how much of the ingredient is required in the recipe. To be able to do that we need to:

- Extract the amounts from the ingredient strings (e.g.
`1 tbsp`

) - Convert the amounts to a mass in grams

When we have the mass of the ingredient in grams we simply multiply the nutrient values by `ing_mass / 100`

. For example, avocado contains 14.66g Fat / 100g. If our recipe calls for 50g of avocado then we calculate that the avocado contributes `14.66 * 50 / 100`

= `7.33`

grams of fat to the recipe. Once we have calculated the individual contribution of each ingredient to a given nutrient we can add them all up to find the total amount of that nutrient contained in the meal as a whole.

## Extracting amounts

We have extracted amounts from ingredient quantity strings in a previous article but there is a bit more to do here to make sure we pick the right amount(s) in cases where more than one is present.

To extract the raw amounts we’ll use the rule class we devised previously:

```
class rule:
def __init__(self, pattern, substitution):
self.p = rule._translate_type_captures(rule._translate_type_matches(pattern))
self.s = rule._translate_type_substitutions(substitution)
def sub(self, s):
return re.sub(self.p, self.s, s)
def _translate_type_captures(s):
pat = r'\{\(\?\<(?P<type_and_index>[a-z_]+[0-9]*)\>(?P<content>.*?)\)\}'
rep = r' ?(?<![^\> ])(?P<T_\g<type_and_index>>\g<content>)(?![^\< ]) ?'
return re.sub(pat, rep, s)
def _translate_type_matches(s):
pat = r'\<\<!(?P<type_and_index>(?P<type>[a-z_]+)[0-9]*)\>\>'
rep = r'(?! ?\<\g<type>\>)'
s2 = re.sub(pat, rep, s)
pat = r'\<\<(?P<type_and_index>(?P<type>[a-z_]+)[0-9]*)\>\>'
rep = r' ?\<\g<type>\>(?P<T_\g<type_and_index>>(?:(?!\<\/\g<type>\>).)*)\<\/\g<type>\> ?'
return re.sub(pat, rep, s2)
def _translate_type_substitutions(s):
pat = r'\<\<(?P<type_and_index>(?P<type>[a-z_]+)[0-9]*)\>\>'
rep = r' <\g<type>>\\g<T_\g<type_and_index>></\g<type>> '
return re.sub(pat, rep, s)
```

For this application we can simply apply our rules sequentially to classify the amounts in a string:

```
def classify_amounts(s):
for r in amount_rules:
s = r.sub(s)
return s
```

Then we can use `extract_typed`

to grab the bits we’re interested in. We’ll simplify it to just accept one type and to match anything up to the corresponding closing tag:

```
def extract_typed(s, t):
return re.findall(r'\<' + t + r'\>((?:(?!\</' + t + '\>).)*)\</' + t + r'\>', s)
```

Now let’s think about the rules themselves. Previously we would match something like `tbsp`

or `tbs`

and mark it as type `unit`

. We’ll need to know that these are both tablespoons to be able to translate the quantities into grams. Therefore let’s be more specific and use a rule like this:

```
rule(r'{(?<tbsp>[tT]bsps?|T|tbl|tbs|[tT]ablespoons?)}', ' <unit><<tbsp>></unit> '),
```

This results in a double classification so that, for example, `tbl`

would be replaced with `<unit><tbsp>tbl</tbsp></unit>`

. We still want to classify as a `unit`

so we can match it as part of an amount but the extra `tbsp`

classification enables us to map to an appropriate scaling factor without having to consider all the possible representations of a tablespoon unit.

We can repeat this for our other units:

```
rules = [
# imprecise cooking units
rule(r'{(?<pinch>[pP]inch(?:es)?)}', ' <unit><<pinch>></unit> '),
rule(r'{(?<dash>[dD]ash)}', ' <unit><<dash>></unit> '),
# general units of volume
rule(r'{(?<ml>mls?|mL|cc|millilitres?|milliliters?)}', ' <unit><<ml>></unit> '),
rule(r'{(?<tsp>tsps?|t|teaspoons?)}', ' <unit><<tsp>></unit> '),
rule(r'{(?<tbsp>[tT]bsps?|T|tbl|tbs|[tT]ablespoons?)}', ' <unit><<tbsp>></unit> '),
rule(r'{(?<floz>fl ?oz|fluid ounces?)}', ' <unit><<floz>></unit> '),
rule(r'{(?<cup>cups?)}', ' <unit><<cup>></unit> '),
rule(r'{(?<pt>p|pts?|pints?)}', ' <unit><<pt>></unit> '),
rule(r'{(?<l>ls?|L|litres?|liters?)}', ' <unit><<l>></unit> '),
rule(r'{(?<gal>gals?|gallons?/)}', ' <unit><<gal>></unit> '),
rule(r'{(?<dl>dls?|dL|decilitre|deciliter)}', ' <unit><<dl>></unit> '),
# general units of mass
rule(r'{(?<kg>kgs?|kilos?|kilograms?)}', ' <unit><<kg>></unit> '),
rule(r'{(?<g>gs?|grams?|grammes?)}', ' <unit><<g>></unit> '),
rule(r'{(?<oz>oz|ounces?)}', ' <unit><<oz>></unit> '),
rule(r'{(?<lb>lbs?|#|pounds?)}', ' <unit><<lb>></unit> '),
```

Then we use our tried and tested number matching rules:

```
# numbers
rule(r'{(?<number>(?:\d* )?\d+ ?\/ ?\d+|\d*\s?[½⅓⅔¼¾⅕⅖⅗⅘⅙⅚⅛⅜⅝⅞]|\d+(\.\d+)?)}', '<<number>>'),
rule(r'{(?<number>an?)}', '<<number>>'),
```

And a general rule to handle amounts containing a number and/or a unit:

```
# general amounts
rule(r'{(?<amount><<number1>>[\-–]?<<unit1>>|<<number2>>|<<unit2>>)}', '<<amount>>')
]
```

Recall the recipe that we used in the previous article:

```
recipe = '''
2 courgettes (zucchini)
1 carrot
1 avocado
1 bunch basil
1 tbsp lemon juice
2 tbsp nutritional yeast
10 olives, sliced
4 garlic cloves, roasted
2 tomatoes, roasted
Pinch of chilli powder or smoked paprika
'''
```

We want to convert each of the lines in this recipe into a structured ingredient quantity that we can query easily. Let’s create a class called `ingredient_quantity`

to encapsulate the number, unit and ingredient name(s):

```
class ingredient_quantity:
def __init__(self, number, unit, names):
self.number = number
self.unit = unit
self.names = names
def __repr__(self):
return 'number: ' + str(self.number) + ', unit: ' + self.unit + ', names: ' + str(self.names)
@classmethod
def from_string(cls, iq_str):
pass
```

We’ll create an instance of this class for each non-blank line in the input recipe:

```
iqs = [ingredient_quantity.from_string(iq_str) for iq_str in recipe.splitlines() if iq_str.strip() != '']
```

Now let’s implement the `from_string`

class method. The first thing we need to do is classify the string. We’ll `tokenise`

and `classify_ingredients`

as before and then we’ll call our new `classify_amounts`

method to add in the amount typing:

```
@classmethod
def from_string(cls, iq_str):
s = tokenise(iq_str)
s = classify_ingredients(s)
s = classify_amounts(s)
```

That gets us a nicely classified string, e.g.

```
'<amount> <number>2</number> <unit> <tbsp>tbsp</tbsp> </unit> </amount>'
```

Now we just need to extract the `number`

and the `unit`

from the `amount`

. We can go through each `amount`

and store a `number`

or `unit`

as we find it. If both a `number`

and `unit`

are found in the same `amount`

then we’ll take those and stop because their affinity implies a better choice than a `number`

and `unit`

from separate `amount`

s. We also extract the ingredient canonical names contained in the first tag inside the `ingredient`

s. Here is the complete `from_string`

method:

```
@classmethod
def from_string(cls, iq_str):
s = tokenise(iq_str)
s = classify_ingredients(s)
s = classify_amounts(s)
amounts = extract_typed(s, 'amount')
number = '1'
unit = 'default'
for amount in amounts:
numbers = extract_typed(amount, 'number')
units = extract_typed(amount, 'unit')
if len(numbers) > 0:
number = numbers[0]
if len(units) > 0:
unit = first_tag(units[0])
if len(numbers) > 0 and len(units) > 0:
break
names = [first_tag(i) for i in extract_typed(s, 'ingredient')]
return cls(float(number), unit, names)
```

If no `number`

is found, we assume `1`

and if no `unit`

is found we assume `default`

(meaning the average size of one item of the ingredient). There are quantities that this won’t work for (e.g. numbers with symbolic fractions) but it’ll do for now and we can come back and improve it later. The `first_tag`

helper function is implemented as a simple regex find:

```
def first_tag(s):
for t in re.finditer(r'\<([^\>]*)\>', s):
return t.group(1)
```

Let’s print out our `ingredient_quantity`

instances:

```
for iq in iqs:
print(str(iq))
```

This gives the following output:

```
number: 2.0, unit: default, names: ['courgette', 'courgette']
number: 1.0, unit: default, names: ['carrot']
number: 1.0, unit: default, names: ['avocado']
number: 1.0, unit: default, names: ['basil']
number: 1.0, unit: tbsp, names: ['lemon']
number: 2.0, unit: tbsp, names: ['yeast']
number: 10.0, unit: default, names: ['black-olives']
number: 4.0, unit: default, names: ['garlic']
number: 2.0, unit: default, names: ['tomato']
number: 1.0, unit: pinch, names: ['chilli-powder', 'paprika']
```

This is looking very respectable. We’ve extracted all the information we need from the recipe. All that remains is to convert the values to a common system of units and aggregate their values.

## Converting to base units

Our ingredient quantities are specified using numerous different units. Ultimately we want everything to be in grams because our nutrient database (FDC) specifies quantities per 100 grams.

Generally, there are two different fundamental types of unit in a recipe:

- Mass (g, oz, lb, ...)
- Volume (ml, tbsp, cup, ...)

There are also quantities such as `2 tomatoes`

where it appears that no unit is present. In fact, the unit here is implicit in the ingredient (i.e. a typical tomato weighs 120 grams). These could be mass or volume depending on the unit we choose to store the typical quantity. Let’s assume we already have the size in grams for a unit of each ingredient and we can simply multiply by this value to get the total mass for the number of units the recipe demands.

That leaves us with those ingredient quantities that are not specified in grams. We first need to multiply by a conversion factor to convert from a unit such as `oz`

to the base unit (e.g. `g`

). We can create a conversion map that stores the conversion factors for the units we know about:

```
{
"g": {
"factor": 1.0,
"base_unit": "g"
},
"oz": {
"factor": 28.3495,
"base_unit": "g"
},
"lb": {
"factor": 453.592,
"base_unit": "g"
},
"kg": {
"factor": 1000.0,
"base_unit": "g"
},
"ml": {
"factor": 1.0,
"base_unit": "ml"
},
"cc": {
"factor": 1.0,
"base_unit": "ml"
},
"pinch": {
"factor": 0.73992,
"base_unit": "ml"
},
"tsp": {
"factor": 5.91939,
"base_unit": "ml"
},
"tbsp": {
"factor": 17.7582,
"base_unit": "ml"
},
"floz": {
"factor": 28.4131,
"base_unit": "ml"
},
"cup": {
"factor": 284.131,
"base_unit": "ml"
},
"pt": {
"factor": 568.261,
"base_unit": "ml"
},
"dl": {
"factor": 100.0,
"base_unit": "ml"
},
"l": {
"factor": 1000.0,
"base_unit": "ml"
},
"gal": {
"factor": 4546.09,
"base_unit": "ml"
}
}
```

Notice for the volume units that the base unit is `ml`

, not `g`

. Mass and volume are related to each other by density, which is specific to the ingredient. Let’s assume we have the density specified in `g/ml`

for each ingredient. We can multiply by the density to convert from `ml`

to `g`

.

Bringing all this together gives us the following method on `ingredient_quantity`

for converting to its grams equivalent:

```
def total_grams_of(self, name):
if self.unit == 'default':
unit_mass = ingredients[name]['unit_mass']
return self.number * unit_mass
factor = units[self.unit]['factor']
base_number = self.number * factor
base_unit = units[self.unit]['base_unit']
if base_unit == 'g':
return base_number
density = ingredients[name]['density']
return base_number * density
```

This produces a value for a given ingredient name and there may be more than one ingredient name so we provide a method that averages for all the names:

```
def total_grams(self):
g = 0
for name in self.names:
g += self.total_grams_of(name)
return g / len(self.names)
```

Finally, we need a method to return the amount (in grams) of a given nutrient. Here we simply look up the nutrient value using the `fdc_id`

(derived from the canonical ingredient name) and the supplied nutrient id and multiply by the number of grams returned by `total_grams_of`

divided by `100`

:

```
def nutrient_grams(self, nutrient_id):
g = 0
for name in self.names:
fdc_id = ingredients[name]['fdc_id']
g += nutrient_values[fdc_id + ':' + nutrient_id] * self.total_grams_of(name) / 100
return g / len(self.names)
```

Note that we added additional data into the `ingredients.json`

file:

`fdc_id`

`density`

`unit_mass`

`unit_name`

(for commentary only)

## Adding it all up

The last step is to aggregate the nutrient values over all of the ingredients to get the totals for the recipe. This just involves iterating over the ingredient quantities and summing the value of `iq.nutrient_grams(nutrient_id)`

for each nutrient:

```
totals = [0.0 for n in nutrients.items()]
total_mass = 0.0
for iq in iqs:
for n, nutrient_id in enumerate(nutrients.keys()):
totals[n] += iq.nutrient_grams(nutrient_id)
total_mass += iq.total_grams()
print()
print('Total mass: ' + str(total_mass))
for n, nutrient_id in enumerate(nutrients.keys()):
print(nutrients[nutrient_id] + ': ' + str(totals[n]))
```

We also sum `iq.total_grams()`

here to get the total mass of the recipe.

## Source code

That’s it! We have extracted amount and ingredient information from each line of a recipe, converted the values to grams and mapped to nutrient data in our database and then aggregated over the whole recipe. The complete source code for this example is on github (https://gist.github.com/jeremyorme/34504e4966763f1170474fc978f44ddf).

## Next time

We’ll think about acquiring enough data to be able to process a wide range of recipes.