Last time we successfully processed a recipe consisting of a list of free(ish)-text ingredient quantities, linked to items in a food database and calculated the number of grams of each macro-nutrient in the recipe.

This was very effective but to accept a wide range of recipes (in the words of Johnny-5) we need input. Lot’s of it. For each ingredient that could be used in a recipe, we need to know:

  • keywords used to identify it
  • the id of the corresponding food item in the Food Data Central database
  • its density in g/ml (if it can be specified by volume)
  • the size of a medium size unit (if it can be specified in units)

Given that there are potentially tens of thousands of ingredients we might want to support, this is a daunting task.

Ways to source the data

There are a few ways we might think about getting hold of the data we need:

  • buy/license it
  • pay somebody to enter it
  • approximate it
  • automate it
  • enter it ourselves
  • persuade users to enter it

Let’s assume that we are a very early stage start-up without funding (or with minimal funding) and we don’t have the cash to hire a team of data entry people or to bung a third party a hefty licence fee. We might also like to avoid relying on a third party and dealing with the inevitable restrictions on data access, lack of control, potential API performance issues and slow release cycles that may hamper our development. So let’s rule out the first two options. That leaves:

  • approximate it
  • automate it
  • enter it ourselves
  • persuade users to enter it

Each of these has pros and cons.


This means using an estimate in place of real data. It could involve:

  • setting some value to a constant
  • using a look-up table indexed by other another property
  • using an approximate formula to calculate a value

Approximation is the easiest method to fill data assuming we can find an acceptable approximation. However, it is likely to be the least accurate.


This involves fetching real data but doing so automatically using, for example, a web scraper.

Automation is fast and cheap to get data compared to manual entry once we have programmed a bot to do the work for us. However, it relies on the data being publicly accessible and organised systematically.

Manual entry

If we find and enter data ourselves we can ensure high quality data but this approach is costly. It takes a lot of our time compared to the other approaches and we therefore want to minimise using this approach to only where it is really needed.

Crowdsourcing from users

Crowdsourcing data from our users could be a great way to get a lot of good data but there is a chicken-and-egg problem. Without data we have no app and therefore no users. There is also potential for malign activity so we need our crowdsourcing system to prevent false data affecting our app.

What source should we use?

In practice, we’ll want to use several of the above sources. For example, we may find reasonable approximations for some data and be able automate a large part of the remainder. Then we may need to enter/verify some data manually ourselves, particularly where accuracy is important.

These sources may provide enough data to bootstrap a sufficiently useful app such that people would be incentivised to contribute to improving the data as part of using the app for their own benefit.

What about our ingredient data?

Let’s apply the above thinking to our ingredient data. First of all we need to decide what ingredients we’ll support. This is likely to be a long list so it would be handy to automate it. We have a long list of ingredients in the Food Data Central database that we could use as a starting point and if we select from there we get the link to the nutrition information for free.

There are a couple of issues we need to deal with:

  • duplicates - the FDC database has the same ingredient multiple times
  • naming - FDC is a US database and some of the ingredient names differ from the UK

Once we have our ingredient list we need to find:

  • search keywords
  • density values
  • unit size values

We can probably automate a large part of the keyword generation by simply taking the first part of the ingredient name from FDC. This will need manual checking but it’s much faster to scan down a list checking values and making occasional corrections than to add each one.

It may be reasonable to approximate density and unit size in the first instance and then gradually replace the approximations with real values. As a first pass we could assign ingredients to an approximation group (e.g. water-based, oil-based, powder) and assign some approximate values based on the group (e.g, 1.0, 0.8, 0.5).

Generating the ingredient list

We’ll start with the SR legacy foods data from FDC. This can be downloaded from here. Inside the zip is a file called food.csv that contains the food description along with its fdc_id, food_category_id and some other bits and bobs we don’t care about.

We can filter out a lot of foods based on their category. We’ll make our program vegan friendly by ignoring categories related to animal products and we’ll also ignore foods that aren’t really ingredients such as ‘Fast Food’. That leaves the following categories:

include_cats = [
	'2',	# Spices and Herbs
	'4',	# Fats and Oils
	'9',	# Fruits and Fruit Juices
	'11',	# Vegetables and Vegetable Products
	'12',	# Nut and Seed Products
	'14',	# Beverages
	'16',	# Legumes and Legume Products
	'20',	# Cereal Grains and Pasta
	'28'	# Alcoholic Beverages

Let’s write some python to load the food.csv file and filter to the categories we are interested in:

import csv

class ingredient:
	def __init__(self, description, fdc_id):
		self.description = description
		self.fdc_id = fdc_id
	def from_row(cls, r):
		return cls(r['description'], r['fdc_id'])
	def __repr__(self):
		return str((self.description, self.fdc_id))
	def __lt__(self, other):
		return self.description < other.description

with open('food.csv') as f:
	rows = csv.DictReader(f)
	ings = [ingredient.from_row(r) for r in rows if r['food_category_id'] in include_cats]
for ing in sorted(ings):

This produces output along the lines of...

('Abiyuch, raw', '167782')
('Acerola juice, raw', '171687')
('Acerola, (west indian cherry), raw', '171686')
('Alcoholic Beverage, wine, table, red, Barbera', '173191')

Let’s obliterate any remaining traces of animal products and branded products with another filter, this time on the description itself. We’ll exclude items that contain any of our forbidden words (we can continue to add to this as we discover any unwanted items):

forbidden = [
	'corn dog',
	'new zealand',
	'house foods',
	'the coca-cola company',

def is_forbidden(ing):
	ing_lower = ing.lower()
	for f in forbidden:
		if f in ing_lower:
			yield True

with open('food.csv') as f:
	rows = csv.DictReader(f)
	ings = [ingredient.from_row(r) for r in rows if r['food_category_id'] in include_cats and not any(is_forbidden(r['description']))]

Let’s add another member to __init__ to store the keywords and a method to generate them:

class ingredient:
	def __init__(self, description, fdc_id):
		# ...
		self.keywords = self.generate_keywords()

	# ...

	def generate_keywords(self):
		return re.sub('(?P<kwd>^[^,]*).*', '\g<kwd>', self.description).lower()

This sets the keywords to be the lower-cased part of the description before the first comma (or the whole description, if no comma is present). Printing out ing.keywords, we can see some good results but a lot of repeated entries, e.g.

acerola juice
alcoholic beverage
alcoholic beverage
alcoholic beverage
alcoholic beverage
alcoholic beverage

Why the repetition? - there are two reasons:

  • different cooking methods for one ingredient (raw, cooked, boiled etc.)
  • different ingredients in a common family

The first case we can consider as one ingredient. In the second case we need separate ingredients. For example, if we look down the list we see beans repeated a large number of times. Printing the ingredient description we can see there are many different bean varieties and not just cooking methods:

Beans, adzuki, mature seed, cooked, boiled, with salt
Beans, adzuki, mature seeds, canned, sweetened
Beans, adzuki, mature seeds, cooked, boiled, without salt
Beans, adzuki, mature seeds, raw
Beans, baked, canned, no salt added
Beans, baked, canned, plain or vegetarian
Beans, baked, canned, with franks
Beans, baked, home prepared
Beans, black turtle, mature seeds, canned
Beans, black turtle, mature seeds, cooked, boiled, with salt
Beans, black turtle, mature seeds, cooked, boiled, without salt
Beans, black turtle, mature seeds, raw

We can fix this by modifying generate_keywords to convert Beans, xxx, to xxx beans,:

	def generate_keywords(self):
		s = self.description
		s = re.sub(r'^(?P<beans>Beans), (?P<bean_type>[^,]*)', r'\g<bean_type> beans', s)
		return re.sub(r'(?P<kwd>^[^,]*).*', r'\g<kwd>', s).lower()

The same logic applies to various other foods and we can generalise this to any food with sub-types:

	def generate_keywords(self):
		s = self.description
		s = re.sub(r'^(?P<type>Beans|Cabbage|Corn|Lettuce|Mushrooms|Peppers|Pickles|Potatoes|Seaweed|Squash|Tomatoes|Vegetables|Vinegar|Oil|Pasta|Rice), (?P<sub_type>[^,]*)', r'\g<sub_type> \g<type>', s)
		return re.sub(r'(?P<kwd>^[^,]*).*', r'\g<kwd>', s).lower()

There are also some foods where we’d like to replace Type, sub-type with just sub-type. For example Spices, basil should be replaced with basil, not basil Spices. We can do that by adding another re.sub:

		s = re.sub(r'^(?P<type>Alcoholic beverages?|Spices|Beverages|Nuts), (?P<sub_type>[^,]*)', r'\g<sub_type>', s)

In the case of wine, we need to do something different again because there are a lot of redundant sub-categories (e.g. Alcoholic beverage, wine, table, white, Sauvignon Blanc). We’ll add another re.sub:

		s = re.sub(r'^(?P<type>Alcoholic beverages?|Spices|Beverages|Nuts), (?P<sub_type>[^,]*)', r'\g<sub_type>', s)

Outputting the data

Now we’ve done the hard part, let’s output the data in csv format so we can load it into a spreadsheet for manual processing.

We want to aggregate together ingredients with the same keywords because we when we match the keywords, we want that to identify a single item. Therefore, we’ll need a list of the FDC IDs of the component ingredients. We’ll separate them with a + to distinguish from the , used to separate csv fields:

def component_ids(ings):
	return '+'.join([i.fdc_id for i in ings])

Next it would be useful to have a name for the ingredient. We can use the common part of the names of the component ingredients. Here we use a brute force approach for finding the common part by comparing characters in the same position of each description at an increasing index until a mismatch is found (the special cases of zero/one ingredient are handled separately):

def common_name(ings):
	if len(ings) < 1:
		return ''
	if len(ings) == 1:
		return ings[0].description
	def mismatch(ings, i):
		for ing in ings:
			yield ing.description[i] != ings[0].description[i]
	for i in range(len(ings[0].description)):
		if any(mismatch(ings, i)):
			return ings[0].description[:i].rstrip(' ,')
	return ''

Armed with these two functions we can print our csv data. First let’s group together ingredients by their keywords in a map:

kwd_to_ing = {}
for ing in sorted(ings):
	if ing.keywords not in kwd_to_ing:
		kwd_to_ing[ing.keywords] = []

Now we’ll print out the header row followed by the data:

for k,v in kwd_to_ing.items():
	print(','.join([k, component_ids(v), '1.0', '100.0', '"' + common_name(v) + '"']))

That gives us output that looks like this:

abiyuch,167782,1.0,100.0,"Abiyuch, raw"
acerola juice,171687,1.0,100.0,"Acerola juice, raw"
acerola,171686,1.0,100.0,"Acerola, (west indian cherry), raw"
red wine,173191+173208+174834+174833+174840+173194+171872+174838+171925+174109+173193+174835+174839+174836+173192,1.0,100.0,"Alcoholic Beverage, wine, table, red"

Note that we have employed the wild approximations that all ingredients have a density of 1.0 and a unit size of 100.0g. We’ll come back to those later but they’ll do for now to enable us to do our calculations.

The complete source for this example is available here:

Next time

We’ll use our (massaged) ingredient data to calculate the nutritional content of a recipe.