Last time we made our web app more robust by testing edge cases and fixing bugs. This time we'll refine the ingredient data to improve performance and accuracy.

There are a couple of issues with our ingredient database that can reduce the quality and performance of the classifier.

  • Some ingredients reference a lot of component foods, all of which must be fetched from FDC to calculate average nutrient densities.
  • Some ingredients amalgamate component foods together that should be under separate ingredients

The first point leads to unnecessary degradation in performance and the second point leads to reduced accuracy in the results.

Actually, they are often two sides of the same coin because ingredients referencing a lot of component foods are typically those that should be split into more specific ingredients.

Examining the data

How can we resolve this? Firstly, we need to see what we're dealing with. Let's filter the database to remove ingredients with a single component food and look up the names of the component foods for those that remain. We can scan through the data and get a feel for what the next step should be.  Let's churn out a bit of python:

import json
import csv

food_names = {}
with open('food.csv') as f:
	for row in csv.DictReader(f):
		food_names[row['fdc_id']] = row['description']

class ing:
	def __init__(self, key_phrases, component_ids, unit_mass, density):
		self.key_phrases = key_phrases
		self.component_ids = component_ids
		self.unit_mass = unit_mass
		self.density = density

def main(ings):
	multis = [(n,i) for n,i in ings.items() if len(i.component_ids) > 1]
	
	for n,i in multis:
		print('* ' + n)
		for c in i.component_ids:
			print('  - ' + c + ': ' + food_names[c])
		print()
	print(str(len(multis)) + ' / ' + str(len(ings)))
	
ing_map = { ... }

main(ing_map)

Looking at the results we can see a few patterns:

Sub-types

Some ingredients contain several sub-types. Take a look at white wine, for example:

* white-wine
  - 174837: Alcoholic beverage, wine, table, white
  - 174110: Alcoholic beverage, wine, table, white, Chardonnay
  - 173195: Alcoholic beverage, wine, table, white, Chenin Blanc
  - 173196: Alcoholic beverage, wine, table, white, Fume Blanc
  - 174843: Alcoholic beverage, wine, table, white, Gewurztraminer
...

In this excerpt we can see, in addition to the generic white wine on the first line, numerous sub-types:

  • chardonnay
  • chenin blanc
  • fume blanc
  • gewurztraminer

These can be moved to separate, more specific ingredients, which will give more accurate results if the specific sub-type is given and reduce the calculation cost for the generic white wine as there will only be one component to retrieve.

Cooking modifiers

Some components vary by cooking modifiers. Consider the humble yam:

* yam
  - 170551: Yam, cooked, boiled, drained, or baked, with salt
  - 170072: Yam, cooked, boiled, drained, or baked, without salt
  - 170071: Yam, raw

Here we have two possibilities:

  • boiled
  • raw

More generally, we may find other cooking modifiers such as:

  • fried
  • roasted
  • grilled

It may be useful to be able to discriminate between these so we'll split our ingredient. The generic ingredient should refer to raw, uncooked or where no cooking modifier is given. Other versions can carry the modifier after the name. In the case of the yam we'd have:

  • yam
  • yam boiled

Removables

There are certain components that we can simply discard such as:

  • branded products
  • added salt, sugar, ascorbic acid etc.

Canned

Some foods are available in canned versions such as the pineapple:

* pineapple
  - 169945: Pineapple, canned, extra heavy syrup pack, solids and liquids
  - 169944: Pineapple, canned, heavy syrup pack, solids and liquids
  - 167767: Pineapple, canned, juice pack, drained
  - 169126: Pineapple, canned, juice pack, solids and liquids
  - 169127: Pineapple, canned, light syrup pack, solids and liquids
  - 169125: Pineapple, canned, water pack, solids and liquids
  - 169946: Pineapple, frozen, chunks, sweetened
  - 169124: Pineapple, raw, all varieties
  - 168194: Pineapple, raw, extra sweet variety
  - 168193: Pineapple, raw, traditional varieties

In this case we'll create another ingredient for the canned version with the word "can" at the end so for pineapple we'd have:

  • pineapple
  • pineapple can

Frozen

Some foods have frozen alternatives such as the raspberry:

* raspberry
  - 167756: Raspberries, canned, red, heavy syrup pack, solids and liquids
  - 167757: Raspberries, frozen, red, sweetened
  - 168209: Raspberries, frozen, red, unsweetened
  - 167808: Raspberries, puree, seedless
  - 167809: Raspberries, puree, with seeds
  - 167755: Raspberries, raw

In this case we can split the frozen variant into a new ingredient yielding:

  • raspberry
  • raspberry frozen

With/Without skin

An example of skin/no skin is the apple:

* apple
  - 171688: Apples, raw, with skin (Includes foods for USDA's Food Distribution Program)
  - 171689: Apples, raw, without skin
  - 173928: Apples, raw, without skin, cooked, boiled

The default should be the with skin variant and without skin should be signified using the keyword skinned:

  • apple
  • apple skinned

Drained vs. Solids and Liquids

For some canned items there are variants due to including the liquids in which the ingredients are submerged. Let's just take the drained variant where possible in such cases as it is unlikely the liquids will be specified or required.

Mature vs. Immature seeds

Let's assume that the mature seed is the default choice as it tends to have greater nutritional value than the immature seed. The immature seed should be identified with the keyword immature.

Alcohol content

Different proofs are available for some alcoholic beverages such as vodka:

* vodka
  - 173664: Alcoholic beverage, distilled, all (gin, rum, vodka, whiskey) 100 proof
  - 174815: Alcoholic beverage, distilled, all (gin, rum, vodka, whiskey) 80 proof
  - 171919: Alcoholic beverage, distilled, all (gin, rum, vodka, whiskey) 86 proof
  - 171920: Alcoholic beverage, distilled, all (gin, rum, vodka, whiskey) 90 proof
  - 173663: Alcoholic beverage, distilled, all (gin, rum, vodka, whiskey) 94 proof

We shall simply select the middle value (in this case 90 proof) as it is unlikely users will want to specify this.

Enriched/unenriched

Some grains may be enriched with vitamins or minerals. We shall take the unenriched component by default and provide the enriched component with the keyword enriched.

Manual fixing

In this instance because there are several sub types that can interact and the total number of ingredients affected is fairly small (284), it makes sense to work through the data manually. This is also a good opportunity to spot other issues that may be present.

Next time

We'll try and persuade people to try our app!