Last time we implemented a lazy matcher, which gave us enough tools to separate ingredient name from amount (at least for our test dataset). This is great because now we can do things like convert the amounts to standard units such as grams or millilitres.
However, the ingredient name is not quite as easy to work with because the same ingredient can have multiple representations, which makes it awkward to do aggregation or other statistical operations. What we need is a canonical form.
This is a single definitive representation for a given item. Let’s consider why our text is not canonical to begin with. There are several possible causes:
- inconsistent casing or punctuation
- synonyms or local differences
- superfluous information
- varying specificity
To produce a canonical form we need to find ways to address all of these issues.
If we could reduce ingredient names to a canonical form then we gain the ability to aggregate ingredients because we know that they refer to the same thing. We could store nutrition information indexed by the canonical name and aggregate it to produce reports of nutritional values for a whole recipe, or a collection of recipes.
Let’s look at each of the causes of name divergence listed above and think about how we might solve them.
This is probably the easiest issue to solve. We can easily apply a consistent casing scheme - it could be as simple as converting to all lowercase.
Equally, we can resolve differences in punctuation by simply removing all the punctuation. The only consideration is when the best time to do this is, as punctuation could prove useful for dealing with some of the other issues.
Synonyms and local differences
We can have more than one word for the same item, usually due to different words used in different localities. For example: courgette (UK) vs. zucchini (US). We can resolve this problem fairly easily by choosing a locality and mapping all words into that locality using a dictionary. The only laborious part is ensuring the dictionary is complete.
We often find that along with the essential information, the scraped text will contain extra information of varying degrees of relevance. In most cases we want to strip this away to leave just the item name, although sometimes we might want to keep some of it (see varying specificity below).
It can be difficult to reliably exclude superfluous information because it is not always clearly separated from the essential text. Our strategy for removing it is likely to depend heavily on our knowledge of the input dataset but we need to be careful not to prune too aggressively and hamper processing of new data. One strategy we could use is to remove content enclosed in parentheses. We’ll come back to this later and see what makes sense in the context of real data.
This is similar to superfluous information but here we are talking about how specific the ingredient name should be. For example, let’s consider the humble carrot:
- carrot, raw
- carrot, raw, sliced
Here we have 3 carrots of varying specificity. What information is important? The name ‘carrot’ is obviously crucial. The attributes ‘raw’ and ‘sliced’ may be important depending on how we plan to use the data. For example a nutrition database may have slightly different entries for raw vs. cooked carrot.
We also need to consider the completeness of the input set. If an attribute is not always specified and there is no sensible default to use in its absence then perhaps we have to accept that our canonical form can’t be so precise and throw it away.
If we’re looking to index into a particular database then we may safely drop attributes that do not feature in our database.
Supposing we elect to include some attributes in our canonical form, there is a potential for differences due to inconsistent ordering of the attributes. We can resolve this problem simply by adhering to a fixed ordering scheme for the attributes. That requires us to classify and extract the attributes and then reform the string in a standard format.
We’ll think about what additional PyFathom functionality we need to be able to transform our ingredient names to a suitable canonical form.