Last time we considered the causes of multiple representations of an item and how we might resolve them to produce a canonical form. As a quick reminder, the causes were:
- inconsistent casing or punctuation
- synonyms or local differences
- superfluous information
- varying specificity
Today, we’ll take our existing dataset of ingredient quantities and attempt to produce a canonical ingredient name.
Let’s start with superfluous information as it’s probably going to be the hardest one to deal with. What superfluous information is evident in our dataset? Here are some examples:
'6 – 7 cups of Three different types of vegetables*, chopped into bite-sized pieces' '2 1/2 T. Soy Sauce, Tamari or Bragg’s Aminos' '15 oz. can Black Beans, drained and rinsed' '3 – 4 cups Cooked Rice or Quinoa (heat up the frozen type when in a pinch)' '75g grated vegan parmesan (or 15g Nutritional yeast flakes)' '2 tbsp extra virgin olive oil, plus a little extra for oiling the squash' 'red chilli 1, deseeded and thinly sliced' '1 15-ounce can chickpeas (rinsed, drained, and dried)' '1 Tbsp vegan parmesan cheese (plus more to taste)' '3 tablespoons favorite Chili Powder storebought or homemade' '3 cups Beans of choice - Kidney Black, Pinto etc. (soaked and cooked, or canned)' 'handful of fresh Cilantro - optional'
There are a few common patterns that we can spot that we could reasonably assume will apply fairly generally:
- processing verbs such as
slicedconnected by logical operators such as
orand possibly adjacent to adverbs such as
- the word
- content in parentheses
We should be able to safely remove this content although we can imagine if parentheses are used inappropriately then essential content could be lost. For example, consider:
2 tbsp olive oil (and another 3 tbsp)
In this case the quantity in the brackets was important. We relied on the convention that information in brackets tends to be side information and not essential, which generally is the case but this example broke that convention. We would therefore have to add additional knowledge to cater for these cases, if they arise.
In addition to the above, there are also some trickier cases:
- content following the word
- alternatives separated by commas
Content following a plus is most likely not wanted to form a canonical name. It may, however, still contain useful information. For example, consider the following string:
‘2 tbsp olive oil plus 1 tbsp for the sauce’
For canonicalisation we can safely remove everything from the
plus onwards. For calculating quantities, we need to include that extra 1 tbsp. We also need to be a bit careful in case our input string includes plus as part of the initial quantity, e.g.
1 tbsp plus 1 tsp olive oil
Alternatives are also tricky to deal with. We could simply take the first and discard the rest but what if we can’t match the first in our database but we could have matched one of the discarded alternatives? To be safe, we probably want to produce canonical names for all of the alternatives.
So now we know what information we want to get rid of - how are we going to do it?
Classification to the rescue
We already have some useful classification functionality at our disposal. Let’s see how far we can get with classifying superfluous information and what additional features we might need/like.
First, we’ll try and remove content in parentheses. We can classify this using a simple rule:
/\(/,/\w+/*?,/\)/ is superfluous
We can see in the following example the content in parentheses has has been classified correctly as
<amount><number>5½</number><unit>tablespoons</unit></amount><ingredient>tahini</ingredient><superfluous>(<ingredient>you can sub cashew butter</ingredient>)</superfluous>
However, it is still being erroneously classified as
ingredient as well. There are a few ways we could go about fixing this:
- implement negative lookahead and use it in the ingredient rule
- use the lazy matcher in the ingredient rule and match
superfluousat the end
- remove the
A more serious issue arises where useful information is contained in parentheses. Consider the input string:
Olive oil (2 tbsp)
This is perfectly legitimate but the quantity will be classified as superfluous by our rule. If we’re planning to delete superfluous tokens then we’ll lose essential information.
In truth, it’s not the end of the world because we could extract the quantities before removing the superfluous information and then extract the names. However, it’s not very elegant to deliberately mis-classify all over the place and it would be much nicer to be able to have one cleaned output and extract the required information from that.
Before we decide how to proceed let’s pause and take a look at one of the other instances of superfluous information - the processing verbs. We can identify these with a rule like this:
/\w*[^e]ed/ is superfluous
This will classify all words ending
ed (but not
eed) as being superfluous. For example:
We can extend this to classify adverbs ending in
ly as superfluous too:
/\w*([^e]ed|ly)/ is superfluous
Finally, lets add a rule to classify those other pesky words like
/and|or|optional/ is superfluous
We’ll surely come across more to add to that list but that’ll do for now.
We’ve made some progress here with the tools at our disposal. Still we have some way to go to achieve a canonical name.
We’ll continue on our mission to canonicalise! Looking in particular at the features we need to improve our classifications.