Last time we tested our classifier with 100 randomly picked ingredient quantity strings and discovered eight additional pieces of knowledge were required for accurate classification. A few avid readers correctly observed that I missed a couple of issues last time, so in fact we needed ten new pieces of knowledge.

Firstly, a hyphen can appear between a number and unit in an amount so the rule should be: range|number,/\-/?,unit?,/\./?,/of/? is amount, yielding:

"1 15-ounce can chickpeas (rinsed, drained, and dried)"
=> "can chickpeas rinsed drained and dried"

Secondly, we can have a (meaning ‘one of’) in place of a number, preceding a unit so the rule should be: range|number|number-word,/\-/?,unit?,/\./?,/of/? is amount and we also need /a/ is number-word. This yields:

"dried chilli flakes a pinch"
=> "dried chilli flakes a pinch"

But we expected to see dried chilli flakes after that change - what’s going on? We’ve come unstuck again because the ingredient greedily matches all the way to the end. We now have to have a lazy matcher to resolve this problem - we can’t simply exclude a from ingredient text like we did with numbers because there are lots of legitimate cases where it can appear. We’ll come back to the lazy matcher next time as it’s off-topic for this post.

Batch 2

Let’s classify the next 100 ingredient quantities. The scraped input strings are:

in_strs = [
	'180g | 1 cup uncooked brown rice',
	'½ small butternut squash , cubed',
	'2 small sweet potatoes , cubed',
	'1 tablespoons olive oil',
	'6 mushrooms',
	'4 handfuls of raw spinach',
	'60g | 1/2 cup edamame beans',
	'2 green onions , chopped',
	'1 teaspoon sesame oil',
	'1 tablespoon Tamari , or soy sauce',
	'1 teaspoon maple syrup',
	'salt and pepper , for seasoning',
	'sesame seeds , for sprinkling',
	'2 squares of silver foil',
	'2 roasted bulbs of garlic (see instructions)',
	'½ lemon juice only',
	'5½ tablespoons tahini (you can sub cashew butter)',
	'leftover marinade from the mushrooms',
	'80mls | 1/3 cup water',
	'1 tablespoon maple syrup',
	'1 tablespoon Tamari or soy sauce',
	'pasta of your choice',
	'1 small butternut squash',
	'1 yellow onion',
	'2 cloves of garlic',
	'1 tsp fresh chopped sage',
	'1 tsp fresh rosemary',
	'1 tsp herbes de provence',
	'½ tsp red pepper flakes',
	'2 cups of vegetable stock or water',
	'3-4 tbsp coconut milk (optional)',
	'1 lime, juice',
	'1 large handful of pecans',
	'salt, pepper',
	'2 tbsp olive oil divided',
	'2 lb mushroom caps we prefer baby Portobellos, stems off and sliced thick',
	'6 large carrots peeled and sliced into 1 inch 2.5 cm circles',
	'1 large yellow onion peeled and diced',
	'1 large shallot peeled and sliced thin',
	'2 garlic cloves peeled and minced',
	'2 cups vegetable broth',
	'1 ½ cups red wine',
	'1 tbsp tomato paste',
	'2 tsp ground sea salt or to taste',
	'2 tbsp fresh thyme leaves plus extra for garnishing',
	'2 tsp dried Italian seasoning',
	'Black pepper to taste',
	'1 tbsp + 1 tsp all-purpose flour use cornstarch to make it gluten free',
	'1/3 cup water',
	'1 1 lb package of fettucine',
	'1 tbsp (15 ml) extra-virgin olive oil',
	'1 medium red onion, finely diced',
	'3 garlic cloves, minced',
	'26 oz (794 g) crushed tomatoes ( I use Pomi brand)',
	'1/3 cup (80 ml) red wine (or 3 tbsp (45 ml) balsamic vinegar)',
	'1 tbsp (2 g) dried Italian seasoning',
	'3 tsps (15 g) ground sea salt, or to taste',
	'2 tsp (1 g) red pepper flakes (optional)',
	'Ground black pepper to taste',
	'½ cup (20 g) fresh basil leaves, chopped',
	'¼ cup (10 g) flat leaf parsley leaves, chopped',
	'15 oz (425 g) black beans, drained (reserve ¼ cup (60 ml) of the juice) and rinsed well',
	'1 tbsp (15 ml) plus 1 tsp extra-virgin olive oil',
	'1 large Portobello mushroom cap – gills removed and sliced thin',
	'1 shallot, peeled and sliced thin',
	'2 garlic cloves, minced',
	'¼ cup (10 g) flat leaf parsley leaves',
	'1 tbsp (2 g) dried Italian seasoning',
	'½ cup (61 g) breadcrumbs (use gluten free if desired)',
	'½ cup (86 g) cornmeal',
	'2 tsp (6 g) tapioca starch',
	'1 tsp ground sea salt',
	'Black pepper to taste',
	'1 (1 lb 454 g) box of spaghetti',
	'2 tbsp (30 g) sea salt',
	'3 pounds zucchini (2 to 3 inches in diameter - for making the "zoodles")',
	'1 head of cauliflower (broken into large florets)',
	'2 carrots (peeled)',
	'8 ounces crimini mushrooms (cleaned and stems trimmed)',
	'1 medium yellow onion (halved or quartered)',
	'3 cloves garlic (peeled)',
	'1 cup walnuts',
	'2 28 ounce cans crushed tomatoes (I love the Muir Glen brand)',
	'1/4 cup sundried tomatoes',
	'2 tablespoon nutritional yeast (optional (adds a savory quality))',
	'1 teaspoon salt',
	'1 teaspoon dried oregano',
	'1 teaspoon dried basil',
	'1 teaspoon maple syrup (to taste)',
	'1 large sweet potato, peeled and cubed (about 2 cups)',
	'1 tablespoon olive oil',
	'¼ yellow onion, diced (about ½ cup)',
	'2 cloves garlic, minced',
	'1 teaspoon garam masala',
	'1 teaspoon curry powder',
	'¼ teaspoon cumin',
	'⅛ teaspoon red pepper/cayenne',
	'½ teaspoon sea salt',
	'1 15 ounce can diced tomatoes (low sodium if available)',
	'1 15 ounce can garbanzo beans (drained & rinsed)',
	'1 14 ounce can light coconut milk'

This time our classifier is more successful, only mis-classifying a couple of input strings and due to the same issue - we don’t handle plurals of abbreviated units (e.g. mls). Therefore, this iteration we only require 1 new piece of knowledge. Let’s update our unit rules to handle plural abbreviations:

units = '''
/pinch/ is unit
/mls?|mL|cc|millilitres?|milliliters?/ is unit
/tsps?|t|teaspoons?/ is unit
/tbsps?|Tbsps?|T|tbl|tbs|tablespoons?/ is unit
/floz/ is unit
/fl/,/oz/ is unit
/fluid/,/ounces?/ is unit
/p|pts?|pints?/ is unit
/ls?|L|litres?|liters?/ is unit
/gals?|gallons?/ is unit
/dls?|dL|decilitre|deciliter/ is unit
/gs?|grams?|grammes?/ is unit
/oz|ounces?/ is unit
/lbs?|#|pounds?/ is unit
/kgs?|kilos?|kilograms?/ is unit

Wait a minute, you say, one piece of information? But you’ve added s to a bunch of rules! Yes, however the one piece of knowledge is that abbreviated units can have plurals - that is independent of the representation that requires us to add s in several places. In fact, if we’d chosen a different representation - grouping all the abbreviated units in the same rule - then we could have encoded this new knowledge with a single s at the end of the pattern.

Batch 3

Let’s try another 100 ingredients:

in_strs = [
	'2 tablespoons raisins (optional)',
	'2 handfuls torn kale leaves',
	'3 cups prepared red quinoa or grain of your choice',
	'2 tablespoons cilantro, roughly chopped',
	'salt and pepper',
	'4 large portabella mushrooms',
	'2 T maple syrup',
	'2 T low sodium tamari/ liquid aminos',
	'1 T sesame oil',
	'2 cloves garlic, minced',
	'Lemon pepper',
	'Lime juice (to serve)',
	'Green onions',
	'Toasted sesame seeds',
	'Greens (I used a supergreens mix of baby spinach + mizuna)',
	'Avocado basil sauce (recipe below)',
	'Caramelized kimchi (recipe below)',
	'1 small head cauliflower, florets removed',
	'1 tablespoon olive oil',
	'6 ounces buckwheat soba noodles',
	'1/3 cup fresh cilantro leaves, chopped',
	'2 tablespoon toasted hemp or sesame seeds',
	'½ lime, juiced (optional)',
	'1 tablespoon plus ½ teaspoon freshly grated ginger',
	'2 tablespoons plus 1 teaspoon low-sodium soy sauce or tamari',
	'1 tablespoon dark sesame oil',
	'1 tablespoon unseasoned rice vinegar',
	'1 teaspoon honey (Vegans can sub brown sugar or agave.)',
	'½-1 teaspoon crushed red pepper flakes (depending on how much heat you like)',
	'¼ cup thinly sliced scallions, white and light green parts only (about 4 scallions)',
	'1 tbsp oil',
	'1 tsp cumin seeds',
	'1/2 yellow onion, finely chopped',
	'1/2 jalapeño chile, minced',
	'1 package of chicken style strips or pieces – for example Fry’s or Quorn, 1inch tofu strips, pre grilled, pan or deep fried, texturized vegetable protein)',
	'1 cilantro bunch',
	'1 tsp salt',
	'Fresh cracked pepper',
	'1 tsp oil',
	'1 Poblano chile, minced',
	'1/2 jalapeño chile, minced',
	'1 yellow banana pepper, chopped',
	'3 garlic cloves, peeled',
	'8 tomatillos',
	'1/4 cup water',
	'1/2 yellow or white onion, quartered',
	'1 tsp salt',
	'8-12 corn tortillas',
	'2 cups cabbage, finely shredded',
	'2-4 limes, quartered',
	'1/4 cup cilantro, chopped',
	'120g gluten-free wild rice',
	'350ml water',
	'280g tofu (1 block), medium to firm',
	'¼ teaspoon turmeric',
	'1 teaspoon coconut oil',
	'1 small onion, long thin slices',
	'½ clove garlic',
	'¼ teaspoon himalayan salt',
	'150g red and yellow peppers, chopped',
	'80g broccoli, chopped',
	'3 tablespoons soy sauce (make sure it is a gluten-free kind)',
	'Black pepper',
	'1 1/2 cups cooked chickpeas',
	'2 teaspoons safflower or other neutral oil',
	'1/4 teaspoon cayenne',
	'1/4 teaspoon ground cinnamon',
	'1/2 teaspoon Garam Masala',
	'1/4 teaspoon salt',
	'3/4 cup chopped red onion',
	'1 (1-inch) knob of ginger',
	'3 cloves garlic',
	'2 tablespoons water',
	'1 teaspoon safflower or other neutral oil',
	'1/4 teaspoon cumin seeds',
	'2 bay leaves',
	'4 cloves',
	'1 1/4 cups canned or culinary coconut milk',
	'3/4 cup ripe mango pulp or puree (unsweetened or lightly sweetened canned)',
	'1/2 teaspoon salt',
	'2 teaspoons apple cider vinegar',
	'Generous dash of black pepper',
	'1/4 teaspoon Garam Masala, for garnish',
	'2 tablespoons chopped cilantro, for garnish',
	'1 cup unroasted cashews, soaked in hot water for at least an hour',
	'3½ cups water',
	'2 cloves garlic',
	'¼ cup nutritional yeast',
	'1½ Tablespoons white miso',
	'1 teaspoon lemon juice',
	'1 teaspoon sea salt',
	'black pepper, to taste',
	'¼ teaspoon nutmeg',
	'2 Tablespoons flour',
	'1lb of fettuccine or any kind of pasta',
	'1 1/4 lb / 565 g kabocha squash',

Again, there’s only one new piece of knowledge required - that two amounts can be added together with the word plus between them. We can fix this by changing the ingredient rule to: amount?,/plus/?,amount?,/[a-zA-Z\-]+/+,amount? is ,,,ingredient, yielding:

"1 tablespoon plus ½ teaspoon freshly grated ginger"
=> "freshly grated ginger"

Further batches

We can repeat this process for more batches and observe that additional knowledge is required less often and relates to more esoteric cases with each batch. That is not surprising because the information is written to be easily understood and we only have a limited number of established conventions for representing it. Writers are unlikely to deviate far from those conventions so we can expect the long tail of less common cases to not be too long in this context.

Next time

This approach seems like it could be useful for some applications so we’ll look at tidying up the code and creating a github project and consider what other features might be useful such as the lazy matcher discussed previously.