Getting lazy

Last time we did a bit of boring admin to tidy up the code and package it. So let’s do something more interesting today and look at lazy matching.

To recap, we wanted a lazy matcher a few articles back to solve the problem of our pattern matching all the way to the end of the string instead of stopping at the ingredient quantity. For example, the string:

salt a pinch  

Would classify as:

<ingredient>salt a pinch</ingredient>  

Instead of:

<ingredient>salt</ingredient><amount>a pinch</amount>  

(Some types omitted for clarity)

This problem would have been more widespread if we hadn’t cheated and disallowed numbers from ingredients. Otherwise we would have seen additional failures for cases such as:

black beans 200g  

The 200g would have been included in the ingredient.

At the time, I mentioned that we needed a lazy matcher that would work the same as the lazy matcher in regexes. Essentially, we only match the current token if the next matcher doesn’t match it. If the next matcher is also lazy then it passes the buck again, to the next matcher after it and so on until a greedy matcher is reached or we get to the last matcher, at which point the rightmost lazy matcher has to try and match. If it fails, we work our way back, trying the previous matcher and this repeats until we reach the first. If that one fails then we bail out.

Writing tests

The first thing we should do is write some tests to capture the required functionality. At this point it becomes apparent that we don’t have any tests yet for the existing functionality. Let’s not get bogged down in discussion about testing - I shall just go ahead and add the missing tests before continuing with this article...

Unsurprisingly, adding those tests revealed a few bugs in the implementation on some edge cases, which I have fixed. Now we have a test suite in test_classifier.py and we can add tests for our new quantifiers:

def test_lazy_zero_many_pattern_match_many():  
    c = classifier('/\w+/*?,/abc/ is foo')
    assert str(c.classify('xyz xyz abc xyz')) == '<foo>xyz xyz abc</foo>xyz'

def test_lazy_zero_many_pattern_match_zero():  
    c = classifier('/\w+/*?,/abc/ is foo')
    assert str(c.classify('abc xyz')) == '<foo>abc</foo>xyz'

def test_lazy_one_many_pattern_match_many():  
    c = classifier('/\w+/+?,/abc/ is foo')
    assert str(c.classify('xyz xyz abc xyz')) == '<foo>xyz xyz abc</foo>xyz'

def test_lazy_one_many_pattern_match_zero():  
    c = classifier('/\w+/+?,/abc/ is foo')
    assert str(c.classify('abc xyz')) == 'abc xyz'

def test_lazy_zero_many_dual_pattern_match_zero_many():  
    c = classifier('/xyz/*?,/pqr/*?,/abc/ is foo')
    assert str(c.classify('pqr pqr abc xyz')) == '<foo>pqr pqr abc</foo>xyz'

def test_lazy_zero_many_dual_pattern_match_many_many():  
    c = classifier('/xyz/*?,/pqr/*?,/abc/ is foo')
    assert str(c.classify('xyz xyz pqr pqr abc xyz')) == '<foo>xyz xyz pqr pqr abc</foo>xyz'

def test_lazy_zero_many_dual_pattern_match_many_zero():  
    c = classifier('/xyz/*?,/pqr/*?,/abc/ is foo')
    assert str(c.classify('xyz xyz abc xyz')) == '<foo>xyz xyz abc</foo>xyz'

def test_lazy_zero_many_dual_pattern_match_zero_zero():  
    c = classifier('/xyz/*?,/pqr/*?,/abc/ is foo')
    assert str(c.classify('abc xyz')) == '<foo>abc</foo>xyz'

def test_lazy_zero_many_dual_pattern_match_zero_many_not_start():  
    c = classifier('/xyz/*?,/pqr/*?,/abc/ is foo')
    assert str(c.classify('zzz pqr pqr abc xyz')) == 'zzz<foo>pqr pqr abc</foo>xyz'

Extending the matcher

First of all we need to change our pattern matcher to support the new cases. In addition to *+ and ?, we now want to support *? (the lazy zero or more matcher) and +? (the lazy one or more matcher). Currently the code relies on the last letter being the quantifier but that won’t work anymore so we need to parse the pattern and quantifier. The modified matcher looks like this:

class pattern_matcher(matcher):

    _fmt_pat = '(?:\\\\\\\\|\\\\\\/|[^\\/])+'
    _fmt_qtf = '(?:\\+\\??|\\*\\??|\\?)?'
    _fmt_caps = '\\/(' + _fmt_pat + ')\\/(' + _fmt_qtf + ')'

    fmt = '\\/' + _fmt_pat + '\\/' + _fmt_qtf

    def __init__(self, pattern, is_type):
        super(pattern_matcher, self).__init__(is_type)
        m = re.match(pattern_matcher._fmt_caps, pattern)
        self.pattern = '^(?:' + m.group(1) + ')$'
        self.quantifier = m.group(2) or '/'

    # ...

All we’ve done here is extend the regex to handle ? after a * or + and to store an internal version of the regex (_fmt_caps) that has capturing groups for the pattern and quantifier. We then use that regex to match the provided expression and extract the pattern and quantifier from group 1 and 2 respectively.

Extending the rule

Now comes the harder bit. We need to extend the algorithm inside the rule class to handle the new quantifier types as described previously.

The first thing we need to do is advance the matcher index past the lazy matchers the first time we try each matcher from a given starting token. We’ll need to work our way back on subsequent attempts though so instead of incrementing j let’s keep another count d and index the matcher using j + d. When we enter the loop with d > 0 it must be a subsequent attempt so we decrement d. Here’s the new code, which goes just inside the j loop (replacing matcher = self.matchers[j]):

                if d > 0:

                    # if need match and have match then we can look ahead past this matcher
                    if self.matchers[j].quantifier == '+?' and k > 0:
                        d += 1

                    # try to look ahead past lazy matchers that don't need at least one match
                    if self.matchers[j + d].quantifier == '*?':
                        while j + d < len(self.matchers) - 1 and (self.matchers[j + d].quantifier == '*?'):
                            d += 1

                else:
                    d -= 1

                matcher = self.matchers[j + d]

Note here that we can only look ahead past one +? matcher and only if we already have one match. One way to think of this is that /xyz/+? is equivalent to /xyz/,/xyz/*? so we have to get past that first required match before treating it like a *? matcher. Then if we hit another one it’s like running into a non-lazy matcher so we can’t look forward past it. Hence, the loop above only jumps *? matchers.

We’ve done most of the hard work with that change. Now we just need to handle updating the matcher index after the matcher has executed. If there was a match and we were looking ahead then we need to update the matcher index j to the matcher that produced the match, i.e. j + d. We’ll also want to reset k (as we’ve moved to a new matcher) and therefore we’ll increase n by k - remember n counts matches for the entire rule whereas k counts matches for the current matcher. Here’s the code to slot in under if matcher.match(tokens[t], t, types)::

                    if d > 0:

                        # move to matcher that matched token
                        j += d
                        d = 0

                        # matcher match count now belongs in rule match count as current matcher changed
                        n += k
                        k = 0

In the else case (where the match failed) we keep the same logic but we need to wrap it in a check that d == 0 because we only want to advance to the next matcher once we have tried all the look-ahead offsets. However we do need to update the test for requiring a match to include the new +? quantifier:

                        # if first match attempt and need match
                        if k == 0 and matcher.quantifier in ['+', '+?', '/']:

That’s it! We have a lazy matcher and we can fix our rule:

amounts?,/plus/?,amount?,/[a-zA-Z\-]+/+,amount? is ,,,ingredient,  

We change /[a-zA-Z\-]+/+ to /[\w\-]/+? to allow it to accept numbers again but to lazily match so it stops matching when an amount is found.

Next time

We’ll look at how we might convert our discovered ingredient name into a canonical form. This becomes important as soon as we want to do further processing of our data.