Following on from the previous article, we’re going to look at implementing our text classifier in Python.
Let’s start with the input string we considered last time:
6 - 7 cups of three different types of vegetables, chopped into bite size pieces. We’re going to classify by matching tokens or a series of tokens so the first thing to do is split the string accordingly. We’ll split the string into numbers, words and other characters and strip off any whitespace using this tokenise function:
def tokenise(in_str): return [t for t in re.split('([a-zA-Z][a-zA-Z\\-]*|\\d+|[^\\w ])', in_str) if t.strip() != '']
This produces the following array of tokens:
['6', '–', '7', 'cups', 'of', 'three', 'different', 'types', 'of', 'vegetables', '*', ',', 'chopped', 'into', 'bite-sized', 'pieces']
Let’s take a look at our classifier rules again:
/\d+/ is number number,/-|–/,number is range /tbsp/ is unit /cups?/ is unit number|range,unit,/of/? is amount amount,/\w+/+ is amount,ingredient
There are two fundamental kinds of classification going on here:
- /some-pattern/ is some-type
- some-type is some-other-type
Each of these can be represented by a matcher class that looks like:
class my_matcher: def match(self, token, idx, types): # specific match logic # ...
We’ll call the
match() method to see if a particular token is a match. It’ll need the token and some other bits that we’ll come to later.
We can convert each of our rules to an array of these matchers and call their
match() methods in sequence with the elements of our token array to classify groups of tokens. We’re getting ahead of ourselves though - let’s look at matching individual tokens first.
The pattern matcher checks if a token matches a specific pattern encoded in a regex. An example is in the rule
/\d+/ is number, which checks for digits and classifies them with type
match() method uses
re.match to check the token matches:
def match(self, token, idx, types): return re.match(self.pattern, token) is not None
self.pattern is a string containing the regex pattern and it’ll be set in the matcher when it’s initialised.
The type matcher checks if a token has already been classified with a particular type. If it has then it is also classified with the new type. For example, in the rule
number,/-/,number is range for the first token in the sequence we are looking for a
number to classify as also being a
match() method simply searches the existing types, for which we need those two other parameters (
idx is the index of
types is an array of all the types classified so far):
def match(self, token, idx, types): for t in types: if t.start_token <= idx and t.end_token >= idx and t.type in self.types: return True return False
self.types is an array of type strings and will be set in the matcher when it’s initialised.
Matching multiple tokens
Looking at our set of rules it is clear that classifying individual tokens isn’t enough. Consider the rule
number,/-/,number is range. This classifies a particular pattern of three tokens as a range.
To classify a
range we need to match each token in the sequence and only if they all match can we designate those tokens as a
We haven’t thought much about how the types are stored yet. We can see that tokens can have many type classifications and we’ve just seen that a type classification can have multiple tokens (the
range type is applied to 3 consecutive tokens). As we are generally matching consecutive tokens, a good first punt would be an array of type classifications each containing:
- type name
- start token
- end token
A suitable class would therefore look something like the following:
class classification: def __init__(self, start_token, end_token, type): self.start_token = start_token self.end_token = end_token self.type = type
Now we have our individual token matchers and a storage structure for classifications we can write an algorithm to test each token sequence and store each discovered classification.
The matching algorithm
The idea is that for each rule we start from the leftmost token and move right one token at a time trying to match them using the array of matchers in the rule. If any one doesn’t match then the attempt is discarded and we try again, moving our start point along by one. If all the matchers in the rule do match then we store the classification and start looking for a new match from after the last token that was matched (we don’t overlap matches).
Sound simple? Well actually it’s a bit more complicated than that because some of our matchers are able to match more than one token and some are allowed match no tokens depending on the trailing quantifier (+, *, ?). So for + and * we need to keep matching tokens until we fail to match or hit the end. For * and ? we need to allow them to not match any tokens and continue to match the rest of the rule.
Let’s have a go at implementing it:
def match(self, tokens, types): # for each token i = 0 while i < len(tokens): new_types =  # reset number of matched tokens for rule n = 0 # reset match count for matcher k = 0 # for each matcher j = 0 while j < len(self.matchers): matcher = self.matchers[j] # calculate token index, t t = i + n + k # match token using current matcher if matcher.match(tokens[t], t, types): # if is-type is given if len(matcher.is_type) > 0: # if is-type per matcher if self.is_type is None: # if first match for matcher if k == 0: # create classification new_types.append(classification(t, t, matcher.is_type)) else: # update classification new_types[-1].end_token = t else: # if first match for rule if j == 0 and k == 0: # create classification new_types.append(classification(t, t, self.is_type)) else: # update classification new_types[-1].end_token = t # if can only match one token if matcher.quantifier in ['/', '?']: # next matcher j += 1 # increment match count for rule n += 1 else: # next match with current matcher k += 1 # if end then break if i + n + k == len(tokens): # add matcher match count to rule match count n += k break else: # if first match attempt and need match if k == 0 and matcher.quantifier in ['+', '/']: # reset matched types and break new_types =  break else: # next matcher j += 1 # add matcher match count to rule match count n += k # reset matcher match count k = 0 # if rule matched if len(new_types) > 0: # add types from this rule to the type store for new_type in new_types: types.append(new_type) # add rule match count to start index i += n else: # next start point i += 1
Note the use of
matcher.is_type - this is set into the matcher when it is initialised and contains the type on the right side of the
is, which is the type that a token will be classified with in case of a match. There is also
self.is_type, which is set where a rule only has one type on the right of the
Take a look at the complete source code here: https://gist.github.com/jeremyorme/4fb29a16d1ff534f5f75f65071e74cc5
We’ll expand our dataset and see if this approach holds water.