Classifying text in Python

Following on from the previous article, we’re going to look at implementing our text classifier in Python.


Let’s start with the input string we considered last time: 6 - 7 cups of three different types of vegetables, chopped into bite size pieces. We’re going to classify by matching tokens or a series of tokens so the first thing to do is split the string accordingly. We’ll split the string into numbers, words and other characters and strip off any whitespace using this tokenise function:

def tokenise(in_str):  
    return [t for t in re.split('([a-zA-Z][a-zA-Z\\-]*|\\d+|[^\\w ])', in_str) if t.strip() != '']

This produces the following array of tokens:

['6', '–', '7', 'cups', 'of', 'three', 'different', 'types', 'of', 'vegetables', '*', ',', 'chopped', 'into', 'bite-sized', 'pieces']

Classifying tokens

Let’s take a look at our classifier rules again:

/\d+/ is number
number,/-|–/,number is range  
/tbsp/ is unit
/cups?/ is unit
number|range,unit,/of/? is amount  
amount,/\w+/+ is amount,ingredient  

There are two fundamental kinds of classification going on here:

  • /some-pattern/ is some-type
  • some-type is some-other-type

Each of these can be represented by a matcher class that looks like:

class my_matcher:

    def match(self, token, idx, types):
        # specific match logic
        # ...

We’ll call the match() method to see if a particular token is a match. It’ll need the token and some other bits that we’ll come to later.

We can convert each of our rules to an array of these matchers and call their match() methods in sequence with the elements of our token array to classify groups of tokens. We’re getting ahead of ourselves though - let’s look at matching individual tokens first.

Pattern matcher

The pattern matcher checks if a token matches a specific pattern encoded in a regex. An example is in the rule /\d+/ is number, which checks for digits and classifies them with type number. Its match() method uses re.match to check the token matches:

def match(self, token, idx, types):  
    return re.match(self.pattern, token) is not None

self.pattern is a string containing the regex pattern and it’ll be set in the matcher when it’s initialised.

Type matcher

The type matcher checks if a token has already been classified with a particular type. If it has then it is also classified with the new type. For example, in the rule number,/-/,number is range for the first token in the sequence we are looking for a number to classify as also being a range. Its match() method simply searches the existing types, for which we need those two other parameters (idx is the index of token and types is an array of all the types classified so far):

def match(self, token, idx, types):  
    for t in types:
        if t.start_token <= idx and t.end_token >= idx and t.type in self.types:
            return True
        return False

self.types is an array of type strings and will be set in the matcher when it’s initialised.

Matching multiple tokens

Looking at our set of rules it is clear that classifying individual tokens isn’t enough. Consider the rule number,/-/,number is range. This classifies a particular pattern of three tokens as a range.

To classify a range we need to match each token in the sequence and only if they all match can we designate those tokens as a range.

We haven’t thought much about how the types are stored yet. We can see that tokens can have many type classifications and we’ve just seen that a type classification can have multiple tokens (the range type is applied to 3 consecutive tokens). As we are generally matching consecutive tokens, a good first punt would be an array of type classifications each containing:

  • type name
  • start token
  • end token

A suitable class would therefore look something like the following:

class classification:  
    def __init__(self, start_token, end_token, type):
        self.start_token = start_token
        self.end_token = end_token
        self.type = type

Now we have our individual token matchers and a storage structure for classifications we can write an algorithm to test each token sequence and store each discovered classification.

The matching algorithm

The idea is that for each rule we start from the leftmost token and move right one token at a time trying to match them using the array of matchers in the rule. If any one doesn’t match then the attempt is discarded and we try again, moving our start point along by one. If all the matchers in the rule do match then we store the classification and start looking for a new match from after the last token that was matched (we don’t overlap matches).

Sound simple? Well actually it’s a bit more complicated than that because some of our matchers are able to match more than one token and some are allowed match no tokens depending on the trailing quantifier (+, *, ?). So for + and * we need to keep matching tokens until we fail to match or hit the end. For * and ? we need to allow them to not match any tokens and continue to match the rest of the rule.

Let’s have a go at implementing it:

    def match(self, tokens, types):

        # for each token
        i = 0
        while i < len(tokens):

            new_types = []

            # reset number of matched tokens for rule
            n = 0

            # reset match count for matcher
            k = 0

            # for each matcher
            j = 0
            while j < len(self.matchers):
                matcher = self.matchers[j]

                # calculate token index, t
                t = i + n + k

                # match token using current matcher
                if matcher.match(tokens[t], t, types):

                    # if is-type is given
                    if len(matcher.is_type) > 0:

                        # if is-type per matcher
                        if self.is_type is None:

                            # if first match for matcher
                            if k == 0:

                                # create classification
                                new_types.append(classification(t, t, matcher.is_type))


                                # update classification
                                new_types[-1].end_token = t


                            # if first match for rule
                            if j == 0 and k == 0:

                                # create classification
                                new_types.append(classification(t, t, self.is_type))


                                # update classification
                                new_types[-1].end_token = t

                    # if can only match one token
                    if matcher.quantifier in ['/', '?']:

                        # next matcher
                        j += 1

                        # increment match count for rule
                        n += 1


                        # next match with current matcher
                        k += 1

                    # if end then break
                    if i + n + k == len(tokens):

                        # add matcher match count to rule match count
                        n += k



                    # if first match attempt and need match
                    if k == 0 and matcher.quantifier in ['+', '/']:

                        # reset matched types and break
                        new_types = []


                        # next matcher
                        j += 1

                        # add matcher match count to rule match count
                        n += k

                        # reset matcher match count
                        k = 0

            # if rule matched
            if len(new_types) > 0:

                # add types from this rule to the type store
                for new_type in new_types:

                # add rule match count to start index
                i += n

                # next start point
                i += 1

Note the use of matcher.is_type - this is set into the matcher when it is initialised and contains the type on the right side of the is, which is the type that a token will be classified with in case of a match. There is also self.is_type, which is set where a rule only has one type on the right of the is.

Take a look at the complete source code here:

Next time

We’ll expand our dataset and see if this approach holds water.