A text comprehension library for Python

Over the past few articles we’ve been writing and testing an algorithm to comprehend text content. Before we make any further improvements, we’ll revisit the API design and formalise it into a library stored in a github repo so we can better manage changes.

PyFathom

I’ve decided to call it PyFathom because it fathoms the meaning of a text string and is written in Python (and all other suitable synonyms have been claimed!)

Revisiting the API

The current API is based on several static functions. This was fine during development but we pass some information around such as tokens and classifications that we aren’t interested in a lot of the time. It would be nice to keep that stuff hidden away and minimise the number of calls we have to make to do a classification.

Therefore we’ll change classifier so that we construct a classifier object with its knowledge and call classify, passing in the input string to be classified. The classify method will return a classifications object that wraps up the tokens and classifications and exposes a method, extract_typed, to return the substring with the specified type. Our new API looks like this:

class classifier:

    def __init__(self, knowledge, tokeniser=default_tokeniser()):
        ...

    def classify(self, in_str):
        ...
        return classifications(...)

class classifications:  
    def __init__(self, token_list, classification_list):

    def extract_typed(self, type):
        ...
        return token_str

So a typical invocation would look like this:

cls = classifier(knowledge)  
classifications = cls.classify(in_str)  
ingredient = classifications.extract_typed('ingredient')  

Extracting multiple items

When we call extract_types, it finds all the tokens that match the specified type and lumps them together into the same string. There may have been more than one classification but that information is lost. Let’s change extract_types to preserve the boundaries and return an array of strings containing the separate matches.

In the new implementation, we’ll find the classifications that match the type we’re interested in and for each we’ll return a string containing the tokens that classification refers to (separated by spaces). Here’s the code:

    def extract_typed(self, type):
        def get_token_str(c):
            tstr = ''
            for i in range(c.start_token, c.end_token + 1):
                if len(tstr) > 0:
                    tstr += ' '
                tstr += self.token_list[i]
            return tstr

        return [get_token_str(c) for c in self.classification_list if c.type == type]

We now see potentially multiple strings per input string, e.g.

"1/4 teaspoon Garam Masala, for garnish"
=> "['Garam Masala', 'for garnish']"

Mark up

We can now move our markup method to the classifications class. It makes sense to rename it to __str__ as it produces a string representation of the classifications for debugging purposes. We can now call str(cls.classify(in_str)) and get mark-up output such as:

"1/4 teaspoon Garam Masala, for garnish"
=> "<number><amount>1</amount></number>/<amount><number>4</number><unit>teaspoon</unit></amount><ingredient>Garam Masala</ingredient>,<ingredient>for garnish</ingredient>"

The file utils.py is now empty and can be removed.

Source code

The complete source code is on github - enjoy!

You can get the package using pip:

pip install pyfathom  

Next time

We’ll look at implementing the lazy matcher that we needed a couple of articles ago.