Earley-style chart parsing made easy

(by Peter Ljunglöf, peter.ljunglof@gu.se)

Chart parsing is a pretty old technique, originating in the 60's with CKY (Cocke-Kasami-Younger). CKY is a really simple and efficient algorithm, but it only works for grammars in Chomsky normal form. The first efficient parsing algorithm for general context-free grammars was Jay Earley's algorithm from 1970, and that algorithm is what I will try to explain in this notebook.

Earley's algorithm is still the best there is in terms of complexity – no practical context-free parsing algorithm has beaten its cubic complexity – but there has been several variants, extensions and optimisations. But Earley's algorithm serves as a very good starting point.

One thing that has bothered me for a long time is that all tutorials on chart parsing and in particular Earley's algorithm are so complicated, or they don't explain the details on what you have to do to make it efficient.

(A boastful side note) One exception is my own chart parsing tutorial from 2002, but it was for the programming language Haskell. The tutorial itself is chapter 5 in my licenciate thesis, and the code is in a GitHub repo.

So, here is a notebook that will try to explain the Earley chart parsing algorithm, and how to implement it in Python.

Introduction

We will use named tuples for a quick-and-dirty way of creating new classes. The islice function will be used for taking a prefix of a generator later. None of them are really important for the parsing algorithms themselves.

In [800]:
from collections import namedtuple
from itertools import islice

Nonterminals and terminals

We assume knowledge of context-free grammars. Context-free categories (nonterminals) are represented as strings, just as words (terminals).

This means that we don't make a distinction between nonterminals (the categories) and terminals (the words or tokens). We happen to use capitals for the nonterminals and lowercase terminals, but that's only by convention. To enforce a strict division of the symbols into nonterminals and terminals only complicate the implementations.

It is quite straightforward to impose this distinction if you want to, so this is left as an exercise for the interested reader.

Grammar rules

A context-free rule A –> B1 ... Bn is represented as a pair, where the second element is a sequence of the right-hand side categories. To make it more readable, we define it as a named tuple:

In [475]:
Rule = namedtuple('Rule', "lhs rhs")
Rule.__str__ = lambda self: "%s --> %s" % (self.lhs, " ".join(map(str, self.rhs)))

Note that we redefine the pretty-printing of rules, so that:

In [809]:
r = Rule('S', ('NP', 'VP'))
print('str(r) = %s' % (r,))
print('repr(r) = %r' % (r,))
print('r.lhs = %s, r.rhs = %s' % (r.lhs, r.rhs))
str(r) = S --> NP VP
repr(r) = Rule(lhs='S', rhs=('NP', 'VP'))
r.lhs = S, r.rhs = ('NP', 'VP')

Context-free grammar

A context-free grammar is simply a collection of grammar rules. Here is the example grammar that wewill use throughout this tutorial:

In [824]:
grammar = [ Rule('S',   ('NP','VP')),
            Rule('VP',  ('Verb',)),       Rule('VP',  ('Verb','NP')),   Rule('VP', ('VP','PP')),
            Rule('NP',  ('Det','Noun')),  Rule('NP', ('NP','PP')),
            Rule('PP',  ('Prep','NP')),
            Rule('Verb',('sees',)),
            Rule('Det', ('the',)),    Rule('Det', ('a',)),
            Rule('Prep',('under',)),  Rule('Prep',('with',)),   Rule('Prep',('in',)), 
            Rule('Noun',('zebra',)),  Rule('Noun',('lion',)),   Rule('Noun',('tree',)),
            Rule('Noun',('park',)),   Rule('Noun',('telescope',)),
          ]
In [825]:
for rule in grammar:
    print(rule)
S --> NP VP
VP --> Verb
VP --> Verb NP
VP --> VP PP
NP --> Det Noun
NP --> NP PP
PP --> Prep NP
Verb --> sees
Det --> the
Det --> a
Prep --> under
Prep --> with
Prep --> in
Noun --> zebra
Noun --> lion
Noun --> tree
Noun --> park
Noun --> telescope

Left corners

Later we will need a way to find the grammar rules we want efficiently. The Earley algorithm looks for grammar rules whose right-hand sides start with a given symbol. This symbol is called the left corner of the rule.

As an example, given the category NP, the matching rules from the example grammar are S –> NP VP and NP –> NP PP.

The following function transforms a grammar into a Python dictionary that returns all matching rules for a given left corner:

In [826]:
def leftcorner_dict(grammar):
    leftcorners = {}
    for rule in grammar:
        leftcorners.setdefault(rule.rhs[0], []).append(rule)
    return leftcorners
In [834]:
for lc, rules in leftcorner_dict(grammar).items():
    print("%-10s" % (lc,), ": ", "".join("%-18s" % (r,) for r in rules))
NP         :  S --> NP VP       NP --> NP PP      
Verb       :  VP --> Verb       VP --> Verb NP    
VP         :  VP --> VP PP      
Det        :  NP --> Det Noun   
Prep       :  PP --> Prep NP    
sees       :  Verb --> sees     
the        :  Det --> the       
a          :  Det --> a         
under      :  Prep --> under    
with       :  Prep --> with     
in         :  Prep --> in       
zebra      :  Noun --> zebra    
lion       :  Noun --> lion     
tree       :  Noun --> tree     
park       :  Noun --> park     
telescope  :  Noun --> telescope

Topdown rules

Another way of finding grammar rules is to search for a given left-hand side. So here's the corresponding function that transforms a grammar into a dictionary that returns all matching rules for a given left-hand side:

In [831]:
def topdown_dict(rules):
    topdowns = {}
    for r in rules:
        topdowns.setdefault(r.lhs, []).append(r)
    return topdowns
In [833]:
for lhs, rules in topdown_dict(grammar).items():
    print("%-10s" % (lhs,), ": ", "".join("%-18s" % (r,) for r in rules))
S          :  S --> NP VP       
VP         :  VP --> Verb       VP --> Verb NP    VP --> VP PP      
NP         :  NP --> Det Noun   NP --> NP PP      
PP         :  PP --> Prep NP    
Verb       :  Verb --> sees     
Det        :  Det --> the       Det --> a         
Prep       :  Prep --> under    Prep --> with     Prep --> in       
Noun       :  Noun --> zebra    Noun --> lion     Noun --> tree     Noun --> park     Noun --> telescope

No empty rules

Note that empty rules (i.e., rules with empty right-hand sides) are forbidden! The reason for this is to make the code as compact and readable as possible. This is a disadvantage, because sometimes it's really easier to write grammars using empty rules.

Fortunately there are methods for transforming grammars with empty rules, into grammars without. Alternatively, the algorithms presented here can be augmented to cope with empty rules.

Example sentences

We also need some example sentences to play around with. Here is an ever-growing list of sentences showing off the PP-attachment problem:

In [837]:
def example(n):
    prefix = "the lion sees a zebra".split()
    suffix = "under a tree with a telescope in the park".split()
    return prefix + (suffix * (n//3+1))[:n*3]
In [838]:
for n in range(5):
    print("example(%d) = %s" % (n, " ".join(example(n))))
example(0) = the lion sees a zebra
example(1) = the lion sees a zebra under a tree
example(2) = the lion sees a zebra under a tree with a telescope
example(3) = the lion sees a zebra under a tree with a telescope in the park
example(4) = the lion sees a zebra under a tree with a telescope in the park under a tree

Parse trees

We take the simplistic approach to parse trees: A tree is a pair of a root node and a sequence of its children. To make them more readable, we define them as a named tuple.

In [855]:
Tree = namedtuple('Tree', "root  children")
Tree.__str__ = lambda tree: ( "%s[%s]" % (tree.root, " ".join(map(str, tree.children))) 
                              if tree.children else str(tree.root) 
                            )
In [856]:
tree = Tree('S', (Tree('NP', (Tree('Det',  (Tree('the', ()),)),
                              Tree('Noun', (Tree('cat', ()),)),)), 
                  Tree('VP', (Tree('Verb', (Tree('sleeps', ()),)),)),))
print("repr(tree):", repr(tree))
print("\nstr(tree):", str(tree))
repr(tree): Tree(root='S', children=(Tree(root='NP', children=(Tree(root='Det', children=(Tree(root='the', children=()),)), Tree(root='Noun', children=(Tree(root='cat', children=()),)))), Tree(root='VP', children=(Tree(root='Verb', children=(Tree(root='sleeps', children=()),)),))))

str(tree): S[NP[Det[the] Noun[cat]] VP[Verb[sleeps]]]

Edges (or items, or states)

The basic element in Earley parsing, and in chart parsing in general, is the edge, also known as an item or a state. The edge consists of a span and a dotted rule, which tells that a part of a grammar rule has been found spanning some part of the input. The following is the general form for an edge:

[i–k: A –> B1 ... Bm • Bn ... Bp]

The meaning of an edge is twofold:

  1. What has been found: "the categories B1...Bm have been found spanning the words i+1, i+2, ..., k in the input sentence".
  2. What we look for: "if we find the categories Bn...Bp spanning the words k+1, ..., r, then we know that there is a A spanningn the words i+1, r".

The idea with chart parsing is that we advance the right-hand side of a grammar rule one symbol at the time, and the dot denotes how far we have come. Another idea is to not do unnecessary work, so if there are two possible ways of reaching a parsing state, we do not have to the same work twice.

In Python we can define edges as a 6-tuple, consisting of the span (start and end), the grammar rule (lhs and rhs), the dot position (dot), and finally an optional parse result (result):

In [870]:
Edge = namedtuple('Edge', "start end lhs rhs dot result", defaults=[None])
Edge.__str__ = lambda e: ( "[%s-%s: %s --> %s . %s]" 
                           % (e.start, e.end, e.lhs, " ".join(e.rhs[:e.dot]), " ".join(e.rhs[e.dot:]))
                         )

Note that we do not show the optional result when pretty-printing the edge. This is because it can grow quite large, and then the edge becomes unreadable.

The following shows two edges for the same grammar rule. If the example sentence is "the lion sees a zebra under a tree", then the first one says that it has found an NP spanning the words "the lion", while the second one says that it has found an NP followed by a VP spannning the words "the lion sees a zebra":

In [871]:
edge1 = Edge(0, 2, 'S', ('NP', 'VP'), 1); print(edge1)
edge2 = Edge(0, 5, 'S', ('NP', 'VP'), 2); print(edge2)
[0-2: S --> NP . VP]
[0-5: S --> NP VP . ]

The second edge has a special status, because it is complete, or passive. This edge says that it has found a complete S phrase between positions 0–5, and that this phrase is derived from the grammar rule S –> NP VP.

The following method is a test whether an edge is passive or not:

In [872]:
Edge.passive = lambda e: e.dot == len(e.rhs)

The chart

Concneptually, the chart is a collection of edges that we build while parsing. When parsing is complete, the chart consists of all edges that can be inferred from the input sentence.

After that we can extract the parse trees (or whatever information we want) from the final chart. For extracting parse trees it's enough to only consider the passive edges in the chart, so our parsing functions will only return the passive edges.

Earley parsing

The main idea with Earley parsing is that we go through the input one word at the time, and calculate all possible edges that can be inferred after seeing this new word. This means that we can partition the chart into edge-sets, one per input word (or rather input position). So, the chart is therefore an array of edgesets – we call this an Earley chart.

The Earley parser consists of three "inference steps", called Scan, Predict and Complete, which all adds edges to an edgeset:

  • Scan: Add a passive edge [j–k: w –> •] to edgeset k for every word w between positions j and k (i.e., where j is k-1).

  • Predict: If there is a passive edge [j–k: B –> ... •] in edgeset k, and there is a grammar rule A –> B C ... looking for a B, then add a new edge [j–k: A –> B • C ...] to edgeset k.

  • Complete: If there is a passive edge [j–k: B –> ... •] in edgeset k, and there is an active edge [i–j: A –> ... • B C ...] in edgeset k, then add a new edge [i–k: A –> B • C ...] to edgeset k.

Note that Predict and Complete have a trigger edge from the same edgeset k. This means that the set already has to be nonempty for them to be able to add new edges. The edgeset is initialised by Scan, and then Predict and Complete can act until non new edges can be added.

Another way to view this is that edgeset k is the set of all inferences that we can draw about phrases ending in word k, from the grammar and the input sentence. Scan is then an axiom, and Predict and Complete are inference rules. We initialise our knowledge from the axiom, and then apply the inference rules until we reach a fixpoint.

One way to implement this is to use fixed-point recursion, which is very easy to do in a functional programming langauge (see e.g. my Haskell implementations). Here we will instead use a todo-list (an "agenda"). The agenda contains the items that should be added to the edgeset but have not been processed by the inference rules yet. When the agenda becomes empty, the edgeset is completed. The agenda is iniatialised by the Scan axiom, and then we use a while-loop where we apply the Predict and Complete inference rules.

This technique is called dynamic programming.

Implementation 1

For each input word and position k, we initialise the agenda to a single passive edge. Then we repeat removing one edge from the agenda until it is exhausted. We add it to the edgeset (if it's not already there), and if it is passive, we apply the predictor and the scanner, adding the new edges to the agenda.

In [925]:
def earley1(grammar, input):
    # The 0th edgeset is always empty, because there are no edges ending in position 0
    chart = [set()]
    
    # Enumerate each word in the input, starting from k=1
    for k, word in enumerate(input, 1):
        edgeset = set()
        
        # Scan: add one single passive edge for the input word
        agenda = [ Edge(k-1, k, word, (), 0) ]
        
        while agenda:
            edge = agenda.pop()
            
            # Add the edge to the edgeset (if it's not already there)
            if edge not in edgeset:
                edgeset.add(edge)
                
                # If the edge is passive we can apply the inference rules
                if edge.passive():
                    
                    # Predict: find grammar rules looking for edge.lhs
                    agenda.extend(
                        Edge(edge.start, k, lhs, rhs, 1)
                        for (lhs, rhs) in grammar 
                        if edge.lhs == rhs[0]
                    )
                    
                    # Complete: find active edges in old edgesets looking for edge.lhs
                    agenda.extend(
                        Edge(e.start, k, e.lhs, e.rhs, e.dot+1) 
                        for e in chart[edge.start] 
                        if not e.passive() 
                        if edge.lhs == e.rhs[e.dot]
                    )
        
        # Add the edgeset to the end of the chart
        chart.append(edgeset)
    
    # Filter all passive edges from the chart, and return them
    return [ [e for e in edgeset if e.passive()] 
             for edgeset in chart 
           ]

So, what can we do with the parse result – the chart? Well, for one we can test if the parsing succeeded, by looking for a passive edge spanning the whole input, with left-hand side being the starting category of the grammar:

In [874]:
def success(chart, cat, start=0, end=-1):
    return any(
        edge.start == start and edge.lhs == cat and edge.passive()
        for edge in chart[end]
    )
In [936]:
sent1 = example(3); sent2 = sent1[:6]
print('Parsing "%s": %s' % (' '.join(sent1), success(earley1(grammar, sent1), 'S')))
print('Parsing "%s": %s' % (' '.join(sent2), success(earley1(grammar, sent2), 'S')))
Parsing "the lion sees a zebra under a tree with a telescope in the park": True
Parsing "the lion sees a zebra under": False

Another thing we can do is to return the number of edges in the chart (which will be useful when we try to debug the implementations):

In [875]:
def chartsize(chart):
    return sum(map(len, chart))

We can also print the edges in the chart. Here is a generic version where we can specify what edgesets should be printed, and a limit on the number of edges printed:

In [894]:
def print_chart(chart, positions=None, cutoff=None):
    print("Chart size: %d edges" % chartsize(chart))
    for (k, edgeset) in enumerate(chart):
        if edgeset and (positions is None or k in positions or (k-len(chart)) in positions):
            print("%d edges ending in position %s:" % (len(edgeset), k))
            for n, edge in enumerate(sorted(edgeset)):
                if cutoff and n >= cutoff:
                    print("    ...")
                    break
                print("   ", edge)

Now we are ready to test our parsers on some input. The first argument is the parsing function (i.e, earley1 above), and the rest are hopefully self-explanatory:

In [895]:
def test(parser, grammar, cat, sentence, positions=None, cutoff=8):
    nwords = len(sentence)
    if nwords <= 15:
        print('Parsing %d words: "%s"' % (nwords, " ".join(sentence)))
    else:
        print('Parsing %d words: "%s ... %s"' 
              % (nwords, " ".join(sentence[:3]), " ".join(sentence[-9:])))
    chart = parser(grammar, sentence)
    print("Yay, success!!" if success(chart, cat) else "Meh, failure:(")
    print_chart(chart, positions, cutoff)
In [898]:
test(earley1, grammar, 'S', example(3), positions=[1,2,-2,-1])
Parsing 14 words: "the lion sees a zebra under a tree with a telescope in the park"
Yay, success!!
Chart size: 58 edges
2 edges ending in position 1:
    [0-1: Det --> the . ]
    [0-1: the -->  . ]
3 edges ending in position 2:
    [0-2: NP --> Det Noun . ]
    [1-2: Noun --> lion . ]
    [1-2: lion -->  . ]
2 edges ending in position 13:
    [12-13: Det --> the . ]
    [12-13: the -->  . ]
12 edges ending in position 14:
    [0-14: S --> NP VP . ]
    [2-14: VP --> VP PP . ]
    [2-14: VP --> Verb NP . ]
    [3-14: NP --> NP PP . ]
    [5-14: PP --> Prep NP . ]
    [6-14: NP --> NP PP . ]
    [8-14: PP --> Prep NP . ]
    [9-14: NP --> NP PP . ]
    ...
In [899]:
test(earley1, grammar, 'S', example(1)[:6])
Parsing 6 words: "the lion sees a zebra under"
Meh, failure:(
Chart size: 18 edges
2 edges ending in position 1:
    [0-1: Det --> the . ]
    [0-1: the -->  . ]
3 edges ending in position 2:
    [0-2: NP --> Det Noun . ]
    [1-2: Noun --> lion . ]
    [1-2: lion -->  . ]
4 edges ending in position 3:
    [0-3: S --> NP VP . ]
    [2-3: VP --> Verb . ]
    [2-3: Verb --> sees . ]
    [2-3: sees -->  . ]
2 edges ending in position 4:
    [3-4: Det --> a . ]
    [3-4: a -->  . ]
5 edges ending in position 5:
    [0-5: S --> NP VP . ]
    [2-5: VP --> Verb NP . ]
    [3-5: NP --> Det Noun . ]
    [4-5: Noun --> zebra . ]
    [4-5: zebra -->  . ]
2 edges ending in position 6:
    [5-6: Prep --> under . ]
    [5-6: under -->  . ]

Implementation 2: Optimising the code

All this is very good, but can we optimise the implementation, and if so how much? Let's test a very long sentence:

In [937]:
%time test(earley1, grammar, 'S', example(200), positions=[-1])
Parsing 605 words: "the lion sees ... in the park under a tree with a telescope"
Yay, success!!
Chart size: 42216 edges
406 edges ending in position 605:
    [0-605: S --> NP VP . ]
    [2-605: VP --> VP PP . ]
    [2-605: VP --> Verb NP . ]
    [3-605: NP --> NP PP . ]
    [5-605: PP --> Prep NP . ]
    [6-605: NP --> NP PP . ]
    [8-605: PP --> Prep NP . ]
    [9-605: NP --> NP PP . ]
    ...
CPU times: user 4.34 s, sys: 117 ms, total: 4.46 s
Wall time: 4.59 s

This means that earley1 can parse a 600-word sentence in 4 seconds, building a chart of 42,000 passive edges. Can we beat that?

There are some places where earley1 does unnecessary work – Predict searches through the whole grammar just to find a couple of matching rules, and Complete searches through the whole edgeset j. The solution is to use suitable data structures:

  • Predict: Instead of searching through the whole grammar, we can use an optimised data structure. What we want to find is all rules that have the same left corner, so let's transform the grammar into a dictionary of the left corners. We already have a function for that, leftcorner_dict.

  • Complete: The same problem here – we have to loop through the whole edgeset j. Instead, we can convert the edgeset into a dictionary from a left corner to all edges with that left corner.

In [982]:
def earley2(grammar, input):
    # Transform the grammar into a more efficient data structure
    leftcorners = leftcorner_dict(grammar)
    
    chart = [{None: set()}]
    for k, sym in enumerate(input, 1):
        # Divide the edgeset into a dictionary where the keys are the left corners of the edges
        lc_edgesets = {}

        # Scan
        agenda = [ Edge(k-1, k, sym, (), 0) ]

        while agenda:
            edge = agenda.pop()
            
            # Get the edgeset for the given left corner
            leftc = None if edge.passive() else edge.rhs[edge.dot]
            edgeset = lc_edgesets.setdefault(leftc, set())
            
            # Add the edge to the edgeset, if it's not already there
            if edge not in edgeset:
                edgeset.add(edge)
                
                # If the edge is passive...
                if edge.passive():
                    # ...let's apply Predict:
                    agenda.extend(
                        Edge(edge.start, k, lhs, rhs, 1)
                        for (lhs, rhs) in leftcorners.get(edge.lhs, [])
                    )
                    # ...and Complete:
                    agenda.extend(
                        Edge(e.start, k, e.lhs, e.rhs, e.dot+1) 
                        for e in chart[edge.start].get(edge.lhs, ())
                    )
        # Append the dictionary of edgesets to the chart
        chart.append(lc_edgesets)
    
    # Return only the passive edges from the chart
    return [ lc_edgesets[None] for lc_edgesets in chart ]

Note that the passive edges don't have a left corner, because they don't look for anything aymore. Instead we want to be able to filter out only the passive edges from a chart. For that reason we use a special symbol which cannot be a left corner, as key in the edgeset dictionary – and we choose the key None for these passive edges.

In [940]:
%time test(earley2, grammar, 'S', example(200), positions=[-1])
Parsing 605 words: "the lion sees ... in the park under a tree with a telescope"
Yay, success!!
Chart size: 42216 edges
406 edges ending in position 605:
    [0-605: S --> NP VP . ]
    [2-605: VP --> VP PP . ]
    [2-605: VP --> Verb NP . ]
    [3-605: NP --> NP PP . ]
    [5-605: PP --> Prep NP . ]
    [6-605: NP --> NP PP . ]
    [8-605: PP --> Prep NP . ]
    [9-605: NP --> NP PP . ]
    ...
CPU times: user 2.26 s, sys: 10.2 ms, total: 2.27 s
Wall time: 2.27 s

Wow! Just by using better data structures, we made the implementation twice as fast.

Implementation 3: Building parse trees

Now we want to add some structure to the results, and we will use parse trees for that.

Luckily the edges have an optional argument, result, which we can use. We will use this to store the parse trees that have been found by an edge. I.e., together with the edge [i–k: A –> B1 ... Bm • Bn ... Bp], we will store the trees that have been found for B1 .... Bm.

The only changes to earley2 is that Scan, Predict and Complete include parse trees in the newly created edges. Everything else stays the same.

In [981]:
def earley3(grammar, input):
    leftcorners = leftcorner_dict(grammar)    
    chart = [{None: set()}]
    for k, sym in enumerate(input, 1):
        lc_edgesets = {}
        
        # Scan: since there are no children, the result will be the empty sequence ()
        agenda = [ Edge(k-1, k, sym, (), 0, ()) ]

        while agenda:
            edge = agenda.pop()
            leftc = None if edge.passive() else edge.rhs[edge.dot]
            edgeset = lc_edgesets.setdefault(leftc, set())
            if edge not in edgeset:
                edgeset.add(edge)
                
                if edge.passive():
                    # Build the tree for the recognised passive edge
                    tree = Tree(edge.lhs, edge.result)

                    # Predict: the new edge has recognised the passive edge,
                    # so we add its tree to the results
                    agenda.extend(
                        Edge(edge.start, k, lhs, rhs, 1, (tree,))
                        for (lhs, rhs) in leftcorners.get(edge.lhs, [])
                    )
                    # Complete: the new edge has recogninsed one more tree, 
                    # so we add it after the already recognised trees
                    agenda.extend(
                        Edge(e.start, k, e.lhs, e.rhs, e.dot+1, e.result+(tree,)) 
                        for e in chart[edge.start].get(edge.lhs, ())
                    )
        chart.append(lc_edgesets)
    
    return [ lc_edgesets[None] for lc_edgesets in chart ]

Now, what can we do with parse results? First, instead of just returning whether parsing succeeded, we can return all results that are found by succeeding passive edges.

Note that we do not return a list of all results, but instead a generator. The reason is that we can stop the loop prematurely if we only need one result, and furthermore it also saves memory.

In [942]:
def results(chart, cat, start=0, end=-1):
    for edge in chart[end]:
        if edge.start == start and edge.lhs == cat and edge.passive():
            yield edge.result

Note that the returned results are not the parse trees for the starting category, but the children trees. To get the final parse trees we have to combine the children with the starting category:

In [913]:
def extract_trees3(chart, cat):
    for children in results(chart, cat):
        yield Tree(cat, children)

Finally, we can implement a parser, taking a grammar, a starting category and the input sentence, yielding parse trees:

In [943]:
def parse3(grammar, cat, sentence):
    chart = earley3(grammar, sentence)
    return extract_trees3(chart, cat)
In [954]:
for n in range(3):
    print('Parsing "%s":' % (' '.join(example(n)),))
    for i, tree in enumerate(parse3(grammar, 'S', example(n)), 1):
        print('%3d. %s' % (i, tree))
    print()
Parsing "the lion sees a zebra":
  1. S[NP[Det[the] Noun[lion]] VP[Verb[sees] NP[Det[a] Noun[zebra]]]]

Parsing "the lion sees a zebra under a tree":
  1. S[NP[Det[the] Noun[lion]] VP[VP[Verb[sees] NP[Det[a] Noun[zebra]]] PP[Prep[under] NP[Det[a] Noun[tree]]]]]
  2. S[NP[Det[the] Noun[lion]] VP[Verb[sees] NP[NP[Det[a] Noun[zebra]] PP[Prep[under] NP[Det[a] Noun[tree]]]]]]

Parsing "the lion sees a zebra under a tree with a telescope":
  1. S[NP[Det[the] Noun[lion]] VP[VP[Verb[sees] NP[Det[a] Noun[zebra]]] PP[Prep[under] NP[NP[Det[a] Noun[tree]] PP[Prep[with] NP[Det[a] Noun[telescope]]]]]]]
  2. S[NP[Det[the] Noun[lion]] VP[Verb[sees] NP[NP[NP[Det[a] Noun[zebra]] PP[Prep[under] NP[Det[a] Noun[tree]]]] PP[Prep[with] NP[Det[a] Noun[telescope]]]]]]
  3. S[NP[Det[the] Noun[lion]] VP[VP[VP[Verb[sees] NP[Det[a] Noun[zebra]]] PP[Prep[under] NP[Det[a] Noun[tree]]]] PP[Prep[with] NP[Det[a] Noun[telescope]]]]]
  4. S[NP[Det[the] Noun[lion]] VP[Verb[sees] NP[NP[Det[a] Noun[zebra]] PP[Prep[under] NP[NP[Det[a] Noun[tree]] PP[Prep[with] NP[Det[a] Noun[telescope]]]]]]]]
  5. S[NP[Det[the] Noun[lion]] VP[VP[Verb[sees] NP[NP[Det[a] Noun[zebra]] PP[Prep[under] NP[Det[a] Noun[tree]]]]] PP[Prep[with] NP[Det[a] Noun[telescope]]]]]

Hmm... the number of parse trees seem to increas. Let's investigate their growth rate:

In [971]:
print("Example    Length    Trees   T(n)/T(n-1)")
oldtrees = 1
for n in range(1, 11):
    sent = example(n)
    ntrees = len(list(parse3(grammar, 'S', sent)))
    print("%5d%10d%10d%10.1f" % (n, len(sent), ntrees, ntrees/oldtrees))
    oldtrees = ntrees
Example    Length    Trees   T(n)/T(n-1)
    1         8         2       2.0
    2        11         5       2.5
    3        14        14       2.8
    4        17        42       3.0
    5        20       132       3.1
    6        23       429       3.2
    7        26      1430       3.3
    8        29      4862       3.4
    9        32     16796       3.5
   10        35     58786       3.5

The growth rate seems to be converging on a constant, which means that the number of trees grows exponentially. In fact, the number of trees form the sequence of Catalan numbers, which is discussed further in the NLTK book, chapter 8 (section 6.2 Pernicious Ambiguity).

Now, what about the parsing efficiency? Let's test example 2:

In [917]:
test(earley3, grammar, 'S', example(2), positions=[-1])
Parsing 11 words: "the lion sees a zebra under a tree with a telescope"
Yay, success!!
Chart size: 51 edges
18 edges ending in position 11:
    [0-11: S --> NP VP . ]
    [0-11: S --> NP VP . ]
    [0-11: S --> NP VP . ]
    [0-11: S --> NP VP . ]
    [0-11: S --> NP VP . ]
    [2-11: VP --> VP PP . ]
    [2-11: VP --> VP PP . ]
    [2-11: VP --> VP PP . ]
    ...

And compare this with the previous parser:

In [973]:
test(earley2, grammar, 'S', example(2), positions=[-1])
Parsing 11 words: "the lion sees a zebra under a tree with a telescope"
Yay, success!!
Chart size: 42 edges
10 edges ending in position 11:
    [0-11: S --> NP VP . ]
    [2-11: VP --> VP PP . ]
    [2-11: VP --> Verb NP . ]
    [3-11: NP --> NP PP . ]
    [5-11: PP --> Prep NP . ]
    [6-11: NP --> NP PP . ]
    [8-11: PP --> Prep NP . ]
    [9-11: NP --> Det Noun . ]
    ...

Hey, our new parser creates more edges! In the final position, the new parser creates 18 edges, where the old one only created 10. And several edges seem to be equivalent... hmm, what's happening here?

The problem is that since we now include the parse trees in the edges, every parse tree will have their own edge. You can see this above, where there are five S–>NP VP edges spanning the whole input, and in the table above we saw that the sentence has exactly five trees.

Let's compare the chart sizes between earley2 and earley3:

In [975]:
print("   n      words     chart2     chart3")
for n in range(1, 11):
    sent = example(n)
    chart2 = earley2(grammar, sent)
    chart3 = earley3(grammar, sent)
    print("%4d %8d %10d %10d" % (n, len(sent), chartsize(chart2), chartsize(chart3)))
   n      words     chart2     chart3
   1        8         28         29
   2       11         42         51
   3       14         58         98
   4       17         76        220
   5       20         96        578
   6       23        118       1704
   7       26        142       5393
   8       29        168      17805
   9       32        196      60377
  10       35        226     208587

This doesn't look very good. The chart for earley3 grows exponentially, compared to the earley2 chart which seems to grow almost linearly. Let's test the parsing times:

In [976]:
print("Testing earley2")
%time test(earley2, grammar, 'S', example(10), positions=[])
print()
print("Testing earley3")
%time test(earley3, grammar, 'S', example(10), positions=[])
Testing earley2
Parsing 35 words: "the lion sees ... with a telescope in the park under a tree"
Yay, success!!
Chart size: 226 edges
CPU times: user 2.27 ms, sys: 129 µs, total: 2.4 ms
Wall time: 2.42 ms

Testing earley3
Parsing 35 words: "the lion sees ... with a telescope in the park under a tree"
Yay, success!!
Chart size: 208587 edges
CPU times: user 4.35 s, sys: 14.1 ms, total: 4.36 s
Wall time: 4.36 s

Apparently our attempt to include the parse trees in the edges didn't work out well. Can we do something else instead?

Implementation 4: Building a forest

Instead of collecting the parse trees, let's collect the recognised phrases. With that I mean to collect which child categories have been found, and where in the input. I.e., instead of storing parse trees, we store phrases, which are tuples of the form (A,i,j) (or A{i-j} as they are pretty-printed).

In [978]:
Phrase = namedtuple('Phrase', "cat start end")
Phrase.__str__ = lambda phr: "%s{%d-%d}" % phr

The only difference from before is that instead of storing full trees, we store phrases in Scan, Predict and Complete:

In [983]:
def earley4(grammar, input):
    leftcorners = leftcorner_dict(grammar)    
    chart = [{None: set()}]
    for k, sym in enumerate(input, 1):
        lc_edgesets = {}
        
        # Scan: since there are no children, the result will be the empty sequence ()
        agenda = [ Edge(k-1, k, sym, (), 0, ()) ]

        while agenda:
            edge = agenda.pop()
            leftc = None if edge.passive() else edge.rhs[edge.dot]
            edgeset = lc_edgesets.setdefault(leftc, set())
            if edge not in edgeset:
                edgeset.add(edge)
                
                if edge.passive():
                    # Build the phrase for the recognised passive edge
                    phrase = Phrase(edge.lhs, edge.start, edge.end)

                    # Predict: the new edge has recognised the passive edge,
                    # so we add its phrase to the results
                    agenda.extend(
                        Edge(edge.start, k, lhs, rhs, 1, (phrase,))
                        for (lhs, rhs) in leftcorners.get(edge.lhs, [])
                    )
                    # Complete: the new edge has recogninsed one more phrase, 
                    # so we add it after the already recognised phrases
                    agenda.extend(
                        Edge(e.start, k, e.lhs, e.rhs, e.dot+1, e.result+(phrase,)) 
                        for e in chart[edge.start].get(edge.lhs, ())
                    )
        chart.append(lc_edgesets)
    
    return [ lc_edgesets[None] for lc_edgesets in chart ]

So, did this change improve anything?

In [984]:
print("Testing earley3")
%time test(earley3, grammar, 'S', example(10), positions=[])
print()
print("Testing earley4")
%time test(earley4, grammar, 'S', example(10), positions=[])
Testing earley3
Parsing 35 words: "the lion sees ... with a telescope in the park under a tree"
Yay, success!!
Chart size: 208587 edges
CPU times: user 3.3 s, sys: 9.79 ms, total: 3.31 s
Wall time: 3.32 s

Testing earley4
Parsing 35 words: "the lion sees ... with a telescope in the park under a tree"
Yay, success!!
Chart size: 436 edges
CPU times: user 3.68 ms, sys: 120 µs, total: 3.8 ms
Wall time: 3.77 ms

It seems so! But now we don't have any parse trees, just phrases – can we build trees from these phrases?

Yes we can, by first converting the final chart to a "parse forest". A forest is really a kind of context-free grammar, where the symbols are phrases. So first we convert the chart to a forest:

In [990]:
def extract_forest4(chart):
    return [ Rule(Phrase(edge.lhs, edge.start, edge.end), edge.result)
             for edgeset in chart
             for edge in edgeset
             if edge.passive()
           ]
In [991]:
print('Forest after parsing "%s":' % ' '.join(example(0)))
forest = extract_forest4(earley4(grammar, example(0)))
for rule in forest:
    print("  ", rule)
Forest after parsing "the lion sees a zebra":
   the{0-1} --> 
   Det{0-1} --> the{0-1}
   NP{0-2} --> Det{0-1} Noun{1-2}
   lion{1-2} --> 
   Noun{1-2} --> lion{1-2}
   VP{2-3} --> Verb{2-3}
   S{0-3} --> NP{0-2} VP{2-3}
   Verb{2-3} --> sees{2-3}
   sees{2-3} --> 
   a{3-4} --> 
   Det{3-4} --> a{3-4}
   VP{2-5} --> Verb{2-3} NP{3-5}
   Noun{4-5} --> zebra{4-5}
   NP{3-5} --> Det{3-4} Noun{4-5}
   S{0-5} --> NP{0-2} VP{2-5}
   zebra{4-5} --> 

Extracting the trees from a forest

From this parse forest we can extract all parse trees, by doing a nondeterministic recursive-descent enumeration. We define two mutually recursive functions, yield_tree and yield_children:

  • yield_tree takes a phrase A{i–k} as input and calculates all trees in the forest, with A as root category and spanning the positions i–k. It does this by calling yield_children to get all possible combinations of children trees.
  • yield_children takes a sequence of phrases as input. It goes through each phrase in turn, calculates a tree for that phrase, and then children trees for the remaining phrases, and finally combines them.

Since both functions are nondeterminstic, we use the yield keyword instead of return, which makes them yield results several times.

One optimisation that we do is to convert the forest into a dictionary where we can look up the left-hand side phrases of the forest. This is done by the function topdown_dict that we defined in the beginning:

In [999]:
def extract_trees4(forest, cat, start=0, end=None):
    topdowns = topdown_dict(forest)
    if end is None:
        end = max(lhs.end for lhs in topdowns)

    def yield_tree(lhs):
        # For every rule in the forest that has 'lhs' as left-hand side...
        for rule in topdowns.get(lhs, ()):
            # ...and for every combination of children trees...
            for children in yield_children(rule.rhs, 0):
                # ...build a tree for the given input:
                yield Tree(lhs.cat, children)

    def yield_children(rhs, i):
        if i == len(rhs):
            # If there are no phrases in 'rhs' to process, we yield the empty sequence of trees:
            yield ()
        else:
            # Otherwise, for every possible tree for the phrase at position 'i'...
            for tree in yield_tree(rhs[i]):
                # ...and for every combination of trees for the rest of the phrases...
                for trees in yield_children(rhs, i+1):
                    # ...add the tree in front of the trees:
                    yield (tree,) + trees
    
    return yield_tree(Phrase(cat, start, end))

Now we can define a new parsing function, that first parses to a chart, then converts the chart to a forest, and after that extracts the trees, one at the time:

In [1000]:
def parse4(grammar, cat, sentence):
    chart = earley4(grammar, sentence)
    forest = extract_forest4(chart)
    return extract_trees4(forest, cat)

Now we can try it out:

In [1001]:
for n in range(3):
    print('Parsing "%s":' % (' '.join(example(n)),))
    for i, tree in enumerate(parse4(grammar, 'S', example(n)), 1):
        print('%3d. %s' % (i, tree))
    print()
Parsing "the lion sees a zebra":
  1. S[NP[Det[the] Noun[lion]] VP[Verb[sees] NP[Det[a] Noun[zebra]]]]

Parsing "the lion sees a zebra under a tree":
  1. S[NP[Det[the] Noun[lion]] VP[Verb[sees] NP[NP[Det[a] Noun[zebra]] PP[Prep[under] NP[Det[a] Noun[tree]]]]]]
  2. S[NP[Det[the] Noun[lion]] VP[VP[Verb[sees] NP[Det[a] Noun[zebra]]] PP[Prep[under] NP[Det[a] Noun[tree]]]]]

Parsing "the lion sees a zebra under a tree with a telescope":
  1. S[NP[Det[the] Noun[lion]] VP[VP[Verb[sees] NP[Det[a] Noun[zebra]]] PP[Prep[under] NP[NP[Det[a] Noun[tree]] PP[Prep[with] NP[Det[a] Noun[telescope]]]]]]]
  2. S[NP[Det[the] Noun[lion]] VP[Verb[sees] NP[NP[NP[Det[a] Noun[zebra]] PP[Prep[under] NP[Det[a] Noun[tree]]]] PP[Prep[with] NP[Det[a] Noun[telescope]]]]]]
  3. S[NP[Det[the] Noun[lion]] VP[Verb[sees] NP[NP[Det[a] Noun[zebra]] PP[Prep[under] NP[NP[Det[a] Noun[tree]] PP[Prep[with] NP[Det[a] Noun[telescope]]]]]]]]
  4. S[NP[Det[the] Noun[lion]] VP[VP[Verb[sees] NP[NP[Det[a] Noun[zebra]] PP[Prep[under] NP[Det[a] Noun[tree]]]]] PP[Prep[with] NP[Det[a] Noun[telescope]]]]]
  5. S[NP[Det[the] Noun[lion]] VP[VP[VP[Verb[sees] NP[Det[a] Noun[zebra]]] PP[Prep[under] NP[Det[a] Noun[tree]]]] PP[Prep[with] NP[Det[a] Noun[telescope]]]]]

Ok, that worked. How about efficiency, compared to our previous parser? Let's see how long time it takes to first to parse a sentence, then extract the first tree, then the second tree, then another 1000 trees. It's important that we don't build the whole list of trees, so we use the islice function to extract the 1000 trees without exhausting the trees iterator.

First, we try our previous parser, parse3:

In [1002]:
%time print("Parsing"); trees = parse3(grammar, 'S', example(10))
%time print("\nTree 1: %.80s..." % (next(trees),))
%time print("\nTree 2: %.80s..." % (next(trees),))
%time print("\nNext 1000 trees: %d" % (len(list(islice(trees, 1000)))))
Parsing
CPU times: user 4.4 s, sys: 15.5 ms, total: 4.41 s
Wall time: 4.42 s

Tree 1: S[NP[Det[the] Noun[lion]] VP[VP[VP[VP[Verb[sees] NP[NP[Det[a] Noun[zebra]] PP[Pr...
CPU times: user 185 µs, sys: 3 µs, total: 188 µs
Wall time: 193 µs

Tree 2: S[NP[Det[the] Noun[lion]] VP[VP[Verb[sees] NP[NP[Det[a] Noun[zebra]] PP[Prep[und...
CPU times: user 238 µs, sys: 56 µs, total: 294 µs
Wall time: 301 µs

Next 1000 trees: 1000
CPU times: user 2.34 ms, sys: 64 µs, total: 2.4 ms
Wall time: 2.42 ms

The parsing takes a lot of time, but then extracting the trees is instantaneous.

How about our new parser, parse4?

In [1003]:
%time print("Parsing"); trees = parse4(grammar, 'S', example(10))
%time print("\nTree 1: %.80s..." % (next(trees),))
%time print("\nTree 2: %.80s..." % (next(trees),))
%time print("\nNext 1000 trees: %d" % (len(list(islice(trees, 1000)))))
Parsing
CPU times: user 88 ms, sys: 1.4 ms, total: 89.4 ms
Wall time: 89 ms

Tree 1: S[NP[Det[the] Noun[lion]] VP[VP[VP[VP[Verb[sees] NP[Det[a] Noun[zebra]]] PP[Prep...
CPU times: user 702 µs, sys: 792 µs, total: 1.49 ms
Wall time: 2.1 ms

Tree 2: S[NP[Det[the] Noun[lion]] VP[VP[VP[VP[Verb[sees] NP[Det[a] Noun[zebra]]] PP[Prep...
CPU times: user 201 µs, sys: 5 µs, total: 206 µs
Wall time: 208 µs

Next 1000 trees: 1000
CPU times: user 94.4 ms, sys: 2.54 ms, total: 96.9 ms
Wall time: 95 ms

Wow, that was good! Parsing is suddenly super-fast, but as you can see, extracting each tree takes slightly longer time than before.

Even more optimisations

Are we done now? Not quite... let's compare our latest parser earley4 with our optimised earley2 (which did not store any parse results), on a 300-word sentence. We don't extract any trees, just build the chart:

In [1006]:
print("Testing earley2")
%time test(earley2, grammar, 'S', example(100), positions=[])
print()
print("Testing earley4")
%time test(earley4, grammar, 'S', example(100), positions=[])
Testing earley2
Parsing 305 words: "the lion sees ... with a telescope in the park under a tree"
Yay, success!!
Chart size: 11116 edges
CPU times: user 320 ms, sys: 3.76 ms, total: 324 ms
Wall time: 322 ms

Testing earley4
Parsing 305 words: "the lion sees ... with a telescope in the park under a tree"
Yay, success!!
Chart size: 182716 edges
CPU times: user 2.36 s, sys: 6.71 ms, total: 2.37 s
Wall time: 2.37 s

Hmm... it seems we're still not there. Although the new parser earley4 is a huge improvement compared to earley3, it is still a magnitude slower than earley2.

Why is that? It's because there can be several ways to combine children phrases into one parent phrase. E.g., the phrase NP{3–305} can be derived from the phrases NP{3–5} PP{5–305}, or NP{3–8} PP{8–305}, or ..., or NP{3–302} PP{302–305}. All these combinations result in one edge each, wheras earley2 would only create one NP edge spanning 3–305.

So, can we do something about this? Well, we can try to convert the chart returned by earley2 into a parse forest. The edges now don't contain any children phrases, so instead we have to calculate them on the fly when building the forest.

In [1031]:
def extract_forest2b(chart):
    # First we build a dictionary from the chart,
    # which maps (lhs,start) into all possible end positions.
    topdowns = {}
    for edgeset in chart:
        for edge in edgeset:
            if edge.passive():
                topdowns.setdefault((edge.lhs, edge.start), set()) \
                    .add(Phrase(edge.lhs, edge.start, edge.end))

    # This function annotates the given right-hand side as a sequence of phrases.
    # This is nondeterministic, since one right-hand side can be recognised using different spans.
    def collect_phrases(rhs, dot, start, end):
        if not rhs:
            yield ()
        elif start == end and dot == len(rhs):
            yield ()
        elif start < end and dot < len(rhs):
            for phr in topdowns.get((rhs[dot], start), ()):
                for phrases in collect_phrases(rhs, dot+1, phr.end, end):
                    yield (phr,) + phrases

    # Now we can transform each edge in the chart into (possibly several) forest rules.
    return [ Rule(Phrase(edge.lhs, edge.start, edge.end), phrases)
             for edgeset in chart
             for edge in edgeset
             if edge.passive()
             for phrases in collect_phrases(edge.rhs, 0, edge.start, edge.end)
           ]
In [1032]:
forest = extract_forest2b(earley2(grammar, example(0)))
for rule in forest:
    print(rule)
Det{0-1} --> the{0-1}
the{0-1} --> 
Noun{1-2} --> lion{1-2}
lion{1-2} --> 
NP{0-2} --> Det{0-1} Noun{1-2}
Verb{2-3} --> sees{2-3}
VP{2-3} --> Verb{2-3}
S{0-3} --> NP{0-2} VP{2-3}
sees{2-3} --> 
Det{3-4} --> a{3-4}
a{3-4} --> 
VP{2-5} --> Verb{2-3} NP{3-5}
NP{3-5} --> Det{3-4} Noun{4-5}
S{0-5} --> NP{0-2} VP{2-5}
zebra{4-5} --> 
Noun{4-5} --> zebra{4-5}

This is exactly the same forest as when we tested extract_forest4 previously.

Now we can define a parser which uses this forest extraction and then extracts the trees from the forest:

In [1033]:
def parse2b(grammar, cat, sentence):
    chart = earley2(grammar, sentence)
    forest = extract_forest2b(chart)
    return extract_trees4(forest, cat)

And behold, it works too:

In [1034]:
for tree in parse2b(grammar, 'S', example(1)):
    print(tree)
S[NP[Det[the] Noun[lion]] VP[Verb[sees] NP[NP[Det[a] Noun[zebra]] PP[Prep[under] NP[Det[a] Noun[tree]]]]]]
S[NP[Det[the] Noun[lion]] VP[VP[Verb[sees] NP[Det[a] Noun[zebra]]] PP[Prep[under] NP[Det[a] Noun[tree]]]]]

But, have we gained anything, when it comes to efficiency?

In [1035]:
%time print("Parsing"); trees = parse2b(grammar, 'S', example(100))
%time print("\nTree 1: %.80s..." % (next(trees),))
%time print("\nTree 2: %.80s..." % (next(trees),))
%time print("\nNext 1000 trees: %d" % (len(list(islice(trees, 1000)))))
Parsing
CPU times: user 5.66 s, sys: 17.7 ms, total: 5.67 s
Wall time: 5.69 s

Tree 1: S[NP[Det[the] Noun[lion]] VP[Verb[sees] NP[NP[NP[NP[Det[a] Noun[zebra]] PP[Prep[...
CPU times: user 3.53 ms, sys: 60 µs, total: 3.59 ms
Wall time: 3.61 ms

Tree 2: S[NP[Det[the] Noun[lion]] VP[Verb[sees] NP[NP[NP[NP[Det[a] Noun[zebra]] PP[Prep[...
CPU times: user 927 µs, sys: 34 µs, total: 961 µs
Wall time: 971 µs

Next 1000 trees: 1000
CPU times: user 145 ms, sys: 1.43 ms, total: 146 ms
Wall time: 146 ms

Oh... this was even worse than before. parse2b took 4 seconds, and this was 50% slower.

It's extracting the forest that takes time:

In [1036]:
%time print("Calling earley2"); chart = earley2(grammar, example(100))
print()
%time print("Calling extract_forest2"); forest = extract_forest2b(chart)
Calling earley2
CPU times: user 318 ms, sys: 3.18 ms, total: 321 ms
Wall time: 320 ms

Calling extract_forest2
CPU times: user 5.64 s, sys: 18 ms, total: 5.65 s
Wall time: 5.67 s

Extracting the trees directly from the chart

So, what can we do about this? We can combine extract_forest2b with extract_trees4.

In [1037]:
def extract_trees2c(chart, cat, start=0, end=None):
    if end is None:
        end = len(chart) - 1

    # We build a dictionary which maps (lhs,start) into all possible end positions.
    topdowns = {}
    for edgeset in chart:
        for edge in edgeset:
            if edge.passive():
                topdowns.setdefault((edge.lhs, edge.start), []).append(edge)

    # Build trees for category 'lhs', that start in position 'start'.
    # We also have a predicate 'test_end' which returns False if we have passed the end of the phrase.
    def yield_tree(lhs, start, test_end):
        # For every rule in the forest that has 'lhs' as left-hand side...
        for edge in topdowns.get((lhs, start), ()):
            # ...such that it does not go pass the end of the phrase...
            if test_end(edge.end):
                # ...and for every combination of children trees...
                for children in yield_children(edge.rhs, 0, start, edge.end):
                    # ...build a tree for the given input:
                    yield Tree(lhs, children), edge.end

    def yield_children(rhs, dot, start, end):
        if not rhs:
            # If this is an edge created by Scan, yield the empty sequence of trees:
            yield ()
        elif start == end and dot == len(rhs):
            # If there are no phrases in 'rhs' to process, yield the empty sequence of trees:
            yield ()
        elif start < end and dot < len(rhs):
            # If there is only one symbol left in 'rhs', the next phrase have to end in 'end',
            # otherwise the next phrase have to end somewhere before 'end':
            test_end = (lambda e:e==end) if dot == len(rhs)-1 else (lambda e:e<end)
            # For every possible tree for the phrase at position 'dot'...
            for tree, mid in yield_tree(rhs[dot], start, test_end):
                # ...and for every combination of trees for the rest of the phrases...
                for trees in yield_children(rhs, dot+1, mid, end):
                    # ...add the tree in front of the trees:
                    yield (tree,) + trees

    for tree, _ in yield_tree(cat, start, lambda e:e==end):
        yield tree
In [1038]:
def parse2c(grammar, cat, sentence):
    chart = earley2(grammar, sentence)
    return extract_trees2c(chart, cat)
In [1040]:
%time print("Parsing"); trees = parse2c(grammar, 'S', example(100))
%time print("\nTree 1: %.80s..." % (next(trees),))
%time print("\nTree 2: %.80s..." % (next(trees),))
%time print("\nNext 1000 trees: %d" % (len(list(islice(trees, 1000)))))
Parsing
CPU times: user 326 ms, sys: 4.45 ms, total: 330 ms
Wall time: 332 ms

Tree 1: S[NP[Det[the] Noun[lion]] VP[Verb[sees] NP[NP[Det[a] Noun[zebra]] PP[Prep[under]...
CPU times: user 15.5 ms, sys: 298 µs, total: 15.8 ms
Wall time: 15.8 ms

Tree 2: S[NP[Det[the] Noun[lion]] VP[Verb[sees] NP[NP[Det[a] Noun[zebra]] PP[Prep[under]...
CPU times: user 2.14 ms, sys: 116 µs, total: 2.26 ms
Wall time: 2.28 ms

Next 1000 trees: 1000
CPU times: user 2.37 s, sys: 7.4 ms, total: 2.37 s
Wall time: 2.37 s

Now we're there! Parsing is very fast, and building every tree is also very fast. To build very many trees still takes time, but we probably don't want 1000 trees to choose from.

Extensions

Using a function for categorising the words

Allowing empty grammar rules

Adding probabilities to the grammar