(by Peter Ljunglöf, peter.ljunglof@gu.se)
Chart parsing is a pretty old technique, originating in the 60's with CKY (Cocke-Kasami-Younger). CKY is a really simple and efficient algorithm, but it only works for grammars in Chomsky normal form. The first efficient parsing algorithm for general context-free grammars was Jay Earley's algorithm from 1970, and that algorithm is what I will try to explain in this notebook.
Earley's algorithm is still the best there is in terms of complexity – no practical context-free parsing algorithm has beaten its cubic complexity – but there has been several variants, extensions and optimisations. But Earley's algorithm serves as a very good starting point.
One thing that has bothered me for a long time is that all tutorials on chart parsing and in particular Earley's algorithm are so complicated, or they don't explain the details on what you have to do to make it efficient.
(A boastful side note) One exception is my own chart parsing tutorial from 2002, but it was for the programming language Haskell. The tutorial itself is chapter 5 in my licenciate thesis, and the code is in a GitHub repo.
So, here is a notebook that will try to explain the Earley chart parsing algorithm, and how to implement it in Python.
We will use named tuples for a quick-and-dirty way of creating new classes. The islice
function will be used for taking a prefix of a generator later. None of them are really important for the parsing algorithms themselves.
from collections import namedtuple
from itertools import islice
We assume knowledge of context-free grammars. Context-free categories (nonterminals) are represented as strings, just as words (terminals).
This means that we don't make a distinction between nonterminals (the categories) and terminals (the words or tokens). We happen to use capitals for the nonterminals and lowercase terminals, but that's only by convention. To enforce a strict division of the symbols into nonterminals and terminals only complicate the implementations.
It is quite straightforward to impose this distinction if you want to, so this is left as an exercise for the interested reader.
A context-free rule A –> B1 ... Bn is represented as a pair, where the second element is a sequence of the right-hand side categories. To make it more readable, we define it as a named tuple:
Rule = namedtuple('Rule', "lhs rhs")
Rule.__str__ = lambda self: "%s --> %s" % (self.lhs, " ".join(map(str, self.rhs)))
Note that we redefine the pretty-printing of rules, so that:
r = Rule('S', ('NP', 'VP'))
print('str(r) = %s' % (r,))
print('repr(r) = %r' % (r,))
print('r.lhs = %s, r.rhs = %s' % (r.lhs, r.rhs))
A context-free grammar is simply a collection of grammar rules. Here is the example grammar that wewill use throughout this tutorial:
grammar = [ Rule('S', ('NP','VP')),
Rule('VP', ('Verb',)), Rule('VP', ('Verb','NP')), Rule('VP', ('VP','PP')),
Rule('NP', ('Det','Noun')), Rule('NP', ('NP','PP')),
Rule('PP', ('Prep','NP')),
Rule('Verb',('sees',)),
Rule('Det', ('the',)), Rule('Det', ('a',)),
Rule('Prep',('under',)), Rule('Prep',('with',)), Rule('Prep',('in',)),
Rule('Noun',('zebra',)), Rule('Noun',('lion',)), Rule('Noun',('tree',)),
Rule('Noun',('park',)), Rule('Noun',('telescope',)),
]
for rule in grammar:
print(rule)
Later we will need a way to find the grammar rules we want efficiently. The Earley algorithm looks for grammar rules whose right-hand sides start with a given symbol. This symbol is called the left corner of the rule.
As an example, given the category NP, the matching rules from the example grammar are S –> NP VP and NP –> NP PP.
The following function transforms a grammar into a Python dictionary that returns all matching rules for a given left corner:
def leftcorner_dict(grammar):
leftcorners = {}
for rule in grammar:
leftcorners.setdefault(rule.rhs[0], []).append(rule)
return leftcorners
for lc, rules in leftcorner_dict(grammar).items():
print("%-10s" % (lc,), ": ", "".join("%-18s" % (r,) for r in rules))
Another way of finding grammar rules is to search for a given left-hand side. So here's the corresponding function that transforms a grammar into a dictionary that returns all matching rules for a given left-hand side:
def topdown_dict(rules):
topdowns = {}
for r in rules:
topdowns.setdefault(r.lhs, []).append(r)
return topdowns
for lhs, rules in topdown_dict(grammar).items():
print("%-10s" % (lhs,), ": ", "".join("%-18s" % (r,) for r in rules))
Note that empty rules (i.e., rules with empty right-hand sides) are forbidden! The reason for this is to make the code as compact and readable as possible. This is a disadvantage, because sometimes it's really easier to write grammars using empty rules.
Fortunately there are methods for transforming grammars with empty rules, into grammars without. Alternatively, the algorithms presented here can be augmented to cope with empty rules.
We also need some example sentences to play around with. Here is an ever-growing list of sentences showing off the PP-attachment problem:
def example(n):
prefix = "the lion sees a zebra".split()
suffix = "under a tree with a telescope in the park".split()
return prefix + (suffix * (n//3+1))[:n*3]
for n in range(5):
print("example(%d) = %s" % (n, " ".join(example(n))))
We take the simplistic approach to parse trees: A tree is a pair of a root node and a sequence of its children. To make them more readable, we define them as a named tuple.
Tree = namedtuple('Tree', "root children")
Tree.__str__ = lambda tree: ( "%s[%s]" % (tree.root, " ".join(map(str, tree.children)))
if tree.children else str(tree.root)
)
tree = Tree('S', (Tree('NP', (Tree('Det', (Tree('the', ()),)),
Tree('Noun', (Tree('cat', ()),)),)),
Tree('VP', (Tree('Verb', (Tree('sleeps', ()),)),)),))
print("repr(tree):", repr(tree))
print("\nstr(tree):", str(tree))
The basic element in Earley parsing, and in chart parsing in general, is the edge, also known as an item or a state. The edge consists of a span and a dotted rule, which tells that a part of a grammar rule has been found spanning some part of the input. The following is the general form for an edge:
[i–k: A –> B1 ... Bm • Bn ... Bp]
The meaning of an edge is twofold:
The idea with chart parsing is that we advance the right-hand side of a grammar rule one symbol at the time, and the dot denotes how far we have come. Another idea is to not do unnecessary work, so if there are two possible ways of reaching a parsing state, we do not have to the same work twice.
In Python we can define edges as a 6-tuple, consisting of the span (start
and end
), the grammar rule (lhs
and rhs
), the dot position (dot
), and finally an optional parse result (result
):
Edge = namedtuple('Edge', "start end lhs rhs dot result", defaults=[None])
Edge.__str__ = lambda e: ( "[%s-%s: %s --> %s . %s]"
% (e.start, e.end, e.lhs, " ".join(e.rhs[:e.dot]), " ".join(e.rhs[e.dot:]))
)
Note that we do not show the optional result when pretty-printing the edge. This is because it can grow quite large, and then the edge becomes unreadable.
The following shows two edges for the same grammar rule. If the example sentence is "the lion sees a zebra under a tree", then the first one says that it has found an NP spanning the words "the lion", while the second one says that it has found an NP followed by a VP spannning the words "the lion sees a zebra":
edge1 = Edge(0, 2, 'S', ('NP', 'VP'), 1); print(edge1)
edge2 = Edge(0, 5, 'S', ('NP', 'VP'), 2); print(edge2)
The second edge has a special status, because it is complete, or passive. This edge says that it has found a complete S phrase between positions 0–5, and that this phrase is derived from the grammar rule S –> NP VP.
The following method is a test whether an edge is passive or not:
Edge.passive = lambda e: e.dot == len(e.rhs)
Concneptually, the chart is a collection of edges that we build while parsing. When parsing is complete, the chart consists of all edges that can be inferred from the input sentence.
After that we can extract the parse trees (or whatever information we want) from the final chart. For extracting parse trees it's enough to only consider the passive edges in the chart, so our parsing functions will only return the passive edges.
The main idea with Earley parsing is that we go through the input one word at the time, and calculate all possible edges that can be inferred after seeing this new word. This means that we can partition the chart into edge-sets, one per input word (or rather input position). So, the chart is therefore an array of edgesets – we call this an Earley chart.
The Earley parser consists of three "inference steps", called Scan, Predict and Complete, which all adds edges to an edgeset:
Scan: Add a passive edge [j–k: w –> •] to edgeset k for every word w between positions j and k (i.e., where j is k-1).
Predict: If there is a passive edge [j–k: B –> ... •] in edgeset k, and there is a grammar rule A –> B C ... looking for a B, then add a new edge [j–k: A –> B • C ...] to edgeset k.
Complete: If there is a passive edge [j–k: B –> ... •] in edgeset k, and there is an active edge [i–j: A –> ... • B C ...] in edgeset k, then add a new edge [i–k: A –> B • C ...] to edgeset k.
Note that Predict and Complete have a trigger edge from the same edgeset k. This means that the set already has to be nonempty for them to be able to add new edges. The edgeset is initialised by Scan, and then Predict and Complete can act until non new edges can be added.
Another way to view this is that edgeset k is the set of all inferences that we can draw about phrases ending in word k, from the grammar and the input sentence. Scan is then an axiom, and Predict and Complete are inference rules. We initialise our knowledge from the axiom, and then apply the inference rules until we reach a fixpoint.
One way to implement this is to use fixed-point recursion, which is very easy to do in a functional programming langauge (see e.g. my Haskell implementations). Here we will instead use a todo-list (an "agenda"). The agenda contains the items that should be added to the edgeset but have not been processed by the inference rules yet. When the agenda becomes empty, the edgeset is completed. The agenda is iniatialised by the Scan axiom, and then we use a while-loop where we apply the Predict and Complete inference rules.
This technique is called dynamic programming.
For each input word and position k, we initialise the agenda to a single passive edge. Then we repeat removing one edge from the agenda until it is exhausted. We add it to the edgeset (if it's not already there), and if it is passive, we apply the predictor and the scanner, adding the new edges to the agenda.
def earley1(grammar, input):
# The 0th edgeset is always empty, because there are no edges ending in position 0
chart = [set()]
# Enumerate each word in the input, starting from k=1
for k, word in enumerate(input, 1):
edgeset = set()
# Scan: add one single passive edge for the input word
agenda = [ Edge(k-1, k, word, (), 0) ]
while agenda:
edge = agenda.pop()
# Add the edge to the edgeset (if it's not already there)
if edge not in edgeset:
edgeset.add(edge)
# If the edge is passive we can apply the inference rules
if edge.passive():
# Predict: find grammar rules looking for edge.lhs
agenda.extend(
Edge(edge.start, k, lhs, rhs, 1)
for (lhs, rhs) in grammar
if edge.lhs == rhs[0]
)
# Complete: find active edges in old edgesets looking for edge.lhs
agenda.extend(
Edge(e.start, k, e.lhs, e.rhs, e.dot+1)
for e in chart[edge.start]
if not e.passive()
if edge.lhs == e.rhs[e.dot]
)
# Add the edgeset to the end of the chart
chart.append(edgeset)
# Filter all passive edges from the chart, and return them
return [ [e for e in edgeset if e.passive()]
for edgeset in chart
]
So, what can we do with the parse result – the chart? Well, for one we can test if the parsing succeeded, by looking for a passive edge spanning the whole input, with left-hand side being the starting category of the grammar:
def success(chart, cat, start=0, end=-1):
return any(
edge.start == start and edge.lhs == cat and edge.passive()
for edge in chart[end]
)
sent1 = example(3); sent2 = sent1[:6]
print('Parsing "%s": %s' % (' '.join(sent1), success(earley1(grammar, sent1), 'S')))
print('Parsing "%s": %s' % (' '.join(sent2), success(earley1(grammar, sent2), 'S')))
Another thing we can do is to return the number of edges in the chart (which will be useful when we try to debug the implementations):
def chartsize(chart):
return sum(map(len, chart))
We can also print the edges in the chart. Here is a generic version where we can specify what edgesets should be printed, and a limit on the number of edges printed:
def print_chart(chart, positions=None, cutoff=None):
print("Chart size: %d edges" % chartsize(chart))
for (k, edgeset) in enumerate(chart):
if edgeset and (positions is None or k in positions or (k-len(chart)) in positions):
print("%d edges ending in position %s:" % (len(edgeset), k))
for n, edge in enumerate(sorted(edgeset)):
if cutoff and n >= cutoff:
print(" ...")
break
print(" ", edge)
Now we are ready to test our parsers on some input. The first argument is the parsing function (i.e, earley1
above), and the rest are hopefully self-explanatory:
def test(parser, grammar, cat, sentence, positions=None, cutoff=8):
nwords = len(sentence)
if nwords <= 15:
print('Parsing %d words: "%s"' % (nwords, " ".join(sentence)))
else:
print('Parsing %d words: "%s ... %s"'
% (nwords, " ".join(sentence[:3]), " ".join(sentence[-9:])))
chart = parser(grammar, sentence)
print("Yay, success!!" if success(chart, cat) else "Meh, failure:(")
print_chart(chart, positions, cutoff)
test(earley1, grammar, 'S', example(3), positions=[1,2,-2,-1])
test(earley1, grammar, 'S', example(1)[:6])
All this is very good, but can we optimise the implementation, and if so how much? Let's test a very long sentence:
%time test(earley1, grammar, 'S', example(200), positions=[-1])
This means that earley1
can parse a 600-word sentence in 4 seconds, building a chart of 42,000 passive edges. Can we beat that?
There are some places where earley1
does unnecessary work – Predict searches through the whole grammar just to find a couple of matching rules, and Complete searches through the whole edgeset j. The solution is to use suitable data structures:
Predict: Instead of searching through the whole grammar, we can use an optimised data structure. What we want to find is all rules that have the same left corner, so let's transform the grammar into a dictionary of the left corners. We already have a function for that, leftcorner_dict
.
Complete: The same problem here – we have to loop through the whole edgeset j. Instead, we can convert the edgeset into a dictionary from a left corner to all edges with that left corner.
def earley2(grammar, input):
# Transform the grammar into a more efficient data structure
leftcorners = leftcorner_dict(grammar)
chart = [{None: set()}]
for k, sym in enumerate(input, 1):
# Divide the edgeset into a dictionary where the keys are the left corners of the edges
lc_edgesets = {}
# Scan
agenda = [ Edge(k-1, k, sym, (), 0) ]
while agenda:
edge = agenda.pop()
# Get the edgeset for the given left corner
leftc = None if edge.passive() else edge.rhs[edge.dot]
edgeset = lc_edgesets.setdefault(leftc, set())
# Add the edge to the edgeset, if it's not already there
if edge not in edgeset:
edgeset.add(edge)
# If the edge is passive...
if edge.passive():
# ...let's apply Predict:
agenda.extend(
Edge(edge.start, k, lhs, rhs, 1)
for (lhs, rhs) in leftcorners.get(edge.lhs, [])
)
# ...and Complete:
agenda.extend(
Edge(e.start, k, e.lhs, e.rhs, e.dot+1)
for e in chart[edge.start].get(edge.lhs, ())
)
# Append the dictionary of edgesets to the chart
chart.append(lc_edgesets)
# Return only the passive edges from the chart
return [ lc_edgesets[None] for lc_edgesets in chart ]
Note that the passive edges don't have a left corner, because they don't look for anything aymore. Instead we want to be able to filter out only the passive edges from a chart. For that reason we use a special symbol which cannot be a left corner, as key in the edgeset dictionary – and we choose the key None
for these passive edges.
%time test(earley2, grammar, 'S', example(200), positions=[-1])
Wow! Just by using better data structures, we made the implementation twice as fast.
Now we want to add some structure to the results, and we will use parse trees for that.
Luckily the edges have an optional argument, result
, which we can use. We will use this to store the parse trees that have been found by an edge. I.e., together with the edge [i–k: A –> B1 ... Bm • Bn ... Bp], we will store the trees that have been found for B1 .... Bm.
The only changes to earley2
is that Scan, Predict and Complete include parse trees in the newly created edges. Everything else stays the same.
def earley3(grammar, input):
leftcorners = leftcorner_dict(grammar)
chart = [{None: set()}]
for k, sym in enumerate(input, 1):
lc_edgesets = {}
# Scan: since there are no children, the result will be the empty sequence ()
agenda = [ Edge(k-1, k, sym, (), 0, ()) ]
while agenda:
edge = agenda.pop()
leftc = None if edge.passive() else edge.rhs[edge.dot]
edgeset = lc_edgesets.setdefault(leftc, set())
if edge not in edgeset:
edgeset.add(edge)
if edge.passive():
# Build the tree for the recognised passive edge
tree = Tree(edge.lhs, edge.result)
# Predict: the new edge has recognised the passive edge,
# so we add its tree to the results
agenda.extend(
Edge(edge.start, k, lhs, rhs, 1, (tree,))
for (lhs, rhs) in leftcorners.get(edge.lhs, [])
)
# Complete: the new edge has recogninsed one more tree,
# so we add it after the already recognised trees
agenda.extend(
Edge(e.start, k, e.lhs, e.rhs, e.dot+1, e.result+(tree,))
for e in chart[edge.start].get(edge.lhs, ())
)
chart.append(lc_edgesets)
return [ lc_edgesets[None] for lc_edgesets in chart ]
Now, what can we do with parse results? First, instead of just returning whether parsing succeeded, we can return all results that are found by succeeding passive edges.
Note that we do not return a list of all results, but instead a generator. The reason is that we can stop the loop prematurely if we only need one result, and furthermore it also saves memory.
def results(chart, cat, start=0, end=-1):
for edge in chart[end]:
if edge.start == start and edge.lhs == cat and edge.passive():
yield edge.result
Note that the returned results are not the parse trees for the starting category, but the children trees. To get the final parse trees we have to combine the children with the starting category:
def extract_trees3(chart, cat):
for children in results(chart, cat):
yield Tree(cat, children)
Finally, we can implement a parser, taking a grammar, a starting category and the input sentence, yielding parse trees:
def parse3(grammar, cat, sentence):
chart = earley3(grammar, sentence)
return extract_trees3(chart, cat)
for n in range(3):
print('Parsing "%s":' % (' '.join(example(n)),))
for i, tree in enumerate(parse3(grammar, 'S', example(n)), 1):
print('%3d. %s' % (i, tree))
print()
Hmm... the number of parse trees seem to increas. Let's investigate their growth rate:
print("Example Length Trees T(n)/T(n-1)")
oldtrees = 1
for n in range(1, 11):
sent = example(n)
ntrees = len(list(parse3(grammar, 'S', sent)))
print("%5d%10d%10d%10.1f" % (n, len(sent), ntrees, ntrees/oldtrees))
oldtrees = ntrees
The growth rate seems to be converging on a constant, which means that the number of trees grows exponentially. In fact, the number of trees form the sequence of Catalan numbers, which is discussed further in the NLTK book, chapter 8 (section 6.2 Pernicious Ambiguity).
Now, what about the parsing efficiency? Let's test example 2:
test(earley3, grammar, 'S', example(2), positions=[-1])
And compare this with the previous parser:
test(earley2, grammar, 'S', example(2), positions=[-1])
Hey, our new parser creates more edges! In the final position, the new parser creates 18 edges, where the old one only created 10. And several edges seem to be equivalent... hmm, what's happening here?
The problem is that since we now include the parse trees in the edges, every parse tree will have their own edge. You can see this above, where there are five S–>NP VP edges spanning the whole input, and in the table above we saw that the sentence has exactly five trees.
Let's compare the chart sizes between earley2
and earley3
:
print(" n words chart2 chart3")
for n in range(1, 11):
sent = example(n)
chart2 = earley2(grammar, sent)
chart3 = earley3(grammar, sent)
print("%4d %8d %10d %10d" % (n, len(sent), chartsize(chart2), chartsize(chart3)))
This doesn't look very good. The chart for earley3
grows exponentially, compared to the earley2
chart which seems to grow almost linearly. Let's test the parsing times:
print("Testing earley2")
%time test(earley2, grammar, 'S', example(10), positions=[])
print()
print("Testing earley3")
%time test(earley3, grammar, 'S', example(10), positions=[])
Apparently our attempt to include the parse trees in the edges didn't work out well. Can we do something else instead?
Instead of collecting the parse trees, let's collect the recognised phrases.
With that I mean to collect which child categories have been found, and where in the input. I.e., instead of storing parse trees, we store phrases, which are tuples of the form (A,i,j)
(or A{i-j}
as they are pretty-printed).
Phrase = namedtuple('Phrase', "cat start end")
Phrase.__str__ = lambda phr: "%s{%d-%d}" % phr
The only difference from before is that instead of storing full trees, we store phrases in Scan, Predict and Complete:
def earley4(grammar, input):
leftcorners = leftcorner_dict(grammar)
chart = [{None: set()}]
for k, sym in enumerate(input, 1):
lc_edgesets = {}
# Scan: since there are no children, the result will be the empty sequence ()
agenda = [ Edge(k-1, k, sym, (), 0, ()) ]
while agenda:
edge = agenda.pop()
leftc = None if edge.passive() else edge.rhs[edge.dot]
edgeset = lc_edgesets.setdefault(leftc, set())
if edge not in edgeset:
edgeset.add(edge)
if edge.passive():
# Build the phrase for the recognised passive edge
phrase = Phrase(edge.lhs, edge.start, edge.end)
# Predict: the new edge has recognised the passive edge,
# so we add its phrase to the results
agenda.extend(
Edge(edge.start, k, lhs, rhs, 1, (phrase,))
for (lhs, rhs) in leftcorners.get(edge.lhs, [])
)
# Complete: the new edge has recogninsed one more phrase,
# so we add it after the already recognised phrases
agenda.extend(
Edge(e.start, k, e.lhs, e.rhs, e.dot+1, e.result+(phrase,))
for e in chart[edge.start].get(edge.lhs, ())
)
chart.append(lc_edgesets)
return [ lc_edgesets[None] for lc_edgesets in chart ]
So, did this change improve anything?
print("Testing earley3")
%time test(earley3, grammar, 'S', example(10), positions=[])
print()
print("Testing earley4")
%time test(earley4, grammar, 'S', example(10), positions=[])
It seems so! But now we don't have any parse trees, just phrases – can we build trees from these phrases?
Yes we can, by first converting the final chart to a "parse forest". A forest is really a kind of context-free grammar, where the symbols are phrases. So first we convert the chart to a forest:
def extract_forest4(chart):
return [ Rule(Phrase(edge.lhs, edge.start, edge.end), edge.result)
for edgeset in chart
for edge in edgeset
if edge.passive()
]
print('Forest after parsing "%s":' % ' '.join(example(0)))
forest = extract_forest4(earley4(grammar, example(0)))
for rule in forest:
print(" ", rule)
From this parse forest we can extract all parse trees, by doing a nondeterministic recursive-descent enumeration. We define two mutually recursive functions, yield_tree
and yield_children
:
yield_tree
takes a phrase A{i–k} as input and calculates all trees in the forest, with A as root category and spanning the positions i–k. It does this by calling yield_children
to get all possible combinations of children trees.yield_children
takes a sequence of phrases as input. It goes through each phrase in turn, calculates a tree for that phrase, and then children trees for the remaining phrases, and finally combines them.Since both functions are nondeterminstic, we use the yield
keyword instead of return
, which makes them yield results several times.
One optimisation that we do is to convert the forest into a dictionary where we can look up the left-hand side phrases of the forest. This is done by the function topdown_dict
that we defined in the beginning:
def extract_trees4(forest, cat, start=0, end=None):
topdowns = topdown_dict(forest)
if end is None:
end = max(lhs.end for lhs in topdowns)
def yield_tree(lhs):
# For every rule in the forest that has 'lhs' as left-hand side...
for rule in topdowns.get(lhs, ()):
# ...and for every combination of children trees...
for children in yield_children(rule.rhs, 0):
# ...build a tree for the given input:
yield Tree(lhs.cat, children)
def yield_children(rhs, i):
if i == len(rhs):
# If there are no phrases in 'rhs' to process, we yield the empty sequence of trees:
yield ()
else:
# Otherwise, for every possible tree for the phrase at position 'i'...
for tree in yield_tree(rhs[i]):
# ...and for every combination of trees for the rest of the phrases...
for trees in yield_children(rhs, i+1):
# ...add the tree in front of the trees:
yield (tree,) + trees
return yield_tree(Phrase(cat, start, end))
Now we can define a new parsing function, that first parses to a chart, then converts the chart to a forest, and after that extracts the trees, one at the time:
def parse4(grammar, cat, sentence):
chart = earley4(grammar, sentence)
forest = extract_forest4(chart)
return extract_trees4(forest, cat)
Now we can try it out:
for n in range(3):
print('Parsing "%s":' % (' '.join(example(n)),))
for i, tree in enumerate(parse4(grammar, 'S', example(n)), 1):
print('%3d. %s' % (i, tree))
print()
Ok, that worked. How about efficiency, compared to our previous parser?
Let's see how long time it takes to first to parse a sentence, then extract the first tree, then the second tree, then another 1000 trees. It's important that we don't build the whole list of trees, so we use the islice
function to extract the 1000 trees without exhausting the trees iterator.
First, we try our previous parser, parse3
:
%time print("Parsing"); trees = parse3(grammar, 'S', example(10))
%time print("\nTree 1: %.80s..." % (next(trees),))
%time print("\nTree 2: %.80s..." % (next(trees),))
%time print("\nNext 1000 trees: %d" % (len(list(islice(trees, 1000)))))
The parsing takes a lot of time, but then extracting the trees is instantaneous.
How about our new parser, parse4
?
%time print("Parsing"); trees = parse4(grammar, 'S', example(10))
%time print("\nTree 1: %.80s..." % (next(trees),))
%time print("\nTree 2: %.80s..." % (next(trees),))
%time print("\nNext 1000 trees: %d" % (len(list(islice(trees, 1000)))))
Wow, that was good! Parsing is suddenly super-fast, but as you can see, extracting each tree takes slightly longer time than before.
Are we done now? Not quite... let's compare our latest parser earley4
with our optimised earley2
(which did not store any parse results), on a 300-word sentence. We don't extract any trees, just build the chart:
print("Testing earley2")
%time test(earley2, grammar, 'S', example(100), positions=[])
print()
print("Testing earley4")
%time test(earley4, grammar, 'S', example(100), positions=[])
Hmm... it seems we're still not there. Although the new parser earley4
is a huge improvement compared to earley3
, it is still a magnitude slower than earley2
.
Why is that? It's because there can be several ways to combine children phrases into one parent phrase. E.g., the phrase NP{3–305} can be derived from the phrases NP{3–5} PP{5–305}, or NP{3–8} PP{8–305}, or ..., or NP{3–302} PP{302–305}. All these combinations result in one edge each, wheras earley2
would only create one NP edge spanning 3–305.
So, can we do something about this? Well, we can try to convert the chart returned by earley2
into a parse forest. The edges now don't contain any children phrases, so instead we have to calculate them on the fly when building the forest.
def extract_forest2b(chart):
# First we build a dictionary from the chart,
# which maps (lhs,start) into all possible end positions.
topdowns = {}
for edgeset in chart:
for edge in edgeset:
if edge.passive():
topdowns.setdefault((edge.lhs, edge.start), set()) \
.add(Phrase(edge.lhs, edge.start, edge.end))
# This function annotates the given right-hand side as a sequence of phrases.
# This is nondeterministic, since one right-hand side can be recognised using different spans.
def collect_phrases(rhs, dot, start, end):
if not rhs:
yield ()
elif start == end and dot == len(rhs):
yield ()
elif start < end and dot < len(rhs):
for phr in topdowns.get((rhs[dot], start), ()):
for phrases in collect_phrases(rhs, dot+1, phr.end, end):
yield (phr,) + phrases
# Now we can transform each edge in the chart into (possibly several) forest rules.
return [ Rule(Phrase(edge.lhs, edge.start, edge.end), phrases)
for edgeset in chart
for edge in edgeset
if edge.passive()
for phrases in collect_phrases(edge.rhs, 0, edge.start, edge.end)
]
forest = extract_forest2b(earley2(grammar, example(0)))
for rule in forest:
print(rule)
This is exactly the same forest as when we tested extract_forest4
previously.
Now we can define a parser which uses this forest extraction and then extracts the trees from the forest:
def parse2b(grammar, cat, sentence):
chart = earley2(grammar, sentence)
forest = extract_forest2b(chart)
return extract_trees4(forest, cat)
And behold, it works too:
for tree in parse2b(grammar, 'S', example(1)):
print(tree)
But, have we gained anything, when it comes to efficiency?
%time print("Parsing"); trees = parse2b(grammar, 'S', example(100))
%time print("\nTree 1: %.80s..." % (next(trees),))
%time print("\nTree 2: %.80s..." % (next(trees),))
%time print("\nNext 1000 trees: %d" % (len(list(islice(trees, 1000)))))
Oh... this was even worse than before. parse2b
took 4 seconds, and this was 50% slower.
It's extracting the forest that takes time:
%time print("Calling earley2"); chart = earley2(grammar, example(100))
print()
%time print("Calling extract_forest2"); forest = extract_forest2b(chart)
So, what can we do about this? We can combine extract_forest2b
with extract_trees4
.
def extract_trees2c(chart, cat, start=0, end=None):
if end is None:
end = len(chart) - 1
# We build a dictionary which maps (lhs,start) into all possible end positions.
topdowns = {}
for edgeset in chart:
for edge in edgeset:
if edge.passive():
topdowns.setdefault((edge.lhs, edge.start), []).append(edge)
# Build trees for category 'lhs', that start in position 'start'.
# We also have a predicate 'test_end' which returns False if we have passed the end of the phrase.
def yield_tree(lhs, start, test_end):
# For every rule in the forest that has 'lhs' as left-hand side...
for edge in topdowns.get((lhs, start), ()):
# ...such that it does not go pass the end of the phrase...
if test_end(edge.end):
# ...and for every combination of children trees...
for children in yield_children(edge.rhs, 0, start, edge.end):
# ...build a tree for the given input:
yield Tree(lhs, children), edge.end
def yield_children(rhs, dot, start, end):
if not rhs:
# If this is an edge created by Scan, yield the empty sequence of trees:
yield ()
elif start == end and dot == len(rhs):
# If there are no phrases in 'rhs' to process, yield the empty sequence of trees:
yield ()
elif start < end and dot < len(rhs):
# If there is only one symbol left in 'rhs', the next phrase have to end in 'end',
# otherwise the next phrase have to end somewhere before 'end':
test_end = (lambda e:e==end) if dot == len(rhs)-1 else (lambda e:e<end)
# For every possible tree for the phrase at position 'dot'...
for tree, mid in yield_tree(rhs[dot], start, test_end):
# ...and for every combination of trees for the rest of the phrases...
for trees in yield_children(rhs, dot+1, mid, end):
# ...add the tree in front of the trees:
yield (tree,) + trees
for tree, _ in yield_tree(cat, start, lambda e:e==end):
yield tree
def parse2c(grammar, cat, sentence):
chart = earley2(grammar, sentence)
return extract_trees2c(chart, cat)
%time print("Parsing"); trees = parse2c(grammar, 'S', example(100))
%time print("\nTree 1: %.80s..." % (next(trees),))
%time print("\nTree 2: %.80s..." % (next(trees),))
%time print("\nNext 1000 trees: %d" % (len(list(islice(trees, 1000)))))
Now we're there! Parsing is very fast, and building every tree is also very fast. To build very many trees still takes time, but we probably don't want 1000 trees to choose from.