Monday, February 8, 2010

Trees in NL parser

Consider you're a parser analyzing the famous The horse raced past the barn fell. Additionally, you're operating in an incremental left-to-right fashion, considering single active parse variant at each moment. Independently of the theory, you'll probably combine the horse and past the barn into some kind of units, say NP1 and PP2. Before you see the last word, you'll probably consider that NP1 is the subject of the finite verb raced, and PP2 is its adjunct. But then the fell comes. At this moment, you'll understand that the single active parse variant you have is wrong, return to raced and reanalyze it as a passive participle (The horse [that was] raced past the barn fell). This is all well known and boring.

What's not boring is the implications for a computational parser operating in a similar way. First, it should be able somehow to restore its state right after NP1 and reparse from there, choosing the passive participle alternative of raced. Second, it would be nice not to reparse the PP2 again. This word sequence will anyway produce the same result and function as an adjunct, though now in the relative clause.

Since we're going to store the parsing state and analyze it in various ways (e.g. find slots to fill), we'd better store it as a data. What kind of data? The standard answer is, an (incomplete) tree. I'll show that's not convenient.

Let's consider mutable trees first. Nodes can be added or removed in such trees to any positions at any time. Therefore to store anything we need first to copy it completely and put aside, so that subsequent changes in the active tree aren't reflected in the copy. This applies both to parser state after NP1 and to the i'd-like-it-cached PP2. But we don't want to copy the parser state at each moment just in case we'll have to backtrack to here in future: we'll run out of memory very soon.

An alternative approach I see is not to copy anything, but to log thoroughly every change you make to the tree in an undoable manner. When a need for backtracking arises, we could just go back through the log undoing every single change and restart. This solves the state-saving problem, but doesn't offer caching for what's undone. We'll have to reparse PP2 anew.

So, mutable trees don't appear to be useful for our particular parser. Let's look at the immutable trees. Their nodes are created once and fixed forever. No nodes can be added or removed. The only way to attach a constituent to a tree is to produce a new tree with the constituent already in the right place. The parts of the old tree not directly dominating the attachment place can be reused in the new tree, everything else has to be recreated. Luckily, the reused part is usually much bigger, so not very much memory is consumed. So, we could easily just save current parsing state at each moment without fear. And we can also cache the PP2 before reparse since no one can hurt it anymore, it's immutable.

But all this is not for free. You should keep track of all the places in the partial tree where it's possible to attach. For example, in 'I gave the ...' there are 2 slots awaiting to be filled: the noun after the and second argument of gave. In the tree they should be somehow represented, and at least one of them won't be on the top level.

The slots to be filled may also not be marked explicitly. In I saw a girl with a telescope the PP with a telescope part may be attached either to noun girl or to the verb saw. Since it's not required by any of them, the parser should walk down the whole tree and find all the constituents touching the right side, and then trying to combine them with the PP. Moreover, psycholinguists observe that most often such an adjunct PP is attached to the downmost suitable constituent. Probably it's easier that way for humans, but this would be very resource-consuming operation in the computational parser.

Finally, the attaching place may even be located not on the right boundary at all. It's the infrequent but still important case of discontinuous constituents, usually occurring in the free word order languages like Russian, Latin or Walpiri. English also has such examples:

The question has been asked, if there's life on Mars.
"Tomorrow," he said, "I'll return".
Yesterday, however, they left.

In the first example, we should attach the subordinate clause if there's life on Mars to the NP the question, which doesn't touch the right boundary at all. The tree we have at that moment is something like

[S: [NP: the question] [VP: has been asked]]

and represents a complete grammatical sentence. So, do we have to search the whole tree to find the constituent to attach to? Nonsense. But I have no idea on what the reasonable attachment strategy should be, that uses immutable trees and covers the presented data.

Of course, all these problems are solvable. It's just me that doesn't want to implement those solutions due to their clumsiness, complexity and possible bug-riddenness. I'd better just use another intermediate parsing representation for the serial parser. Which one? That's another story.

No comments: