Friday, July 20, 2012

Visibility ambiguity

I'm now dealing with a quite interesting kind of ambiguity which I call visibility ambiguity. Consider the following three sentences (I might have put too many commas; imagine it's Russian where they're obligatory):
  1. But there, thinking about her words, we got sad.
  2. But there, thinking about her words, which were so cruel, we got sad.
  3. But there, thinking about her words, intonation and gestures, we got sad.
The sentences are the same until the second comma after which they diverge dramatically. To process the next word the parser should determine which previous state it can attach to. In 1, we is attached to there and the top-level sentence is continued. In 2, which relates to words plus comma, a relative clause begins. In 3, there's a conjunction: words gets merged with intonation and re-attached together to about and her.

Obviously, all three attachment sites should be visible to the parser: there, words and comma. And all three continuation possibilities should be explored.

Which is where it gets complicated, because these attachment sites are also incompatible with each other. Once you've attached a word following route 1, you cannot attach the next one to route 2. The parser should regard such attachments as mutually incompatible. There are at least three diverging routes after every comma, each defined by its own set of constructions. And all constructions from different sets should be marked as contradictory, pair-wise. Which is quite a few pairs.

So it's not only that the number of possible variants will be big. It's also that the number of incompatibility relations to track will be much bigger, something pretty exponential. Given that I normally log all the visible attachment sites after each word during parsing, the log will be hard to analyze.

And what troubles me: people don't do this, they don't track all the possibilities. Instead, they quite quickly choose a variant which suits best, purge everything else from the memory and pursue only this parsing route. If it goes wrong eventually, they perform some kind of reanalysis. But the heuristics are tuned well enough and normally such a need never arises.

Right now my parser maintains all the structures needed for all possible analyses, and switching between them is very simple. But that just doesn't scale well enough to support visibility ambiguities. So it seems that the time has come to teach my parser the human strategy: choosing one route and then reparsing when anything doesn't fit well.

No comments: