Saturday, February 13, 2010

Charts and clouds

As Doug Arnold pointed out, attacking trees and looking for a replacement isn't new. It has already been done, so I didn't actually invent anything revolutionary. The construction cloud is actually a well known structure called chart, as in chart parsing. The chart is designed for keeping all possible parse variants in a compact way to ease backtracking and reparse. And my cloud does exactly this and in a very similar fashion.

One difference is what kind of information is stored in the chart. In the classical approach, it's a bunch of records like "there may be an NP constituent from word 3 to word 7". No internal structure of these constituents is stored. If someone needs the complete tree, it can be computed on demand, since all of the daughter phrases are also present in the chart. And you're lucky if there's only one variant of such subtree.

Cloud is a collection of records like "constructions X, Y and Z may form a construction T". Since constructions don't tend to have a deep hierarchy, each of them stores its children. We don't have to compute the subtree on demand. But the subtree rarely covers the whole sentence. Therefore the task is to find a subset of all constructions such that it covers the input and its constructions don't contradict each other.

Another difference is that usually chart is tightly connected with the parser. A typical chart parser computes all the results in parallel, storing every possible intermediate constituent in the chart. On the other hand, I'm into single-variant serial parsing with error recovery. Of course, nothing prevents from implementing that kind of parser with phrase-structure-based chart: just drive one variant instead of all. One of the reasons I prefer dependency-like constructions over constituents is that free word order languages appear to be easier to describe in terms of constructions.

Wednesday, February 10, 2010

But why cloud?

Once upon a time I was having problems with trees. I decided I prefer construction cloud instead of them. Here's why.

Like the syntactic groups grammar, construction cloud accommodates phrase structure and dependency grammars. A construction relating 2 words can be thought of as a dependency relation between them, with the rare exception that this 'dependency' can participate in another construction, as with Prepositional and Genitive from the example.

Construction cloud also doesn't object to discontinuous constructions. If the parser managed to build it from the text, then it's acceptable.

Unlike the phrase structure, dependency and categorial grammars, it's possible for a construction to have more than one parent. This allows overlapping of various kinds, including, for example, having separate constructions for syntactic agreement, semantic relatedness and prosodical cues to information structure.

From the programmer's perspective, it's much easier to maintain the list of attachable constructions during the parsing process. They are not buried deep inside the tree anymore, but stored together in a flat list. Being immutable, they are easy to cache for future reparse. Rollback in case of reanalysis is also simple: just look at the constructions ending before the reparse point to restore the context. Finally, nobody prevents you from having constructions from several parse variants in the same cloud simultaneously.

Isn't it convincing?

Tuesday, February 9, 2010

Cloud of constructions

What I propose to use as the parser representation instead of trees is a collection of constructions. I call it a cloud.

Base constructions are words. When several constructions (e.g. words) tend to go together and comprise something meaningful in the semantics, they form a composite construction and become its arguments, or children. Composite constructions may also be children of even higher-order constructions, although not very often.

Consider I see a little silhouetto of a man as an example. Each word is a construction, so it's already in the cloud. I+see form a construction whose meaning is to specify see's subject as I. Similar things happen with see+silhouetto, though this time it's an object. Both a and little provide us with additional information about silhouetto: it's indeterminate and, well, little. Again, indeterminacy is added to man in a+man construction. The preposition of+man form a prepositional construction, which then itself forms a genitive construction with silhouetto. The result can be nicely visualized:

Isn't this looking like a strangely-colored cloud?

Of course, the composite constructions don't need to have exactly 2 arguments. For example, the more X the more Y construction has 6 arguments, 4 of which are constant words. X is Y to Z (as in John is hard to convince) has 5 arguments with 3 slots (actually 4: is also isn't fixed, it's replaceable with was).

And yes, clouds are better than trees.

What else is wrong with the trees?

I've already explained why I don't like using trees as an intermediate parser representation. Here's why I don't want to use them as the final parser representation, i.e. for capturing the syntactic structure of a text.

1. Double attachment

I wrote a letter to my friend. Did I write something (a letter) to my friend or was it the letter to my friend that I wrote? I have no idea. Both alternatives seem plausible to me. So why should I make some unmotivated choice if I can have 'to my friend' attached to both 'wrote' and 'letter'? But the result won't be a tree.

2. Overlapping

zum means zu (preposition) + dem (dative definite article). If the tree has separate slots for prepositions and articles (e.g. PP and DP heads), this word would go to both, which is not tree-like.

Welsh has single constituent subject-verb agreement. For example, in the Welsh translation of (very "natural") I and my brother saw ourselves, saw agrees with singular I, whilst ourselves agrees with the plural compound I and my brother. If we think that the agreed items should be directly connected in the syntactic structure (not necessarily true, but would be nice), then saw should be directly connected both with I (agreement) and I and my brother (subject).

However, Doctor, I'll stay here. IMHO any text should be analyzed as is. In particular, the punctuation should be processed in the same way as the words, since it's part of the text and is subject to grammatical rules of the language called Written English. But however at the beginning of a clause requires a comma after itself. The title of the Doctor being addressed, also requires to be surrounded by commas. So, where does the comma between however and Doctor belong? If you ask me, it belongs to both places, and this again isn't tree-like.

3. Discontinuous constituents

Actually, there's nothing wrong about them from programmer's perspective. But both the phrase-structure and dependency grammars prohibit discontinuous constituents. They don't allow the lines connecting their nodes to cross for some theoretical reasons.

So, for me trees are also not very convincing as the 'ideal' syntactic representation. But is there an alternative? Yes, and not a single one. Attribute-value matrices, for example. Or constructions. But that's another story.

Monday, February 8, 2010

Trees in NL parser

Consider you're a parser analyzing the famous The horse raced past the barn fell. Additionally, you're operating in an incremental left-to-right fashion, considering single active parse variant at each moment. Independently of the theory, you'll probably combine the horse and past the barn into some kind of units, say NP1 and PP2. Before you see the last word, you'll probably consider that NP1 is the subject of the finite verb raced, and PP2 is its adjunct. But then the fell comes. At this moment, you'll understand that the single active parse variant you have is wrong, return to raced and reanalyze it as a passive participle (The horse [that was] raced past the barn fell). This is all well known and boring.

What's not boring is the implications for a computational parser operating in a similar way. First, it should be able somehow to restore its state right after NP1 and reparse from there, choosing the passive participle alternative of raced. Second, it would be nice not to reparse the PP2 again. This word sequence will anyway produce the same result and function as an adjunct, though now in the relative clause.

Since we're going to store the parsing state and analyze it in various ways (e.g. find slots to fill), we'd better store it as a data. What kind of data? The standard answer is, an (incomplete) tree. I'll show that's not convenient.

Let's consider mutable trees first. Nodes can be added or removed in such trees to any positions at any time. Therefore to store anything we need first to copy it completely and put aside, so that subsequent changes in the active tree aren't reflected in the copy. This applies both to parser state after NP1 and to the i'd-like-it-cached PP2. But we don't want to copy the parser state at each moment just in case we'll have to backtrack to here in future: we'll run out of memory very soon.

An alternative approach I see is not to copy anything, but to log thoroughly every change you make to the tree in an undoable manner. When a need for backtracking arises, we could just go back through the log undoing every single change and restart. This solves the state-saving problem, but doesn't offer caching for what's undone. We'll have to reparse PP2 anew.

So, mutable trees don't appear to be useful for our particular parser. Let's look at the immutable trees. Their nodes are created once and fixed forever. No nodes can be added or removed. The only way to attach a constituent to a tree is to produce a new tree with the constituent already in the right place. The parts of the old tree not directly dominating the attachment place can be reused in the new tree, everything else has to be recreated. Luckily, the reused part is usually much bigger, so not very much memory is consumed. So, we could easily just save current parsing state at each moment without fear. And we can also cache the PP2 before reparse since no one can hurt it anymore, it's immutable.

But all this is not for free. You should keep track of all the places in the partial tree where it's possible to attach. For example, in 'I gave the ...' there are 2 slots awaiting to be filled: the noun after the and second argument of gave. In the tree they should be somehow represented, and at least one of them won't be on the top level.

The slots to be filled may also not be marked explicitly. In I saw a girl with a telescope the PP with a telescope part may be attached either to noun girl or to the verb saw. Since it's not required by any of them, the parser should walk down the whole tree and find all the constituents touching the right side, and then trying to combine them with the PP. Moreover, psycholinguists observe that most often such an adjunct PP is attached to the downmost suitable constituent. Probably it's easier that way for humans, but this would be very resource-consuming operation in the computational parser.

Finally, the attaching place may even be located not on the right boundary at all. It's the infrequent but still important case of discontinuous constituents, usually occurring in the free word order languages like Russian, Latin or Walpiri. English also has such examples:

The question has been asked, if there's life on Mars.
"Tomorrow," he said, "I'll return".
Yesterday, however, they left.

In the first example, we should attach the subordinate clause if there's life on Mars to the NP the question, which doesn't touch the right boundary at all. The tree we have at that moment is something like

[S: [NP: the question] [VP: has been asked]]

and represents a complete grammatical sentence. So, do we have to search the whole tree to find the constituent to attach to? Nonsense. But I have no idea on what the reasonable attachment strategy should be, that uses immutable trees and covers the presented data.

Of course, all these problems are solvable. It's just me that doesn't want to implement those solutions due to their clumsiness, complexity and possible bug-riddenness. I'd better just use another intermediate parsing representation for the serial parser. Which one? That's another story.