Tuesday, December 16, 2008

NLParser architecture

A typical NLP program faces lots of tasks. They include tokenization, lemmatization, dealing with word order, resolving ambiguities, building some structure, etc. The parser suggested here will deal with only one of those problems: word order. The task will be to convert token stream to the simplest structure, independent of it. It appears that functional composition is a perfect example of such a structure. For the sentence 'A kiss is a terrible thing to waste' this structure will look like

(is (a kiss) ((to waste) (a (terrible thing))))

or, in a more usual notation:

is(a(kiss))(to(waste)(a(terrible(thing))))

This would be the result of parser: just an expression, where some functions are invoked being passed another invocations' results as arguments. Functions can be of any kind, it depends of an interpreter. The parser may not only combine the tokens it gets into function calls, but also insert some auxiliary functions, e.g. to mark the missing word order info. In free word order languages like Russian where some meanings are expressed solely by word order this will be the case.

Basically the parser will deal with two interpreters. One will be syntactic, where the result of '(terrible thing)' will be almost the same as simple 'thing' (a noun phrase). This result will be used by parser itself, to help constructing the functional tree. The parser will known about all the possible values he syntactic interpreter could produce.

The syntactic evaluator will probably produce single result for any functional composition, i.e. it will not support alternative word meanings. The parser will probably concurrently support several different functional compositions until it can make a choice. It should also be able to rollback any choice it had made and reparse from that moment based on new information, if at some moment all the interpretations it has prove to be false.

The second evaluator will be of semantic nature, problem-dependent and replaceable. It will appear to the parser as a black box. It will be also invoked by the parser, but only to see whether an application (A B) will produce any results, i.e. it will have some semantic sense. The other thing we could do with anything is to compare by equality. So we'll see whether the results of two different compositions are in fact the same. For example, with several adjectives accompanying one noun it often doesn't matter in which order they go, so '(green (colorless ideas))' and '(colorless (green ideas))' will produce the same result in such tasks. This would mean that our parser won't have to maintain these two variants, one of them will be sufficient.

At each moment the parser is going to produce new function application, say, having A and B, to bind them via (A B), it evaluates '(A B)' syntactically and semantically. If both results are good, then the tree '(A B)' will certainly live further. In other cases, the parser has to choose the most promising variants from the set it has (remember, it maintains several concurrent trees). The variant that has only syntactic interpretation (but no semantic) will be better than the semantic-only variant. The variant having no interpretations will also have no chance of survival. Some psycholinguists say this is really similar to what the human brains do.

That's it, in short. The only thing left is to invent some semantics interpreter and implement this parser.

No comments: