Tuesday, December 16, 2008

NLParser architecture

A typical NLP program faces lots of tasks. They include tokenization, lemmatization, dealing with word order, resolving ambiguities, building some structure, etc. The parser suggested here will deal with only one of those problems: word order. The task will be to convert token stream to the simplest structure, independent of it. It appears that functional composition is a perfect example of such a structure. For the sentence 'A kiss is a terrible thing to waste' this structure will look like

(is (a kiss) ((to waste) (a (terrible thing))))

or, in a more usual notation:

is(a(kiss))(to(waste)(a(terrible(thing))))

This would be the result of parser: just an expression, where some functions are invoked being passed another invocations' results as arguments. Functions can be of any kind, it depends of an interpreter. The parser may not only combine the tokens it gets into function calls, but also insert some auxiliary functions, e.g. to mark the missing word order info. In free word order languages like Russian where some meanings are expressed solely by word order this will be the case.

Basically the parser will deal with two interpreters. One will be syntactic, where the result of '(terrible thing)' will be almost the same as simple 'thing' (a noun phrase). This result will be used by parser itself, to help constructing the functional tree. The parser will known about all the possible values he syntactic interpreter could produce.

The syntactic evaluator will probably produce single result for any functional composition, i.e. it will not support alternative word meanings. The parser will probably concurrently support several different functional compositions until it can make a choice. It should also be able to rollback any choice it had made and reparse from that moment based on new information, if at some moment all the interpretations it has prove to be false.

The second evaluator will be of semantic nature, problem-dependent and replaceable. It will appear to the parser as a black box. It will be also invoked by the parser, but only to see whether an application (A B) will produce any results, i.e. it will have some semantic sense. The other thing we could do with anything is to compare by equality. So we'll see whether the results of two different compositions are in fact the same. For example, with several adjectives accompanying one noun it often doesn't matter in which order they go, so '(green (colorless ideas))' and '(colorless (green ideas))' will produce the same result in such tasks. This would mean that our parser won't have to maintain these two variants, one of them will be sufficient.

At each moment the parser is going to produce new function application, say, having A and B, to bind them via (A B), it evaluates '(A B)' syntactically and semantically. If both results are good, then the tree '(A B)' will certainly live further. In other cases, the parser has to choose the most promising variants from the set it has (remember, it maintains several concurrent trees). The variant that has only syntactic interpretation (but no semantic) will be better than the semantic-only variant. The variant having no interpretations will also have no chance of survival. Some psycholinguists say this is really similar to what the human brains do.

That's it, in short. The only thing left is to invent some semantics interpreter and implement this parser.

Thursday, December 11, 2008

Academic officers: top-down contextual ambiguity resolution

'The officers taught at the academy'. This sentence appears unambiguous to us until we are presented with an upgraded version: 'The officers taught at the academy were very demanding'. Functional structures for them will be:

(1) ((at (taught (the officers))) (the academy))
(2) ((were ((at (taught (the officers))) (the academy))) (very demanding))

These structures share common part, namely the whole first sentence. But this common part should be evaluated to different results: in (1) it would be complete clause - sentence, while in (2) it's only a noun phrase. So it appears that the first sentence should actually have two valid interpretations, but we've already understood it unambiguously!

This can be resolved by implying that the whole construction should always have clause type. So when we assume 'taught' having two alternatives and calculate the (1) expression value with both, we could safely remove the noun phrase since it can't be final result. This top-level clausal implication can be even explicated by introducing one more function here, whose name could be '.' (dot). The only thing for it to do could be just to filter clauses out of all its argument's alternatives. But there's another way.

Let's assume that every functional language expression has only one interpretation. It's simple and it looks like humans do the same. So when (1) is considered in its standalone version, it will be unambiguously a clause. But then it appears that the sentence is not finished, and we get 'were' applied to this complex expression, which can't take clauses as arguments. So we now have to evaluate (1) in a different way, knowing now that its context requires it to be noun phrase. It should be not hard to find another alternative of 'taught' that will suit these new requirements: a passive participle.

At the top, a clause type will be expected, on other levels it will depend on a particular function. Function 'were' will want a noun phrase (and so will 'at' and 'the' do), the resulting function '(were ...)' will need a gerund ('very demanding').

Technically, the evaluator has to get the context requirements from somewhere. We can't calculate the expression result part-by-part, starting with '(the officers)' and then applying other functions to it, we should have the whole expression in our hands. It also means that functions don't take arguments as eagerly computed values any more, they take arguments as data instead and evaluate that data by themselves in a right context. Hello, LISP's macros.

Wednesday, December 10, 2008

Time flies like an arrow: an ambiguity resolution example

Let's suppose we have to analyze a well-known ambiguous phrase 'Time flies like an arrow'. Here all three first words can be sentence predicates. Presented as a function composition, these alternatives will look as follows (in LISP notation, with function separated from its argument by space, both inside parentheses):

(1) 'Time' is a verb: ((like (an arrow)) (time flies))
(2) 'Flies' is a verb: ((like (an arrow)) (flies time))
(3) 'Like' is a verb: ((like (time flies)) (an arrow))

All of these applicative representations are self-disambiguating. Let's list the lexical word ambiguities. 'Time' can be either singular noun or a verb, 'flies' - plural noun or a singular third-person finite verb, 'like' - a verb or an adverb.

The result of '(an arrow)' occurring in all three alternatives is always the same, it means some object (it may be ambiguous, but the context doesn't provide more information, and the fact it's inanimate noun should satisfy us now).

We see 'time' being applied to 'flies' in the first and third variants. The result is ambiguous, it's either an order to time some flies (e.g. measure the time of their flight) or a description of some flies (maybe they fly through time from future to past).

If we exchange the function-argument roles, we get '(flies time)' from the second variant. Here the situation is better: 'flies' as noun doesn't have actants, therefore can't be a function, so only its verb-alternative remains. The latter also can't be applied to verbs in any way, so 'time' is noun there now, and '(flies time)' has clause type.

As for the '(like (an arrow))' construction, of course 'like' here can be an adverb describing the similarity of something to an arrow, but it can also be a verb. It can't be a usual narrative sentence predicate, since then it would require a subject, which should be passed to it as first argument by parser. And '(an arrow)' can't be this subject since it's 3d person singular requiring the verb to end with 's'. But 'like' here can still be imperative or infinitive verb. Both can't be applied then to clause '(flies time)', which disambiguates the second variant completely, giving the usual proverb about time which is flying too fast.

The verb 'like' also has only one direct object, which will be filled by '(an arrow)', so the resulting construction can't be applied to '(time flies)': we remember, it's either an imperative clause, or a noun phrase. This leaves only adverbial alternative for 'like'. It can modify only clause, so '(time flies)' must be really an imperative clause. Congratulations, we've disambiguated (1). We have to measure the flight time of some flies, and do this in a precise way, like an arrow, whatever this could mean.

Only the third variant is left. Here 'like' is applied to '(time flies)' and the result is applied to '(an arrow)' then. Imperative variant of '(time files)' can't be an argument of both verb and noun variants of 'like', and the noun 'like' doesn't have actants. This leaves us with verb 'like' eating noun phrase '(time flies)', which can be its subject when the verb is finite or its object when the verb is imperative or infinitive. An attempt to eat also '(an arrow)' then removes the non-evident imperative and infinitive alternatives, making this interpretation also unambiguous. We will know that all the future-to past flies are fond of some arrow.

Saturday, December 6, 2008

Статейко

Проект статьи про представление текста в виде композиции функций