Sepulkarium: наука

Showing posts with label наука. Show all posts

Tuesday, December 16, 2008

NLParser architecture

A typical NLP program faces lots of tasks. They include tokenization, lemmatization, dealing with word order, resolving ambiguities, building some structure, etc. The parser suggested here will deal with only one of those problems: word order. The task will be to convert token stream to the simplest structure, independent of it. It appears that functional composition is a perfect example of such a structure. For the sentence 'A kiss is a terrible thing to waste' this structure will look like

(is (a kiss) ((to waste) (a (terrible thing))))

or, in a more usual notation:

is(a(kiss))(to(waste)(a(terrible(thing))))

This would be the result of parser: just an expression, where some functions are invoked being passed another invocations' results as arguments. Functions can be of any kind, it depends of an interpreter. The parser may not only combine the tokens it gets into function calls, but also insert some auxiliary functions, e.g. to mark the missing word order info. In free word order languages like Russian where some meanings are expressed solely by word order this will be the case.

Basically the parser will deal with two interpreters. One will be syntactic, where the result of '(terrible thing)' will be almost the same as simple 'thing' (a noun phrase). This result will be used by parser itself, to help constructing the functional tree. The parser will known about all the possible values he syntactic interpreter could produce.

The syntactic evaluator will probably produce single result for any functional composition, i.e. it will not support alternative word meanings. The parser will probably concurrently support several different functional compositions until it can make a choice. It should also be able to rollback any choice it had made and reparse from that moment based on new information, if at some moment all the interpretations it has prove to be false.

The second evaluator will be of semantic nature, problem-dependent and replaceable. It will appear to the parser as a black box. It will be also invoked by the parser, but only to see whether an application (A B) will produce any results, i.e. it will have some semantic sense. The other thing we could do with anything is to compare by equality. So we'll see whether the results of two different compositions are in fact the same. For example, with several adjectives accompanying one noun it often doesn't matter in which order they go, so '(green (colorless ideas))' and '(colorless (green ideas))' will produce the same result in such tasks. This would mean that our parser won't have to maintain these two variants, one of them will be sufficient.

At each moment the parser is going to produce new function application, say, having A and B, to bind them via (A B), it evaluates '(A B)' syntactically and semantically. If both results are good, then the tree '(A B)' will certainly live further. In other cases, the parser has to choose the most promising variants from the set it has (remember, it maintains several concurrent trees). The variant that has only syntactic interpretation (but no semantic) will be better than the semantic-only variant. The variant having no interpretations will also have no chance of survival. Some psycholinguists say this is really similar to what the human brains do.

That's it, in short. The only thing left is to invent some semantics interpreter and implement this parser.

Thursday, December 11, 2008

Academic officers: top-down contextual ambiguity resolution

'The officers taught at the academy'. This sentence appears unambiguous to us until we are presented with an upgraded version: 'The officers taught at the academy were very demanding'. Functional structures for them will be:

(1) ((at (taught (the officers))) (the academy))
(2) ((were ((at (taught (the officers))) (the academy))) (very demanding))

These structures share common part, namely the whole first sentence. But this common part should be evaluated to different results: in (1) it would be complete clause - sentence, while in (2) it's only a noun phrase. So it appears that the first sentence should actually have two valid interpretations, but we've already understood it unambiguously!

This can be resolved by implying that the whole construction should always have clause type. So when we assume 'taught' having two alternatives and calculate the (1) expression value with both, we could safely remove the noun phrase since it can't be final result. This top-level clausal implication can be even explicated by introducing one more function here, whose name could be '.' (dot). The only thing for it to do could be just to filter clauses out of all its argument's alternatives. But there's another way.

Let's assume that every functional language expression has only one interpretation. It's simple and it looks like humans do the same. So when (1) is considered in its standalone version, it will be unambiguously a clause. But then it appears that the sentence is not finished, and we get 'were' applied to this complex expression, which can't take clauses as arguments. So we now have to evaluate (1) in a different way, knowing now that its context requires it to be noun phrase. It should be not hard to find another alternative of 'taught' that will suit these new requirements: a passive participle.

At the top, a clause type will be expected, on other levels it will depend on a particular function. Function 'were' will want a noun phrase (and so will 'at' and 'the' do), the resulting function '(were ...)' will need a gerund ('very demanding').

Technically, the evaluator has to get the context requirements from somewhere. We can't calculate the expression result part-by-part, starting with '(the officers)' and then applying other functions to it, we should have the whole expression in our hands. It also means that functions don't take arguments as eagerly computed values any more, they take arguments as data instead and evaluate that data by themselves in a right context. Hello, LISP's macros.

Wednesday, December 10, 2008

Time flies like an arrow: an ambiguity resolution example

Let's suppose we have to analyze a well-known ambiguous phrase 'Time flies like an arrow'. Here all three first words can be sentence predicates. Presented as a function composition, these alternatives will look as follows (in LISP notation, with function separated from its argument by space, both inside parentheses):

(1) 'Time' is a verb: ((like (an arrow)) (time flies))
(2) 'Flies' is a verb: ((like (an arrow)) (flies time))
(3) 'Like' is a verb: ((like (time flies)) (an arrow))

All of these applicative representations are self-disambiguating. Let's list the lexical word ambiguities. 'Time' can be either singular noun or a verb, 'flies' - plural noun or a singular third-person finite verb, 'like' - a verb or an adverb.

The result of '(an arrow)' occurring in all three alternatives is always the same, it means some object (it may be ambiguous, but the context doesn't provide more information, and the fact it's inanimate noun should satisfy us now).

We see 'time' being applied to 'flies' in the first and third variants. The result is ambiguous, it's either an order to time some flies (e.g. measure the time of their flight) or a description of some flies (maybe they fly through time from future to past).

If we exchange the function-argument roles, we get '(flies time)' from the second variant. Here the situation is better: 'flies' as noun doesn't have actants, therefore can't be a function, so only its verb-alternative remains. The latter also can't be applied to verbs in any way, so 'time' is noun there now, and '(flies time)' has clause type.

As for the '(like (an arrow))' construction, of course 'like' here can be an adverb describing the similarity of something to an arrow, but it can also be a verb. It can't be a usual narrative sentence predicate, since then it would require a subject, which should be passed to it as first argument by parser. And '(an arrow)' can't be this subject since it's 3d person singular requiring the verb to end with 's'. But 'like' here can still be imperative or infinitive verb. Both can't be applied then to clause '(flies time)', which disambiguates the second variant completely, giving the usual proverb about time which is flying too fast.

The verb 'like' also has only one direct object, which will be filled by '(an arrow)', so the resulting construction can't be applied to '(time flies)': we remember, it's either an imperative clause, or a noun phrase. This leaves only adverbial alternative for 'like'. It can modify only clause, so '(time flies)' must be really an imperative clause. Congratulations, we've disambiguated (1). We have to measure the flight time of some flies, and do this in a precise way, like an arrow, whatever this could mean.

Only the third variant is left. Here 'like' is applied to '(time flies)' and the result is applied to '(an arrow)' then. Imperative variant of '(time files)' can't be an argument of both verb and noun variants of 'like', and the noun 'like' doesn't have actants. This leaves us with verb 'like' eating noun phrase '(time flies)', which can be its subject when the verb is finite or its object when the verb is imperative or infinitive. An attempt to eat also '(an arrow)' then removes the non-evident imperative and infinitive alternatives, making this interpretation also unambiguous. We will know that all the future-to past flies are fond of some arrow.

Saturday, December 6, 2008

Статейко

Проект статьи про представление текста в виде композиции функций

Saturday, December 29, 2007

Как писать парсеры

Обычно парсеры языков программирования пишутся в виде грамматики, которая либо скармливается потом генератору, либо уже представляет собой код с использованием парсер-комбинаторов. К сожалению, если язык сложен, грамматика его тоже непроста. А сложным, надо сказать, становится любой язык, когда в него нужно добавить разумную обработку ошибок и восстановление после таковых, и при этом выдать какое-то достаточно вменяемое синтаксическое дерево на выходе. Ровно это необходимо для хорошей поддержки языка в IDE.

Поэтому в самой знакомой мне IDE парсеры пишутся обычно руками, рекурсивным спуском, скучно, занудно и однообразно (чего стоит один практически идентичный код, раскопированный для каждой бинарной операции). Восстановление после ошибок тогда получается достаточно несложно.

Так вот, не рекурсивным спуском единым жив человек. Уже почти устоялось API для другого метода, придуманного когда-то Праттом, примененного Крокфордом, до боли напоминающего вдохновленный языком Forth метод Тузова, наконец, дошедшего до меня и творчески переработанного. Про это API и расскажу.

Итак, по старинной фортовской привычке мы начинаем с объявления данных кодом. Данные у нас - это поток токенов от лексера, вот они-то и будут кодом. У каждого токена будет свой собственный маленький кодик, вот такого вида:


interface TokenParser {
  boolean parseToken(PrattBuilder builder);
}

ПраттБилдер (лучшие варианты названия принимаются) - это интерфейс к системе, он - наше все, двигает лексер, хранит промежуточные состояния, строит дерево и запускает токенпарсеры. Построен он поверх идейского PsiBuilder'а и обладает сходной приятной функциональностью по прозванию маркеры. Маркер - это очень легкая штука, позволяющая запомнить место в потоке токенов. Потом к этому месту можно будет откатиться, либо создать новый узел дерева с началом в нем.

С каждым токеном ассоциируется еще и число, приоритет. Самый главный метод праттбилдера носит скромное название parse, принимает одно число, и идет последовательно, каждый раз для текущего лексерного токена вызывая его токенпарсеры, пока приоритеты токенов строго больше числа-параметра либо пока какой-нибудь токенпарсер не вернет false. Сами токенпарсеры внутри себя могут делать что угодно, сдвигать лексер насколько угодно (в том числе путем вызова билдеровского parse рекурсивно), но не в их власти откатиться раньше токена, для которого их вызвали. Билдер всегда помнит место, в котором вызвали его parse, и умеет создавать узлы дерева, начинающиеся там.

Это была фортовская часть. Теперь к Пратту. Самый частый вариант токенпарсера, DefaultTokenParser, реализован заветами этого товарища, и умеет различать два случая вызова себя и делегировать всю работу одному из двух подпарсеров - соответствующих своих аргументов конструктора. Первый случай, "свежачок", когда токен встретился сразу же в билдеровском parse, слева от него ничего еще нет, у Пратта это зовется nud, у меня пока - PrefixParser (better name needed!!!). Второй случай, когда слева уже что-то есть, и там даже построился узел дерева с определенным типом, у Пратта - led, у меня пока - ProceedingParser (oh name...). Первое используется, когда по токену можно понять, какую конструкцию он начинает (унарная префиксная операция), второй - какую конструкцию он продолжает (инфиксная бинарная операция или много чего еще) или закачивает (унарная постфиксная операция).

У токена нечасто определены оба компонента, самый очевидный пример токена, умеющего и начинать и продолжать - "+" или "-". Если nud или led нужен, но отсутствует, генерируется ошибка. Токен, который мы анализировали, нам уже неинтересен, мы с него все, что хотели, получили, поэтому перед вызовом подпарсеров лексер сдвигается вперед. Оба подпарсера возвращают тип узла, который надо построить от стартового маркера до текущего места. Обычно алгоритм разбора токенпарсера по умолчанию получаются следующим: вызываем nud первого токена, а потом, пока не посинеем, вызываем led у текущих токенов.

Пример простого нуда на псевдоязыке:


unary(IElementType type, int priority) = fun (builder) ->
     builder.parse(priority);
     return type

И еще проще:


tokenElement(IElementType type) = fun (builder) ->
     return type

Пример простого леда (бинарная левоассоциативная операция) почти дословно такой же:


binary(IElementType type, int priority) = fun (leftType, builder) ->
     builder.parse(priority);
     return type

Постфиксная операция еще проще:


postfix(IElementType type) = fun (leftType, builder) ->
     return type

Распарсить простые выражения можно теперь с помощью весьма лаконичного кода:


NUMBER: priority=10, nud=tokenElement(NUMBER_LITERAL)
IDENTIFIER: priority=10, nud=tokenElement(QNAME)

"(": priority=10, nud=fun (builder: PrattBuilder) -> 
    builder.parse(0);
    builder.assertToken(")");
    return PARENTHESES

".": priority=9, led=binary(QNAME, 9)
"!": priority = 8, nud=unary(NEGATION, 8)

"+","-": priority=8, 
      nud=unary(ARITHMETIC_UNARY, 8), 
      led=binary(ARITHMETIC_BINARY, 6)
"*","/": priority=7, led=binary(ARITHMETIC_BINARY, 7)

Парсер, например, для сокращенного определения класса в Java лучше выражается через просто токенпарсеры:


"public", "static", etc.: priority = -100, 
  tokenParser = fun (builder) -> 
    builder.advance(); 
    return true;
"class": priority=-100, 
  tokenParser = fun (builder) ->
    builder.finishPrefix(MODIFIER_LIST);
    builder.advance();
    builder.assertToken(IDENTIFIER);
    if (builder.assertToken(EXTENDS)) {
      builder.advance();
      builder.parse(9);
    }
    builder.assertToken("{");
    builder.parse(-90);
    builder.assertToken("}");
    return JAVA_CLASS

Это уже достаточно напоминает рекурсивный спуск. В принципе, мне сей метод видится как гармоничное сочетание разбора сверху, снизу и стека приоритетов операторов. В каждом месте программист вправе выбрать, что лучше в данном конкретном случае.

В данный момент на праттовской основе (только, увы, в синтаксисе Java) в достаточной степени написан парсер для языка шаблонов FreeMarker, и процесс мне нравится. Парсер получается более прямолинейный, проверок условий в нем намного меньше, чем в обычном рекурсивном спуске. Это можно рассматривать как паттерн Visitor, он же полиморфизм, мы вместо делания многочисленных проверок на тип вызываем абстрактный метод у самого типа.

Из достоинств еще можно упомянуть модулярность. Можно теоретически взять и поменять парсер или приоритет у токена. Что важнее, можно легко начать парсить не сначала, а из любой середины, почти с любого токена, что открывает неплохие возможности для инкрементального репарса.

Главный недостаток - меньшая понятность. Суть рекурсивного спуска понятна сразу, потом только начинаешь увязать в конкретных бесчисленных флажках, циклах, условиях и маркерах, но это уже детали реализации конкретного парсера. Тут флажков, циклов, условий и маркеров поменьше, но зато основная идея ясна не сразу, непрозрачна. И конкретные числа несколько мешают восприятию.

Еще особенность, уж не знаю, достоинство или недостаток. При написании парсера надо думать. Этот парсер не совсем похож на грамматику, и не пишется на автомате, глядя на таковую. Скорее всего, такие парсеры тоже можно генерировать по грамматике, для этого надо только погрузиться в соответствующую науку. Но это уже совсем другая история.

Sepulkarium