Saturday, April 30, 2011

Disambiguation by constructions

A word may have many senses and a good parser must choose the right one. Which is the right one? This is often obvious from the context. When you see an ambiguous word alone (e.g. cooks), you can tell nothing about its meaning. But if you see it in context, you may start guessing its syntactic and/or semantic behavior (our cooks have prepared the dinner vs. Mary cooks fish). Surrounding words greatly help in determining the part of speech, and many disambiguation algorithms take advantage of that.

But who on earth cares about the parts of speech? Well, many do, for example, parsers, both statistical and declarative, employ this information for building all kinds of structures. But anyway, that's an intermediate thing used for the own convenience of those parsers. For the ultimate text analysis tasks parts of speech are not important at all. The meaning is what is important, not them. So why bother at all? I'm currently trying to live without the intermediate part-of-speech level in my parser, and so far it works. How?

Consider cooks again. It can participate in the following constructions:
  1. she cooks
  2. cooks fish
  3. cooks when something happens
  4. cooks well
  5. the cooks
  6. sad cooks
  7. cooks that came from Germany
  8. and so on
Of those, 1-4 are "verbal", they refer to the same meaning of cooks, the process of cooking. And in fact they can all occur together in one sentence: She cooks fish very well when she's not tired. 5-7, on the other hand, are "nominal", they describe the people who cook for a living. They are also mutually compatible and at the same time totally incompatible with 1-4.

Let's now say there are no nouns, verbs and so on, there are only constructions. Upon encountering cooks, the parser notes all the constructions possible with this word (at least 1-7 from above). It also marks some of them (1-4) as incompatible with others (5-7). Then another word comes, for example, fish. It also generates tons of constructions, some of them also mutually incompatible (fish can be a noun or a verb as well). Importantly, one of them is the familiar Transitive (number 2 in the list). It's been suggested by both words, and it clearly wins over the others which were suggested only by one of the two words.

Now the constructions which are incompatible with this Transitive can be pruned: both "nominal" for cooks and "verbal" for fish. And the Transitive is promoted and may now contribute to the meaning of the entire text. (e.g. Cook.patient=Fish). Disambiguation complete.

Positive side: it's very simple and I don't have to create boring data structures for different parts of speech with all those cases, inclinations, numbers, genders, etc. Negative side: every word now has to know of all contexts it can occur in. Both adjectives and nouns have to specify that they participate in Adjective+Noun construction. That's quite unusual in rule-based parsing where people try to hand-code as little redundancy as possible. Anyway, unusual doesn't mean bad, I'm not very much against redundancy, and I really like the overall simplicity.

Tuesday, April 12, 2011

Parsing and questions

The usual declarative sentences are considered simple by formal semantics. In fact they are not, but anyway they're the simplest what there is. They even deceive people to believe that the first-order logic is adequate for expressing their meaning. John loves Mary is loves(john, mary). Simple. It gets more interesting with quantifiers, especially when there is more than one (can you spot an ambiguity in Every man loves some woman?), but that's not my point. Remember, this was the first Kharms' sentence:
Удивительный случай случился со   мной: я вдруг    забыл, что  идет раньше - 7 или 8
Amazing case happened with me I suddenly forgot what goes earlier 7 or 8
An amazing thing happened to me today, I suddenly forgot what comes first - 7 or 8

And the part of it remaining uncommented is the very last one, starting with what. It looks suspiciously like a question. I can imagine saying to myself What comes first 7 or 8? Damn, I forgot that! Yes, that's definitely a question. So, we have to step outside the comfortable world of declarative sentences and enter the darkness of what the advanced topics of formal semantics are about: interrogatives. Man, they even have a semester seminar on questions, only questions and nothing but questions!

But a quick look at their analyses is sufficient for me to realize that I don't like them. I don't want to implement that for the first 10 years (well, maybe less: they provide some code in Haskell), and then spend the rest of my life analyzing the resulting sets-of-sets-of-possible-worlds kinds of structures just to understand that it only means What comes first?. I don't seek an absolute truth, for me the simpler the structure, the closer it is to the surface, the better. Of course, if it still is acceptable as a true interlingua.

That said, I don't have much choice on how to represent the question from above. It's a clause containing a verb and a subject, and all these three entities are unusual in their own ways.

The unusual thing about verb is that it consists of two words - come first. Actually, it's a more generic verb come X, where X can be of any scalar value: first, next, previous, last, 42th, etc.

The subject is also unusual since it's what, a typical wh-word which many questions start with. It also comes with variants at the end of the clause - 7 or 8. I consider this a special construction, characteristic of questions. Those 7 and 8 are just listed in the semantics as the variants slots of the what frame.

Finally, the clause is unusual since it has to mark in some way its questionness. It would also be nice if it could specify which part of the clause is asked about (here it's the subject what). These two things are solved by one means: the situation corresponding to this clause has a questioned attribute pointing to what. Simple.

Finally, there should be a way of linking the question clause to the verb it depends on: forgot. It would be also nice to distinguish between the different things one can forget: real things (I forgot my cell phone), facts (I forgot that 2x2=4), some values (I forgot the area of Africa) and, finally, the answers to the questions (I forgot what comes first). At least two of these variants employ clauses: facts and questions. Luckily, a fact's clause definitely won't have questioned attribute, while in our case it will definitely be there. So indeed, we can just say that forgot's theme is the whole situation corresponding to the question and seems to be sufficient for the current purposes.

So, now I'm finally ready to present the semantics built for the complete sentence. Well, almost ready. There remains that or in 7 or 8. That's a conjunct, and, as conjuncts are my favorite and very interesting phenomena, I'll discuss them later in great detail. So, the interlingual representation for the first Sonnet sentence is this: