Tuesday, October 25, 2011

NLP using graphs

Could natural language parsing be a task on graphs? Maybe, at least partly.

Now, during parsing, every word contributes to some constructions. Some of these contributions are mutually exclusive. For example, in Russian a noun typically can't be nominative and accusative at the same time. So this noun's contributions for nom and acc constructions are incompatible.

Different words also may have contradicting contributions. Two nouns in the same clause can't both be accusative even if their forms are compatible with such an alternative (i.e. they both contribute to acc construction).

So here's the idea. Consider a graph whose vertices are formed by each construction contribution for each word. And where those contributions contradict each other in any way, there's an edge. The task is then to find a maximal subset of vertices which are not connected to each other.

There are, of course, other constraints. The resulting graph should make sense from the semantic viewpoint. The subset should be constructed incrementally and conservatively: if the parser can proceed without reanalysis of the already built structures, it should do so. Finally, this graph task has a very limited scope, e.g. it doesn't apply in different clauses.

That's just an idea. Currently my parser doesn't track individual contributions per construction, it just unifies them eagerly. But the test data suggests that might change.

Saturday, July 16, 2011

Ellipsis

One of the great things about constructions is that you can do cool things with them which are not so easy with other parsing techniques. One example is ellipsis, when you may just omit any number of words if they are the same as in the previous clause:

По мнению одних дальше следовало 7, по мнению других — 8
PREP opinion some next went 7 PREP opinion others — 8
In the opinion of some, a 7 went next; but in opinion of others an 8 did

To translate this correctly, several things should be done. The parser should recognize that "" actually stands for went next. The semantic model should somehow incorporate this as well as the meta-knowledge that this went next is actually not specified explicitly (otherwise the generator would produce but in opinion of others an 8 went next). Finally, the generator should be able to replace the full verb phrase with did.

Parser


There's clearly some parallelism here. After processing in opinion of others the parser should already know that this is similar to in the opinion of some. The same happens with 8, which is similar to 7. Now the parser should fill in the dash gap with the missing information taken from the previous clause. This means it should have access to all actions it performed during processing of went next, and it should repeat them.

So, there are two problems. One is how to find similar fragments given that they consist of different words. The other is how to be aware of the earlier activity of the parser and be able to reproduce it in new contexts. The problems are quite tightly connected: it's the earlier activity where you search for something similar. And no surprise that both problems are solved by one means, which is, keeping the change log.

Every change in the parsing state is kept in a log. Luckily, the construction architecture is very suitable for this. Remember, each word tells the parser which constructions it participates in and which of them are mutually incompatible. It also contributes arguments to the constructions: specifying a noun of a nom(inative) construction is completely different from contributing a head (verb). Based on this information, the parser resolves the possible conflicts, determines the winning constructions and executes the code attached to them which makes the changes to the semantic model.

The log consists of the contributions each word has made. And by allowing constructions to have access to this log, the parser can tackle the two problems posed by ellipsis:

1. Similarity

7 and 8 are different words, but the contributions they make are quite similar: they invoke the same constructions (nom, gen, acc, ... — all cases) but provide different arguments to them: one creates a frame whose type is 7, another — the same with 8. So I infer that two contributions are similar if they contribute to the same construction and supply same kinds of arguments.

2. Repetition

The parser knows it's processing ellipsis. It knows where the ellipsis starts (in opinion of others) and ends (8). It has found the similar fragments in the previous clause (in the opinion of some and 7). Now it just has to look at the contribution history, find everything in between those two similar fragments and insert it between in opinion of others and 8. The copying happens upon encountering 8 after "—", and the own contribution of 8 is nicely integrated into the environment set by the copied history (e.g. it completes the noun of nom construction, whose head was set by copied went).

Intermediate representation


OK, now we have two went next situations, but only one of them was fully expressed in Russian. And we should do the same in translation. So we have to mark the copy in a special way. In the semantic model we have no contributions, we have only frames and assignments to their properties. Hence we should mark the effects of the copied contributions, i.e. the assignments they add.

For this I've introduced a notion of scope into the semantic model. A scope is just a frame that plays a special role during parsing. Which special role — depends on the frame and its attributes. Important is that it's activated at one moment and deactivated at another moment, with arbitrarily many assignments made in between. In the semantic log this is written as [X] ... [/X] where X is the scope frame and ... stands for the assignments.

I plan to use scopes for many more things, mostly of meta-linguistic kind: separating paragraphs, reflecting the markup, etc. Using it with ellipsis is simple. You just create a frame, say its type is ellipsis, activate it, do the whole contribution copying process (which surely adds some assignments) and deactivate it. Thus you get the nice clear separation of what's said and what's inferred:

A.type=OPINION
#1.opinion_of=A
B.type=SOME
A.arg1=B
#1.time=PAST
C.type=COME_SCALARLY
C.order=AFTER
C.arg1=D
D.type=7
D.number=true
#1.but=#2
--
A.type=OPINION
#2.opinion_of=A
B.type=OTHERS
A.arg1=B
C.type=ellipsis
[C]
D.type=COME_SCALARLY
D.order=AFTER
[/C]
D.arg1=E
E.type=8
E.number=true

Generator


Getting the generator to work is not interesting at all. It's just a matter of adding another condition, given that I mock the generation throughout anyway.


After writing all this I've suddenly thought that well, this is great, but possibly unnecessary. Maybe the right way is not to replicate the history but just mark the whole second clause situation as ellipsis, parse only what's in the text, attach it somewhere and assume that the inference engine (mocked) should be responsible for inferring the went next part. But I don't have enough data yet to make a choice.

Sunday, May 8, 2011

Lost in translation simulation

That I don't write about my parser doesn't mean I don't write it. I just can't be bothered to write both. Most of the time I prefer coding. And I can't say there's a lack of interesting problems. For example, here's a part of the 12th sentence:

по мнению одних дальше следовало 7, по мнению других - 8
In the opinion of some, a 7 went next; but in opinion of others an 8 did

There are many interesting things here. Just look at this nice little ellipsis (the dash before 8)! But what worries me the most is English. Or just my English. I've found that I don't know it enough to understand why the translation looks as it looks and how I'm supposed generate this translation.

Take the articles for example: a 7 went next; an 8 did. Why on earth are the numbers indefinite? Are there many sevens and is it just one of them? I don't get it. But it's all over the translation. Only in the first sentence 7 or 8 are without articles, in every other place they're indefinite.

Then, compare In the opinion of some with in opinion of others. The main difference is the article again. Is the opinion of some more definite than the opinion of the others? Does it just sound better that way? Of course, I've hacked all that, but I'm still uncomfortable.

And, by the way, just if anyone cares, here it is: lots of hacks and some ideas.

Saturday, April 30, 2011

Disambiguation by constructions

A word may have many senses and a good parser must choose the right one. Which is the right one? This is often obvious from the context. When you see an ambiguous word alone (e.g. cooks), you can tell nothing about its meaning. But if you see it in context, you may start guessing its syntactic and/or semantic behavior (our cooks have prepared the dinner vs. Mary cooks fish). Surrounding words greatly help in determining the part of speech, and many disambiguation algorithms take advantage of that.

But who on earth cares about the parts of speech? Well, many do, for example, parsers, both statistical and declarative, employ this information for building all kinds of structures. But anyway, that's an intermediate thing used for the own convenience of those parsers. For the ultimate text analysis tasks parts of speech are not important at all. The meaning is what is important, not them. So why bother at all? I'm currently trying to live without the intermediate part-of-speech level in my parser, and so far it works. How?

Consider cooks again. It can participate in the following constructions:
  1. she cooks
  2. cooks fish
  3. cooks when something happens
  4. cooks well
  5. the cooks
  6. sad cooks
  7. cooks that came from Germany
  8. and so on
Of those, 1-4 are "verbal", they refer to the same meaning of cooks, the process of cooking. And in fact they can all occur together in one sentence: She cooks fish very well when she's not tired. 5-7, on the other hand, are "nominal", they describe the people who cook for a living. They are also mutually compatible and at the same time totally incompatible with 1-4.

Let's now say there are no nouns, verbs and so on, there are only constructions. Upon encountering cooks, the parser notes all the constructions possible with this word (at least 1-7 from above). It also marks some of them (1-4) as incompatible with others (5-7). Then another word comes, for example, fish. It also generates tons of constructions, some of them also mutually incompatible (fish can be a noun or a verb as well). Importantly, one of them is the familiar Transitive (number 2 in the list). It's been suggested by both words, and it clearly wins over the others which were suggested only by one of the two words.

Now the constructions which are incompatible with this Transitive can be pruned: both "nominal" for cooks and "verbal" for fish. And the Transitive is promoted and may now contribute to the meaning of the entire text. (e.g. Cook.patient=Fish). Disambiguation complete.

Positive side: it's very simple and I don't have to create boring data structures for different parts of speech with all those cases, inclinations, numbers, genders, etc. Negative side: every word now has to know of all contexts it can occur in. Both adjectives and nouns have to specify that they participate in Adjective+Noun construction. That's quite unusual in rule-based parsing where people try to hand-code as little redundancy as possible. Anyway, unusual doesn't mean bad, I'm not very much against redundancy, and I really like the overall simplicity.

Tuesday, April 12, 2011

Parsing and questions

The usual declarative sentences are considered simple by formal semantics. In fact they are not, but anyway they're the simplest what there is. They even deceive people to believe that the first-order logic is adequate for expressing their meaning. John loves Mary is loves(john, mary). Simple. It gets more interesting with quantifiers, especially when there is more than one (can you spot an ambiguity in Every man loves some woman?), but that's not my point. Remember, this was the first Kharms' sentence:
Удивительный случай случился со   мной: я вдруг    забыл, что  идет раньше - 7 или 8
Amazing case happened with me I suddenly forgot what goes earlier 7 or 8
An amazing thing happened to me today, I suddenly forgot what comes first - 7 or 8

And the part of it remaining uncommented is the very last one, starting with what. It looks suspiciously like a question. I can imagine saying to myself What comes first 7 or 8? Damn, I forgot that! Yes, that's definitely a question. So, we have to step outside the comfortable world of declarative sentences and enter the darkness of what the advanced topics of formal semantics are about: interrogatives. Man, they even have a semester seminar on questions, only questions and nothing but questions!

But a quick look at their analyses is sufficient for me to realize that I don't like them. I don't want to implement that for the first 10 years (well, maybe less: they provide some code in Haskell), and then spend the rest of my life analyzing the resulting sets-of-sets-of-possible-worlds kinds of structures just to understand that it only means What comes first?. I don't seek an absolute truth, for me the simpler the structure, the closer it is to the surface, the better. Of course, if it still is acceptable as a true interlingua.

That said, I don't have much choice on how to represent the question from above. It's a clause containing a verb and a subject, and all these three entities are unusual in their own ways.

The unusual thing about verb is that it consists of two words - come first. Actually, it's a more generic verb come X, where X can be of any scalar value: first, next, previous, last, 42th, etc.

The subject is also unusual since it's what, a typical wh-word which many questions start with. It also comes with variants at the end of the clause - 7 or 8. I consider this a special construction, characteristic of questions. Those 7 and 8 are just listed in the semantics as the variants slots of the what frame.

Finally, the clause is unusual since it has to mark in some way its questionness. It would also be nice if it could specify which part of the clause is asked about (here it's the subject what). These two things are solved by one means: the situation corresponding to this clause has a questioned attribute pointing to what. Simple.

Finally, there should be a way of linking the question clause to the verb it depends on: forgot. It would be also nice to distinguish between the different things one can forget: real things (I forgot my cell phone), facts (I forgot that 2x2=4), some values (I forgot the area of Africa) and, finally, the answers to the questions (I forgot what comes first). At least two of these variants employ clauses: facts and questions. Luckily, a fact's clause definitely won't have questioned attribute, while in our case it will definitely be there. So indeed, we can just say that forgot's theme is the whole situation corresponding to the question and seems to be sufficient for the current purposes.

So, now I'm finally ready to present the semantics built for the complete sentence. Well, almost ready. There remains that or in 7 or 8. That's a conjunct, and, as conjuncts are my favorite and very interesting phenomena, I'll discuss them later in great detail. So, the interlingual representation for the first Sonnet sentence is this:

A.property=AMAZING
A.type=THING
A.given=false
B.type=HAPPEN
#1.time=PAST
B.arg1=A
B.experiencer=C
C.type=ME
#1.elaboration=#2
--
A.type=ME
B.manner=SUDDENLY
B.type=FORGET
#2.time=PAST
B.arg1=A
B.theme=#3
--
#3.questioned=A
#3.time=PRESENT
B.type=COME_SCALARLY
B.order=EARLIER
B.arg1=A
C.type=7
C.number=true
A.variants=D
D.member=C
D.conj=or
E.type=8
E.number=true
D.member=E

Wednesday, March 30, 2011

New model

What happens when you hear a sentence? Nobody knows for sure, but there are some guesses on the market. Many people think you build a logical formula, and do something with it (perhaps, store it and use for inference). However, I find huge logical expressions terribly hard to analyze, and as I surely need some analysis to translate correctly, this idea doesn't suit me.

So I prefer to think in terms of objects and relations between them. Luckily, I don't have to invent anything, since some psychologists have similar ideas. They argue that during sentence comprehension people don't operate with abstract logical symbols, but mentally simulate the input, and this is what we call meaning. You hear The eagle in the sky, you imagine an eagle in the blue sky with few clouds and airplanes, and it's natural that the eagle in this picture has wide-spread wings, and not, say, folded.

Many experiments seem to support this theory. But actually that doesn't matter: I don't care that much what really happens in the brain. What I care about are ideas that may be useful for parsing. And I find the idea of mental simulations quite useful. The words don't have to determine the meaning, they act like cues which invoke the memories of previous situations where you met them. These memories are put together and a new situation is simulated. The situation may involve objects or actions that are not mentioned at all, but they'll be highly accessible in the following discourse, without a complex logical inference. The important things that the hearer learned through simulation are stored in the memory for future retrieval, also as parts of other simulations.

This is a completely informal theory. That's great because I may treat it in any way useful to me. In particular, I assume the simulation engine to be a black box that communicates with the parser and generator via frames. Frames are just objects, with attributes and values. Values may be strings or other frames. The parser creates cues like Frame A has attribute 'actor' pointing to frame B or Frame C is described as 'man'. And the notation is:

A.actor=B
C.type=man

The parser gives such cues to the simulator as soon as it encounters new words. It also receives feedback which may help choosing between several competitive parses. But it doesn't know what the simulator does internally, it only sees those frames and attributes. That allows me to mock the simulator in any way I want. In the end, my main object of interest is parsing, so I leave a well-behaving simulator to someone else.

The target language generator has access to both the sequence of the cues fed by the parser to the simulator, and to the simulator itself. The cue sequence is needed to provide as close translation as possible, so that what was said in the source will be more or less the same as what was generated. The simulator is needed to ensure that the information inferred from the cues in both languages is similar as well. The simulation results may also be used in cases when the generator needs something not explicitly specified in the source but obligatory in the result, for example, pronoun gender, when translating from Finnish to Russian.

So far, the model is quite simple. A discourse is a sequence of (numbered, #1,#2,...) situations. Each situation has a sequence of constraints on frames (referred to by variables: A,B,C,...) and their attribute values. Situations are also variables, and also have properties that may be assigned.

For example, let's consider the first part of Kharms' Sonnet:

An amazing thing happened to me today, I suddenly forgot....

It's currently analyzed as:

----- Situation #1 -----
A.property=AMAZING
A.type=THING
A.given=false
B.type=HAPPEN
#1.time=PAST
B.arg1=A
B.experiencer=C
C.type=ME
#1.elaboration=#2
----- Situation #2 -----
A.type=ME
B.manner=SUDDENLY
B.type=FORGET
#2.time=PAST
B.arg1=A
B.theme=#3
----- Situation #3 -----
...

Wednesday, February 9, 2011

Declarative programming

If you have a problem and think "I'll describe everything as data in a nice declarative way (maybe even XML)", now you have two problems.

Seriously, unless you can prove your software will never be extended, don't program declaratively. (I'm speaking mostly to myself here but if this helps anyone else, you're welcome). The reason is obvious: declarative description is very pretty and concise as long as it's simple. But usually it becomes extremely complex to change things a bit if that bit isn't foreseen from the beginning. And the real world is always like that. I've written quite a few frameworks and declarative parsers and there hasn't been any one where I didn't regret that when adding new functionality.

Simple things should be simple, complex things should be possible (c). With declarative code they usually end up being impossible. Even if they're possible, the efforts you spend are huge, and it looks like a hack. Just look at Ant.

Alternatives? Write code that does something, not the one that creates some structures which are then stored, sorted, converted into other structures, stored and sorted again and then finally some completely different code does something based on these structures. Just do it as straightforwardly as possible. Execute the code instead of creating data objects. Code invocation may in fact look quite DSLish (even in Java), so the not-so-big-if-any lengthiness will definitely pay off in the long run when you have to change the foundations. And it's muuuch easier to debug this way.

The moral: when writing data instead of code, think again. And again. And again.