Tuesday, April 12, 2011

Parsing and questions

The usual declarative sentences are considered simple by formal semantics. In fact they are not, but anyway they're the simplest what there is. They even deceive people to believe that the first-order logic is adequate for expressing their meaning. John loves Mary is loves(john, mary). Simple. It gets more interesting with quantifiers, especially when there is more than one (can you spot an ambiguity in Every man loves some woman?), but that's not my point. Remember, this was the first Kharms' sentence:
Удивительный случай случился со   мной: я вдруг    забыл, что  идет раньше - 7 или 8
Amazing case happened with me I suddenly forgot what goes earlier 7 or 8
An amazing thing happened to me today, I suddenly forgot what comes first - 7 or 8

And the part of it remaining uncommented is the very last one, starting with what. It looks suspiciously like a question. I can imagine saying to myself What comes first 7 or 8? Damn, I forgot that! Yes, that's definitely a question. So, we have to step outside the comfortable world of declarative sentences and enter the darkness of what the advanced topics of formal semantics are about: interrogatives. Man, they even have a semester seminar on questions, only questions and nothing but questions!

But a quick look at their analyses is sufficient for me to realize that I don't like them. I don't want to implement that for the first 10 years (well, maybe less: they provide some code in Haskell), and then spend the rest of my life analyzing the resulting sets-of-sets-of-possible-worlds kinds of structures just to understand that it only means What comes first?. I don't seek an absolute truth, for me the simpler the structure, the closer it is to the surface, the better. Of course, if it still is acceptable as a true interlingua.

That said, I don't have much choice on how to represent the question from above. It's a clause containing a verb and a subject, and all these three entities are unusual in their own ways.

The unusual thing about verb is that it consists of two words - come first. Actually, it's a more generic verb come X, where X can be of any scalar value: first, next, previous, last, 42th, etc.

The subject is also unusual since it's what, a typical wh-word which many questions start with. It also comes with variants at the end of the clause - 7 or 8. I consider this a special construction, characteristic of questions. Those 7 and 8 are just listed in the semantics as the variants slots of the what frame.

Finally, the clause is unusual since it has to mark in some way its questionness. It would also be nice if it could specify which part of the clause is asked about (here it's the subject what). These two things are solved by one means: the situation corresponding to this clause has a questioned attribute pointing to what. Simple.

Finally, there should be a way of linking the question clause to the verb it depends on: forgot. It would be also nice to distinguish between the different things one can forget: real things (I forgot my cell phone), facts (I forgot that 2x2=4), some values (I forgot the area of Africa) and, finally, the answers to the questions (I forgot what comes first). At least two of these variants employ clauses: facts and questions. Luckily, a fact's clause definitely won't have questioned attribute, while in our case it will definitely be there. So indeed, we can just say that forgot's theme is the whole situation corresponding to the question and seems to be sufficient for the current purposes.

So, now I'm finally ready to present the semantics built for the complete sentence. Well, almost ready. There remains that or in 7 or 8. That's a conjunct, and, as conjuncts are my favorite and very interesting phenomena, I'll discuss them later in great detail. So, the interlingual representation for the first Sonnet sentence is this:

A.property=AMAZING
A.type=THING
A.given=false
B.type=HAPPEN
#1.time=PAST
B.arg1=A
B.experiencer=C
C.type=ME
#1.elaboration=#2
--
A.type=ME
B.manner=SUDDENLY
B.type=FORGET
#2.time=PAST
B.arg1=A
B.theme=#3
--
#3.questioned=A
#3.time=PRESENT
B.type=COME_SCALARLY
B.order=EARLIER
B.arg1=A
C.type=7
C.number=true
A.variants=D
D.member=C
D.conj=or
E.type=8
E.number=true
D.member=E

Wednesday, March 30, 2011

New model

What happens when you hear a sentence? Nobody knows for sure, but there are some guesses on the market. Many people think you build a logical formula, and do something with it (perhaps, store it and use for inference). However, I find huge logical expressions terribly hard to analyze, and as I surely need some analysis to translate correctly, this idea doesn't suit me.

So I prefer to think in terms of objects and relations between them. Luckily, I don't have to invent anything, since some psychologists have similar ideas. They argue that during sentence comprehension people don't operate with abstract logical symbols, but mentally simulate the input, and this is what we call meaning. You hear The eagle in the sky, you imagine an eagle in the blue sky with few clouds and airplanes, and it's natural that the eagle in this picture has wide-spread wings, and not, say, folded.

Many experiments seem to support this theory. But actually that doesn't matter: I don't care that much what really happens in the brain. What I care about are ideas that may be useful for parsing. And I find the idea of mental simulations quite useful. The words don't have to determine the meaning, they act like cues which invoke the memories of previous situations where you met them. These memories are put together and a new situation is simulated. The situation may involve objects or actions that are not mentioned at all, but they'll be highly accessible in the following discourse, without a complex logical inference. The important things that the hearer learned through simulation are stored in the memory for future retrieval, also as parts of other simulations.

This is a completely informal theory. That's great because I may treat it in any way useful to me. In particular, I assume the simulation engine to be a black box that communicates with the parser and generator via frames. Frames are just objects, with attributes and values. Values may be strings or other frames. The parser creates cues like Frame A has attribute 'actor' pointing to frame B or Frame C is described as 'man'. And the notation is:

A.actor=B
C.type=man

The parser gives such cues to the simulator as soon as it encounters new words. It also receives feedback which may help choosing between several competitive parses. But it doesn't know what the simulator does internally, it only sees those frames and attributes. That allows me to mock the simulator in any way I want. In the end, my main object of interest is parsing, so I leave a well-behaving simulator to someone else.

The target language generator has access to both the sequence of the cues fed by the parser to the simulator, and to the simulator itself. The cue sequence is needed to provide as close translation as possible, so that what was said in the source will be more or less the same as what was generated. The simulator is needed to ensure that the information inferred from the cues in both languages is similar as well. The simulation results may also be used in cases when the generator needs something not explicitly specified in the source but obligatory in the result, for example, pronoun gender, when translating from Finnish to Russian.

So far, the model is quite simple. A discourse is a sequence of (numbered, #1,#2,...) situations. Each situation has a sequence of constraints on frames (referred to by variables: A,B,C,...) and their attribute values. Situations are also variables, and also have properties that may be assigned.

For example, let's consider the first part of Kharms' Sonnet:

An amazing thing happened to me today, I suddenly forgot....

It's currently analyzed as:

----- Situation #1 -----
A.property=AMAZING
A.type=THING
A.given=false
B.type=HAPPEN
#1.time=PAST
B.arg1=A
B.experiencer=C
C.type=ME
#1.elaboration=#2
----- Situation #2 -----
A.type=ME
B.manner=SUDDENLY
B.type=FORGET
#2.time=PAST
B.arg1=A
B.theme=#3
----- Situation #3 -----
...

Wednesday, February 9, 2011

Declarative programming

If you have a problem and think "I'll describe everything as data in a nice declarative way (maybe even XML)", now you have two problems.

Seriously, unless you can prove your software will never be extended, don't program declaratively. (I'm speaking mostly to myself here but if this helps anyone else, you're welcome). The reason is obvious: declarative description is very pretty and concise as long as it's simple. But usually it becomes extremely complex to change things a bit if that bit isn't foreseen from the beginning. And the real world is always like that. I've written quite a few frameworks and declarative parsers and there hasn't been any one where I didn't regret that when adding new functionality.

Simple things should be simple, complex things should be possible (c). With declarative code they usually end up being impossible. Even if they're possible, the efforts you spend are huge, and it looks like a hack. Just look at Ant.

Alternatives? Write code that does something, not the one that creates some structures which are then stored, sorted, converted into other structures, stored and sorted again and then finally some completely different code does something based on these structures. Just do it as straightforwardly as possible. Execute the code instead of creating data objects. Code invocation may in fact look quite DSLish (even in Java), so the not-so-big-if-any lengthiness will definitely pay off in the long run when you have to change the foundations. And it's muuuch easier to debug this way.

The moral: when writing data instead of code, think again. And again. And again.

Sunday, October 17, 2010

Translating Kharms

My approach to NLU is to take some real sample of Natural Language and make computer Understand it. Since I don't want the result to be controversial and debatable, I take something that everyone would agree upon, a 'golden standard'. In my case the program should translate a text in the very way that some human has already done this. For them being short, I took one of Daniil Kharms's short stories translated into English. The program parses a text, computes its semantic representation, translates it into the target language concepts and generates the target text based solely on that transfer result.

The problems begin straight away. Here's the first sentence:

Удивительный случай случился со мной: я вдруг забыл, что идет раньше - 7 или 8
Amazing case happened with me I suddenly forgot what goes earlier 7 or 8
An amazing thing happened to me today, I suddenly forgot what comes first - 7 or 8

Problem: Everyone translating Russian into English immediately stumbles upon articles. Russian doesn't have them, so they have to be added. One can note that they somehow relate to the information structure: the new material is usually (but not always) indefinite, the given one - definite.

Solution: For the anaphora resolution purposes, a smart parser should anyway take a record of discourse referents and the generator could mark the newly introduced ones as indefinite. For a start that'll do, until we find a counterexample. So, the case=thing is a new discourse referent, and we add An before it.

Problem: This is absolutely crazy, but human-made translations abound with those and my purpose is to imitate them. In the Russian text, we are only told the amazing thing happened, but in English it was today. Not suprisingly, another translation of the same story doesn't have it. But anyway, it's in the 'gold standard' so the program should somehow be able to generate it.

Solution: Of course, the translator may have added this by random, but I always first try to find something in the source text until I surrender. Here, my solution is that actually the Russian story has an anecdotal flavor. And this today adds the same flavor, since (perhaps) without it the English counterpart doesn't have it. Therefore, Russian parser has to specify the flavor explicitly in the semantics so that the generator can use this information later.

Problem: The colon, which separates two clauses of the sentence. Nothing's wrong with it except that it's translated as a comma. So I had to find a deep meaning for it which allegedly should be expressed using a comma in English.

Solution: The second clause seems to elaborate on something from the first clause. I'd say it's the semantically empty thing about which the reader now is told more. So, apparently the colon actually expresses an elaboration dependency between thing and the second clause, and what we should preserve in the deep semantics is this dependency, not the colon itself.

Problem: Due to the parsing algorithm, once the verb (happened) has come after the subject (thing), the latter is deactivated: it's not available for forming syntactic relations with the words coming after the verb. This seems reasonable because discontinuous subjects are not very frequent in Russian, and always marked (e.g. by a special intonation). But nevertheless in this case the subject should form the elaboration relation with what follows the colon.

Solution: Look at it from the semantic point of view. We're told that an amazing thing happened. Naturally, we're interested what kind of thing it was. So, the verb's semantic handler should be taught to understand such luring constructions and wait for a colon followed by a clause, which is one of the ways of expressing elaboration in Russian.

Let's deal with the rest of this sentence's problems later.

Wednesday, August 18, 2010

Placeholders aka left-corner parsing

From the parsing perspective, some words have much more far-reaching consequences than the other ones. Take English articles, for example. They don't only denote the definiteness or indefiniteness, they also give a broad hint that a noun is coming. So, having encountered an article, a smart parser may immediately create a placeholder node for noun (and also specify its definiteness). When the noun comes, instead of creating its own structure it'll just fit into the existing placeholder.

Once an noun placeholder is constructed, a parser may immediately attach it where it's is needed. If there is an ambiguous verb or preposition, attaching that noun as an argument can help choosing the right alternative and thus reduce the working memory consumption.

Hawkins argues that such placeholders play an important role in the syntax. He shows for many languages that their word order is in many cases optimized for such placeholders to be constructed and integrated as early as possible. That helps minimizing the amount of the nodes awaiting integration at any moment in parsing.

In German and Russian, the NP placeholder creators (determiners and adjectives respectively) also carry case markers. Therefore case is known for the placeholders, which also aids faster verb disambiguation (e.g. in+Dative denotes location, while in+Accusative denotes direction in German).

Verb placeholders also seem to exist. For example, in Japanese, where the verb comes at the very end of the sentence, there is no sign that a sequence of the argument noun phrases causes any processing difficulties. I conclude that these noun phrases get integrated into some structure as they appear. When a verb comes, it just quickly iterates through this structure and assigns to the accumulated noun phrases their semantic roles. The rich case marking of Japanese helps the arguments to be stored efficiently and not to interfere with each other while awaiting for the verb. I think that the structure I'm talking about can safely be called a verbal (or clausal) placeholder.

It gets more interesting in other languages. On surface German is a verb-second language. A nice processing explanation would be that there is no verb placeholder, and the first constituent hangs active in the memory until a verb actually appears. So the verb should appear immediately to reduce the memory usage. In most relative clauses, on the other hand, the verb comes at the end, and no processing problems arise even if there are lots of NP/PP dependents before it. Incidentally, all these clauses have to be introduced by a some complementizer. When there is no complementizer, the verb comes second just as if it were a main clause. All this makes me think that the complementizer just creates the verb placeholder which enables the verb to be postponed.

In English, the verb never comes too late, so the need for a placeholder is doubtful. Russian is difficult in this respect, because normally the verb is close to the beginning but it can easily move to the end for focalization purposes. I'm not aware of any experimental data, but I personally don't feel too much overburdened by multiple intermediate NPs and PPs in this case. So I have no idea about when a verbal placeholder is created in Russian: always, never, only in case of too many subsequent NPs and PPs, or only in some specific context (e.g. the one that forces focalization).

Monday, July 26, 2010

Frames, constructs and the log

Some time has passed. My ideas of the internal representation of natural language have changed. Here are the new ones.

Every word in the input is taken as a whole and all the information about it is stored in a single frame. The frame may have different aspects: morphological, syntactic, semantic, discourse, etc. The aspect names are called roles, the objects corresponding to these roles are called constructs.

The lexical ambiguities are handled frame-internally. In case it's syntactically ambiguous, it has several possible syntactic constructs. A polysemous word may have different semantic constructs. The frame should eventually choose one construct for each role, and the chosen constructs have not to contradict with each other. Here's how a frame for the famously ambiguous bank could look like:
Once frames with embedded constructs appear in the model, they remain activated for some time and may perform various actions. They may add constructs to their own or another frame. They may establish links to constructs of other roles in the same frame. Or they may create connections to other frames.

Establishing a connection between frames means creating a new frame whose child frames are the ones that are connected. The new frame may also have several aspects. For example, in the usual John loves Mary two extra frames are created to link the predicate with the subject and the object respectively. These frames host two constructs both: one representing syntactic relation, another - semantic one (in experiencer, state and theme terms):
As I've already stated, not only the final structure is important, but also the sequence in which it was constructed. For this, a simple program-like log can be used:

frame: syn=noun, sem=JOHN
frame: syn=verb;transitive, sem=love
frame ^2 ^1: syn=subject+predicate, sem=experiencer+state
frame: syn=noun, sem=MARY
frame ^3 ^1: syn=predicate+object, sem=state+object

Each line here talks about some new frame. The frame's children are referred to as ^x where x is how many lines back was the one talking about the mentioned frame. ^1 means the previous line, ^2 - the one before the previous. When a frame is created, it has no aspects. Their subsequent assignments are reflected in the log.

This log reflects the dynamics of the parsing process and can thus be used in the applications where the order of operations is important, like machine translation.

Friday, May 21, 2010

The horrors of the Russian language

Two quite simple phenomena, typical in everyday speech:

Он сам пришел (he himself came)
Мы все пошли гулять (we all went to walk)

Everything's fine until you try to understand the functions of сам and все. First, their meaning is quite hard to describe. As for the second one, I still don't have an idea what does все add to this sentence and how does it differ from Мы пошли гулять. I suspect that there must be the same difference in English translation, but can't state it clearly.

Furthermore, the syntactic behavior is completely weird. It looks as if все was an optional syntactic modifier of мы. But if we change the case, we'll get a very strange construction:

У нас (у) самих ничего нет (we ourselves don't have anything).
До нас (до) всех дошло известие (to us (to) all has come the news).

This optional preposition doubling makes no sense to me, and I don't know how to describe that in both phrase structure and dependency grammar rules. What modifies what in this case?