Sunday, October 17, 2010

Translating Kharms

My approach to NLU is to take some real sample of Natural Language and make computer Understand it. Since I don't want the result to be controversial and debatable, I take something that everyone would agree upon, a 'golden standard'. In my case the program should translate a text in the very way that some human has already done this. For them being short, I took one of Daniil Kharms's short stories translated into English. The program parses a text, computes its semantic representation, translates it into the target language concepts and generates the target text based solely on that transfer result.

The problems begin straight away. Here's the first sentence:

Удивительный случай случился со мной: я вдруг забыл, что идет раньше - 7 или 8
Amazing case happened with me I suddenly forgot what goes earlier 7 or 8
An amazing thing happened to me today, I suddenly forgot what comes first - 7 or 8

Problem: Everyone translating Russian into English immediately stumbles upon articles. Russian doesn't have them, so they have to be added. One can note that they somehow relate to the information structure: the new material is usually (but not always) indefinite, the given one - definite.

Solution: For the anaphora resolution purposes, a smart parser should anyway take a record of discourse referents and the generator could mark the newly introduced ones as indefinite. For a start that'll do, until we find a counterexample. So, the case=thing is a new discourse referent, and we add An before it.

Problem: This is absolutely crazy, but human-made translations abound with those and my purpose is to imitate them. In the Russian text, we are only told the amazing thing happened, but in English it was today. Not suprisingly, another translation of the same story doesn't have it. But anyway, it's in the 'gold standard' so the program should somehow be able to generate it.

Solution: Of course, the translator may have added this by random, but I always first try to find something in the source text until I surrender. Here, my solution is that actually the Russian story has an anecdotal flavor. And this today adds the same flavor, since (perhaps) without it the English counterpart doesn't have it. Therefore, Russian parser has to specify the flavor explicitly in the semantics so that the generator can use this information later.

Problem: The colon, which separates two clauses of the sentence. Nothing's wrong with it except that it's translated as a comma. So I had to find a deep meaning for it which allegedly should be expressed using a comma in English.

Solution: The second clause seems to elaborate on something from the first clause. I'd say it's the semantically empty thing about which the reader now is told more. So, apparently the colon actually expresses an elaboration dependency between thing and the second clause, and what we should preserve in the deep semantics is this dependency, not the colon itself.

Problem: Due to the parsing algorithm, once the verb (happened) has come after the subject (thing), the latter is deactivated: it's not available for forming syntactic relations with the words coming after the verb. This seems reasonable because discontinuous subjects are not very frequent in Russian, and always marked (e.g. by a special intonation). But nevertheless in this case the subject should form the elaboration relation with what follows the colon.

Solution: Look at it from the semantic point of view. We're told that an amazing thing happened. Naturally, we're interested what kind of thing it was. So, the verb's semantic handler should be taught to understand such luring constructions and wait for a colon followed by a clause, which is one of the ways of expressing elaboration in Russian.

Let's deal with the rest of this sentence's problems later.