Sunday, October 17, 2010

Translating Kharms

My approach to NLU is to take some real sample of Natural Language and make computer Understand it. Since I don't want the result to be controversial and debatable, I take something that everyone would agree upon, a 'golden standard'. In my case the program should translate a text in the very way that some human has already done this. For them being short, I took one of Daniil Kharms's short stories translated into English. The program parses a text, computes its semantic representation, translates it into the target language concepts and generates the target text based solely on that transfer result.

The problems begin straight away. Here's the first sentence:

Удивительный случай случился со мной: я вдруг забыл, что идет раньше - 7 или 8
Amazing case happened with me I suddenly forgot what goes earlier 7 or 8
An amazing thing happened to me today, I suddenly forgot what comes first - 7 or 8

Problem: Everyone translating Russian into English immediately stumbles upon articles. Russian doesn't have them, so they have to be added. One can note that they somehow relate to the information structure: the new material is usually (but not always) indefinite, the given one - definite.

Solution: For the anaphora resolution purposes, a smart parser should anyway take a record of discourse referents and the generator could mark the newly introduced ones as indefinite. For a start that'll do, until we find a counterexample. So, the case=thing is a new discourse referent, and we add An before it.

Problem: This is absolutely crazy, but human-made translations abound with those and my purpose is to imitate them. In the Russian text, we are only told the amazing thing happened, but in English it was today. Not suprisingly, another translation of the same story doesn't have it. But anyway, it's in the 'gold standard' so the program should somehow be able to generate it.

Solution: Of course, the translator may have added this by random, but I always first try to find something in the source text until I surrender. Here, my solution is that actually the Russian story has an anecdotal flavor. And this today adds the same flavor, since (perhaps) without it the English counterpart doesn't have it. Therefore, Russian parser has to specify the flavor explicitly in the semantics so that the generator can use this information later.

Problem: The colon, which separates two clauses of the sentence. Nothing's wrong with it except that it's translated as a comma. So I had to find a deep meaning for it which allegedly should be expressed using a comma in English.

Solution: The second clause seems to elaborate on something from the first clause. I'd say it's the semantically empty thing about which the reader now is told more. So, apparently the colon actually expresses an elaboration dependency between thing and the second clause, and what we should preserve in the deep semantics is this dependency, not the colon itself.

Problem: Due to the parsing algorithm, once the verb (happened) has come after the subject (thing), the latter is deactivated: it's not available for forming syntactic relations with the words coming after the verb. This seems reasonable because discontinuous subjects are not very frequent in Russian, and always marked (e.g. by a special intonation). But nevertheless in this case the subject should form the elaboration relation with what follows the colon.

Solution: Look at it from the semantic point of view. We're told that an amazing thing happened. Naturally, we're interested what kind of thing it was. So, the verb's semantic handler should be taught to understand such luring constructions and wait for a colon followed by a clause, which is one of the ways of expressing elaboration in Russian.

Let's deal with the rest of this sentence's problems later.

Wednesday, August 18, 2010

Placeholders aka left-corner parsing

From the parsing perspective, some words have much more far-reaching consequences than the other ones. Take English articles, for example. They don't only denote the definiteness or indefiniteness, they also give a broad hint that a noun is coming. So, having encountered an article, a smart parser may immediately create a placeholder node for noun (and also specify its definiteness). When the noun comes, instead of creating its own structure it'll just fit into the existing placeholder.

Once an noun placeholder is constructed, a parser may immediately attach it where it's is needed. If there is an ambiguous verb or preposition, attaching that noun as an argument can help choosing the right alternative and thus reduce the working memory consumption.

Hawkins argues that such placeholders play an important role in the syntax. He shows for many languages that their word order is in many cases optimized for such placeholders to be constructed and integrated as early as possible. That helps minimizing the amount of the nodes awaiting integration at any moment in parsing.

In German and Russian, the NP placeholder creators (determiners and adjectives respectively) also carry case markers. Therefore case is known for the placeholders, which also aids faster verb disambiguation (e.g. in+Dative denotes location, while in+Accusative denotes direction in German).

Verb placeholders also seem to exist. For example, in Japanese, where the verb comes at the very end of the sentence, there is no sign that a sequence of the argument noun phrases causes any processing difficulties. I conclude that these noun phrases get integrated into some structure as they appear. When a verb comes, it just quickly iterates through this structure and assigns to the accumulated noun phrases their semantic roles. The rich case marking of Japanese helps the arguments to be stored efficiently and not to interfere with each other while awaiting for the verb. I think that the structure I'm talking about can safely be called a verbal (or clausal) placeholder.

It gets more interesting in other languages. On surface German is a verb-second language. A nice processing explanation would be that there is no verb placeholder, and the first constituent hangs active in the memory until a verb actually appears. So the verb should appear immediately to reduce the memory usage. In most relative clauses, on the other hand, the verb comes at the end, and no processing problems arise even if there are lots of NP/PP dependents before it. Incidentally, all these clauses have to be introduced by a some complementizer. When there is no complementizer, the verb comes second just as if it were a main clause. All this makes me think that the complementizer just creates the verb placeholder which enables the verb to be postponed.

In English, the verb never comes too late, so the need for a placeholder is doubtful. Russian is difficult in this respect, because normally the verb is close to the beginning but it can easily move to the end for focalization purposes. I'm not aware of any experimental data, but I personally don't feel too much overburdened by multiple intermediate NPs and PPs in this case. So I have no idea about when a verbal placeholder is created in Russian: always, never, only in case of too many subsequent NPs and PPs, or only in some specific context (e.g. the one that forces focalization).

Monday, July 26, 2010

Frames, constructs and the log

Some time has passed. My ideas of the internal representation of natural language have changed. Here are the new ones.

Every word in the input is taken as a whole and all the information about it is stored in a single frame. The frame may have different aspects: morphological, syntactic, semantic, discourse, etc. The aspect names are called roles, the objects corresponding to these roles are called constructs.

The lexical ambiguities are handled frame-internally. In case it's syntactically ambiguous, it has several possible syntactic constructs. A polysemous word may have different semantic constructs. The frame should eventually choose one construct for each role, and the chosen constructs have not to contradict with each other. Here's how a frame for the famously ambiguous bank could look like:
Once frames with embedded constructs appear in the model, they remain activated for some time and may perform various actions. They may add constructs to their own or another frame. They may establish links to constructs of other roles in the same frame. Or they may create connections to other frames.

Establishing a connection between frames means creating a new frame whose child frames are the ones that are connected. The new frame may also have several aspects. For example, in the usual John loves Mary two extra frames are created to link the predicate with the subject and the object respectively. These frames host two constructs both: one representing syntactic relation, another - semantic one (in experiencer, state and theme terms):
As I've already stated, not only the final structure is important, but also the sequence in which it was constructed. For this, a simple program-like log can be used:

frame: syn=noun, sem=JOHN
frame: syn=verb;transitive, sem=love
frame ^2 ^1: syn=subject+predicate, sem=experiencer+state
frame: syn=noun, sem=MARY
frame ^3 ^1: syn=predicate+object, sem=state+object

Each line here talks about some new frame. The frame's children are referred to as ^x where x is how many lines back was the one talking about the mentioned frame. ^1 means the previous line, ^2 - the one before the previous. When a frame is created, it has no aspects. Their subsequent assignments are reflected in the log.

This log reflects the dynamics of the parsing process and can thus be used in the applications where the order of operations is important, like machine translation.

Friday, May 21, 2010

The horrors of the Russian language

Two quite simple phenomena, typical in everyday speech:

Он сам пришел (he himself came)
Мы все пошли гулять (we all went to walk)

Everything's fine until you try to understand the functions of сам and все. First, their meaning is quite hard to describe. As for the second one, I still don't have an idea what does все add to this sentence and how does it differ from Мы пошли гулять. I suspect that there must be the same difference in English translation, but can't state it clearly.

Furthermore, the syntactic behavior is completely weird. It looks as if все was an optional syntactic modifier of мы. But if we change the case, we'll get a very strange construction:

У нас (у) самих ничего нет (we ourselves don't have anything).
До нас (до) всех дошло известие (to us (to) all has come the news).

This optional preposition doubling makes no sense to me, and I don't know how to describe that in both phrase structure and dependency grammar rules. What modifies what in this case?

Saturday, February 13, 2010

Charts and clouds

As Doug Arnold pointed out, attacking trees and looking for a replacement isn't new. It has already been done, so I didn't actually invent anything revolutionary. The construction cloud is actually a well known structure called chart, as in chart parsing. The chart is designed for keeping all possible parse variants in a compact way to ease backtracking and reparse. And my cloud does exactly this and in a very similar fashion.

One difference is what kind of information is stored in the chart. In the classical approach, it's a bunch of records like "there may be an NP constituent from word 3 to word 7". No internal structure of these constituents is stored. If someone needs the complete tree, it can be computed on demand, since all of the daughter phrases are also present in the chart. And you're lucky if there's only one variant of such subtree.

Cloud is a collection of records like "constructions X, Y and Z may form a construction T". Since constructions don't tend to have a deep hierarchy, each of them stores its children. We don't have to compute the subtree on demand. But the subtree rarely covers the whole sentence. Therefore the task is to find a subset of all constructions such that it covers the input and its constructions don't contradict each other.

Another difference is that usually chart is tightly connected with the parser. A typical chart parser computes all the results in parallel, storing every possible intermediate constituent in the chart. On the other hand, I'm into single-variant serial parsing with error recovery. Of course, nothing prevents from implementing that kind of parser with phrase-structure-based chart: just drive one variant instead of all. One of the reasons I prefer dependency-like constructions over constituents is that free word order languages appear to be easier to describe in terms of constructions.

Wednesday, February 10, 2010

But why cloud?

Once upon a time I was having problems with trees. I decided I prefer construction cloud instead of them. Here's why.

Like the syntactic groups grammar, construction cloud accommodates phrase structure and dependency grammars. A construction relating 2 words can be thought of as a dependency relation between them, with the rare exception that this 'dependency' can participate in another construction, as with Prepositional and Genitive from the example.

Construction cloud also doesn't object to discontinuous constructions. If the parser managed to build it from the text, then it's acceptable.

Unlike the phrase structure, dependency and categorial grammars, it's possible for a construction to have more than one parent. This allows overlapping of various kinds, including, for example, having separate constructions for syntactic agreement, semantic relatedness and prosodical cues to information structure.

From the programmer's perspective, it's much easier to maintain the list of attachable constructions during the parsing process. They are not buried deep inside the tree anymore, but stored together in a flat list. Being immutable, they are easy to cache for future reparse. Rollback in case of reanalysis is also simple: just look at the constructions ending before the reparse point to restore the context. Finally, nobody prevents you from having constructions from several parse variants in the same cloud simultaneously.

Isn't it convincing?

Tuesday, February 9, 2010

Cloud of constructions

What I propose to use as the parser representation instead of trees is a collection of constructions. I call it a cloud.

Base constructions are words. When several constructions (e.g. words) tend to go together and comprise something meaningful in the semantics, they form a composite construction and become its arguments, or children. Composite constructions may also be children of even higher-order constructions, although not very often.

Consider I see a little silhouetto of a man as an example. Each word is a construction, so it's already in the cloud. I+see form a construction whose meaning is to specify see's subject as I. Similar things happen with see+silhouetto, though this time it's an object. Both a and little provide us with additional information about silhouetto: it's indeterminate and, well, little. Again, indeterminacy is added to man in a+man construction. The preposition of+man form a prepositional construction, which then itself forms a genitive construction with silhouetto. The result can be nicely visualized:

Isn't this looking like a strangely-colored cloud?

Of course, the composite constructions don't need to have exactly 2 arguments. For example, the more X the more Y construction has 6 arguments, 4 of which are constant words. X is Y to Z (as in John is hard to convince) has 5 arguments with 3 slots (actually 4: is also isn't fixed, it's replaceable with was).

And yes, clouds are better than trees.

What else is wrong with the trees?

I've already explained why I don't like using trees as an intermediate parser representation. Here's why I don't want to use them as the final parser representation, i.e. for capturing the syntactic structure of a text.

1. Double attachment

I wrote a letter to my friend. Did I write something (a letter) to my friend or was it the letter to my friend that I wrote? I have no idea. Both alternatives seem plausible to me. So why should I make some unmotivated choice if I can have 'to my friend' attached to both 'wrote' and 'letter'? But the result won't be a tree.

2. Overlapping

German
zum means zu (preposition) + dem (dative definite article). If the tree has separate slots for prepositions and articles (e.g. PP and DP heads), this word would go to both, which is not tree-like.

Welsh has single constituent subject-verb agreement. For example, in the Welsh translation of (very "natural") I and my brother saw ourselves, saw agrees with singular I, whilst ourselves agrees with the plural compound I and my brother. If we think that the agreed items should be directly connected in the syntactic structure (not necessarily true, but would be nice), then saw should be directly connected both with I (agreement) and I and my brother (subject).

However, Doctor, I'll stay here. IMHO any text should be analyzed as is. In particular, the punctuation should be processed in the same way as the words, since it's part of the text and is subject to grammatical rules of the language called Written English. But however at the beginning of a clause requires a comma after itself. The title of the Doctor being addressed, also requires to be surrounded by commas. So, where does the comma between however and Doctor belong? If you ask me, it belongs to both places, and this again isn't tree-like.

3. Discontinuous constituents

Actually, there's nothing wrong about them from programmer's perspective. But both the phrase-structure and dependency grammars prohibit discontinuous constituents. They don't allow the lines connecting their nodes to cross for some theoretical reasons.

So, for me trees are also not very convincing as the 'ideal' syntactic representation. But is there an alternative? Yes, and not a single one. Attribute-value matrices, for example. Or constructions. But that's another story.

Monday, February 8, 2010

Trees in NL parser

Consider you're a parser analyzing the famous The horse raced past the barn fell. Additionally, you're operating in an incremental left-to-right fashion, considering single active parse variant at each moment. Independently of the theory, you'll probably combine the horse and past the barn into some kind of units, say NP1 and PP2. Before you see the last word, you'll probably consider that NP1 is the subject of the finite verb raced, and PP2 is its adjunct. But then the fell comes. At this moment, you'll understand that the single active parse variant you have is wrong, return to raced and reanalyze it as a passive participle (The horse [that was] raced past the barn fell). This is all well known and boring.

What's not boring is the implications for a computational parser operating in a similar way. First, it should be able somehow to restore its state right after NP1 and reparse from there, choosing the passive participle alternative of raced. Second, it would be nice not to reparse the PP2 again. This word sequence will anyway produce the same result and function as an adjunct, though now in the relative clause.

Since we're going to store the parsing state and analyze it in various ways (e.g. find slots to fill), we'd better store it as a data. What kind of data? The standard answer is, an (incomplete) tree. I'll show that's not convenient.

Let's consider mutable trees first. Nodes can be added or removed in such trees to any positions at any time. Therefore to store anything we need first to copy it completely and put aside, so that subsequent changes in the active tree aren't reflected in the copy. This applies both to parser state after NP1 and to the i'd-like-it-cached PP2. But we don't want to copy the parser state at each moment just in case we'll have to backtrack to here in future: we'll run out of memory very soon.

An alternative approach I see is not to copy anything, but to log thoroughly every change you make to the tree in an undoable manner. When a need for backtracking arises, we could just go back through the log undoing every single change and restart. This solves the state-saving problem, but doesn't offer caching for what's undone. We'll have to reparse PP2 anew.

So, mutable trees don't appear to be useful for our particular parser. Let's look at the immutable trees. Their nodes are created once and fixed forever. No nodes can be added or removed. The only way to attach a constituent to a tree is to produce a new tree with the constituent already in the right place. The parts of the old tree not directly dominating the attachment place can be reused in the new tree, everything else has to be recreated. Luckily, the reused part is usually much bigger, so not very much memory is consumed. So, we could easily just save current parsing state at each moment without fear. And we can also cache the PP2 before reparse since no one can hurt it anymore, it's immutable.

But all this is not for free. You should keep track of all the places in the partial tree where it's possible to attach. For example, in 'I gave the ...' there are 2 slots awaiting to be filled: the noun after the and second argument of gave. In the tree they should be somehow represented, and at least one of them won't be on the top level.

The slots to be filled may also not be marked explicitly. In I saw a girl with a telescope the PP with a telescope part may be attached either to noun girl or to the verb saw. Since it's not required by any of them, the parser should walk down the whole tree and find all the constituents touching the right side, and then trying to combine them with the PP. Moreover, psycholinguists observe that most often such an adjunct PP is attached to the downmost suitable constituent. Probably it's easier that way for humans, but this would be very resource-consuming operation in the computational parser.

Finally, the attaching place may even be located not on the right boundary at all. It's the infrequent but still important case of discontinuous constituents, usually occurring in the free word order languages like Russian, Latin or Walpiri. English also has such examples:

The question has been asked, if there's life on Mars.
"Tomorrow," he said, "I'll return".
Yesterday, however, they left.

In the first example, we should attach the subordinate clause if there's life on Mars to the NP the question, which doesn't touch the right boundary at all. The tree we have at that moment is something like

[S: [NP: the question] [VP: has been asked]]

and represents a complete grammatical sentence. So, do we have to search the whole tree to find the constituent to attach to? Nonsense. But I have no idea on what the reasonable attachment strategy should be, that uses immutable trees and covers the presented data.

Of course, all these problems are solvable. It's just me that doesn't want to implement those solutions due to their clumsiness, complexity and possible bug-riddenness. I'd better just use another intermediate parsing representation for the serial parser. Which one? That's another story.

Sunday, January 17, 2010

Operational Text=>Meaning model

The model has several levels, though they are almost completely different from Melchuk's:

Level 1: text/phonology.
The parser receives it as an input and then produces

Level 2: constructions.
The contiguous text is separated into constructions of all sizes, from morphological to complex syntactical ones. Basic constructions are (mostly) word stems, complex constructions combine the simpler ones. The structure is most probably a tree though not necessarily. Each construction may have information structure. Constructions are ordered and considered to be executable in that order. Being executed, they give us

Level 3: semantics: a program.
In some general-purpose (programming) language with well-defined semantics and model. The model probably consists of local variables (max. 4-7), salient context and general memory, which is organized as a graph of frames and represents the listener's knowledge of the world. The language is imperative, object-oriented and almost lacks any control flow, being very linear. The basic instructions are frame creation and assignments to local variables or frame slots. If we execute this semantics program, we'll get

Level 4: change in the frame structure in the memory.
Which means change in the listener's knowledge of the world. Actually, the change may be greater than we expect looking at the semantics program. Every its simple instruction may cause significant changes that are not directly encoded in that program. This is what we call pragmatics.

We've done! Though it's worth noting that the built frames might well encode a sequence of actions (e.g. cooking recipe), which can also be executed via interpretation and result in real listener's actions in the real world.