Showing posts with label parsing. Show all posts
Showing posts with label parsing. Show all posts

Tuesday, October 9, 2012

Parser rewrite completed

In June I started rewriting the parser because there were too many hacks in there. I've just finished: all the tests now pass, that were passing before that decision. Hurray!

It took about 4 months. I hoped it would be faster. Perhaps it could be, if I wasn't procrastinating so much. Now there's a lot less hacks as there's a much stronger support for syntactic hierarchy. Hopefully, adding each new test sentences won't take so long now. Although fixing the last several tests really took several weeks which isn't very promising :)

The plan now is to try to speedup things a little by not trying all the semantic alternatives after each and every word. And then to add new sentences. I have quite a few of them enqueued for parsing and translation.

Tuesday, July 31, 2012

The comma dilemma

There's a sentence to translate, and there's a comma missing there. This doesn't prevent me from understanding the sentence, but my parser fails. In another version of the same text found online the comma is present. It's only missing in the one I've found long ago which I'm trying to translate literally.

So here's the dilemma:

  • Add the comma according to the Russian rules and parse the correct text,   or
  • Remember that I'm trying to simulate humans for whom extra commas are not a big deal, and dive right away into the wonderful world of parsing noisy input without proper understanding how to deal with correct one.


No idea.

UPD. In the end I corrected the commas in the original sentence, and added failing tests to fix later with all kinds of comma omissions.

Friday, July 20, 2012

Visibility ambiguity

I'm now dealing with a quite interesting kind of ambiguity which I call visibility ambiguity. Consider the following three sentences (I might have put too many commas; imagine it's Russian where they're obligatory):
  1. But there, thinking about her words, we got sad.
  2. But there, thinking about her words, which were so cruel, we got sad.
  3. But there, thinking about her words, intonation and gestures, we got sad.
The sentences are the same until the second comma after which they diverge dramatically. To process the next word the parser should determine which previous state it can attach to. In 1, we is attached to there and the top-level sentence is continued. In 2, which relates to words plus comma, a relative clause begins. In 3, there's a conjunction: words gets merged with intonation and re-attached together to about and her.

Obviously, all three attachment sites should be visible to the parser: there, words and comma. And all three continuation possibilities should be explored.

Which is where it gets complicated, because these attachment sites are also incompatible with each other. Once you've attached a word following route 1, you cannot attach the next one to route 2. The parser should regard such attachments as mutually incompatible. There are at least three diverging routes after every comma, each defined by its own set of constructions. And all constructions from different sets should be marked as contradictory, pair-wise. Which is quite a few pairs.

So it's not only that the number of possible variants will be big. It's also that the number of incompatibility relations to track will be much bigger, something pretty exponential. Given that I normally log all the visible attachment sites after each word during parsing, the log will be hard to analyze.

And what troubles me: people don't do this, they don't track all the possibilities. Instead, they quite quickly choose a variant which suits best, purge everything else from the memory and pursue only this parsing route. If it goes wrong eventually, they perform some kind of reanalysis. But the heuristics are tuned well enough and normally such a need never arises.

Right now my parser maintains all the structures needed for all possible analyses, and switching between them is very simple. But that just doesn't scale well enough to support visibility ambiguities. So it seems that the time has come to teach my parser the human strategy: choosing one route and then reparsing when anything doesn't fit well.

Monday, June 18, 2012

7 stages of exploratory programming


  1. One has an idea on natural language understanding, starts implementing it. Cool, it works!
  2. One encounters some sentences not easy to deal with. It's not clear how to do it the right way, beautifully, so one does it somehow  by dirty hacking. Hopefully, later it'll become more clear when new data arrives.
  3. As more and more sentences sentences are encountered, some hacks are corrected but others are added. Hacks proliferate and get tightly intertwined.
  4. It becomes harder to parse new sentences. Hacks start getting in the way. Fortunately, one starts seeing how to avoid them.
  5. Correcting the design appears to be non-trivial because hacks depend on each other. So to fix one you should first correct another one, which requires third one, and so on. There's a whole dependency graph of hacks!
  6. The test suite doesn't seem a friend anymore. It keeps failing, revealing more and more ways hacks depend on each other. No new sentences get parsed anymore.
  7. After half a year of pure refactoring, the morale is low and still no light at the end of the tunnel. Yet another cyclic hack dependency is found. Tired, one decides to apply all the desired design changes at once and fix the tests one by one, almost as if writing the parser afresh. Goto 1.

Saturday, November 12, 2011

Test-driven parsing

I use Test-Driven Development for my natural language parser, in a very orthodox way. And now that I do it, I see that what I had before is just test-first development, not quite test-driven.

I try to do something new in my parser every day. But very often in the evening I see that I won't manage to get the tests pass today. But, if I insert a little hack here and there, then maybe... And, if mocking really helps, I do it, but not without consequences. I add another test, that doesn't pass because of the mocking. Usually I just take the sentence and add a small variation to it and, of course, to its translation (e.g. I can change a tense, a noun person, the word order). After that I may go to sleep untroubled: the new test belongs to the next day, and I have plenty of time to think how to parse in a more generic way. If this time is not sufficient, I can mock further and create one more test. After a series of those variation tests, I get a satisfactory algorithm and quite a coverage. That's how it goes.

This mocking-all-the-time approach seems to be quite a fruitful way to get the things done. Now I really understand those agile guys saying "keep it simple and never change anything without a test". I never can think of all the possibilities, and there's no right way to do things yet. If I implement a nice beautiful logic based on just one sentence, then the next sentence will inevitably prove this code wrong, or too inflexible, or just plain inconvenient. But when I mock things several times, the mocking logic tends to get more and more generic and finally doesn't look like mocking at all, while still being flexible and convenient. Many mocks have already been transformed into serious logic, many others still wait for it.

By the way, it's not about unit-testing at all. Of 175 tests I have right now only 4 are unit tests. They test the longest common subsequence algorithm that I need for ellipsis and which I had to implement myself. There are also 17 parsing tests which check the internal semantic representation produced after parsing. They are useful to keep an eye on the semantics, but their expected output also tends to change insignificantly quite often. Then it's just tedious to update the tests. So I prefer to write translation tests, because their expectations are a golden standard. I could completely change the architecture of the parser without ever touching them.

Tuesday, November 8, 2011

NLP using graphs, actually


The reason for complete rewrite was simple:

Это  отвлекло   нас от   нашего спора.
That distracted us  from our    argument.

Every word makes a contribution to the parsing state by saying that it participates in a number of constructions, and provides some attributes for each of them. So a contribution consists of pairs of constructions and attribute sets. I'll call such pairs mites.

That (это) is capable to add noun attributes to nom(inative) and acc(usative) constructions, and these mites contradict each other. Distracted (отвлекло) is a transitive verb, therefore it defines head for nom and acc constructions. But these two mites don't contradict each other, they're both very welcome.

Earlier, I unified mites as soon as they came, and I only kept the results of the unification, not the mites themselves. After the word that, there were nom and acc constructions with noun defined, and nom was (randomly) chosen as active. Active constructions had higher priority, and therefore the verb's nom mite was integrated first. Two nom mites were unified, and the nom construction with both head and noun defined linked the values of these attributes semantically. That's all great. But what to do with acc?

You can't just unify it as well, because there's already a nom construction and it contradicts acc in its first mite. So you have to drop either of the mites completely and leave the other. In this sentence, it made sense to drop the acc.noun mite — the accusative alternative of that. The acc.head mite would survive and unify nicely with the real object that comes next — us. But that's a hack and that didn't work when the first это was in fact accusative. A more right way would be to preserve both acc mites but only mark the second one as active. That clearly required per-mite active status, not per-construction.

So now I don't remove any mites automatically when processing word contributions, they all survive. The matter is, are they active or not. Active mites are guaranteed to be non-contradicting. Only active mites about the same construction are unified and passed to this construction, it then may make some semantic changes based on that information. 

When another word comes, the parser solves a constraint satisfaction task on graphs. Some of the new mites are chosen to be active, the old mites may change their active status to accommodate that. The reanalysis tries to 
  • minimize the number of active constructions
  • to maximize the number of mites in those constructions (more mites mean more information)
  • prefer more recent mites
  • to prefer semantically more plausible variants
Surprisingly, it works, though twice as slow as before. And the algorithm now seems so clean and logical that I can't find anything to remove! But I definitely should optimize, as I just get bored waiting for long 10 seconds until 171 tests pass.

Monday, October 31, 2011

Great was their and my amazement!


Let's start with a simple noun phrase: my amazement. My parser analyzes it as follows. The word my creates a frame with type ME and contributes possessor attribute to possessive construction. The word amazement creates a frame with type AMAZE and contributes head attribute to the same possessive construction. Now that both head and possessor attributes are defined, the construction links them with arg1 attribute:

A.type=ME
B.type=AMAZE
B.arg1=A


It's quite similar with their amazement:



A.type=THEY
B.type=AMAZE
B.arg1=A




Now let's parse their and my amazement. The semantics I want to get is this:

A.type=THEY
B.type=ME
C.member1=A
C.conj=and
C.member2=B
D.type=AMAZE
D.arg1=C

I want their, my and amazement to preserve their functionality. This means that after their and my there should be a possessive construction whose possessor points to the conjunction frame C. So it's the word and which should do this, there are no other candidates.

Currently and puts the parser into a special state. In this state, every subsequent contribution is intercepted and analyzed whether it can be a conjunct. A contribution can be the right operand of coordination if there was a similar contribution to the left of it (similar in the same sense as with ellipsis). When a similar contribution is found, the parser takes both left and right contributions and merges them. In this particular case, the left contribution had possessive.possessor=A, the right one  possessive.possessor=B. While merging, a new frame C is introduced, a composite over A and B. Now the possessor attribute of the merged possessive construction points to C instead of A or B. The updated contribution is applied to the parsing state, so that then, when amazement comes, the result will be precisely as wanted. That's it.

This is the core idea. In my parser, conjuncts don't connect constituents or any static structure. They connect contributions, i.e. the changes made to the parsing state. Merging two changes results in one merged change. It's quite powerful and allows for some interesting things like non-constituent coordination. But that's another story.

Tuesday, October 25, 2011

NLP using graphs

Could natural language parsing be a task on graphs? Maybe, at least partly.

Now, during parsing, every word contributes to some constructions. Some of these contributions are mutually exclusive. For example, in Russian a noun typically can't be nominative and accusative at the same time. So this noun's contributions for nom and acc constructions are incompatible.

Different words also may have contradicting contributions. Two nouns in the same clause can't both be accusative even if their forms are compatible with such an alternative (i.e. they both contribute to acc construction).

So here's the idea. Consider a graph whose vertices are formed by each construction contribution for each word. And where those contributions contradict each other in any way, there's an edge. The task is then to find a maximal subset of vertices which are not connected to each other.

There are, of course, other constraints. The resulting graph should make sense from the semantic viewpoint. The subset should be constructed incrementally and conservatively: if the parser can proceed without reanalysis of the already built structures, it should do so. Finally, this graph task has a very limited scope, e.g. it doesn't apply in different clauses.

That's just an idea. Currently my parser doesn't track individual contributions per construction, it just unifies them eagerly. But the test data suggests that might change.

Saturday, July 16, 2011

Ellipsis

One of the great things about constructions is that you can do cool things with them which are not so easy with other parsing techniques. One example is ellipsis, when you may just omit any number of words if they are the same as in the previous clause:

По мнению одних дальше следовало 7, по мнению других — 8
PREP opinion some next went 7 PREP opinion others — 8
In the opinion of some, a 7 went next; but in opinion of others an 8 did

To translate this correctly, several things should be done. The parser should recognize that "" actually stands for went next. The semantic model should somehow incorporate this as well as the meta-knowledge that this went next is actually not specified explicitly (otherwise the generator would produce but in opinion of others an 8 went next). Finally, the generator should be able to replace the full verb phrase with did.

Parser


There's clearly some parallelism here. After processing in opinion of others the parser should already know that this is similar to in the opinion of some. The same happens with 8, which is similar to 7. Now the parser should fill in the dash gap with the missing information taken from the previous clause. This means it should have access to all actions it performed during processing of went next, and it should repeat them.

So, there are two problems. One is how to find similar fragments given that they consist of different words. The other is how to be aware of the earlier activity of the parser and be able to reproduce it in new contexts. The problems are quite tightly connected: it's the earlier activity where you search for something similar. And no surprise that both problems are solved by one means, which is, keeping the change log.

Every change in the parsing state is kept in a log. Luckily, the construction architecture is very suitable for this. Remember, each word tells the parser which constructions it participates in and which of them are mutually incompatible. It also contributes arguments to the constructions: specifying a noun of a nom(inative) construction is completely different from contributing a head (verb). Based on this information, the parser resolves the possible conflicts, determines the winning constructions and executes the code attached to them which makes the changes to the semantic model.

The log consists of the contributions each word has made. And by allowing constructions to have access to this log, the parser can tackle the two problems posed by ellipsis:

1. Similarity

7 and 8 are different words, but the contributions they make are quite similar: they invoke the same constructions (nom, gen, acc, ... — all cases) but provide different arguments to them: one creates a frame whose type is 7, another — the same with 8. So I infer that two contributions are similar if they contribute to the same construction and supply same kinds of arguments.

2. Repetition

The parser knows it's processing ellipsis. It knows where the ellipsis starts (in opinion of others) and ends (8). It has found the similar fragments in the previous clause (in the opinion of some and 7). Now it just has to look at the contribution history, find everything in between those two similar fragments and insert it between in opinion of others and 8. The copying happens upon encountering 8 after "—", and the own contribution of 8 is nicely integrated into the environment set by the copied history (e.g. it completes the noun of nom construction, whose head was set by copied went).

Intermediate representation


OK, now we have two went next situations, but only one of them was fully expressed in Russian. And we should do the same in translation. So we have to mark the copy in a special way. In the semantic model we have no contributions, we have only frames and assignments to their properties. Hence we should mark the effects of the copied contributions, i.e. the assignments they add.

For this I've introduced a notion of scope into the semantic model. A scope is just a frame that plays a special role during parsing. Which special role — depends on the frame and its attributes. Important is that it's activated at one moment and deactivated at another moment, with arbitrarily many assignments made in between. In the semantic log this is written as [X] ... [/X] where X is the scope frame and ... stands for the assignments.

I plan to use scopes for many more things, mostly of meta-linguistic kind: separating paragraphs, reflecting the markup, etc. Using it with ellipsis is simple. You just create a frame, say its type is ellipsis, activate it, do the whole contribution copying process (which surely adds some assignments) and deactivate it. Thus you get the nice clear separation of what's said and what's inferred:

A.type=OPINION
#1.opinion_of=A
B.type=SOME
A.arg1=B
#1.time=PAST
C.type=COME_SCALARLY
C.order=AFTER
C.arg1=D
D.type=7
D.number=true
#1.but=#2
--
A.type=OPINION
#2.opinion_of=A
B.type=OTHERS
A.arg1=B
C.type=ellipsis
[C]
D.type=COME_SCALARLY
D.order=AFTER
[/C]
D.arg1=E
E.type=8
E.number=true

Generator


Getting the generator to work is not interesting at all. It's just a matter of adding another condition, given that I mock the generation throughout anyway.


After writing all this I've suddenly thought that well, this is great, but possibly unnecessary. Maybe the right way is not to replicate the history but just mark the whole second clause situation as ellipsis, parse only what's in the text, attach it somewhere and assume that the inference engine (mocked) should be responsible for inferring the went next part. But I don't have enough data yet to make a choice.

Sunday, May 8, 2011

Lost in translation simulation

That I don't write about my parser doesn't mean I don't write it. I just can't be bothered to write both. Most of the time I prefer coding. And I can't say there's a lack of interesting problems. For example, here's a part of the 12th sentence:

по мнению одних дальше следовало 7, по мнению других - 8
In the opinion of some, a 7 went next; but in opinion of others an 8 did

There are many interesting things here. Just look at this nice little ellipsis (the dash before 8)! But what worries me the most is English. Or just my English. I've found that I don't know it enough to understand why the translation looks as it looks and how I'm supposed generate this translation.

Take the articles for example: a 7 went next; an 8 did. Why on earth are the numbers indefinite? Are there many sevens and is it just one of them? I don't get it. But it's all over the translation. Only in the first sentence 7 or 8 are without articles, in every other place they're indefinite.

Then, compare In the opinion of some with in opinion of others. The main difference is the article again. Is the opinion of some more definite than the opinion of the others? Does it just sound better that way? Of course, I've hacked all that, but I'm still uncomfortable.

And, by the way, just if anyone cares, here it is: lots of hacks and some ideas.

Saturday, April 30, 2011

Disambiguation by constructions

A word may have many senses and a good parser must choose the right one. Which is the right one? This is often obvious from the context. When you see an ambiguous word alone (e.g. cooks), you can tell nothing about its meaning. But if you see it in context, you may start guessing its syntactic and/or semantic behavior (our cooks have prepared the dinner vs. Mary cooks fish). Surrounding words greatly help in determining the part of speech, and many disambiguation algorithms take advantage of that.

But who on earth cares about the parts of speech? Well, many do, for example, parsers, both statistical and declarative, employ this information for building all kinds of structures. But anyway, that's an intermediate thing used for the own convenience of those parsers. For the ultimate text analysis tasks parts of speech are not important at all. The meaning is what is important, not them. So why bother at all? I'm currently trying to live without the intermediate part-of-speech level in my parser, and so far it works. How?

Consider cooks again. It can participate in the following constructions:
  1. she cooks
  2. cooks fish
  3. cooks when something happens
  4. cooks well
  5. the cooks
  6. sad cooks
  7. cooks that came from Germany
  8. and so on
Of those, 1-4 are "verbal", they refer to the same meaning of cooks, the process of cooking. And in fact they can all occur together in one sentence: She cooks fish very well when she's not tired. 5-7, on the other hand, are "nominal", they describe the people who cook for a living. They are also mutually compatible and at the same time totally incompatible with 1-4.

Let's now say there are no nouns, verbs and so on, there are only constructions. Upon encountering cooks, the parser notes all the constructions possible with this word (at least 1-7 from above). It also marks some of them (1-4) as incompatible with others (5-7). Then another word comes, for example, fish. It also generates tons of constructions, some of them also mutually incompatible (fish can be a noun or a verb as well). Importantly, one of them is the familiar Transitive (number 2 in the list). It's been suggested by both words, and it clearly wins over the others which were suggested only by one of the two words.

Now the constructions which are incompatible with this Transitive can be pruned: both "nominal" for cooks and "verbal" for fish. And the Transitive is promoted and may now contribute to the meaning of the entire text. (e.g. Cook.patient=Fish). Disambiguation complete.

Positive side: it's very simple and I don't have to create boring data structures for different parts of speech with all those cases, inclinations, numbers, genders, etc. Negative side: every word now has to know of all contexts it can occur in. Both adjectives and nouns have to specify that they participate in Adjective+Noun construction. That's quite unusual in rule-based parsing where people try to hand-code as little redundancy as possible. Anyway, unusual doesn't mean bad, I'm not very much against redundancy, and I really like the overall simplicity.

Wednesday, March 30, 2011

New model

What happens when you hear a sentence? Nobody knows for sure, but there are some guesses on the market. Many people think you build a logical formula, and do something with it (perhaps, store it and use for inference). However, I find huge logical expressions terribly hard to analyze, and as I surely need some analysis to translate correctly, this idea doesn't suit me.

So I prefer to think in terms of objects and relations between them. Luckily, I don't have to invent anything, since some psychologists have similar ideas. They argue that during sentence comprehension people don't operate with abstract logical symbols, but mentally simulate the input, and this is what we call meaning. You hear The eagle in the sky, you imagine an eagle in the blue sky with few clouds and airplanes, and it's natural that the eagle in this picture has wide-spread wings, and not, say, folded.

Many experiments seem to support this theory. But actually that doesn't matter: I don't care that much what really happens in the brain. What I care about are ideas that may be useful for parsing. And I find the idea of mental simulations quite useful. The words don't have to determine the meaning, they act like cues which invoke the memories of previous situations where you met them. These memories are put together and a new situation is simulated. The situation may involve objects or actions that are not mentioned at all, but they'll be highly accessible in the following discourse, without a complex logical inference. The important things that the hearer learned through simulation are stored in the memory for future retrieval, also as parts of other simulations.

This is a completely informal theory. That's great because I may treat it in any way useful to me. In particular, I assume the simulation engine to be a black box that communicates with the parser and generator via frames. Frames are just objects, with attributes and values. Values may be strings or other frames. The parser creates cues like Frame A has attribute 'actor' pointing to frame B or Frame C is described as 'man'. And the notation is:

A.actor=B
C.type=man

The parser gives such cues to the simulator as soon as it encounters new words. It also receives feedback which may help choosing between several competitive parses. But it doesn't know what the simulator does internally, it only sees those frames and attributes. That allows me to mock the simulator in any way I want. In the end, my main object of interest is parsing, so I leave a well-behaving simulator to someone else.

The target language generator has access to both the sequence of the cues fed by the parser to the simulator, and to the simulator itself. The cue sequence is needed to provide as close translation as possible, so that what was said in the source will be more or less the same as what was generated. The simulator is needed to ensure that the information inferred from the cues in both languages is similar as well. The simulation results may also be used in cases when the generator needs something not explicitly specified in the source but obligatory in the result, for example, pronoun gender, when translating from Finnish to Russian.

So far, the model is quite simple. A discourse is a sequence of (numbered, #1,#2,...) situations. Each situation has a sequence of constraints on frames (referred to by variables: A,B,C,...) and their attribute values. Situations are also variables, and also have properties that may be assigned.

For example, let's consider the first part of Kharms' Sonnet:

An amazing thing happened to me today, I suddenly forgot....

It's currently analyzed as:

----- Situation #1 -----
A.property=AMAZING
A.type=THING
A.given=false
B.type=HAPPEN
#1.time=PAST
B.arg1=A
B.experiencer=C
C.type=ME
#1.elaboration=#2
----- Situation #2 -----
A.type=ME
B.manner=SUDDENLY
B.type=FORGET
#2.time=PAST
B.arg1=A
B.theme=#3
----- Situation #3 -----
...

Sunday, October 17, 2010

Translating Kharms

My approach to NLU is to take some real sample of Natural Language and make computer Understand it. Since I don't want the result to be controversial and debatable, I take something that everyone would agree upon, a 'golden standard'. In my case the program should translate a text in the very way that some human has already done this. For them being short, I took one of Daniil Kharms's short stories translated into English. The program parses a text, computes its semantic representation, translates it into the target language concepts and generates the target text based solely on that transfer result.

The problems begin straight away. Here's the first sentence:

Удивительный случай случился со мной: я вдруг забыл, что идет раньше - 7 или 8
Amazing case happened with me I suddenly forgot what goes earlier 7 or 8
An amazing thing happened to me today, I suddenly forgot what comes first - 7 or 8

Problem: Everyone translating Russian into English immediately stumbles upon articles. Russian doesn't have them, so they have to be added. One can note that they somehow relate to the information structure: the new material is usually (but not always) indefinite, the given one - definite.

Solution: For the anaphora resolution purposes, a smart parser should anyway take a record of discourse referents and the generator could mark the newly introduced ones as indefinite. For a start that'll do, until we find a counterexample. So, the case=thing is a new discourse referent, and we add An before it.

Problem: This is absolutely crazy, but human-made translations abound with those and my purpose is to imitate them. In the Russian text, we are only told the amazing thing happened, but in English it was today. Not suprisingly, another translation of the same story doesn't have it. But anyway, it's in the 'gold standard' so the program should somehow be able to generate it.

Solution: Of course, the translator may have added this by random, but I always first try to find something in the source text until I surrender. Here, my solution is that actually the Russian story has an anecdotal flavor. And this today adds the same flavor, since (perhaps) without it the English counterpart doesn't have it. Therefore, Russian parser has to specify the flavor explicitly in the semantics so that the generator can use this information later.

Problem: The colon, which separates two clauses of the sentence. Nothing's wrong with it except that it's translated as a comma. So I had to find a deep meaning for it which allegedly should be expressed using a comma in English.

Solution: The second clause seems to elaborate on something from the first clause. I'd say it's the semantically empty thing about which the reader now is told more. So, apparently the colon actually expresses an elaboration dependency between thing and the second clause, and what we should preserve in the deep semantics is this dependency, not the colon itself.

Problem: Due to the parsing algorithm, once the verb (happened) has come after the subject (thing), the latter is deactivated: it's not available for forming syntactic relations with the words coming after the verb. This seems reasonable because discontinuous subjects are not very frequent in Russian, and always marked (e.g. by a special intonation). But nevertheless in this case the subject should form the elaboration relation with what follows the colon.

Solution: Look at it from the semantic point of view. We're told that an amazing thing happened. Naturally, we're interested what kind of thing it was. So, the verb's semantic handler should be taught to understand such luring constructions and wait for a colon followed by a clause, which is one of the ways of expressing elaboration in Russian.

Let's deal with the rest of this sentence's problems later.

Wednesday, August 18, 2010

Placeholders aka left-corner parsing

From the parsing perspective, some words have much more far-reaching consequences than the other ones. Take English articles, for example. They don't only denote the definiteness or indefiniteness, they also give a broad hint that a noun is coming. So, having encountered an article, a smart parser may immediately create a placeholder node for noun (and also specify its definiteness). When the noun comes, instead of creating its own structure it'll just fit into the existing placeholder.

Once an noun placeholder is constructed, a parser may immediately attach it where it's is needed. If there is an ambiguous verb or preposition, attaching that noun as an argument can help choosing the right alternative and thus reduce the working memory consumption.

Hawkins argues that such placeholders play an important role in the syntax. He shows for many languages that their word order is in many cases optimized for such placeholders to be constructed and integrated as early as possible. That helps minimizing the amount of the nodes awaiting integration at any moment in parsing.

In German and Russian, the NP placeholder creators (determiners and adjectives respectively) also carry case markers. Therefore case is known for the placeholders, which also aids faster verb disambiguation (e.g. in+Dative denotes location, while in+Accusative denotes direction in German).

Verb placeholders also seem to exist. For example, in Japanese, where the verb comes at the very end of the sentence, there is no sign that a sequence of the argument noun phrases causes any processing difficulties. I conclude that these noun phrases get integrated into some structure as they appear. When a verb comes, it just quickly iterates through this structure and assigns to the accumulated noun phrases their semantic roles. The rich case marking of Japanese helps the arguments to be stored efficiently and not to interfere with each other while awaiting for the verb. I think that the structure I'm talking about can safely be called a verbal (or clausal) placeholder.

It gets more interesting in other languages. On surface German is a verb-second language. A nice processing explanation would be that there is no verb placeholder, and the first constituent hangs active in the memory until a verb actually appears. So the verb should appear immediately to reduce the memory usage. In most relative clauses, on the other hand, the verb comes at the end, and no processing problems arise even if there are lots of NP/PP dependents before it. Incidentally, all these clauses have to be introduced by a some complementizer. When there is no complementizer, the verb comes second just as if it were a main clause. All this makes me think that the complementizer just creates the verb placeholder which enables the verb to be postponed.

In English, the verb never comes too late, so the need for a placeholder is doubtful. Russian is difficult in this respect, because normally the verb is close to the beginning but it can easily move to the end for focalization purposes. I'm not aware of any experimental data, but I personally don't feel too much overburdened by multiple intermediate NPs and PPs in this case. So I have no idea about when a verbal placeholder is created in Russian: always, never, only in case of too many subsequent NPs and PPs, or only in some specific context (e.g. the one that forces focalization).

Tuesday, February 9, 2010

Cloud of constructions

What I propose to use as the parser representation instead of trees is a collection of constructions. I call it a cloud.

Base constructions are words. When several constructions (e.g. words) tend to go together and comprise something meaningful in the semantics, they form a composite construction and become its arguments, or children. Composite constructions may also be children of even higher-order constructions, although not very often.

Consider I see a little silhouetto of a man as an example. Each word is a construction, so it's already in the cloud. I+see form a construction whose meaning is to specify see's subject as I. Similar things happen with see+silhouetto, though this time it's an object. Both a and little provide us with additional information about silhouetto: it's indeterminate and, well, little. Again, indeterminacy is added to man in a+man construction. The preposition of+man form a prepositional construction, which then itself forms a genitive construction with silhouetto. The result can be nicely visualized:

Isn't this looking like a strangely-colored cloud?

Of course, the composite constructions don't need to have exactly 2 arguments. For example, the more X the more Y construction has 6 arguments, 4 of which are constant words. X is Y to Z (as in John is hard to convince) has 5 arguments with 3 slots (actually 4: is also isn't fixed, it's replaceable with was).

And yes, clouds are better than trees.

Monday, February 8, 2010

Trees in NL parser

Consider you're a parser analyzing the famous The horse raced past the barn fell. Additionally, you're operating in an incremental left-to-right fashion, considering single active parse variant at each moment. Independently of the theory, you'll probably combine the horse and past the barn into some kind of units, say NP1 and PP2. Before you see the last word, you'll probably consider that NP1 is the subject of the finite verb raced, and PP2 is its adjunct. But then the fell comes. At this moment, you'll understand that the single active parse variant you have is wrong, return to raced and reanalyze it as a passive participle (The horse [that was] raced past the barn fell). This is all well known and boring.

What's not boring is the implications for a computational parser operating in a similar way. First, it should be able somehow to restore its state right after NP1 and reparse from there, choosing the passive participle alternative of raced. Second, it would be nice not to reparse the PP2 again. This word sequence will anyway produce the same result and function as an adjunct, though now in the relative clause.

Since we're going to store the parsing state and analyze it in various ways (e.g. find slots to fill), we'd better store it as a data. What kind of data? The standard answer is, an (incomplete) tree. I'll show that's not convenient.

Let's consider mutable trees first. Nodes can be added or removed in such trees to any positions at any time. Therefore to store anything we need first to copy it completely and put aside, so that subsequent changes in the active tree aren't reflected in the copy. This applies both to parser state after NP1 and to the i'd-like-it-cached PP2. But we don't want to copy the parser state at each moment just in case we'll have to backtrack to here in future: we'll run out of memory very soon.

An alternative approach I see is not to copy anything, but to log thoroughly every change you make to the tree in an undoable manner. When a need for backtracking arises, we could just go back through the log undoing every single change and restart. This solves the state-saving problem, but doesn't offer caching for what's undone. We'll have to reparse PP2 anew.

So, mutable trees don't appear to be useful for our particular parser. Let's look at the immutable trees. Their nodes are created once and fixed forever. No nodes can be added or removed. The only way to attach a constituent to a tree is to produce a new tree with the constituent already in the right place. The parts of the old tree not directly dominating the attachment place can be reused in the new tree, everything else has to be recreated. Luckily, the reused part is usually much bigger, so not very much memory is consumed. So, we could easily just save current parsing state at each moment without fear. And we can also cache the PP2 before reparse since no one can hurt it anymore, it's immutable.

But all this is not for free. You should keep track of all the places in the partial tree where it's possible to attach. For example, in 'I gave the ...' there are 2 slots awaiting to be filled: the noun after the and second argument of gave. In the tree they should be somehow represented, and at least one of them won't be on the top level.

The slots to be filled may also not be marked explicitly. In I saw a girl with a telescope the PP with a telescope part may be attached either to noun girl or to the verb saw. Since it's not required by any of them, the parser should walk down the whole tree and find all the constituents touching the right side, and then trying to combine them with the PP. Moreover, psycholinguists observe that most often such an adjunct PP is attached to the downmost suitable constituent. Probably it's easier that way for humans, but this would be very resource-consuming operation in the computational parser.

Finally, the attaching place may even be located not on the right boundary at all. It's the infrequent but still important case of discontinuous constituents, usually occurring in the free word order languages like Russian, Latin or Walpiri. English also has such examples:

The question has been asked, if there's life on Mars.
"Tomorrow," he said, "I'll return".
Yesterday, however, they left.

In the first example, we should attach the subordinate clause if there's life on Mars to the NP the question, which doesn't touch the right boundary at all. The tree we have at that moment is something like

[S: [NP: the question] [VP: has been asked]]

and represents a complete grammatical sentence. So, do we have to search the whole tree to find the constituent to attach to? Nonsense. But I have no idea on what the reasonable attachment strategy should be, that uses immutable trees and covers the presented data.

Of course, all these problems are solvable. It's just me that doesn't want to implement those solutions due to their clumsiness, complexity and possible bug-riddenness. I'd better just use another intermediate parsing representation for the serial parser. Which one? That's another story.

Sunday, January 17, 2010

Operational Text=>Meaning model

The model has several levels, though they are almost completely different from Melchuk's:

Level 1: text/phonology.
The parser receives it as an input and then produces

Level 2: constructions.
The contiguous text is separated into constructions of all sizes, from morphological to complex syntactical ones. Basic constructions are (mostly) word stems, complex constructions combine the simpler ones. The structure is most probably a tree though not necessarily. Each construction may have information structure. Constructions are ordered and considered to be executable in that order. Being executed, they give us

Level 3: semantics: a program.
In some general-purpose (programming) language with well-defined semantics and model. The model probably consists of local variables (max. 4-7), salient context and general memory, which is organized as a graph of frames and represents the listener's knowledge of the world. The language is imperative, object-oriented and almost lacks any control flow, being very linear. The basic instructions are frame creation and assignments to local variables or frame slots. If we execute this semantics program, we'll get

Level 4: change in the frame structure in the memory.
Which means change in the listener's knowledge of the world. Actually, the change may be greater than we expect looking at the semantics program. Every its simple instruction may cause significant changes that are not directly encoded in that program. This is what we call pragmatics.

We've done! Though it's worth noting that the built frames might well encode a sequence of actions (e.g. cooking recipe), which can also be executed via interpretation and result in real listener's actions in the real world.

Thursday, October 22, 2009

Shallow vs. structural analysis

Let's look at a very simple sentence, namely 'John has two sisters'. I'm now interested in its semantics, or, more precisely, its computer representation. The truth condition is actually very simple, it says that the number of those who happen to be sisters of John equals to 2:

|{ x | SISTER(x, JOHN) }|=2

(let the uppercase letters denote some semantic meaning here).

A question arises, how can we assemble this semantics from meanings of sentence components? The constituent structure for this sentence would be:

[S[NPJohn] [VPhas [QPtwo sisters]]]

The dependency structure:

John <- has -> two -> sisters

The beloved one, applicative structure:

(has John (two sisters))

Lexical Functional Grammar-style:

| PRED 'has'
|
| SUBJ | PRED 'John'
|
| OBJ | PRED 'sisters'
| | SPEC | NUM 2

In any of these variants has has two arguments: John and the combined two sisters. So it appears that we should combine the word meanings in this order, getting something like f(HAS, JOHN, g(2, SISTER)). And this formula should somehow be equivalent to |{x | SISTER(x, JOHN)}|=2. The question is, what are f and g? I see no direct structural answer. The best variant I've come to is that we should change the structure, replace it with another one which contains only one predicate:

HAS_N_SISTERS(Who,N)

which would translate to

|{ x | SISTER(x, Who) }|=N

This can be generalized a bit (take sibling instead of sister), but not further. A similar sentence 'John has two dogs' would have a different semantics, e.g. |{x | DOG(x) & BELONGS(x, John)}|=2. A two-place 'sister'-like 'dog' predicate would be funny.

So it seems that all the structures I know of are of no use with this sentence. That's one of the reasons I prefer shallow parsing based on patterns with wildcards: it appears to map better onto semantics. And a probable sad consequence is that the applicative structure, though being so beautiful, will remain unapplied.

Tuesday, October 20, 2009

Linear Unit Grammar

Predominant natural language syntax theories usually deal with one sentence, which is carefully selected to be well-formed, so called 'grammatical'. But the live language seems to be different, more complex, especially the spoken one. It contains lots of repetitions, false starts, hesitations, reformulations, speaker changes and so on.

I therefore like meeting theories aiming to describe spoken discourse 'as is' instead of labelling it as an incorrect. One of those theories is Linear Unit Grammar by John Sinclair and Anna Mauranen.

The main notion here is 'chunk'. It's a fuzzy pre-theoretical concept, with no precise definition. Basically it's a linear fragment of input signal (text, speech, etc.) which people tend to comprehend all at once. It may be unfinished. Usually it's formed of closely connected words like a noun phrase with adjectives (a small dog) or verb with its obligatory arguments (to love Mary). Relative clauses, of course, form separate chunks. Moreover, the auxiliary words (like 'that' in 'the dog that loves Mary') are separated from anything else and go to single-word chunks. The chunks are very small, their size rarely exceeds 5 words.

Here's a sample chunking made by me from a spoken English corpus. The analyzed fragment is:

so an example of this s- s- second rule i mean the second rule probably is the easiest one to look at cuz you have the, the f- the the four-six-seven type of relationship

I divide it according to my intuition only. Whenever I have doubts, I put a chunk boundary. And so the result will be:

1. so
2. an example of this
3. s- s-
4. second rule
5. i mean
6. the second rule
7. probably
8. is
9. the easiest one
10. to look at
11. cuz
12. you have the
13. the f-
14. the
15. the four-six-seven type
16. of relationship

Sinclair & Mauranen classify the chunks into organizational fragments (1,5,11) and message fragments (the others). These groups are also divided into subgroups according to organizational function or message completeness. There's a non-deterministic algorithm that translates any kind of text into a well-formed one. In this example it would be something like 'an example of this second rule is probably the easiest one to look at cuz you have the four-six-seven type of relationship'.

That's a bit surprising! How can anyone claim that 'grammatical sentences are unnatural, let's analyze the real ones' and then analyze spoken discourse by first making it grammatical? The answer is that the authors in fact don't aim to compete with major syntactic theories, they strive to co-exist with them, at least in the beginning. The described algorithm may be just a first step in complex language analysis. Authors also suggest chunking could help in second language teaching/learning.

What I personally like about Linear Unit Grammar is precisely the chunking. It's so simple! And, in contrast to Gasparov's approach when the text is divided into communicative fragments, the LUG chunks are contiguous and non-overlapping. Therefore the LUG chunking can be done by simple regular expressions, or Markov processes. A great part of the syntax lies inside chunks so there's no need to analyze it the same way as the 'bigger' structures like clause subordination. NLTK seems to provide chunking out of the box, so I guess I gotta try it.