Sepulkarium: constructor

Showing posts with label constructor. Show all posts

Tuesday, October 9, 2012

Parser rewrite completed

In June I started rewriting the parser because there were too many hacks in there. I've just finished: all the tests now pass, that were passing before that decision. Hurray!

It took about 4 months. I hoped it would be faster. Perhaps it could be, if I wasn't procrastinating so much. Now there's a lot less hacks as there's a much stronger support for syntactic hierarchy. Hopefully, adding each new test sentences won't take so long now. Although fixing the last several tests really took several weeks which isn't very promising :)

The plan now is to try to speedup things a little by not trying all the semantic alternatives after each and every word. And then to add new sentences. I have quite a few of them enqueued for parsing and translation.

Tuesday, July 31, 2012

The comma dilemma

There's a sentence to translate, and there's a comma missing there. This doesn't prevent me from understanding the sentence, but my parser fails. In another version of the same text found online the comma is present. It's only missing in the one I've found long ago which I'm trying to translate literally.

So here's the dilemma:

Add the comma according to the Russian rules and parse the correct text, or
Remember that I'm trying to simulate humans for whom extra commas are not a big deal, and dive right away into the wonderful world of parsing noisy input without proper understanding how to deal with correct one.

No idea.

UPD. In the end I corrected the commas in the original sentence, and added failing tests to fix later with all kinds of comma omissions.

Friday, July 20, 2012

Visibility ambiguity

I'm now dealing with a quite interesting kind of ambiguity which I call visibility ambiguity. Consider the following three sentences (I might have put too many commas; imagine it's Russian where they're obligatory):

But there, thinking about her words, we got sad.
But there, thinking about her words, which were so cruel, we got sad.
But there, thinking about her words, intonation and gestures, we got sad.

The sentences are the same until the second comma after which they diverge dramatically. To process the next word the parser should determine which previous state it can attach to. In 1, we is attached to there and the top-level sentence is continued. In 2, which relates to words plus comma, a relative clause begins. In 3, there's a conjunction: words gets merged with intonation and re-attached together to about and her.

Obviously, all three attachment sites should be visible to the parser: there, words and comma. And all three continuation possibilities should be explored.

Which is where it gets complicated, because these attachment sites are also incompatible with each other. Once you've attached a word following route 1, you cannot attach the next one to route 2. The parser should regard such attachments as mutually incompatible. There are at least three diverging routes after every comma, each defined by its own set of constructions. And all constructions from different sets should be marked as contradictory, pair-wise. Which is quite a few pairs.

So it's not only that the number of possible variants will be big. It's also that the number of incompatibility relations to track will be much bigger, something pretty exponential. Given that I normally log all the visible attachment sites after each word during parsing, the log will be hard to analyze.

And what troubles me: people don't do this, they don't track all the possibilities. Instead, they quite quickly choose a variant which suits best, purge everything else from the memory and pursue only this parsing route. If it goes wrong eventually, they perform some kind of reanalysis. But the heuristics are tuned well enough and normally such a need never arises.

Right now my parser maintains all the structures needed for all possible analyses, and switching between them is very simple. But that just doesn't scale well enough to support visibility ambiguities. So it seems that the time has come to teach my parser the human strategy: choosing one route and then reparsing when anything doesn't fit well.

Monday, June 18, 2012

7 stages of exploratory programming

One has an idea on natural language understanding, starts implementing it. Cool, it works!
One encounters some sentences not easy to deal with. It's not clear how to do it the right way, beautifully, so one does it somehow — by dirty hacking. Hopefully, later it'll become more clear when new data arrives.
As more and more sentences sentences are encountered, some hacks are corrected but others are added. Hacks proliferate and get tightly intertwined.
It becomes harder to parse new sentences. Hacks start getting in the way. Fortunately, one starts seeing how to avoid them.
Correcting the design appears to be non-trivial because hacks depend on each other. So to fix one you should first correct another one, which requires third one, and so on. There's a whole dependency graph of hacks!
The test suite doesn't seem a friend anymore. It keeps failing, revealing more and more ways hacks depend on each other. No new sentences get parsed anymore.
After half a year of pure refactoring, the morale is low and still no light at the end of the tunnel. Yet another cyclic hack dependency is found. Tired, one decides to apply all the desired design changes at once and fix the tests one by one, almost as if writing the parser afresh. Goto 1.

Saturday, November 12, 2011

Test-driven parsing

I use Test-Driven Development for my natural language parser, in a very orthodox way. And now that I do it, I see that what I had before is just test-first development, not quite test-driven.

I try to do something new in my parser every day. But very often in the evening I see that I won't manage to get the tests pass today. But, if I insert a little hack here and there, then maybe... And, if mocking really helps, I do it, but not without consequences. I add another test, that doesn't pass because of the mocking. Usually I just take the sentence and add a small variation to it and, of course, to its translation (e.g. I can change a tense, a noun person, the word order). After that I may go to sleep untroubled: the new test belongs to the next day, and I have plenty of time to think how to parse in a more generic way. If this time is not sufficient, I can mock further and create one more test. After a series of those variation tests, I get a satisfactory algorithm and quite a coverage. That's how it goes.

This mocking-all-the-time approach seems to be quite a fruitful way to get the things done. Now I really understand those agile guys saying "keep it simple and never change anything without a test". I never can think of all the possibilities, and there's no right way to do things yet. If I implement a nice beautiful logic based on just one sentence, then the next sentence will inevitably prove this code wrong, or too inflexible, or just plain inconvenient. But when I mock things several times, the mocking logic tends to get more and more generic and finally doesn't look like mocking at all, while still being flexible and convenient. Many mocks have already been transformed into serious logic, many others still wait for it.

By the way, it's not about unit-testing at all. Of 175 tests I have right now only 4 are unit tests. They test the longest common subsequence algorithm that I need for ellipsis and which I had to implement myself. There are also 17 parsing tests which check the internal semantic representation produced after parsing. They are useful to keep an eye on the semantics, but their expected output also tends to change insignificantly quite often. Then it's just tedious to update the tests. So I prefer to write translation tests, because their expectations are a golden standard. I could completely change the architecture of the parser without ever touching them.

Tuesday, November 8, 2011

NLP using graphs, actually

The reason for complete rewrite was simple:

Это отвлекло нас от нашего спора.

That distracted us from our argument.

Every word makes a contribution to the parsing state by saying that it participates in a number of constructions, and provides some attributes for each of them. So a contribution consists of pairs of constructions and attribute sets. I'll call such pairs mites.

That (это) is capable to add noun attributes to nom(inative) and acc(usative) constructions, and these mites contradict each other. Distracted (отвлекло) is a transitive verb, therefore it defines head for nom and acc constructions. But these two mites don't contradict each other, they're both very welcome.

Earlier, I unified mites as soon as they came, and I only kept the results of the unification, not the mites themselves. After the word that, there were nom and acc constructions with noun defined, and nom was (randomly) chosen as active. Active constructions had higher priority, and therefore the verb's nom mite was integrated first. Two nom mites were unified, and the nom construction with both head and noun defined linked the values of these attributes semantically. That's all great. But what to do with acc?

You can't just unify it as well, because there's already a nom construction and it contradicts acc in its first mite. So you have to drop either of the mites completely and leave the other. In this sentence, it made sense to drop the acc.noun mite — the accusative alternative of that. The acc.head mite would survive and unify nicely with the real object that comes next — us. But that's a hack and that didn't work when the first это was in fact accusative. A more right way would be to preserve both acc mites but only mark the second one as active. That clearly required per-mite active status, not per-construction.

So now I don't remove any mites automatically when processing word contributions, they all survive. The matter is, are they active or not. Active mites are guaranteed to be non-contradicting. Only active mites about the same construction are unified and passed to this construction, it then may make some semantic changes based on that information.

When another word comes, the parser solves a constraint satisfaction task on graphs. Some of the new mites are chosen to be active, the old mites may change their active status to accommodate that. The reanalysis tries to

minimize the number of active constructions
to maximize the number of mites in those constructions (more mites mean more information)
prefer more recent mites
to prefer semantically more plausible variants

Surprisingly, it works, though twice as slow as before. And the algorithm now seems so clean and logical that I can't find anything to remove! But I definitely should optimize, as I just get bored waiting for long 10 seconds until 171 tests pass.

Saturday, February 13, 2010

Charts and clouds

As Doug Arnold pointed out, attacking trees and looking for a replacement isn't new. It has already been done, so I didn't actually invent anything revolutionary. The construction cloud is actually a well known structure called chart, as in chart parsing. The chart is designed for keeping all possible parse variants in a compact way to ease backtracking and reparse. And my cloud does exactly this and in a very similar fashion.

One difference is what kind of information is stored in the chart. In the classical approach, it's a bunch of records like "there may be an NP constituent from word 3 to word 7". No internal structure of these constituents is stored. If someone needs the complete tree, it can be computed on demand, since all of the daughter phrases are also present in the chart. And you're lucky if there's only one variant of such subtree.

Cloud is a collection of records like "constructions X, Y and Z may form a construction T". Since constructions don't tend to have a deep hierarchy, each of them stores its children. We don't have to compute the subtree on demand. But the subtree rarely covers the whole sentence. Therefore the task is to find a subset of all constructions such that it covers the input and its constructions don't contradict each other.

Another difference is that usually chart is tightly connected with the parser. A typical chart parser computes all the results in parallel, storing every possible intermediate constituent in the chart. On the other hand, I'm into single-variant serial parsing with error recovery. Of course, nothing prevents from implementing that kind of parser with phrase-structure-based chart: just drive one variant instead of all. One of the reasons I prefer dependency-like constructions over constituents is that free word order languages appear to be easier to describe in terms of constructions.

Wednesday, February 10, 2010

But why cloud?

Once upon a time I was having problems with trees. I decided I prefer construction cloud instead of them. Here's why.

Like the syntactic groups grammar, construction cloud accommodates phrase structure and dependency grammars. A construction relating 2 words can be thought of as a dependency relation between them, with the rare exception that this 'dependency' can participate in another construction, as with Prepositional and Genitive from the example.

Construction cloud also doesn't object to discontinuous constructions. If the parser managed to build it from the text, then it's acceptable.

Unlike the phrase structure, dependency and categorial grammars, it's possible for a construction to have more than one parent. This allows overlapping of various kinds, including, for example, having separate constructions for syntactic agreement, semantic relatedness and prosodical cues to information structure.

From the programmer's perspective, it's much easier to maintain the list of attachable constructions during the parsing process. They are not buried deep inside the tree anymore, but stored together in a flat list. Being immutable, they are easy to cache for future reparse. Rollback in case of reanalysis is also simple: just look at the constructions ending before the reparse point to restore the context. Finally, nobody prevents you from having constructions from several parse variants in the same cloud simultaneously.

Isn't it convincing?

Tuesday, February 9, 2010

Cloud of constructions

What I propose to use as the parser representation instead of trees is a collection of constructions. I call it a cloud.

Base constructions are words. When several constructions (e.g. words) tend to go together and comprise something meaningful in the semantics, they form a composite construction and become its arguments, or children. Composite constructions may also be children of even higher-order constructions, although not very often.

Consider I see a little silhouetto of a man as an example. Each word is a construction, so it's already in the cloud. I+see form a construction whose meaning is to specify see's subject as I. Similar things happen with see+silhouetto, though this time it's an object. Both a and little provide us with additional information about silhouetto: it's indeterminate and, well, little. Again, indeterminacy is added to man in a+man construction. The preposition of+man form a prepositional construction, which then itself forms a genitive construction with silhouetto. The result can be nicely visualized:

Isn't this looking like a strangely-colored cloud?

Of course, the composite constructions don't need to have exactly 2 arguments. For example, the more X the more Y construction has 6 arguments, 4 of which are constant words. X is Y to Z (as in John is hard to convince) has 5 arguments with 3 slots (actually 4: is also isn't fixed, it's replaceable with was).

And yes, clouds are better than trees.

Sepulkarium