Tuesday, July 31, 2012

The comma dilemma

There's a sentence to translate, and there's a comma missing there. This doesn't prevent me from understanding the sentence, but my parser fails. In another version of the same text found online the comma is present. It's only missing in the one I've found long ago which I'm trying to translate literally.

So here's the dilemma:

  • Add the comma according to the Russian rules and parse the correct text,   or
  • Remember that I'm trying to simulate humans for whom extra commas are not a big deal, and dive right away into the wonderful world of parsing noisy input without proper understanding how to deal with correct one.

No idea.

UPD. In the end I corrected the commas in the original sentence, and added failing tests to fix later with all kinds of comma omissions.

Sunday, July 22, 2012

Immutable data structures in OOP (Groovy)

In my parser, everything except some minor auxiliary things is immutable, and I cannot express how valuable it is. It helps debugging, testing, backtracking in the algorithm. But there are certain issues with immutability in Groovy (which I use for its syntactic sugar) and other object-oriented languages.


One very often has to copy an object with a subset of its field values replaced, leaving the others as they are. Functional languages normally have special syntax for that; mainstream object-oriented languages normally don't, including Groovy. So I had to resort to an updating function like copyWith variant for Scala. Manually and without any reflection so far. When adding a field, I also have to update the default constructor, the all-fields constructor and the cloning function. Luckily that's not very often, otherwise I'll probably create an AST-transforming annotation that would do this automatically for me.


Groovy has no built-in persistent collections: immutable but cheap to create an updated copy of. I've found that Java ecosystem is not rich in these kinds of libraries at all, and that's a pity. Scala and Clojure have them, but tuned to those languages and I couldn't be bothered to use them so far. There are also pcollections and Functional Java libraries which don't have everything I need (e.g. a persistent version of LinkedHashMap). So I have either to create something myself, or use the standard Java collections very carefully, or just give up and rewrite everything in another language. So far I'm combining the  first and the second variants.

State flow

I've found is that writing complex logic in instance methods may be quite error-prone. I have ParsingState class with apply method which is invoked when processing each word. It takes some possibly contradicting language constructions that a word may be part of, and decides which to activate, i.e. which parsing alternative to choose. This stage involves trying all the alternatives and evaluating their semantic plausibility. Something like this:

Whenever you change anything (obtain a changed copy of ParsingState), you must reassign it to state variable. Not a big deal, but a bit tiresome. Worse is this: everything except the first line has to be prefixed with state, because you need to operate on the latest possible version of the parsing state. And it's so damn easy to forget this qualifier! The IDE will autocomplete it, the compiler will eat it and most probably the problem won't be noticed for some time.

After several such bugs I'm very inclined to wrap all complex logic in static methods where you can't have unqualified references. There's normally only one possible state qualifier then, and if you have some special case, you add another variable to keep an earlier version of state, and to mix them is not so easy. At the same time, I like calling instance methods. It's very convenient: there's a namespace clearly defined by the qualifier type, it contains only what you need. As a result, I create an instance method which immediately calls some private static one where all the logic lies.


So, here's my wishlist for an ideal language:
  1. Support for creating immutable structures easily. I find Scala way to be the most concise. Groovy offers @TupleConstructor annotation which is nice but not a part of the language.
  2. Support for updating immutable structures so that adding a new field requires changing just one place.
  3. Persistent collections with a predictable iteration order. An ability to create them using concise list/map literals would be a plus.
  4. A way of expressing complex state transition logic concisely, clearly and not so error-prone. Like State monad, only simpler, something for human beings. Maybe just a syntactic sugar for var=f(var).  Computing other values alongside state change as well: var,sideResult=f(var).

Friday, July 20, 2012

Visibility ambiguity

I'm now dealing with a quite interesting kind of ambiguity which I call visibility ambiguity. Consider the following three sentences (I might have put too many commas; imagine it's Russian where they're obligatory):
  1. But there, thinking about her words, we got sad.
  2. But there, thinking about her words, which were so cruel, we got sad.
  3. But there, thinking about her words, intonation and gestures, we got sad.
The sentences are the same until the second comma after which they diverge dramatically. To process the next word the parser should determine which previous state it can attach to. In 1, we is attached to there and the top-level sentence is continued. In 2, which relates to words plus comma, a relative clause begins. In 3, there's a conjunction: words gets merged with intonation and re-attached together to about and her.

Obviously, all three attachment sites should be visible to the parser: there, words and comma. And all three continuation possibilities should be explored.

Which is where it gets complicated, because these attachment sites are also incompatible with each other. Once you've attached a word following route 1, you cannot attach the next one to route 2. The parser should regard such attachments as mutually incompatible. There are at least three diverging routes after every comma, each defined by its own set of constructions. And all constructions from different sets should be marked as contradictory, pair-wise. Which is quite a few pairs.

So it's not only that the number of possible variants will be big. It's also that the number of incompatibility relations to track will be much bigger, something pretty exponential. Given that I normally log all the visible attachment sites after each word during parsing, the log will be hard to analyze.

And what troubles me: people don't do this, they don't track all the possibilities. Instead, they quite quickly choose a variant which suits best, purge everything else from the memory and pursue only this parsing route. If it goes wrong eventually, they perform some kind of reanalysis. But the heuristics are tuned well enough and normally such a need never arises.

Right now my parser maintains all the structures needed for all possible analyses, and switching between them is very simple. But that just doesn't scale well enough to support visibility ambiguities. So it seems that the time has come to teach my parser the human strategy: choosing one route and then reparsing when anything doesn't fit well.