Tuesday, October 9, 2012

Parser rewrite completed

In June I started rewriting the parser because there were too many hacks in there. I've just finished: all the tests now pass, that were passing before that decision. Hurray!

It took about 4 months. I hoped it would be faster. Perhaps it could be, if I wasn't procrastinating so much. Now there's a lot less hacks as there's a much stronger support for syntactic hierarchy. Hopefully, adding each new test sentences won't take so long now. Although fixing the last several tests really took several weeks which isn't very promising :)

The plan now is to try to speedup things a little by not trying all the semantic alternatives after each and every word. And then to add new sentences. I have quite a few of them enqueued for parsing and translation.

Friday, September 7, 2012

How I do natural language parsing without combinatorial explosion (almost)

TL;DR. One big parser is made of many little composable parsers, one for each language construct. That's great but has some issues with syntactic hierarchy.


How does one parse a highly ambiguous natural language text? Simply. One just runs several parsers in parallel and chooses the most semantically plausible analysis.

Many words have several alternatives, and avoiding exponential analysis forking is important. The parsers that run in parallel are not for the whole text, they're highly specialized. Each of them only parses a specific construction. When an adverb looks for a head verb, it doesn't care that the noun it encounters instead can be either nominative or accusative. It even doesn't care of the verb's gender, number, finiteness, whatever. Those things are managed by other constructions.

Forking still happens, but in manageable quantities. Example: Russian, Mother loves daughter. Both mother and daughter can be nominative or accusative, so this sentence is globally ambiguous (although it has a preferred Subject-Verb-Object reading). So there are two constructions, nom and acc which both have two noun-verb pairings: mother loves and loves daughter. Needless to say, these variants are mutually exclusive.

Basic parsing algorithm

Parsing works as follows. Each word specifies the constructions in which it can participate, and which role it can play in those constructions. Formally, it contributes a number of mites — pieces of constructions with some attributes attached. Mites can be marked as contradictory, e.g. mother can participate as a noun in nom or acc constructions, but not in both.

Parsing state also consists of mites. When applying a contribution, the new mites are added to the state. Besides, the mites that were already there, now have an opportunity to analyze the whole contribution and generate yet more mites to add. The newly added mites form the new parsing state which is ready for the new word to come.

The generation of new mites based on the contribution is called enrichment. Most often it's just a unification. Example:
  • you have mother which defines nom and acc mites with noun attribute (contradicting each other)
  • you meet loves which defines nom and acc mites with head attribute (not contradicting)
  • then you can just unify the mites with the same construction and get nom and acc mites with both noun and head defined (also contradicting).
Sometimes word order is important, so you only unify mites that come in a specific order. For example, prepositions can only come before their dependent noun, so some conditional logic is needed for enrichment in this case. When parsing sequences (A and B), yet more complex enrichments are used. They merge whole series of mites that both A and B contribute.


Each construction has a meaning function associated with it. That means, each mite can contribute semantic relations. Example: nom construction links its head and noun with arg1 predicate. Of course, this can only happen when both head and noun are defined, i.e. both the verb and the noun have occurred.

The current meaning of the whole parsing state is produced by choosing a subset of compatible mites and combining their meanings. This choosing is a complex process with several (contradictory) guidelines:
  • maintain status quo: once a mite is chosen, keep it
  • if a chosen mite gets unified, try to choose the unification result
  • for every word contribution, the order is important: first mites listed have higher priority (that's how we get SVO reading in mother loves daughternom is just preferred over acc)
  • prefer complete mites over incomplete; nom with both head and noun defined is definitely better than a mite with only one attribute.
It can still happen that a chosen subset makes no sense semantically, while another one would be more plausible. To account for this, on each iteration the parser also checks some alternatives and chooses the more plausible ones. That's a bit inefficient place and I'm thinking of ways to make it more smart.


I've tried to live with all this above but without any syntactic hierarchy, and that proved to be very inconvenient. Now I have some kind of hierarchy.

Every parsing state has a frontier. That's mites that appeared most recently: either generated directly by the previous word or from enrichment process based on this generation. These mites will always have a possibility of enriching new word contributions.

Some of the frontier mites have a special structural capability. They can expose an earlier parsing state so that mites from its frontier will also be called for enrichment. 

Example: after mother loves daughter there is a frontier having the mites related to daughter. That's fine if the next word is that starting a relative clause describing the daughter. But if the next word is strongly, then it should be linked to the verb loves which is not in the frontier. That's why one of the frontier constructions, namely the unified acc having both head and noun, will expose the earlier parsing state with loves in the frontier. Correct enrichment and unification of adverb construction will then be possible.

The topmost element in the hierarchy (usually verb) exposes an empty parsing state. So do mites that want to disallow free unification and control everything: commas, conjuncts, prepositions.

Dealing with visibility ambiguity

There may be several mites in the frontier exposing different earlier parsing states. That results in a visibility ambiguity. Not that I have a great recipe for this, but I have a way to deal with such circumstances.

If different mites expose different states, those mites should be mutually exclusive. So once you have chosen some non-contradicting mites for semantic evaluation, the previous state is defined unambigously.

Given the "status quo" guideline in the choosing algorithm, this results in structurally serial parsing. Once you've chosen one structural analysis of many, you're likely to stay on this parsing route. Luckily, so far I haven't encountered too many structural fork places. Just the comma.

But still, there are moments when you've chosen a wrong route and it needs to be corrected. This implies the parser should be able to do two things:
  1. One should detect that the current route is wrong. It's not simple, because a seemingly wrong analysis may be easily become correct once the next word arrives. Right now, I have a dumb ad-hoc code that analyses all the mites, unified or not. It detects things like "we're parsing a comma-separated sequence, but we haven't yet met a second member, but here's a participle that's usually surrounded by commas. Maybe we should have chosen the participle-parsing route".
  2. After we've realized our route is wrong, we should switch to a correct one as quickly as possible. Right now I use the way that was the easiest to implement, but absolutely psycholinguiscally implausible. I just go back in parsing state, try to choose mites with different structural properties and reparse everything after it. So far it works but I'm desperately thinking of something more efficient.
That's how my parser currently works, generally.

Monday, September 3, 2012

Groovy highlighting, continued

I was wrong. I thought that ignoring method argument types was enough to avoid complex data flow with cyclic dependencies. I absolutely forgot about method call qualifier. It can be complex and contain many local variable references which may easily lead to cyclic dependencies. Too many method calls have qualifiers, and ignoring all of them would render the argument-based type inference feature.

I tried to decrease the performance slowdown by caching the things that are invoked during resolve but don't depend on other local variables and thus don't lead to cyclic dependencies and are actually cached. For example, the list of all non-local declarations visible from a particular reference in code. That gave some speedup, but not enough for me.

So I remembered that once there was a completely different type inference algorithm. It was control-flow-based, it was smart, it inferred the local variable types for the whole method at once. It walked the control flow graph and incrementally refined the variable types. I had abandoned it because it had three issues:

  1. Each iteration used the previously computed types which were stored in a global thread-local variable. Not beautiful.
  2. The results based on these partially-computed types were cached elsewhere and used by other highlighting threads which led to spurious errors in the editor.
  3. Sometimes whole-method inference is not needed and slow, e.g. when you're searching for usages and want to resolve just one reference as quickly as possible.
Now I had to give a second thought about abandoning that algorithm. OK, it had all those problems, but at least it was fast. So why not try to solve those problems another way?
  1. Non-beautiful global thread-local state. That's easy, you just have to pass this state around to all the methods that deal with type inference and resolve. Easy, yes, but quite tiresome. After about two hours of non-intellectual parameter propagation, when my changelist exceeded 50 files, I decided I can live with this global thread-local state and reverted everything.
  2. The problem of caching. It's actually easy as well. There's not many places whose caching leads to multi-threading problems, so it's easy to introduce yet another indirection level there and cache those results in the same global thread-local state during type inference.
  3. Finally, the "find usages will be slow again" problem. One can analyze the variable dependency graph on each other and infer the types for a strongly-connected subcomponent of that graph. So far I haven't done this though.
But I have done everything else! I've restored the algorithm and adapted it to the new feature set, which was tricky. Now Groovy type inference in IntelliJ IDEA doesn't introduce any cyclic dependencies caused by local variables and is not exponential. Seems fast enough on my cases, but of course, there's always something to improve in that area. In fact, for some code it's a bit slower now, I have to investigate that.

Thursday, August 30, 2012

Recursion, caching and Groovy highlighting

It's hard to write anything (e.g. my parser) when it takes 20 seconds for IntelliJ IDEA to highlight a not very large Groovy file. And after you rename anything, you wait the same amount of time while it's optimizing imports. Today I finally decided to investigate the issue. Most of the highlighting time is spent in reference resolution: an IDE must know what every identifier refers to.


As it happens, nowadays most performance problems in reference resolution are caused by cyclic dependencies. That means, resolving one reference requires resolving another one, and that one in turn requires the first one. As we use data flow to determine the variable types in Groovy, this is easy to encounter when you have a loop:

def a = new A()
def b = new B()
while (...) {
  a = b.nextA()
  b = a.nextB()

In a = b.nextA() one should first determine the type of b, then find a nextA method on that type and finally use its return type for a. But we should take into account that this might be not a first iteration of the loop, therefore b might be assigned in the very bottom of the loop body, so its type should be taken from there. And there it depends on the type of a defined in the same assignment that we are looking at.

That's just impossible to figure out. So if we discover such a cyclic dependency, we give up on inferring this particular expression type and use other assignments available. In this case a would just have the type A even if b.nextB() actually returns some AImpl.

Technically, it's just a stack overflow prevention. We maintain the list of expressions that we are currently calculating the type of. If we're asked for a type of such an expression, we just return null. The caller is ready for that.


Things get more complicated because we also cache the type of each expression. The problem is, we shouldn't cache null returned by stack overflow prevention functionality. Moreover, we shouldn't cache anything that depends on that null. Because in the end the type of a will not be null, it'll be A. And anything that depends on it will have a normal type based on A. If we cached an incomplete type, another highlighting thread would come and use it, and highlight a wrong error.

That's why endless recursion prevention and caching should know about each other. In IDEA, we have RecursionManager class which does precisely that.

As a result, if we have lots of cyclic dependencies, we don't cache lots of things. In fact, RecursionManager tries to memoize some calculation results that don't directly lie on a cyclic dependency. And this memoization is what makes this class really complex. It speeds up things quite a bit, but still, the best solution is not to create cyclic dependencies unless one really needs to.

Back to Groovy

So I went commenting out various parts of my Groovy code and putting a breakpoint on return null in RecursionManager. And here's what I found.

There were a couple of plain bugs which also led to cyclic dependencies: incorrect control flow for for-in loops, and too wide search scope when resolving a super class name.

Some Groovy AST transformations (e.g. @TupleConstructor) add constructors with parameters based on the class properties. Property list retrieval requires traversing all class methods and fields. Constructors are also methods in IDEA's terminology, so the transformation handler depends on itself. Fixed in the handler by carefully querying only the physical fields and methods, without running any extenders.

Finally, a new feature in IDEA 12. Each time a variable is passed to a method, IDEA checks the corresponding formal parameter type and narrows the inferred variable type accordingly if needed. Example:

def s = ...
if ("abc".contains(s)) ...
// now IDEA knows that s is a String

Unfortunately to resolve a method one should know the argument types. Moreover, after resolving one should also map arguments to parameters which is quite non-trivial in Groovy given all those default parameter values, named arguments. One should choose a correct overload. If a method is parameterized, then one argument type may be inferred from another. And we are right now calculating and refining the type of one of the arguments. Another cyclic dependency, quite a nasty one.

The best solution I have invented so far is to restrict this feature. Only allow cases when we can unambiguously resolve the method and map arguments to parameters without knowing argument types. This rules out methods with overloads, with default arguments, with generic parameters. Most of the methods I've ever encountered don't have it all anyway.

That's how I spent about 7 hours today. With all those changes (mostly with the last one) it now takes 3 seconds to highlight a file (instead of 20). Still not ideal, but bearable. And I can finally relax, learn some new things about Model Thinking and sleep.

UPD. Continued.

Monday, August 27, 2012

Suddenly filler-gap dependencies

Theoreticians say: in sentences like I know who you saw the deep structure is in fact I know [you saw who]. And in the surface structure there's an invisible gap after saw which is filled by who. Many psycholinguistic studies also seem to confirm that: upon seeing who people start to wait for a right place for it and only settle down after finding it.

Due to Russian free word order, I've had a luxury to ignore this complex thing for a while and treat wh-words as normal verbal arguments, like pronouns. But then two surprises have come.

One surprise was that implementing filler-gap dependencies was the easiest way to resolve a nasty ambiguity. Russian has a word что which can be either a complementizer (я знаю, что ты видел его; I know that you saw him) or a wh-word (я знаю, что ты видел; I know what you saw). The first one is higher in structure than the verb, the second one is lower. This made my parser suffer, it still doesn't like visibility ambiguities very well. Now что is no longer a verbal argument directly, it's a filler and is also higher in the hierarchy, just like in many syntactic theories.

Another surprise was that all this was actually very easy to add in the current parser architecture (given that there's no pied-piping yet). The filler is just a special construction which listens for what the incoming words contribute. If a contribution looks as a right head for the filler's grammatical functions, the contribution is enriched accordingly.

Example: Russian wh-word что can be in nominative or accusative case. For normal nouns that would mean it should generate nom and acc construction mites with noun attribute defined, pointing to a frame with some special wh semantic type. In the filler-gap approach it generates a filler construction instead which then sits and waits until it sees a contribution with nom or acc mites with head attribute defined. E.g. saw as a verb can be a head to both nominative and accusative arguments. The filler construction then adds a nom/acc mite having both head and noun attributes, where the noun points to a frame with wh type, and the head comes from the verb.

So how my parser works in this aspect now is quite similar to human sentence processing: a wh-word creates an active filler that finds a gap when there comes a verb with suitable argument requirements.

Tuesday, July 31, 2012

The comma dilemma

There's a sentence to translate, and there's a comma missing there. This doesn't prevent me from understanding the sentence, but my parser fails. In another version of the same text found online the comma is present. It's only missing in the one I've found long ago which I'm trying to translate literally.

So here's the dilemma:

  • Add the comma according to the Russian rules and parse the correct text,   or
  • Remember that I'm trying to simulate humans for whom extra commas are not a big deal, and dive right away into the wonderful world of parsing noisy input without proper understanding how to deal with correct one.

No idea.

UPD. In the end I corrected the commas in the original sentence, and added failing tests to fix later with all kinds of comma omissions.

Sunday, July 22, 2012

Immutable data structures in OOP (Groovy)

In my parser, everything except some minor auxiliary things is immutable, and I cannot express how valuable it is. It helps debugging, testing, backtracking in the algorithm. But there are certain issues with immutability in Groovy (which I use for its syntactic sugar) and other object-oriented languages.


One very often has to copy an object with a subset of its field values replaced, leaving the others as they are. Functional languages normally have special syntax for that; mainstream object-oriented languages normally don't, including Groovy. So I had to resort to an updating function like copyWith variant for Scala. Manually and without any reflection so far. When adding a field, I also have to update the default constructor, the all-fields constructor and the cloning function. Luckily that's not very often, otherwise I'll probably create an AST-transforming annotation that would do this automatically for me.


Groovy has no built-in persistent collections: immutable but cheap to create an updated copy of. I've found that Java ecosystem is not rich in these kinds of libraries at all, and that's a pity. Scala and Clojure have them, but tuned to those languages and I couldn't be bothered to use them so far. There are also pcollections and Functional Java libraries which don't have everything I need (e.g. a persistent version of LinkedHashMap). So I have either to create something myself, or use the standard Java collections very carefully, or just give up and rewrite everything in another language. So far I'm combining the  first and the second variants.

State flow

I've found is that writing complex logic in instance methods may be quite error-prone. I have ParsingState class with apply method which is invoked when processing each word. It takes some possibly contradicting language constructions that a word may be part of, and decides which to activate, i.e. which parsing alternative to choose. This stage involves trying all the alternatives and evaluating their semantic plausibility. Something like this:

Whenever you change anything (obtain a changed copy of ParsingState), you must reassign it to state variable. Not a big deal, but a bit tiresome. Worse is this: everything except the first line has to be prefixed with state, because you need to operate on the latest possible version of the parsing state. And it's so damn easy to forget this qualifier! The IDE will autocomplete it, the compiler will eat it and most probably the problem won't be noticed for some time.

After several such bugs I'm very inclined to wrap all complex logic in static methods where you can't have unqualified references. There's normally only one possible state qualifier then, and if you have some special case, you add another variable to keep an earlier version of state, and to mix them is not so easy. At the same time, I like calling instance methods. It's very convenient: there's a namespace clearly defined by the qualifier type, it contains only what you need. As a result, I create an instance method which immediately calls some private static one where all the logic lies.


So, here's my wishlist for an ideal language:
  1. Support for creating immutable structures easily. I find Scala way to be the most concise. Groovy offers @TupleConstructor annotation which is nice but not a part of the language.
  2. Support for updating immutable structures so that adding a new field requires changing just one place.
  3. Persistent collections with a predictable iteration order. An ability to create them using concise list/map literals would be a plus.
  4. A way of expressing complex state transition logic concisely, clearly and not so error-prone. Like State monad, only simpler, something for human beings. Maybe just a syntactic sugar for var=f(var).  Computing other values alongside state change as well: var,sideResult=f(var).

Friday, July 20, 2012

Visibility ambiguity

I'm now dealing with a quite interesting kind of ambiguity which I call visibility ambiguity. Consider the following three sentences (I might have put too many commas; imagine it's Russian where they're obligatory):
  1. But there, thinking about her words, we got sad.
  2. But there, thinking about her words, which were so cruel, we got sad.
  3. But there, thinking about her words, intonation and gestures, we got sad.
The sentences are the same until the second comma after which they diverge dramatically. To process the next word the parser should determine which previous state it can attach to. In 1, we is attached to there and the top-level sentence is continued. In 2, which relates to words plus comma, a relative clause begins. In 3, there's a conjunction: words gets merged with intonation and re-attached together to about and her.

Obviously, all three attachment sites should be visible to the parser: there, words and comma. And all three continuation possibilities should be explored.

Which is where it gets complicated, because these attachment sites are also incompatible with each other. Once you've attached a word following route 1, you cannot attach the next one to route 2. The parser should regard such attachments as mutually incompatible. There are at least three diverging routes after every comma, each defined by its own set of constructions. And all constructions from different sets should be marked as contradictory, pair-wise. Which is quite a few pairs.

So it's not only that the number of possible variants will be big. It's also that the number of incompatibility relations to track will be much bigger, something pretty exponential. Given that I normally log all the visible attachment sites after each word during parsing, the log will be hard to analyze.

And what troubles me: people don't do this, they don't track all the possibilities. Instead, they quite quickly choose a variant which suits best, purge everything else from the memory and pursue only this parsing route. If it goes wrong eventually, they perform some kind of reanalysis. But the heuristics are tuned well enough and normally such a need never arises.

Right now my parser maintains all the structures needed for all possible analyses, and switching between them is very simple. But that just doesn't scale well enough to support visibility ambiguities. So it seems that the time has come to teach my parser the human strategy: choosing one route and then reparsing when anything doesn't fit well.

Monday, June 18, 2012

7 stages of exploratory programming

  1. One has an idea on natural language understanding, starts implementing it. Cool, it works!
  2. One encounters some sentences not easy to deal with. It's not clear how to do it the right way, beautifully, so one does it somehow  by dirty hacking. Hopefully, later it'll become more clear when new data arrives.
  3. As more and more sentences sentences are encountered, some hacks are corrected but others are added. Hacks proliferate and get tightly intertwined.
  4. It becomes harder to parse new sentences. Hacks start getting in the way. Fortunately, one starts seeing how to avoid them.
  5. Correcting the design appears to be non-trivial because hacks depend on each other. So to fix one you should first correct another one, which requires third one, and so on. There's a whole dependency graph of hacks!
  6. The test suite doesn't seem a friend anymore. It keeps failing, revealing more and more ways hacks depend on each other. No new sentences get parsed anymore.
  7. After half a year of pure refactoring, the morale is low and still no light at the end of the tunnel. Yet another cyclic hack dependency is found. Tired, one decides to apply all the desired design changes at once and fix the tests one by one, almost as if writing the parser afresh. Goto 1.