Friday, September 7, 2012

How I do natural language parsing without combinatorial explosion (almost)

TL;DR. One big parser is made of many little composable parsers, one for each language construct. That's great but has some issues with syntactic hierarchy.

Overview

How does one parse a highly ambiguous natural language text? Simply. One just runs several parsers in parallel and chooses the most semantically plausible analysis.

Many words have several alternatives, and avoiding exponential analysis forking is important. The parsers that run in parallel are not for the whole text, they're highly specialized. Each of them only parses a specific construction. When an adverb looks for a head verb, it doesn't care that the noun it encounters instead can be either nominative or accusative. It even doesn't care of the verb's gender, number, finiteness, whatever. Those things are managed by other constructions.

Forking still happens, but in manageable quantities. Example: Russian, Mother loves daughter. Both mother and daughter can be nominative or accusative, so this sentence is globally ambiguous (although it has a preferred Subject-Verb-Object reading). So there are two constructions, nom and acc which both have two noun-verb pairings: mother loves and loves daughter. Needless to say, these variants are mutually exclusive.

Basic parsing algorithm

Parsing works as follows. Each word specifies the constructions in which it can participate, and which role it can play in those constructions. Formally, it contributes a number of mites — pieces of constructions with some attributes attached. Mites can be marked as contradictory, e.g. mother can participate as a noun in nom or acc constructions, but not in both.

Parsing state also consists of mites. When applying a contribution, the new mites are added to the state. Besides, the mites that were already there, now have an opportunity to analyze the whole contribution and generate yet more mites to add. The newly added mites form the new parsing state which is ready for the new word to come.

The generation of new mites based on the contribution is called enrichment. Most often it's just a unification. Example:
  • you have mother which defines nom and acc mites with noun attribute (contradicting each other)
  • you meet loves which defines nom and acc mites with head attribute (not contradicting)
  • then you can just unify the mites with the same construction and get nom and acc mites with both noun and head defined (also contradicting).
Sometimes word order is important, so you only unify mites that come in a specific order. For example, prepositions can only come before their dependent noun, so some conditional logic is needed for enrichment in this case. When parsing sequences (A and B), yet more complex enrichments are used. They merge whole series of mites that both A and B contribute.

Semantics

Each construction has a meaning function associated with it. That means, each mite can contribute semantic relations. Example: nom construction links its head and noun with arg1 predicate. Of course, this can only happen when both head and noun are defined, i.e. both the verb and the noun have occurred.

The current meaning of the whole parsing state is produced by choosing a subset of compatible mites and combining their meanings. This choosing is a complex process with several (contradictory) guidelines:
  • maintain status quo: once a mite is chosen, keep it
  • if a chosen mite gets unified, try to choose the unification result
  • for every word contribution, the order is important: first mites listed have higher priority (that's how we get SVO reading in mother loves daughternom is just preferred over acc)
  • prefer complete mites over incomplete; nom with both head and noun defined is definitely better than a mite with only one attribute.
It can still happen that a chosen subset makes no sense semantically, while another one would be more plausible. To account for this, on each iteration the parser also checks some alternatives and chooses the more plausible ones. That's a bit inefficient place and I'm thinking of ways to make it more smart.

Hierarchy

I've tried to live with all this above but without any syntactic hierarchy, and that proved to be very inconvenient. Now I have some kind of hierarchy.

Every parsing state has a frontier. That's mites that appeared most recently: either generated directly by the previous word or from enrichment process based on this generation. These mites will always have a possibility of enriching new word contributions.

Some of the frontier mites have a special structural capability. They can expose an earlier parsing state so that mites from its frontier will also be called for enrichment. 

Example: after mother loves daughter there is a frontier having the mites related to daughter. That's fine if the next word is that starting a relative clause describing the daughter. But if the next word is strongly, then it should be linked to the verb loves which is not in the frontier. That's why one of the frontier constructions, namely the unified acc having both head and noun, will expose the earlier parsing state with loves in the frontier. Correct enrichment and unification of adverb construction will then be possible.

The topmost element in the hierarchy (usually verb) exposes an empty parsing state. So do mites that want to disallow free unification and control everything: commas, conjuncts, prepositions.

Dealing with visibility ambiguity

There may be several mites in the frontier exposing different earlier parsing states. That results in a visibility ambiguity. Not that I have a great recipe for this, but I have a way to deal with such circumstances.

If different mites expose different states, those mites should be mutually exclusive. So once you have chosen some non-contradicting mites for semantic evaluation, the previous state is defined unambigously.

Given the "status quo" guideline in the choosing algorithm, this results in structurally serial parsing. Once you've chosen one structural analysis of many, you're likely to stay on this parsing route. Luckily, so far I haven't encountered too many structural fork places. Just the comma.

But still, there are moments when you've chosen a wrong route and it needs to be corrected. This implies the parser should be able to do two things:
  1. One should detect that the current route is wrong. It's not simple, because a seemingly wrong analysis may be easily become correct once the next word arrives. Right now, I have a dumb ad-hoc code that analyses all the mites, unified or not. It detects things like "we're parsing a comma-separated sequence, but we haven't yet met a second member, but here's a participle that's usually surrounded by commas. Maybe we should have chosen the participle-parsing route".
  2. After we've realized our route is wrong, we should switch to a correct one as quickly as possible. Right now I use the way that was the easiest to implement, but absolutely psycholinguiscally implausible. I just go back in parsing state, try to choose mites with different structural properties and reparse everything after it. So far it works but I'm desperately thinking of something more efficient.
That's how my parser currently works, generally.

Monday, September 3, 2012

Groovy highlighting, continued

I was wrong. I thought that ignoring method argument types was enough to avoid complex data flow with cyclic dependencies. I absolutely forgot about method call qualifier. It can be complex and contain many local variable references which may easily lead to cyclic dependencies. Too many method calls have qualifiers, and ignoring all of them would render the argument-based type inference feature.

I tried to decrease the performance slowdown by caching the things that are invoked during resolve but don't depend on other local variables and thus don't lead to cyclic dependencies and are actually cached. For example, the list of all non-local declarations visible from a particular reference in code. That gave some speedup, but not enough for me.

So I remembered that once there was a completely different type inference algorithm. It was control-flow-based, it was smart, it inferred the local variable types for the whole method at once. It walked the control flow graph and incrementally refined the variable types. I had abandoned it because it had three issues:

  1. Each iteration used the previously computed types which were stored in a global thread-local variable. Not beautiful.
  2. The results based on these partially-computed types were cached elsewhere and used by other highlighting threads which led to spurious errors in the editor.
  3. Sometimes whole-method inference is not needed and slow, e.g. when you're searching for usages and want to resolve just one reference as quickly as possible.
Now I had to give a second thought about abandoning that algorithm. OK, it had all those problems, but at least it was fast. So why not try to solve those problems another way?
  1. Non-beautiful global thread-local state. That's easy, you just have to pass this state around to all the methods that deal with type inference and resolve. Easy, yes, but quite tiresome. After about two hours of non-intellectual parameter propagation, when my changelist exceeded 50 files, I decided I can live with this global thread-local state and reverted everything.
  2. The problem of caching. It's actually easy as well. There's not many places whose caching leads to multi-threading problems, so it's easy to introduce yet another indirection level there and cache those results in the same global thread-local state during type inference.
  3. Finally, the "find usages will be slow again" problem. One can analyze the variable dependency graph on each other and infer the types for a strongly-connected subcomponent of that graph. So far I haven't done this though.
But I have done everything else! I've restored the algorithm and adapted it to the new feature set, which was tricky. Now Groovy type inference in IntelliJ IDEA doesn't introduce any cyclic dependencies caused by local variables and is not exponential. Seems fast enough on my cases, but of course, there's always something to improve in that area. In fact, for some code it's a bit slower now, I have to investigate that.