Thursday, August 30, 2012

Recursion, caching and Groovy highlighting

It's hard to write anything (e.g. my parser) when it takes 20 seconds for IntelliJ IDEA to highlight a not very large Groovy file. And after you rename anything, you wait the same amount of time while it's optimizing imports. Today I finally decided to investigate the issue. Most of the highlighting time is spent in reference resolution: an IDE must know what every identifier refers to.

Recursion

As it happens, nowadays most performance problems in reference resolution are caused by cyclic dependencies. That means, resolving one reference requires resolving another one, and that one in turn requires the first one. As we use data flow to determine the variable types in Groovy, this is easy to encounter when you have a loop:

def a = new A()
def b = new B()
while (...) {
  a = b.nextA()
  ...
  b = a.nextB()
}

In a = b.nextA() one should first determine the type of b, then find a nextA method on that type and finally use its return type for a. But we should take into account that this might be not a first iteration of the loop, therefore b might be assigned in the very bottom of the loop body, so its type should be taken from there. And there it depends on the type of a defined in the same assignment that we are looking at.

That's just impossible to figure out. So if we discover such a cyclic dependency, we give up on inferring this particular expression type and use other assignments available. In this case a would just have the type A even if b.nextB() actually returns some AImpl.

Technically, it's just a stack overflow prevention. We maintain the list of expressions that we are currently calculating the type of. If we're asked for a type of such an expression, we just return null. The caller is ready for that.

Caching

Things get more complicated because we also cache the type of each expression. The problem is, we shouldn't cache null returned by stack overflow prevention functionality. Moreover, we shouldn't cache anything that depends on that null. Because in the end the type of a will not be null, it'll be A. And anything that depends on it will have a normal type based on A. If we cached an incomplete type, another highlighting thread would come and use it, and highlight a wrong error.

That's why endless recursion prevention and caching should know about each other. In IDEA, we have RecursionManager class which does precisely that.

As a result, if we have lots of cyclic dependencies, we don't cache lots of things. In fact, RecursionManager tries to memoize some calculation results that don't directly lie on a cyclic dependency. And this memoization is what makes this class really complex. It speeds up things quite a bit, but still, the best solution is not to create cyclic dependencies unless one really needs to.

Back to Groovy

So I went commenting out various parts of my Groovy code and putting a breakpoint on return null in RecursionManager. And here's what I found.

There were a couple of plain bugs which also led to cyclic dependencies: incorrect control flow for for-in loops, and too wide search scope when resolving a super class name.

Some Groovy AST transformations (e.g. @TupleConstructor) add constructors with parameters based on the class properties. Property list retrieval requires traversing all class methods and fields. Constructors are also methods in IDEA's terminology, so the transformation handler depends on itself. Fixed in the handler by carefully querying only the physical fields and methods, without running any extenders.

Finally, a new feature in IDEA 12. Each time a variable is passed to a method, IDEA checks the corresponding formal parameter type and narrows the inferred variable type accordingly if needed. Example:

def s = ...
if ("abc".contains(s)) ...
// now IDEA knows that s is a String

Unfortunately to resolve a method one should know the argument types. Moreover, after resolving one should also map arguments to parameters which is quite non-trivial in Groovy given all those default parameter values, named arguments. One should choose a correct overload. If a method is parameterized, then one argument type may be inferred from another. And we are right now calculating and refining the type of one of the arguments. Another cyclic dependency, quite a nasty one.

The best solution I have invented so far is to restrict this feature. Only allow cases when we can unambiguously resolve the method and map arguments to parameters without knowing argument types. This rules out methods with overloads, with default arguments, with generic parameters. Most of the methods I've ever encountered don't have it all anyway.

That's how I spent about 7 hours today. With all those changes (mostly with the last one) it now takes 3 seconds to highlight a file (instead of 20). Still not ideal, but bearable. And I can finally relax, learn some new things about Model Thinking and sleep.

UPD. Continued.

Monday, August 27, 2012

Suddenly filler-gap dependencies

Theoreticians say: in sentences like I know who you saw the deep structure is in fact I know [you saw who]. And in the surface structure there's an invisible gap after saw which is filled by who. Many psycholinguistic studies also seem to confirm that: upon seeing who people start to wait for a right place for it and only settle down after finding it.

Due to Russian free word order, I've had a luxury to ignore this complex thing for a while and treat wh-words as normal verbal arguments, like pronouns. But then two surprises have come.

One surprise was that implementing filler-gap dependencies was the easiest way to resolve a nasty ambiguity. Russian has a word что which can be either a complementizer (я знаю, что ты видел его; I know that you saw him) or a wh-word (я знаю, что ты видел; I know what you saw). The first one is higher in structure than the verb, the second one is lower. This made my parser suffer, it still doesn't like visibility ambiguities very well. Now что is no longer a verbal argument directly, it's a filler and is also higher in the hierarchy, just like in many syntactic theories.

Another surprise was that all this was actually very easy to add in the current parser architecture (given that there's no pied-piping yet). The filler is just a special construction which listens for what the incoming words contribute. If a contribution looks as a right head for the filler's grammatical functions, the contribution is enriched accordingly.

Example: Russian wh-word что can be in nominative or accusative case. For normal nouns that would mean it should generate nom and acc construction mites with noun attribute defined, pointing to a frame with some special wh semantic type. In the filler-gap approach it generates a filler construction instead which then sits and waits until it sees a contribution with nom or acc mites with head attribute defined. E.g. saw as a verb can be a head to both nominative and accusative arguments. The filler construction then adds a nom/acc mite having both head and noun attributes, where the noun points to a frame with wh type, and the head comes from the verb.

So how my parser works in this aspect now is quite similar to human sentence processing: a wh-word creates an active filler that finds a gap when there comes a verb with suitable argument requirements.