Saturday, September 21, 2013

Data flow analysis

I do many things in IntelliJ IDEA. One of the most interesting of them is data flow inspection, also known to the world under a somewhat strange name Constant Conditions & Exceptions. It analyzes your Java code, checks how you assign variables and what you write in if conditions about them, and issues warnings like Condition is always true or This method call can produce NullPointerException. The analysis it performs is quite advanced. Sometimes, it requires some thinking to understand that a warning is actually correct. Users even complain it's too intelligent (e.g. 1, 2, 3)!

I didn't maintain this inspection code from the very beginning, only in the last several years. During this time, I did several improvements, not counting numerous bug fixes. I like this inspection. I like even its bugs, because it's fun fixing them. So if you see a bug, please report it, I'll be grateful for this new fun!

Closures

I've supported anonymous classes, whose analysis should be aware of what happens outside the class. For example, o cannot be null inside Runnable and no warnings should be reported:

o = getSomeObject();
if (o != null) {
    new Runnable() {
      public void run() {
        o.doSomething();
      }
    };
}

Repeated invocations

The inspection used to warn on the following code:

@Nullable File getFile() {...}
...
if (getFile() != null) return getFile().getName();

It's perfectly correct, but quite inconvenient. Correct because getFile is marked as @Nullable, it's a method and its result may change between invocations. Even if it was a mutable field named file, it could be reassigned from another thread. But it's an inconvenient warning because most of the code out there isn't subject to these conditions. And I don't feel like making users change their perfectly valid code by extracting a variable out of getFile() just to make a tool happy.

So now IDEA produces no possible NPE warning on such calls. What's complicated about this fix is that it also doesn't presuppose getFile() being not-null. So if you write code targeted to mutable environment, and check the null-ness of the file again in case it's changed, this second if (getFile() != null) check is not reported as an always true condition. What happens inside is that after any method call (getFile in this case) IDEA marks this expression as having an unknown value, and doesn't track any possible constant conditions involving it.

Tracking other objects

I've also supported data flow on final fields of other instances, not only this. And on non-final fields. And on getter method calls on other objects. And on expression chains comprised of all of the above. The main catch is to wisely forget the facts you know about all those expressions when it's time (e.g. on qualifier reassignment).

Contracts

More recently, I've added declarative method contract support to the inspection. It was quite easy. What was hard, was not to generate any excessive warnings because of a contract. If you call a method transform(o) where transform has a contract of null->null, then IDEA analyzes the method call as o == null ? null : transform(o). And whenever IDEA sees that you compare a variable to null, it unsurprisingly starts suspecting that it might actually be null! So here's the discrepancy: with such a contract, IDEA actually sees the comparison with null, but the user doesn't. It's just a normal method call, but its argument mysteriously is considered nullable after the call. As I've said, it was hard to do and took me more than a week of thinking, but I did it finally!

It remains to add checking that the contract corresponds to what the method is actually doing, and to infer contracts automatically. But it's quite non-trivial, and I'm still trying to understand how it could be implemented.

Value is always constant

if (var==null) { ... someMethod(var) ... }

Quite probably, passing to a method a variable which is known to be null, is a bug and the user intended something else. Even if it's not, writing null instead of var may make the code clearer. So the inspection reports such uses of variables with values known to be null, true or false and suggests to replace them with constants.

Too complex?

Finally, the thing I did yesterday and I'm really proud of. There's a problem: IDEA marks some methods as being too complex to analyze. This warning irritates people and not everyone understands it means you don't get any NPE warnings inside that method. The warning appears in two cases:
  • when it just takes too long to analyze the method
  • when the number of possible variable states is too large
Having a profiler, the first cause is relatively easy to fix (and is actually quite rare). The second one is harder.

So, there's too many data flow states. Each state is basically a set of assertions about variables and their values. IDEA tracks these states for each point in your code, and each point may correspond to many states. For example, a nullable variable is represented by two states: in one of them it's null, in the other it's not-null. If you call a method on this variable, both states are considered, and if in any of them the variable is null (which is true) the NPE warning is issued.

For simplicity let's assume that all assertions are either in form variable==value or variable!=value.

So, how to get many-many different states? It's very easy. You may have, say, 10 variables. Just compare each of them with some constant value, and here you are:

//1 state here
if (var1 == 0) { print("var0"); }
//2 states here: var1==0 and var1!=0
if (var2 == 0) { ... }
//4 states here:
//var1==0 && var2==0, var1==0 && var2!=0
//var1!=0 && var2==0, var1!=0 && var2!=0
...
if (var10 == 0) {...}
//1024 states here, too complex :(

But it need not be so sad! It's not too complex for people to understand, why should it be so for the IDE? The 2 states after the first if statement actually are complementary, they only differ in their assertions about var1. In fact, they can safely be replaced with just one state with no assertions about var1.

And that's the basic idea. At each point where different control flow branches join, all the states are gathered and checked if they can be merged. This has reduced the number of states dramatically, at least for the samples of "too complex" methods in our code base that I've looked at. Hence, many methods are not too complex anymore, users are not annoyed and get useful highlighting instead. Profit!

One issue is that determining which states can be merged is quite computationally expensive at the moment, and it doesn't merge all the states it could. But I'm working on it.