Spreadsheet Dependency graph

阅读更多

In spreadsheets, such as Excel, we can write simple formules like the following:

 

Spreadsheet Dependency graph_第1张图片

 

Here C4 = A3 + A5, so if A3=5 and A5=7, then C4=12. If we change A3 to 6 and hit enter, then C4 is automatically changed to 13. So in the background, there is a dependency graph among the cells. We can explicitly show the graph by the menu Tools | Formula Auditing | Trace Precedence.

 

To go a little bit further, now if we have another cell E6, depends on C4 and C8, where C4 depends on A3 and A5. If we change the value C8, then E6 is recalculated, but not C4. In other words, the automatic recalculation happens only when necessary (The intermediate values are saved somewhere).

 

Spreadsheet Dependency graph_第2张图片

 

Now I want to replicate this logic in Java (or C#, python, etc). I've been thinking about this in the last couple of years, still hunting for a good solution. The motivation is that I need to run hundreds of millions of complicated calculations every day on thousands of machines. The duplicated calculations is about 20% - 30%. 

 

It's not difficult to replicate spreadsheet logic because the formula in spreadsheet is fairly limited. However, to do the same thing for a general programming language like Java/C#/python is more complex because of rich grammers.

 

One way to do this is the following:

1. have setter/getter for each relevent field, then build dependency graph in getters method (make sure getter is called everywhere). Build cached values and dirty flags in setters.

2. cache returned values for relevent methods, and build dependent graph between methods and variables. Also we need to take care of dirty flags.

 

A similar approach is in the project: http://publicobject.com/glazedlists/.

 

However, I am not satisfied with this approach, the overhead of the coding is way too much. I want a simpler solution. Now I found a better, yet not complete, solution - using annotation and aspectj. Since the coding is pretty simple, I am going to skip it. Rather, I'll show the behavior and the design logic.

 

Now let's try to mimic the spread sheet logic. Suppose we have a simple class:

 

 

public class A
{
    @Depend int i=5;
    @Depend int j=7;

    @Depend public int calc1()
    {
        System.out.println("run calc1() ...");
        return i + j;
    }
}

 

Here we want to monitor the fields i and j, and the method calc1(). My testcase is somewhat like this:

 

 

public void testCalc1()
{
    A a = new A();
    a.calc1();
    a.calc1();
}

 

There should be only 1 printout from the method (meaning we calculate it only once).

 

I like this design because it puts minimal effort on developers, with only one extra annotation. This is the most attractive feature that I can dream of. I can't think of any simpler solution (If you can, let me know).

 

I tried to use annotation only without aspectj, but it didn't work out. The main reason is that we need object identifiers, not just class identifiers. For example, If we have a, b of the same class A, then a.i and b.i are different objects. So when we build dependency graphs, we have to make separate nodes.

I use Object's toString() method as its identifier (so don't overwrite it, :-)). This works only in one JVM.

 

Using aspectj's field get aspect, we could intercept all references to annotated variables and thus we could build the dependency graph between relevent variables and methods. 

 

Using aspectj's field set aspect, we could intercept all variable assignments (there is one exception), then we need to set dirty flags for all parents in the dependency tree. There is one exception in the aspectj field set aspect - it can't intercept array element assignment, such as x[3] = 10. For more information on this, check aspectj documents.

 

Once we have the dependency graph, we could add another aspect to intercept annotated method calls. If the cached value is null or the dirty flag is true, then we run the method and save the result.

 

While these work well for new classes, it would be better if we can handle Collection classes as well. So we just add another aspect to intercept all methods in Collection/Map classes which change the content, such as add/put/set, etc (only when the field is annotated).

 

There are several implicit design decisions:

1. Since we are using aspectj field get interception, this means we intercept every reference of annotated variables. Though aspectj is pretty fast, the overhead is still not acceptable sometimes. If this is the case, e.g., we reference variables many many time, we could introduce a new local variable. As long as the new variable is not annotated, the aspectj interception won't be triggered.

2. I mentioned above that we can't intercept array element assignment, this is a limitation of aspectj. But we implement the aspectj for Collections/Maps, so in case of arrays, use Collections/Maps.

3. Since we are using aspectj interception, we ignore the control graph, i.e., if-else, loops. This tradeoff can be worked around. If the control graph is complex, then we should break the method into several sub-methods to reduce the complexity; otherwise, the harm is minimal so that we could ignore it.

4. Notice that there is no parameter in the annotated method in the above example. All dependencies are class level fields. Technically, we could take parameters into account. But I feel this does more harm than beneficial. Of course, my current approach would make the class stateful (not thread safe).

 

If we google the phrase "program dependency graph", we could find tons of references using different approaches. One is to build the dependency graphs from source code. Though JDK6 has compiler APIs, it's still a tough job. A lot of research papers use abc compiler: http://abc.comlab.ox.ac.uk/introduction.

 

 

Another interesting (and useful) feature we can extend from spreadsheet is the following. In my environment, I need to do a lot of computing in a certain pattern: given a set of input, compute a value; then tweak a field in the input, compute it again, then tweak it back and tweak another field, etc. Sometimes, we forgot to tweak back the field value before going on the next evaluation and disasters happen. For example,

class A
{
		@Depend int i = 5;
		@Depend int j = 11;
				
		@Depend public int prod()
		{
			return i * 18 + j;
		}
		
		@Tweak @Depend public int mytweak()
		{
			i = 6;
			j = 29;
			return prod();
		}
}

 

In the method prod() we want i and j to be the values 5 and 11 (initially assigned). Then in method mytweak() we changed the values i and j. We want this change to be local to mytweak(), not spill out to prod().

 

To be precise, here is a testcase:

public void testPrimitiveTweak()
{
		A a = new A();
		int r = a.mytweak();
		assertTrue(r == 137);
		
		r = a.prod();
		assertTrue(r == 101);		
}
 So whatever change we make in mytweak() is not showing up in prod(), i.e., the changes are "erased" and the original values are restored.

 

  • Spreadsheet Dependency graph_第3张图片
  • 大小: 19.4 KB
  • Spreadsheet Dependency graph_第4张图片
  • 大小: 6.6 KB
  • 查看图片附件

你可能感兴趣的:(Python,Excel,J#,JVM,Google)