Kirk Pepperdine's attendence of AMD's performance talk at JavaOne produced a cascade of fascinating memories about Cray optimizations. Here, Kirk relates some of the most interesting optimizations that helped make Cray's superfast - and how that relates to your Java programs.Published July 2007, Author Kirk Pepperdine
Traditionally JavaONE has offered more performance related talks than any other Java conference. This year was no exception, with so many performance related talks it was impossible to attend all of them. One of the more interesting performance related sessions was put on by AMD's Azeem Jiva. The timeless theme of the talk was: make sure your programs are good to your hardware and your hardware will be good to you. I say timeless because as I watched Azeem stroll through the demos, my mind was deluged with memories of my days programming on Cray supercomptuers.
The Cray series of super-computers were an engineering marvel when they were in their prime. The brilliance in the machine architecture wasn't only about speed. The scalar processors were not much faster than what would be found on any other server. Cray's brilliance was about the balance within the machine. As they saw it, there is no point in having a superfast CPU if it was only going to be starved for work. So, much of the extreme engineering that took place was in making sure that the CPUs were never hung on wait conditions.
One of my long time recomendations for Windows users to eliminate virtual memory from their machines (don't do this unless you've got plenty of real RAM) is based on the lack of virtual memory on Cray systems. In a time when memory was both in short supply and expensive, Cray recognized that getting data from a disk created huge wait conditions. So they eliminated virtual memory. To help with the I/O they introduced the use of solid state memory devices and multiple separate channels to move data from one place to another.
Most of these optimizations were performed under the hood and, aside from a few rule of thumbs such as don't do I/O and process in the same loop, one's coding style had little effect on performance. That said, there were other optimizations that could be obliterated if the developer ignored or didn't understand how the underlying hardware was architected and functioned. Out of the many optimizations that a developers coding style had the direct ability to affect, I'd like to mention three. These were: instruction buffer faults; striding through memory; and the ability to utilize the vector processors.
Though rare at the time, some form of the technologies found in Cray's vector processors are now commonplace in modern day processors. For example, pipelining intermediate results through various stages of computation so that the processor can work on multiple pieces of data at the same time is quite common. Things like path prediction are much more advanced now then they were at when I programmed Crays. Back in the late 80s, early 90s, it was fairly easy (and it still is) to obfuscate what you may want to do next. In the worst case, Cray would run your code in scalar mode instead of being able to utilize the much faster and more effecient vector processors. The most common way to obfuscate was to put branch statements in a for
loop (vector processors worked best with large for
loops). In order to get code to vectorize, one would often separate the data based on the condition in the branch prior to entering the processing loop. Each dataset would then be run through it's own separate loop with the branch removed.
Cray's instruction buffer was big enough to hold 40 instructions. The system would load the next 40 instructions to be executed and when they were exhausted it would load the next 40. It did have the ability to do a predictive pre-fetch but in general, fetching the next set of instructions would most likely be a hold condition (CPU goes hungry). This is yet another case where a developer's coding style could have adverse effects on performance. Of course code that randomly jumped to instructions not in the buffer would have the biggest impact on performance, but there were more subtle conditions than that. Again loops become important. Loops that were larger than 40 instructions and those that spilled over an instruction buffer boundary would result in some (sometimes significant) performance degradation. The obvious solution for the former problem was to write very small tight loops even if that meant looping twice over the same dataset. Crays were very well tuned for doing this, so quite often several single passes worked much better than a single "do all" pass over the dataset.
In retrospect the latter problem should have been handled automatically by having the optimizer align loops on instruction buffer boundries. Cray's solution at the time was to introduce a pragma statement. The pragma told the compiler/linker to align the code following the statement on an instruction buffer boundry. The programmers role in all of this, other than recognizing where to put the pragma's, is to ensure that loops do not span more than 40 instructions. Done right, a couple of short loops will outperform a single loop that does everything.
The most interesting optimization was to support the feeding of the vector processor from main memory. The vector processor was capable of both accepting a single piece of data and returning a result all in the same clock tick. The electronic reality is that memory, once strobed to be read, requires some time before it can be read again. Cray was always careful to make sure that the bank cool off time was 4 clock cycles. They were also careful to arrange memory into 4 different banks and, rather than have contiguous memory in the same bank, adjacent memory locations will arranged in different memory banks. The consequence of this design is that one bank of memory will always be ready to be read.
The developers responsibility in this case is to ensure that any strides through memory hit the cold bank on every clock tick. To do this, you may have to adjust the data structures being used. Again coding style counts. So by now you may be asking, what does all of this have to do with Java? The answer is: more that one would think there to be.
Right now you may be wondering why on earth anyone would be interested in hardware level counters when they are looking at Java. After all, Java runs in an abstraction commonly known as the Java Virtual Machine which places some distance between our code and the hardware. Aside from taking care with our choice of algorithm, what could you possibly do aside from implementing some dangerous premature optimizations that would affect how our code utilizes the underlying hardware? Surpisingly there are some easy changes you can make to your coding style that should help you to better utilize your hardware. More surpisingly, these style optimizations have been with us for longer than Java has.
The style optimization pointed to by Azeem was in respect to striding through a doubly indexed array.
The example presented looked something like
public void transform( int[][] matrix) { int j = 0; int k = 0; for ( ; k < matrix[ j].length; k++) { for ( ; j < matrix.length; j++) { do stuff } }
According to the JLS, arrays are evaluated from left to right. So we can write int[3][] matrix and follow that up with matrix[0] = new int[ 3];. This implies that it is the right most index that will point to a single dimensional array whose elements will be held in a contiguous block of memory. So the above code "jumps" through memory creating a situation that thrashes the CPU's onboard cache. Of course the fix is to reverse the for loops so that the code is running through memory in a more predictable manner. Now this example is a toy so the problem is quite obvious. The question is: do you have some obfsucated code lurking in your application that is doing the same thing?
Another important feature the analyzer was able to detect was lock contention. Lock contention can have some pretty devastating effects on your application's ability to perform. Aside from starving threads from obtaining the CPU, lock contention puts pressure on the operating system. Even more interesting was watching Azeem using the Analyzer to point out how disruptive it was to the processor as well. What I got from this demo is that just as the Cray processors worked best when our code worked in a predictable manner, so too do our modern processors. And, there is nothing quite as disruptive to a processor as having to execute code to acquire a lock. This isn't to say that we shouldn't when we need to, but it does suggest that something that I've known to be true in the past is still true today even in Java, namely that your coding style can have positive effects on performance.
Azeem's talk at JavaOne was TS-9363, "Java Platform Performance on Multicore: Better Performance or Bigger Headache?" Related tools are Intel's VTune and AMD's Analyzer.