Between SISD and SIMD: a bridge not too far

GPGPU, since its emerge in 2007 by nVidia, has been continuously changing the paradigm of software/hardware computing. It has already been proven to be a very competitive candidate in supercomputing, compared to conventional CPU processors. Why? Because it has got excellent cost/performance ratio. To invest your precious $1K into a GPGPU card will likely have the performance of a $3k-4k high-end CPU-only machine. Good bargain.

So parallelism is the way to go. The more parallel, the better. AMD and nVidia are having a rocket performance race. The world is changing. But before we praise the new hardware god, there is a big problem -- softwares are still having a hard time shifting to parallel. Algorithms, optimizations we are now using are mostly crafted and customized for single threaded environment. To utilize this new parallel computing environment, people need to either: 1. develop new algorithms for new machines, or 2. port the old algorithms over.

Okay, enough chic chat about the revolution of parallelism. Today let's take a look at one of the interesting paper that deals with the second case. Oh, btw, it is one of the 2 winners of best paper award of CGO 2013:

Bin Ren, Gagan Agrawal, Jim Larus, Todd Mytkowicz, Tomi Poutanen (Microsoft), and Wolfram Schulte for “SIMD Parallelization of Applications that Traverse Irregular Data Structures”

www.cse.ohio-state.edu/~ren/papers/CGO13_BinRen.pdf



To those who are not familiar with it: vector instructions are like issuing a bunch of a same scalar instruction in parallel (a vector). Suppose we want to add array A to array B, and store the results into array C, usually we write:

for (int i = 0; i < n; ++i)

    c[i] = a[i] + b[i];

without optimization, each element addition will roughly be translated into:

__asm {

    MOV EAX, [header_A + i * 4];

    ADD EAX, [header_B + i * 4];

    MOV [header_C + i *4], EAX;

}



A vectorizing compiler will first unroll it into:

for (int i = 0; i < n; i+=4)

{

    c[i+0] = a[i+0] + b[i+0];

    c[i+1] = a[i+1] + b[i+1];

    c[i+2] = a[i+2] + b[i+2];

    c[i+3] = a[i+3] + b[i+3];

}

this will eliminate 3/4 of the branches in the loop. And then transform it into (or usually idiom recognition will do it in one step):

for (int i = 0; i < n/4; ++i)

__asm{


    MOVUPS XMM0, [header_A + i * 16]; //load 128 bit

    MOVUPS XMM1, [header_B + i * 16]; //load 128 bit

    ADDPS XMM0, XMM1; //add 4 elements

    MOVUPS [header_C + i * 16], XMM0; //store 128 bit

}



Those instructions are already been exploited extensively in low-level such as compiler's loop vectorization (gcc supports a whole bunch of them. see: reference * ), video compression, graphic rendering, signal processing etc. Unlike conventional scalar instructions, vector instructions only work with dedicated vector registers (XMM0, XMM1 in the above example), and it requires the vector operands to be contiguous (MOVUPS loads 128 bit of data at a time). In order to exploit SIMD instructions, programmers have to make the data layout of an application to be contiguous.

But what about data structure with pointers? Usually, an abstract data structure, such as linked list, does not have data contiguity, and is not eligible to have SIMD acceleration. But this paper tries to tackle this problem -- to vectorize the execution of operations on abstract data structure.

A vector operation, not only requires data conformity, but also requires  no control flow branches. Consider this code:

for (int = 0; i < n; ++i)

{

    if (a[i] == b[i])  c[i] = a[i] - b[i];

    else c[i] = a[i] + 0;

}

A branch in the loop body tells compiler the answer is NO (but, heh, actually we can still do it manually). What would the code do if a[0] > b[0], while a[1] < b[1] ? It is because a vector instruction is an atomic operation, and within a single thread there is no way to execute two paths in one instruction. Too bad, there is no way we can generate SIMD instructions for this loop.



Let's see how nVidia's CUDA deals with this branching problem. The programming paradigm of nVidia's GPGPU, SIMT (single instruction multiple thread) is a slightly different version of SIMD, as it is scalable. Here is an example of how the branching array addition example works: for n = array's length, the GPU issues n concurrent threads, and each thread will be given an individual but unique id ranging from 0 to n-1. With that id, thread id will be able to identify the address of an individual a[id], b[id] and c[id], and each thread is responsible for operation on a single element, the n threads accomplish the task collectively.

Looks amazing, but the GPGPU hardware is still a SIMD machine. Given a bunch of threads, when encountering a branch, the group execution is serialized: first each thread determines the result of a[i] < b[i], then, those threads with evaluated result true will execution their path, while the rest of the threads idle; then vice versa. GPGPU is designed like this for performance purposes -- SIMD architecture means less transistors, more parallelism.

So in short, SIMD hates control flow dependency. The solution is stated in the paper: turn control dependencies into data dependencies. Looking back at the last example, we can see that the difficulty is that the computing the program does are different in the two branch paths: one is f(c[i]) = a[i]; and the other is g(c[i]) = a[i] + b[i]; If we want to vectorize it we have to unify its interface, such as:

h(c[i]) = a[i] + ( b[i] * (a[i] > b[i] ? -1 : 0) )

And we can make the branching part: (a[i] > b[i] ? 0 : 1) evaluated into a vector compare(_mm_cmpeq_epi32), Voila, thus the branch goes away, and we can continue to generate a SIMD version of it!

Of course I made up this silly example myself, but it at least tells you the idea behind it: unify the interface to eliminate branches. In this paper the authors gave 2 applications, random forest (traversing binary trees) and NFA traversing, both involve decision making (jump to left or right child, evaluate non-leaf node or leaf-node,  automata state change). I found it amusing that the NFA SIMD is like a flattened GPGPU implementation, but still, the idea is brilliant. The rest of the paper is not very hard to follow, I would recommend those who are interested in codegen to take a look. 



* Auto-vectorization in GCC: http://gcc.gnu.org/projects/tree-ssa/vectorization.html

你可能感兴趣的:(CUDA,compiler,SIMD,cgo)