C++ AMP Accelerated Massive Parallelism

Microsoft's C++ AMP Unveiled

http://drdobbs.com/windows/231600761

Over the past few years, some developers have started to take advantage of the power of GPU hardware in their apps. In other words, from their CPU code, they have been offloading parts of their app that are compute intensive to the GPU, and enjoying overall performance increases in their solution.

The code parts that are offloadable to an accelerator such as the GPU are data parallel algorithms operating over large (often multi-dimensional) arrays of data. So if you have code that fits that pattern, it is in your interest to explore taking advantage of the GPU from your app.

As you start on such a journey, you will face the same issue that so many before you have faced: While your app is written in a mainstream programming language such as C++, most of the options for targeting a data parallel accelerator involve learning a new niche syntax, obtaining a new set of development tools (including a separate compiler), trying to figure out which hardware you can target and which is out of reach for your chosen option, and perhaps drawing a deployment matrix of what needs to ship to the customer's machine for your solution.

Microsoft is aiming to significantly lower the barrier to entry by providing a mainstream C++ option that we are calling "C++ Accelerated Massive Parallelism" or "C++ AMP" for short.

C++ AMP introduces a key new language feature to C++ and a minimal STL-like library that enables you to very easily work with large multidimensional arrays to express your data parallel algorithms in a manner that exposes massive parallelism on an accelerator, such as the GPU.

We announced this technology at the AMD Fusion Developer Summit in June 2011. At the same time, we announced our intent to make the specification open, and we are working with other compiler vendors so they can support it in their compilers (on any platform).

Microsoft's implementation of C++ AMP is part of the next version of the Visual C++ compiler and the next version of Visual Studio. A Developer Preview of that release should be available publicly at the time you are reading this article.

Microsoft's implementation targets Windows by building on top of the ubiquitous and reliable Direct3D platform, and that means that in addition to the performance and productivity advantages of C++ AMP, you will benefit from hardware portability across all major hardware vendors. The core API surface area is general and Direct3D-neutral, such that one could think of Direct3D as an implementation detail; in future releases, we could offer additional implementations targeting other kinds of hardware and topologies (e.g., cloud), while preserving your investments in learning our data parallel API.

So what does the C++ AMP API look like? Before delving into a complete example, the next section covers the basics of the structure of C++ AMP code, which you can find in the amp.h file in the concurrency namespace.

C++ AMP API in a Nutshell

You can see an example of a matrix multiplication in C++ AMP in this blog post. Figure 1 shows a simple array addition example demonstrating some core concepts.

Figure 1: Simple array addition using C++ AMP.

To use large datasets with C++ AMP, you can either copy them to an array or wrap them with an array_view. The array class is a container of data of element and of rank N, residing on a specific accelerator, that you must explicitly copy data to and from. The array_view class is a wrapper over existing data, and copying of data required for computations is done implicitly on demand.

The entry point to an algorithm implemented in C++ AMP is one of the overloads of parallel_for_each.parallel_for_each invocations are translated by our compiler into GPU code and GPU runtime calls (through Direct3D).

The first argument to parallel_for_each is a grid object. The grid class lets you define an N-dimensional space. The second argument to the parallel_for_each call is a lambda, whose parameter list consists of anindex object (If you are not familiar with lambdas in C++, visit this page for further details). The index class lets you define an N-dimensional point.

The lambda you write and pass to the parallel_for_each is called by the C++ AMP runtime once per thread, passing in the thread ID as an index, which you can use to index into your array or array_view objects. The variables for those objects are not explicit in any signature; instead, you capture them into the lambda as needed — one of the beauties of a lambda-based design.

Note that the lambda (and any other functions that it calls) must be annotated with the new restrict modifier, indicating that the function should be compiled for the restrict specifier — in our case, any Direct3D device. This new language feature, whose usage is very simple, will be covered in more depth in a separate article.

There are other classes as part of the C++ AMP API — for example, accelerator and accelerator_view — that let you check the capabilities of accelerators and query information about them, and hence, let you choose which one you want your algorithm to execute on.

Finally, there is a tiled overload of parallel_for_each that accepts a tiled_grid, and whose lambda takes atiled_index; and within the lambda body, you can use the new tile_static storage class and thetile_barrier class. This variant of parallel_for_each lets you take advantage of the programmable cache on the GPU (also known as shared memory, also known in C++ AMP as tile_static memory), for maximum performance benefits, and an example of it will be shown in the next part of the article.

Calculating a Moving Average with C++ AMP


A common problem in finance and science is that of calculating a moving average over a time series. For example, for a given company stock, financial Web sites provide in addition to a normal chart tracking the stock's price as a function of time, a smoother curve, which tracks the stock's average price over the last 50 or 200 days. That curve is the simple moving average of the stock's price.

We start with a serial implementation, which calculates the simple moving average directly based on its definition (in our presentation of the problem, we will assume that we don't need to calculate the values of the moving average for points in the range [0…window-2]).

#include <amp.h>
void sma(int series_length, const float *series, float *moving_average, int window)
{
    for (int i=window-1; i<series_length;i++)
	{
        float sum = 0.0f;
        for (int j=i-(window-1); j<=i; j++)
            sum += series[j];
        moving_average[i] = sum/window;
    }
}

Simple C++ AMP Version

Our first C++ AMP version of this algorithm is produced by parallelizing the serial algorithm. We note that each iteration of the loop over i is independent and, thus, could safely be parallelized. Compare the serial version with the parallel version below:

void sma_amp(int series_length, const float *series, float *moving_average, int window)
{
    array_view<float>  arr_in(series_length, series);
    array_view<float>  arr_out(series_length, moving_average);
  
    parallel_for_each(
        grid<1>(extent<1>(series_length-(window-1))),
        [=](index<1> idx) restrict(direct3d)
        {
            const int i = idx[0] + (window-1);
  
            float sum = 0;
            for (int j=i-(window-1); j<=i; i++)
                sum += arr_in[j];
  
            arr_out[i] = sum/window;
        }
    );
}

We have arrived at this parallel version by taking the serial algorithm and wrapping the float* buffers, seriesand moving_average, with array_view classes. Then we transformed the loop over i into a parallel_for_eachcall, where the loop bounds have been replaced by an instance of class grid. The body of the lambda is then almost identical to the serial loop.

When we try the parallel algorithm on a machine with a decent DX11 card, we find it is significantly faster than the serial version. For example, on our hardware, for certain data sizes, we saw the GPU perform the calculation 150-times faster than the serial CPU implementation. In fact, the parallel algorithm is also faster than more-sophisticated serial algorithms. The reasons for this increased speed are that the GPU has a high degree of parallelism and a wide pipe to the memory subsystem to service each of these cores.

In order to reduce the number of RAM accesses the parallel algorithm performs, we would like to share input data between threads, rather than letting each thread independently load the same data from global memory. It is possible to achieve this goal using "thread tiling" and by utilizing "tile static memory." A tile is a unit of sharing of memory and also a unit of synchronization between threads. Threads in a tile can synchronize using tile barriers, and they can also employ atomic operations on tile static memory.

Our strategy for tiling is to load tile_size worth of memory at a time. This is done in tandem by all threads in the tile, each one loading one element of input. After having completed this collaborative operation, each thread figures out whether any of the data loaded needs to be factored into its local moving average calculation. The code follows below:

void sma_amp_tiled(int series_length, const float *series, float *moving_average, int window)
{
array_view<const float>        arr_in(series_length, series);
array_view<writeonly<float>>   arr_out(series_length, moving_average);
static const int tile_size = 512;
int number_of_groups = (series_length + tile_size -1)/tile_size;
int total_threads = number_of_groups * tile_size;
parallel_for_each(
  grid<1>(extent<1> (total_threads)).tile<tile_size>(), 
  [=](tiled_index<tile_size> tiled_idx) restrict(direct3d) {
    int gid = tiled_idx.global[0];
    int lid = tiled_idx.local[0];   
 
    int tile_min_thread = tiled_idx.tile_origin[0];
    int tile_max_thread = tile_min_thread + tile_size -1;
 
    int tile_min_ref = tile_min_thread - (window - 1) < 0 ? 0 : tile_min_thread - (window - 1);
    int tile_max_ref = tile_max_thread >= arr_in.extent[0] ? arr_in.extent[0]-1 : tile_max_thread;   
 
    int my_min_ref = gid - (window - 1) < tile_min_ref ? tile_min_ref : gid - (window - 1);
    int my_max_ref = gid > tile_max_ref ? tile_max_ref : gid;    
 
    float sum = 0.0f;
    for (int i=tile_min_ref; i<=tile_max_ref; i+=tile_size) {
      tile_static float tile[tile_size];    
 
      tile[lid] = arr_in[i + lid];
      tiled_idx.barrier.wait(); 
 
      int ith_min_ref = i < my_min_ref ? my_min_ref : i;
      int ith_max_ref = i + (tile_size-1) > my_max_ref ? my_max_ref : i + (tile_size-1); 
 
      for (int j=ith_min_ref; j<=ith_max_ref; j++)
        sum += tile[j-i];
      tiled_idx.barrier.wait();
    }
    if (gid >= (window-1) && gid < series_length)
      arr_out[gid] = sum / window;
  });
}

Explaining the algorithm in minute detail falls outside of the scope of this short introduction to C++ AMP, but let's focus on a few common patterns that appear in the code. Lines 5 through 9 take the number of elements in the problem and launch a tiled computation over this index range. To do that, we select a tile size, pad the number of threads to a multiple of the tile size, and then launch a kernel over a tiled_grid. The parameter to the lambda will correspondingly be a tiled_index, which provides information about the location of the thread within the local tile and within the global computation.

Lines 11 through 21 all contain index math to figure out the pieces of data the thread tile will load collaboratively and the indices within this range that each specific thread has interest in.

Lines 24 through 36 iterate over the index set for the tile, and load memory from thearray_view into the variable tile, which is annotated with the tile_static storage class — this tells the system to make tile a variable that is accessible by all threads within a given tile. After storing the read values into tile and before proceeding to read the values produced by other threads, each thread must reach a barrier and wait for its peers, denoted in the code with tiled_idx.barrier.wait().

Finally, results are stored in lines 37 and 38. We must check whether the current thread corresponds to an element in the output. Since we have padded the grid over which theparallel_for_each was launched, some threads may not correspond to an output element.

The tiled version is naturally much more complicated than the simple parallel version, but the rewards are significant, too; for many inputs we have tested, it is twice as fast.

Additional optimizations are possible, both in the high-level algorithm and tactically in the way we code using C++ AMP. These optimizations will be covered in future articles. For many problems, however, the speedup offered by a simple parallel algorithm is by itself compelling enough (as is the case here). So the advice as always will be: Invest in optimization only as much as your requirements call for.

And Now What?

This article has attempted to give you a taste of C++ AMP, but there is a long list of features that we simply did not have space to cover. We will discuss them in a future article on Dr. Dobb's. That list of features includes our math and atomics library, Direct3D-interop interface and HLSL-intrinsic support (for game developers), a CPU fall-back solution when there is no other capable accelerator on the system, and upcoming Visual Studio support for debugging, profiling, and all other parts of the market-leading IDE, plus more.

We hope that you have found this short introduction to C++ AMP compelling and we look forward to hearing how massive parallelism is going to transform your problem domain! Microsoft is hard at work bringing heterogeneous computing to the mainstream, through a technology that offers performance, productivity, and portability. We are doing that with a future-proof design that abstracts from today's hardware landscape that is still in motion.



你可能感兴趣的:(C++ AMP Accelerated Massive Parallelism)