Multi-GPU computing by CUDA

CUDA实现多GPU调用

1、CUDA API 提供 cudaSetDevice(1) 函数切换GPU。只用CUDA编程,CPU只有一个线程控制一个GPU,默认是GPU(0)。通信拓扑结构如下图(左):
Multi-GPU computing by CUDA_第1张图片
GPU0可以获取GPU1上的值和将值传输到GPU1;GPU1可以获取主机值但是不能将值传输到主机;GPU1只能将值传输到GPU0上但不能获取GPU0上的值。

2、考虑Peer-to-peer memory acces。cudaDeviceCanAccessPeer()没有返回1,即不能进行该策略,原因官方给出如下解释:

On Linux only, CUDA and the display driver does not support IOMMU-enabled bare-metal PCIe peer to peer memory copy. However, CUDA and the display driver does support IOMMU via VM pass through. As a consequence, users on Linux, when running on a native bare metal system, should disable the IOMMU. The IOMMU should be enabled and the VFIO driver be used as a PCIe pass through for virtual machines.
On Windows the above limitation does not exist.

参考例子:/usr/local/cuda/samples/0_Simple/simpleP2P

3、考虑Peer-to-Peer Memory Copy和DeviceToDevice。由于GPU间的主从关系,下面关系需要对应上。
Multi-GPU computing by CUDA_第2张图片

OpenMP+Multi-GPU

/usr/local/cuda/samples/0_Simple/cudaOpenMP:use OpenMP API to write an application for multiple GPUs.
1、每个GPU都有对应的一个CPU的线程控制,这样每个GPU的地位等同。通信拓扑结构如上图(右)。其主要作用可以有更加灵活的任务分配,减少不必要的同步。

2、并行块语法:

#pragma omp parallel num_threads(2)
{
            int cpuid=omp_get_thread_num();
            if(cpuid==0)
            {kernel......}
            if(cpuid==1)
            {kernel......}
 }               

3、编译注意事项:
/usr/local/cuda/bin/nvcc -ccbin g++ -I/usr/local/cuda/samples/common/inc -m64 -Xcompiler -fopenmp -gencode arch=compute_70,code=sm_70 -o cudaOpenMP.o -c cudaOpenMP.cu
/usr/local/cuda/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_70,code=sm_70 -gencode -o cudaOpenMP cudaOpenMP.o -lgomp

计时函数

double start_t=omp_get_wtime();
运行代码块
double end_t=omp_get_wtime();
printf("runtime = %f(s)\n", (double)(end_t-start_t));

你可能感兴趣的:(a)