
1. 闭扫描和开扫描



 * x: input array
 * y: output array
void sequential_scan(float* x, float* y, int Max_i) {
	y[0] = x[0];
	for (int i=1; i





假设输入数组为[3, 1, 7, 0, 4, 1, 6, 3],闭操作输出数组为[3, 4, 11, 11, 15, 16, 22, 25],开操作输出数组为[0, 3, 4, 11, 11, 15, 16, 22],可以验证。


2. 简单并行扫描




(2)inclusive scan表示闭扫描部分,而exclusive scan表示开扫描部分。


3. 工作高效的并行扫描


4. 任意输入长度的并行扫描


5. Thrust与CUDA的互操作性

解析:Thrust与CUDA的互操作性有利于迭代开发策略,比如使用Thrust库快速开发出并行应用的原型,确定程序瓶颈,使用CUDA C实现特定算法并作必要优化。When a Thrust function is called, it inspects the type of the iterator to determine whether to use a host or a device implementation. This process is known as static dispatching since the host/device dispatch is resolved at compile time. Note that this implies that there is no runtime overhead to the dispatch process.


size_t N = 1024;
device_vector d_vec(N);
int raw_ptr = raw_pointer_cast(&d_vec[0]);
cudaMemset(raw_ptr, 0, N*sizeof(int));
my_kernel << > >(N, raw_ptr);

说明:通过raw_pointer_cast()将设备地址转换为原始C指针,原始C指针可以调用CUDA C API函数,或者作为参数传递到CUDA C kernel函数中。


size_t N = 1024;
int raw_ptr;
cudaMalloc(&raw_ptr, N*sizeof(int));
device_ptr dev_ptr = device_pointer_cast(raw_ptr);
sort(dev_ptr, dev_ptr+N);
dev_ptr[0] = 1;



6. GPU,SM,SP与Grid,Block,Thread之间的映射关系

解析:GPU的任务分配单元将Grid分配到GPU芯片上。任务分配单元使用轮询策略将Block分配到SM上,决定能否分配的因素包括每个Block使用的共享存储器数量,每个Block使用的寄存器数量,以及其它的一些限制条件。SM中的线程调度单元又将分配到的Block进行细分,将其中的线程组织成线程束(Warp),Block中的每一个Thread被发射到一个SP上。一个SM可以同时处理多个Block,比如现在有16个SM、64个Block、每个SM可以同时处理3个Block,那么设备刚开始的时候就会同时处理48个Block,剩下的16个Block等待SM。一个SM一次只会执行一个Block中的一个Warp,但是SM遇到正在执行的Warp需要等待的时候(比如存取Global Memory等),就切换到别的Warp继续做运算。


7. 固定内存(pinned memory)

解析:malloc()分配的是可分页的主机内存,而cudaHostAlloc()分配的是页锁定的主机内存,也称固定内存(pinned memory),它的一个重要特点是操作系统不会对这块内存分页并交换到磁盘上,从而保证了这块内存不会被破坏或者重新定位。


8. CUDA 7.5和cuDNN 5.0安装



(2)分别将cuda/include、cuda/lib、cuda/bin三个目录中的内容拷贝到C:\Program Files\NVIDIA GPU Computing 


说明:CUDA 8.0对应的cuDNN 5.0和CUDA 7.5对应的cuDNN 5.0是不一样的。


9. NVIDIA Deep Learning SDK


(1)Deep Learning Primitives (cuDNN): High-performance building blocks for deep neural network applications 

including convolutions, activation functions, and tensor transformations.

(2)Deep Learning Inference Engine (TensorRT): High-performance deep learning inference runtime

for production deployment.

(3)Deep Learning for Video Analytics (DeepStream SDK): High-level C++ API and runtime for GPU-accelerated 

transcoding and deep learning inference.

(4)Linear Algebra (cuBLAS): GPU-accelerated BLAS functionality that delivers 6x to 17x faster performance 

than CPU-only BLAS libraries.

(5)Sparse Matrix Operations (cuSPARSE): GPU-accelerated linear algebra subroutines for sparse matrices that 

deliver up to 8x faster performance than CPU BLAS (MKL), ideal for applications such as natural language 


(6)Multi-GPU Communication (NCCL): Collective communication routines, such as all-gather, reduce, and 

broadcast that accelerate multi-GPU deep learning training on up to eight GPUs.

(7)NVIDIA DIGITS:Interactively manage data and train deep learning models for image classification, object 

detection, and image segmentation without the need to write code.

说明:Fast Fourier Transforms (cuFFT);Dense and Sparse Direct Solvers (cuSOLVER);Random Number 

Generation (cuRAND);Image & Video Processing Primitives (NPP);NVIDIA Graph Analytics Library 

(nvGRAPH);Templated Parallel Algorithms & Data Structures (Thrust);CUDA Math Library.


10. istream_iterator和ostream_iterator


(1)template > class ostream_iterator;

#include      // std::cout
#include      // std::ostream_iterator
#include        // std::vector
#include     // std::copy

int main () {
  std::vector myvector;
  for (int i=1; i<10; ++i) myvector.push_back(i*10);

  std::ostream_iterator out_it (std::cout, ", ");
  std::copy ( myvector.begin(), myvector.end(), out_it );
  return 0;

(2)template , class Distance = ptrdiff_t> class 


#include      // std::cin, std::cout
#include      // std::istream_iterator
using namespace std;

int main() {
	double value1, value2;
	std::cout << "Please, insert two values: ";

	std::istream_iterator eos;             // end-of-stream iterator
	std::istream_iterator iit(std::cin);   // stdin iterator

	if (iit != eos)
		cout << *eos << endl;
		cout << *iit << endl;
		cout << "test1" << endl;
		value1 = *iit;


	if (iit != eos)
		cout << *eos << endl;
		cout << *iit << endl;
		cout << "test2" << endl;
		value2 = *iit;

	std::cout << value1 << "*" << value2 << "=" << (value1*value2) << '\n';

	return 0;


11. __host__ __device__ int foo(int a){}  

解析:__host__ int foo(int a){}表示由CPU调用的函数。__device__ int foo(int a){}表示由GPU调用的函数。__host__和__device__关键字可以连用,比如__host__ __device__ int foo(int a){}会被编译成两个版本,分别可以由CPU和GPU调用。



解析:SAXPY(Scalar Alpha X Plus Y)是一个在Basic Linear Algebra Subprograms(BLAS)数据包中的函数,并且是一个并行向量处理机(vector processor)中常用的计算操作指令。SAXPY是标量乘法和矢量加法的组合:y=ax+y,其中a是标量,x和y矢量。

struct saxpy_functor 
	const float a; 
	saxpy_functor(float _a) : a(_a) {} 
	__host__ __device__ float operator()(const float& x, const float& y) const 
		return a * x + y; 

void saxpy_fast(float A, thrust::device_vector& X, thrust::device_vector& Y) 
{ // Y <- A * X + Y 
	thrust::transform(X.begin(), X.end(), Y.begin(), Y.begin(), saxpy_functor(A)); 
void saxpy_slow(float A, thrust::device_vector& X, thrust::device_vector& Y
	thrust::device_vector temp(X.size()); 
	// temp <- A 
	thrust::fill(temp.begin(), temp.end(), A); 
	// temp <- A * X 
	thrust::transform(X.begin(), X.end(), temp.begin(), temp.begin(), thrust::multiplies()); 
	// Y <- A * X + Y 
	thrust::transform(temp.begin(), temp.end(), Y.begin(), Y.begin(), thrust::plus()); 



13. Thrust中的Transformations(转换)











14. Thrust中的Reductions(规约)  












15. 初始化thrust::device_vector


float x[4] = { 1.0, 2.0, 3.0, 4.0 };
thrust::device_vector d_x(x, x + 4);
for (int i = 0; i < d_x.size(); i++)
	cout << d_x[i] << endl;


16. template struct thrust::plus< T >


int sum = thrust::reduce(D.begin(), D.end(), (int) 0, thrust::plus());

float norm = std::sqrt(thrust::transform_reduce(d_x.begin(), d_x.end(), unary_op, init, binary_op));



