Tensor Core编程

文章目录

    • 背景
    • demo
    • 总结

背景

这里的Tensor Core是指Nvidia的显卡中的计算单元。

demo

template <typename TIN, //输入类型
		  typename TOUT, //输出类型
		  int M_TILE,    //参考:https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#wmma-type-sizes
		  int N_TILE,   
		  int K_TILE>
__global__ void wmma_kernel(TIN *a, //输入矩阵A
                            TIN *b, //输如矩阵B
                            TOUT *c,//输出矩阵C
                            int M,
                            int N,
                            int K) 
      {
   int idx,midx,nidx,ndim,kdim;
   ndim = N / N_TILE;
   kdim = K / K_TILE;
   idx = (blockIdx.x*blockDim.x+threadIdx.x)/WARP_SIZE; //warp的一维id值warp_0, warp_1,..warp_n
   nidx = idx%ndim;//warp对应的二维坐标
   midx = idx/ndim;
   // Declare the fragments
   wmma::fragment<wmma::matrix_a, M_TILE, N_TILE, K_TILE, TIN, wmma::row_major> a_frag;
   wmma::fragment<wmma::matrix_b, M_TILE, N_TILE, K_TILE, TIN, wmma::row_major> b_frag;
   wmma::fragment<wmma::accumulator, M_TILE, N_TILE, K_TILE, TOUT> c_frag;

   // Initialize the output to zero
   wmma::fill_fragment(c_frag, 0.0f);

   TOUT *c_unique = c + nidx*N_TILE + midx*M_TILE*ndim*N_TILE;

   for(int kidx=0;kidx<kdim;kidx++){

      // Load the inputs
      TIN *a_unique = a + kidx*K_TILE + midx*M_TILE*kdim*K_TILE;
      TIN *b_unique = b + nidx*N_TILE + kidx*K_TILE*ndim*N_TILE;

      wmma::load_matrix_sync(a_frag, a_unique, K);//K和N都是ld
      wmma::load_matrix_sync(b_frag, b_unique, N);

      // Perform the matrix multiplication
      wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
   }
   // Store the output
   wmma::store_matrix_sync(c_unique, c_frag, N, wmma::mem_row_major);
}

总结

你可能感兴趣的:(CUDA编程,人工智能)