目前,3D的网络,尤其时point-based的网络,很多模块在pytorch中都没有官方实现,这就需要我们自己写。例如PointNet++中的FPS,group,query等函数。之前也只是用过,对其的修改也限于python层面,这次,就好好探究一下,如何自定义一个函数,如何将其加入到pytorch中,使得在pytorch中也能用。
其实,这一块,有非常详细的官方文档,讲述了如何自定义一个函数,并将其放入Pytorch中。当然,如何写一个函数,我们还需要cuda编程的知识,这里就先讲外围一部分,假设我们已经写好了一个函数。官网文档的示例讲的很清楚了,这里就拿PointNet++来说明一下。如果想要详细了解的话,可以先看一下官方文档:
https://pytorch.org/tutorials/advanced/cpp_extension.html?highlight=pybind11_module
官方文档中清楚的给出了两种将自己定义的cuda编程的函数放入pytorch中的方法。一种是通过编译,生成一个python的包,一种是在程序执行中调用。
个人认为编译的方法更好一些,生成了一个python包,在其他的project中也很方便调用。
首先,我们先看一下pytorch接口的设置,这里,我们先假设已经写好了函数。
这里的PointNet++版本以这个链接中的为例:
https://github.com/sshaoshuai/Pointnet2.PyTorch/tree/5a4416f51ceaeba242828cabf39133433336850d
假设我们已经写好了要实现的函数,在本例中,函数包括pointnet2/src中的一系列xxx.cpp,xxx.cu和xxx.h。
那么我们如何放到pytorch的接口中呢?这就要看pointnet2/setup.py中:
# 这两个import是标准写法,不用改,setuptools是为了把我们自定义的函数变成一个包
from setuptools import setup
from torch.utils.cpp_extension import BuildExtension, CUDAExtension
setup(
# 这个包的name是pointnet2
name='pointnet2',
ext_modules=[
# 模块的name是pointnet2_cuda,就是说要import pointnet2_cuda
# 定义与这个包关联的xxx.cpp, xxx.cu, xxx.h
CUDAExtension('pointnet2_cuda', [
'src/pointnet2_api.cpp',
'src/ball_query.cpp',
'src/ball_query_gpu.cu',
'src/group_points.cpp',
'src/group_points_gpu.cu',
'src/interpolate.cpp',
'src/interpolate_gpu.cu',
'src/sampling.cpp',
'src/sampling_gpu.cu',
],
# 以下的东西都不用改
extra_compile_args={'cxx': ['-g'],
'nvcc': ['-O2']})
],
cmdclass={'build_ext': BuildExtension}
)
在我们用这些函数之前,要先运行
python setup.py install
其实就是在把我们定义的这些函数,集合成一个包安装起来。这就出现了一个问题,函数包是安装上了,但我们用什么接口去调用函数呢?
这部分就定义在pointnet2/pointnet2_api.py中
#include
#include
// 把写好的函数都先include进来
#include "ball_query_gpu.h"
#include "group_points_gpu.h"
#include "sampling_gpu.h"
#include "interpolate_gpu.h"
// 使用PYBIND11_MODULE,这个是在torch/extension.h中包含了的
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
// python中调用时使用的函数名为:ball_query_wrapper
// cpp中相关的函数是:ball_query_wrapper_fast
// python中调用help所产生的提示是:"ball_query_wrapper_fast"
m.def("ball_query_wrapper", &ball_query_wrapper_fast, "ball_query_wrapper_fast");
m.def("group_points_wrapper", &group_points_wrapper_fast, "group_points_wrapper_fast");
m.def("group_points_grad_wrapper", &group_points_grad_wrapper_fast, "group_points_grad_wrapper_fast");
m.def("gather_points_wrapper", &gather_points_wrapper_fast, "gather_points_wrapper_fast");
m.def("gather_points_grad_wrapper", &gather_points_grad_wrapper_fast, "gather_points_grad_wrapper_fast");
m.def("furthest_point_sampling_wrapper", &furthest_point_sampling_wrapper, "furthest_point_sampling_wrapper");
m.def("three_nn_wrapper", &three_nn_wrapper_fast, "three_nn_wrapper_fast");
m.def("three_interpolate_wrapper", &three_interpolate_wrapper_fast, "three_interpolate_wrapper_fast");
m.def("three_interpolate_grad_wrapper", &three_interpolate_grad_wrapper_fast, "three_interpolate_grad_wrapper_fast");
}
上面就完成了pytorch中要调用的接口了。那么我们看一下,是如何调用的?
这个在pointnet2/pointnet2_utils.py中,以
import torch
from torch.autograd import Variable
from torch.autograd import Function
import torch.nn as nn
from typing import Tuple
# import我们自己定义的包
import pointnet2_cuda as pointnet2
# 定义一个pytorch的函数,要继承torch.autograd.Function
class GatherOperation(Function):
# 定义前向运算,ctx保存一些变量,保存如ctx中的变量会传入backward中
@staticmethod
def forward(ctx, features: torch.Tensor, idx: torch.Tensor) -> torch.Tensor:
"""
:param ctx:
:param features: (B, C, N)
:param idx: (B, npoint) index tensor of the features to gather
:return:
output: (B, C, npoint)
"""
assert features.is_contiguous()
assert idx.is_contiguous()
B, npoint = idx.size()
_, C, N = features.size()
output = torch.cuda.FloatTensor(B, C, npoint)
# 调用我们定义的函数,进行计算
pointnet2.gather_points_wrapper(B, C, N, npoint, features, idx, output)
# 将反向传播中要用到的变量放入ctx中
ctx.for_backwards = (idx, C, N)
return output
# 定义反向传播的函数,其输入的第一个变量是ctx,然后其他输入的数量与forward的输出的数量相同
@staticmethod
def backward(ctx, grad_out):
# 从ctx中取出前向计算中保存的变量
idx, C, N = ctx.for_backwards
B, npoint = idx.size()
grad_features = Variable(torch.cuda.FloatTensor(B, C, N).zero_())
grad_out_data = grad_out.data.contiguous()
pointnet2.gather_points_grad_wrapper(B, C, N, npoint, grad_out_data, idx, grad_features.data)
# 输出变量的数量必须与forward输入的变量数量(除ctx之外)相同
return grad_features, None
# 调用我们定义的函数的方法是outputs = xxx.apply(inputs),这里预先把apply取出来,所以用的时候就可以直接使用 outputs = gather_operation(inputs)即可
gather_operation = GatherOperation.apply
以PVCNN中的代码为例。PVCNN中的xxx.cpp,xxx.cu,xxx.h都modules/functional/src文件夹中。
对应编译的方式的顺序来看,先看看,xxx.cpp和xxx.cu是怎么被pytorch所知道的呢?这个在modules/backend.py中:
import os
from torch.utils.cpp_extension import load
_src_path = os.path.dirname(os.path.abspath(__file__))
_backend = load(name='_pvcnn_backend',
extra_cflags=['-O3', '-std=c++17'],
sources=[os.path.join(_src_path,'src', f) for f in [
'ball_query/ball_query.cpp',
'ball_query/ball_query.cu',
'grouping/grouping.cpp',
'grouping/grouping.cu',
'interpolate/neighbor_interpolate.cpp',
'interpolate/neighbor_interpolate.cu',
'interpolate/trilinear_devox.cpp',
'interpolate/trilinear_devox.cu',
'sampling/sampling.cpp',
'sampling/sampling.cu',
'voxelization/vox.cpp',
'voxelization/vox.cu',
'bindings.cpp',
]]
)
__all__ = ['_backend']
可以说,这个就是标准代码,也就name和sources需要按照自己的写一下。
那就出现了下一个问题,这些程序是已经被pytorch知道了,但接口该怎么调用呢,哪个函数是哪个函数呢?这个跟编译的方式的接口的定义方式是一样的。在modules/functional/src/bindings.cpp中:
#include
#include "ball_query/ball_query.hpp"
#include "grouping/grouping.hpp"
#include "interpolate/neighbor_interpolate.hpp"
#include "interpolate/trilinear_devox.hpp"
#include "sampling/sampling.hpp"
#include "voxelization/vox.hpp"
PYBIND11_MODULE(_pvcnn_backend, m) {
m.def("gather_features_forward", &gather_features_forward,
"Gather Centers' Features forward (CUDA)");
m.def("gather_features_backward", &gather_features_backward,
"Gather Centers' Features backward (CUDA)");
m.def("furthest_point_sampling", &furthest_point_sampling_forward,
"Furthest Point Sampling (CUDA)");
m.def("ball_query", &ball_query_forward, "Ball Query (CUDA)");
m.def("grouping_forward", &grouping_forward,
"Grouping Features forward (CUDA)");
m.def("grouping_backward", &grouping_backward,
"Grouping Features backward (CUDA)");
m.def("three_nearest_neighbors_interpolate_forward",
&three_nearest_neighbors_interpolate_forward,
"3 Nearest Neighbors Interpolate forward (CUDA)");
m.def("three_nearest_neighbors_interpolate_backward",
&three_nearest_neighbors_interpolate_backward,
"3 Nearest Neighbors Interpolate backward (CUDA)");
m.def("trilinear_devoxelize_forward", &trilinear_devoxelize_forward,
"Trilinear Devoxelization forward (CUDA)");
m.def("trilinear_devoxelize_backward", &trilinear_devoxelize_backward,
"Trilinear Devoxelization backward (CUDA)");
m.def("avg_voxelize_forward", &avg_voxelize_forward,
"Voxelization forward with average pooling (CUDA)");
m.def("avg_voxelize_backward", &avg_voxelize_backward,
"Voxelization backward (CUDA)");
}
紧接着,下一个问题,知道了xxx.cpp所对应的函数在python中是怎么调用的,那如何封装为一个pytorch的Function呢?这个与编译的方式的定义方式也是一样的。以modules/functional/grouping.py为例:
from torch.autograd import Function
from modules.functional.backend import _backend
__all__ = ['grouping']
class Grouping(Function):
@staticmethod
def forward(ctx, features, indices):
"""
:param ctx:
:param features: features of points, FloatTensor[B, C, N]
:param indices: neighbor indices of centers, IntTensor[B, M, U], M is #centers, U is #neighbors
:return:
grouped_features: grouped features, FloatTensor[B, C, M, U]
"""
features = features.contiguous()
indices = indices.contiguous()
ctx.save_for_backward(indices)
ctx.num_points = features.size(-1)
return _backend.grouping_forward(features, indices)
@staticmethod
def backward(ctx, grad_output):
indices, = ctx.saved_tensors
grad_features = _backend.grouping_backward(grad_output.contiguous(), indices, ctx.num_points)
return grad_features, None
grouping = Grouping.apply
由此,这个就和torch.Function中的函数使用方法都一样了。
下面,我翻译了一些pytorch官方例子的一些我认为重要的点。
A wonderful fact about PyTorch’s ATen backend is that it abstracts the computing device you are running on. This means the same code we wrote for CPU can also run on GPU, and individual operations will correspondingly dispatch to GPU-optimized implementations. For certain operations like matrix multiply (like mm or addmm), this is a big win. Let’s take a look at how much performance we gain from running our C++ code with CUDA tensors. No changes to our implementation are required, we simply need to put our tensors in GPU memory from Python, with either adding device=cuda_device argument at creation time or using .to(cuda_device) after creation:
“关于PyTorch的ATen后端的一个奇妙事实是,它可以抽象化您正在运行的计算设备。 这意味着我们为CPU编写的相同代码也可以在GPU上运行,并且各个操作将相应地分派到GPU优化的实现。 对于某些运算,如矩阵乘法(如mm或addmm),这是一个很大的胜利。 让我们看一下使用CUDA张量运行C ++代码所获得的性能。 无需更改实现,只需将张量从Python放到GPU内存中,可以在创建时添加device = cuda_device参数,或者在创建后使用.to(cuda_device):”
The general strategy for writing a CUDA extension is to first write a C++ file which defines the functions that will be called from Python, and binds those functions to Python with pybind11. Furthermore, this file will also declare functions that are defined in CUDA (.cu) files. The C++ functions will then do some checks and ultimately forward its calls to the CUDA functions. In the CUDA files, we write our actual CUDA kernels. The cpp_extension package will then take care of compiling the C++ sources with a C++ compiler like gcc and the CUDA sources with NVIDIA’s nvcc compiler. This ensures that each compiler takes care of files it knows best to compile. Ultimately, they will be linked into one shared library that is available to us from Python code.
“编写CUDA扩展的一般策略是首先编写一个C ++文件,该文件定义将从Python调用的函数,然后使用pybind11将这些函数绑定到Python。此外,此文件还将声明在CUDA(.cu)文件中定义的函数。然后,C ++函数将进行一些检查,并最终将其调用转发给CUDA函数。在CUDA文件中,我们编写了实际的CUDA内核。然后,cpp_extension包将负责使用gcc之类的C ++编译器来编译C ++源代码,并使用NVIDIA的nvcc编译器来编译CUDA源代码。这样可以确保每个编译器都处理最了解要编译的文件。最终,它们将被链接到一个共享库中,该库可以从Python代码中获得。”
上面我们讲的都是如何将自己定义的函数放到pytorch中,这我们假设了我们已经写好了这些函数。这里,我们就再讲一下如何写使用CUDA的函数,这部分内容需要使用CUDA编程的知识。这里我不过多讲述有关CUDA编程的知识和原理,只是把每一块在干什么写清楚。哪些东西是应该有的?哪些东西要随着自己的任务不同而改的。
这里仍然先以PointNet++为例,然后看一下Faster-RCNN以做验证。最后看一下PVCNN。
PointNet++使用以下版本:
https://github.com/sshaoshuai/Pointnet2.PyTorch/tree/5a4416f51ceaeba242828cabf39133433336850d
就以最简单的furtherst point sampling的实现为例,因为这个就是返回一些idx,只有一个函数,也不需要计算梯度的反向传播。有关FPS的函数定义在pointnet2/src/sampling.cpp,pointnet2/src/sampling_gpu.h,pointnet2/src/sampling_gpu.cu中。
先看sampling.cpp,这是cpu中执行的函数,也是python识别的接口
#include
#include
#include
#include
// include gpu版本的函数
#include "sampling_gpu.h"
int furthest_point_sampling_wrapper(int b, int n, int m,
at::Tensor points_tensor, at::Tensor temp_tensor, at::Tensor idx_tensor) {
/*
Inputs:
b: Batch的值
n: 原始点云中点的数量
m: 要选取点的数量
points_tensor: 原始点云,大小为b*n
temp_tensor: 中间变量,大小为b*n
idx_tensor: 这个是返回值,储存选取的idx, 大小为b*m
points_tensor, temp_tensor, idx_tensor都是在cuda上面的tensor
*/
const float *points = points_tensor.data<float>();
float *temp = temp_tensor.data<float>();
int *idx = idx_tensor.data<int>();
// 定义一个cuda的stream,并初始化,这个不用改
cudaStream_t stream = THCState_getCurrentStream(state);
// 调用sampling_gpu.cu中的函数
furthest_point_sampling_kernel_launcher(b, n, m, points, temp, idx, stream);
return 1;
}
然后我们看一下sampling_gpu.cu中写了什么:
__device__ void __update(float *__restrict__ dists, int *__restrict__ dists_i, int idx1, int idx2){
const float v1 = dists[idx1], v2 = dists[idx2];
const int i1 = dists_i[idx1], i2 = dists_i[idx2];
dists[idx1] = max(v1, v2);
dists_i[idx1] = v2 > v1 ? i2 : i1;
}
template <unsigned int block_size>
__global__ void furthest_point_sampling_kernel(int b, int n, int m,
const float *__restrict__ dataset, float *__restrict__ temp, int *__restrict__ idxs) {
// dataset: (B, N, 3)
// tmp: (B, N)
// output:
// idx: (B, M)
if (m <= 0) return;
__shared__ float dists[block_size];
__shared__ int dists_i[block_size];
int batch_index = blockIdx.x;
dataset += batch_index * n * 3;
temp += batch_index * n;
idxs += batch_index * m;
int tid = threadIdx.x;
const int stride = block_size;
int old = 0;
if (threadIdx.x == 0)
idxs[0] = old;
__syncthreads();
for (int j = 1; j < m; j++) {
int besti = 0;
float best = -1;
float x1 = dataset[old * 3 + 0];
float y1 = dataset[old * 3 + 1];
float z1 = dataset[old * 3 + 2];
for (int k = tid; k < n; k += stride) {
float x2, y2, z2;
x2 = dataset[k * 3 + 0];
y2 = dataset[k * 3 + 1];
z2 = dataset[k * 3 + 2];
// float mag = (x2 * x2) + (y2 * y2) + (z2 * z2);
// if (mag <= 1e-3)
// continue;
float d = (x2 - x1) * (x2 - x1) + (y2 - y1) * (y2 - y1) + (z2 - z1) * (z2 - z1);
float d2 = min(d, temp[k]);
temp[k] = d2;
besti = d2 > best ? k : besti;
best = d2 > best ? d2 : best;
}
dists[tid] = best;
dists_i[tid] = besti;
__syncthreads();
if (block_size >= 1024) {
if (tid < 512) {
__update(dists, dists_i, tid, tid + 512);
}
__syncthreads();
}
if (block_size >= 512) {
if (tid < 256) {
__update(dists, dists_i, tid, tid + 256);
}
__syncthreads();
}
if (block_size >= 256) {
if (tid < 128) {
__update(dists, dists_i, tid, tid + 128);
}
__syncthreads();
}
if (block_size >= 128) {
if (tid < 64) {
__update(dists, dists_i, tid, tid + 64);
}
__syncthreads();
}
if (block_size >= 64) {
if (tid < 32) {
__update(dists, dists_i, tid, tid + 32);
}
__syncthreads();
}
if (block_size >= 32) {
if (tid < 16) {
__update(dists, dists_i, tid, tid + 16);
}
__syncthreads();
}
if (block_size >= 16) {
if (tid < 8) {
__update(dists, dists_i, tid, tid + 8);
}
__syncthreads();
}
if (block_size >= 8) {
if (tid < 4) {
__update(dists, dists_i, tid, tid + 4);
}
__syncthreads();
}
if (block_size >= 4) {
if (tid < 2) {
__update(dists, dists_i, tid, tid + 2);
}
__syncthreads();
}
if (block_size >= 2) {
if (tid < 1) {
__update(dists, dists_i, tid, tid + 1);
}
__syncthreads();
}
old = dists_i[0];
if (tid == 0)
idxs[j] = old;
}
}
void furthest_point_sampling_kernel_launcher(int b, int n, int m,
const float *dataset, float *temp, int *idxs, cudaStream_t stream) {
// dataset: (B, N, 3)
// tmp: (B, N)
// output:
// idx: (B, M)
cudaError_t err;
unsigned int n_threads = opt_n_threads(n);
switch (n_threads) {
case 1024:
furthest_point_sampling_kernel<1024><<<b, n_threads, 0, stream>>>(b, n, m, dataset, temp, idxs); break;
case 512:
furthest_point_sampling_kernel<512><<<b, n_threads, 0, stream>>>(b, n, m, dataset, temp, idxs); break;
case 256:
furthest_point_sampling_kernel<256><<<b, n_threads, 0, stream>>>(b, n, m, dataset, temp, idxs); break;
case 128:
furthest_point_sampling_kernel<128><<<b, n_threads, 0, stream>>>(b, n, m, dataset, temp, idxs); break;
case 64:
furthest_point_sampling_kernel<64><<<b, n_threads, 0, stream>>>(b, n, m, dataset, temp, idxs); break;
case 32:
furthest_point_sampling_kernel<32><<<b, n_threads, 0, stream>>>(b, n, m, dataset, temp, idxs); break;
case 16:
furthest_point_sampling_kernel<16><<<b, n_threads, 0, stream>>>(b, n, m, dataset, temp, idxs); break;
case 8:
furthest_point_sampling_kernel<8><<<b, n_threads, 0, stream>>>(b, n, m, dataset, temp, idxs); break;
case 4:
furthest_point_sampling_kernel<4><<<b, n_threads, 0, stream>>>(b, n, m, dataset, temp, idxs); break;
case 2:
furthest_point_sampling_kernel<2><<<b, n_threads, 0, stream>>>(b, n, m, dataset, temp, idxs); break;
case 1:
furthest_point_sampling_kernel<1><<<b, n_threads, 0, stream>>>(b, n, m, dataset, temp, idxs); break;
default:
furthest_point_sampling_kernel<512><<<b, n_threads, 0, stream>>>(b, n, m, dataset, temp, idxs);
}
err = cudaGetLastError();
if (cudaSuccess != err) {
fprintf(stderr, "CUDA kernel failed : %s\n", cudaGetErrorString(err));
exit(-1);
}
}
不得不说,好长啊。具体的我也就不讲在做什么了。懂CUDA编程的人,自然能看懂,还算比较简单,不懂的,感觉也讲不清楚。这里就写一下我认为比较重要的是调用kernel的语句是:
kernel<<<num_block, num_thread, 0, stream>>>(a, b, c)
要在<<<>>>中,最后一个参数是在cpp中定义的stream
不得不说,目前没懂的是在<<<>>>前面的<4>这个是什么意思?有大佬懂的话,评论留个言。
所以,这里总结一下,要自己写一个函数怎么写:
先写xxx.cpp:
#include
extern THCState *state;
#include "xxx.h"
int xxx(inputs,
...
cudaStream_t stream = THCState_getCurrentStream(state);
xxx_launcher(inputs, stream);
return 1;
}
然后定义xxx.h
int xxx_kernel(kernel_inpus);
void xxx_launcher(inputs, cudaStream_t stream);
最后定义xxx.cu
#include "xxx.h"
__global__ void xxx_kernel(kernel_inpus) {
...
}
void xxx_launcher(inputs, cudaStream_t stream) {
...
xxx_kernel<<<n_blocks, n_threads, 0, stream>>>(kernel_inpus);
...
}
说实话,include这一块也不是很清楚,实在不行多include一些,总没有啥问题
上面就总结完了PointNet++的,下面就拿Faster-RCNN的代码来验证一下我所总结的。
以下面链接这个版本为例:
https://github.com/jwyang/faster-rcnn.pytorch
以roi_crop为例,在lib/model/roi_crop/src中,roi_crop_cuda.c中,出现了以下:
#include
#include
#include
#include "roi_crop_cuda_kernel.h"
#define real float
extern THCState *state;
int BilinearSamplerBHWD_updateOutput_cuda(THCudaTensor *inputImages, THCudaTensor *grids, THCudaTensor *output){
int success = 0;
success = BilinearSamplerBHWD_updateOutput_cuda_kernel(THCudaTensor_size(state, output, 1),
THCudaTensor_size(state, output, 3),
THCudaTensor_size(state, output, 2),
THCudaTensor_size(state, output, 0),
THCudaTensor_size(state, inputImages, 1),
THCudaTensor_size(state, inputImages, 2),
THCudaTensor_size(state, inputImages, 3),
THCudaTensor_size(state, inputImages, 0),
THCudaTensor_data(state, inputImages),
THCudaTensor_stride(state, inputImages, 0),
THCudaTensor_stride(state, inputImages, 1),
THCudaTensor_stride(state, inputImages, 2),
THCudaTensor_stride(state, inputImages, 3),
THCudaTensor_data(state, grids),
THCudaTensor_stride(state, grids, 0),
THCudaTensor_stride(state, grids, 3),
THCudaTensor_stride(state, grids, 1),
THCudaTensor_stride(state, grids, 2),
THCudaTensor_data(state, output),
THCudaTensor_stride(state, output, 0),
THCudaTensor_stride(state, output, 1),
THCudaTensor_stride(state, output, 2),
THCudaTensor_stride(state, output, 3),
THCState_getCurrentStream(state));
//check for errors
if (!success) {
THError("aborting");
}
return 1;
}
可以从上述代码中看到,同样是出现了state和cuda_stream的定义!
从roi_crop_cuda_kernel.cu中,同样出现了:
bilinearSamplingFromGrid<<<(output_size + kThreadsPerBlock - 1) / kThreadsPerBlock, kThreadsPerBlock, 0, stream>>>(output_size,...);
这种<<
这里就不放代码了,仍然是以xxx.cpp+xxx.cu的形式,cu中一个launch函数被cpp中调用,一个kernel函数,在GPU上做运算。不同的是,可以从PVCNN中看到,stream并不是必须。