Tensorrt版本:TensorRT-7.2.3.4.Windows10.x86_64.cuda-11.1.cudnn8.1
GPU GTX1650 笔记本显卡
具体代码转到:Github
https://github.com/wz940216/Win10_TensorRT_Pytorch_ONNX.
部署成功之后别忘了点个star啊。
参考链接:https://arleyzhang.github.io/articles/7f4b25ce/
测试模型
torchvision
中的resnet18
,输入[1,3,224,224]
输出FC层换成[1,5]
,五个结果值便于直观对比输出结果的一致性。
主要代码
torch.onnx.export(model, input, ONNX_FILE_PATH,input_names=["input"], output_names=["output"], export_params=True)
pytorch转onnx
cd torch_to_onnx
python torch_to_onnx.py
TensorRT中的模式:
INT8 和 fp16模式
INT8推理仅在具有6.1或7.x计算能力的GPU上可用,并支持在诸如ResNet-50、VGG19和MobileNet等NX模型上进行图像分类。
DLA模式
DLA是NVIDIA推出的用于专做视觉的部件,一般用于开发板 Jetson AGX Xavier ,Xavier板子上有两个DLA,定位是专做常用计算(Conv+激活函数+Pooling+Normalization+Reshape),然后复杂的计算交给Volta GPU做。DLA功耗很低,性能很好。参考https://zhuanlan.zhihu.com/p/71984335
首推Release版本,可直接用,Debug版本配置有些问题。
属性->调试->命令参数->–fp16(根据需求选择–int8模式还是–fp16)
属性->VC++目录->包含目录->
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1\include;
D:\TensorRT-7.2.3.4\include;
D:\TensorRT-7.2.3.4\samples\common;
D:\TensorRT-7.2.3.4\samples\common\windows;
D:\opencv\build\include;
D:\opencv\build\include\opencv;
D:\opencv\build\include\opencv2;$(IncludePath)
属性->VC++目录->库目录->
D:\opencv\build\x64\vc15\lib;
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1\lib\x64;$(LibraryPath)
属性->C/C+±>附加包含目录->
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1\include;%(AdditionalIncludeDirectories)
属性->链接器->输入->(自行删减)
opencv_world342.lib;OpenCL.lib;cudnn_adv_infer64_8.lib;cudnn_ops_train64_8.lib;nppicc.lib;nvinfer.lib;cublas.lib;cudnn_adv_train.lib;cufft.lib;nppidei.lib;nvinfer_plugin.lib;cublasLt.lib;cudnn_adv_train64_8.lib;cufftw.lib;nppif.lib;nvjpeg.lib;cuda.lib;cudnn_cnn_infer.lib;curand.lib;nppig.lib;nvml.lib;cudadevrt.lib;cudnn_cnn_infer64_8.lib;cusolver.lib;nppim.lib;nvonnxparser.lib;cudart.lib;cudnn_cnn_train.lib;cusolverMg.lib;nppist.lib;nvparsers.lib;cudart_static.lib;cudnn_cnn_train64_8.lib;cusparse.lib;nppisu.lib;nvptxcompiler_static.lib;cudnn.lib;cudnn_ops_infer.lib;myelin64_1.lib;nppitc.lib;nvrtc.lib;cudnn64_8.lib;cudnn_ops_infer64_8.lib;nppc.lib;npps.lib;cudnn_adv_infer.lib;cudnn_ops_train.lib;nppial.lib;nvblas.lib;%(AdditionalDependencies)
Pytorch 输出
module output:
tensor([[ 4.7960, -1.9805, 7.9566, 2.4818, -13.3275]], device='cuda:0', grad_fn=
onnx output:
output [[ 4.796035 -1.9805057 7.9566107 2.4817739 -13.327486 ]]
TensorRT输出 --fp16
名称 | 值 | 类型 | |
---|---|---|---|
◢ | output,6 | 0x00000275858da9d0 {4.53114319, -1.66334152, 7.31781292, 2.51477814, -12.7687111, 1.401e-45#DEN} | float[6] |
[0] | 4.53114319 | float | |
[1] | -1.66334152 | float | |
[2] | 7.31781292 | float | |
[3] | 2.51477814 | float | |
[4] | -12.7687111 | float |
TensorRT输出 --int8
这里输出一个warning 提示未使用int8校准
[03/11/2021-15:16:58] [W] [TRT] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32.
原因是未指定每一个tensor的量化范围,需要传入--ranges=per_tensor_dynamic_range_file.txt
类似于
gpu_0/data_0: 1.00024
gpu_0/conv1_1: 5.43116
gpu_0/res_conv1_bn_1: 8.69736
gpu_0/res_conv1_bn_2: 8.69736
gpu_0/pool1_1: 8.69736
gpu_0/res2_0_branch2a_1: 12.819
gpu_0/res2_0_branch2a_bn_1: 5.47741
gpu_0/res2_0_branch2a_bn_2: 5.58704
gpu_0/res2_0_branch2b_1: 5.27718
gpu_0/res2_0_branch2b_bn_1: 5.08003
gpu_0/res2_0_branch2b_bn_2: 5.08003
gpu_0/res2_0_branch2c_1: 2.33625
gpu_0/res2_0_branch2c_bn_1: 3.17859
gpu_0/res2_0_branch1_1: 6.10492
未使用校准的int8量化精度损失很大
名称 | 值 | 类型 | |
---|---|---|---|
◢ | output,6 | 0x00000229086eee00 {21.9405861, -2.17396450, 15.4984961, -12.8306799, -22.7700291, 5.86610616e-24} | float[6] |
[0] | 21.9405861 | float | |
[1] | -2.17396450 | float | |
[2] | 15.4984961 | float | |
[3] | -12.8306799 | float | |
[4] | -22.7700291 | float |
pytorch
torch.cuda.synchronize()
start = time.time()
for i in range(1000):
output = model(input)
torch.cuda.synchronize()
print("pytorch time used ",(time.time()-start)/1000)
time used 4.887054920196534 ms
TensorRT
CSpendTime time;
time.Start();
for (int i = 0; i < 1000; i++)
{
bool status = context->executeV2(buffers.getDeviceBindings().data());
}
double dTime = time.End();
printf("time used %.8f\n", dTime/1000);
TensorRT int8
第一次 time used 0.99424440 ms
第二次 time used 0.98159920 ms
TensorRT fp16
第一次 time used 0.97576060 ms
第二次 time used 0.97672040 ms
加速了五倍左右。TensorRT还是很好用的,希望对大家有帮助。
最后 int8模式比fp16还慢一点,很可能是我没有做精度校准,GPU1650不支持int8量化。有问题欢迎留言讨论。