深度学习(Deep Learning)因其计算复杂度或参数冗余,在一些场景和设备上限制了相应的模型部署,需要借助模型压缩、优化加速、异构计算等方法突破瓶颈。
TensorRT是NVIDIA推出的深度学习优化加速工具,采用的原理如下图所示,具体可参考[3] [4]:
TensorRT能够优化重构由不同深度学习框架训练的深度学习模型:
在1080ti平台上,基于TensorRT4.0.1,Resnet101-v2的优化加速效果如下:
Network | Precision | Framework / GPU: 1080ti (P) | Avg. Time (Batch=8, unit: ms) | Top1 Val. Acc. (ImageNet-1k) |
Resnet101 | fp32 | TensorFlow | 36.7 | 0.7612 |
Resnet101 | fp32 | MXnet | 25.8 | 0.7612 |
Resnet101 | fp32 | TRT4.0.1 | 19.3 | 0.7612 |
Resnet101 | int8 | TRT4.0.1 | 9 | 0.7574 |
在1080ti/2080ti平台上,基于TensorRT5.1.5,Resnet101-v1d的float16加速效果如下(由于2080ti包含Tensor Core,因此float16加速效果较为明显):
网络 | 平台 | 数值精度 | Batch=8 | Batch=4 | Batch=2 | Batch=1 |
Resnet101-v1d | 1080ti | float32 | 19.4ms | 12.4ms | 8.4ms | 7.4ms |
float16 | 28.2ms | 16.9ms | 10.9ms | 8.1ms | ||
int8 | 8.1ms | 6.7ms | 4.6ms | 4ms | ||
2080ti | float32 | 16.6ms | 10.8ms | 8.0ms | 7.2ms | |
float16 | 14.6ms | 9.6ms | 5.5ms | 4.3ms | ||
int8 | 7.2ms | 3.8ms | 3.0ms | 2.6ms |
深度学习模型因其稀疏性或过拟合倾向,可以被裁剪为结构精简的网络模型,具体包括结构性剪枝与非结构性剪枝:
以Channel Pruning为例,结构剪枝的规整操作如下图所示,可兼容现有的、成熟的深度学习框架:
模型量化是指权重或激活输出可以被聚类到一些离散、低精度(reduced precision)的数值点上,通常依赖于特定算法库或硬件平台的支持:
若模型压缩之后,推理精度存在较大损失,可以通过fine-tuning予以恢复,并在训练过程中结合适当的Tricks,例如Label Smoothing、Mix-up、Knowledge Distillation、Focal Loss等。 此外,模型压缩、优化加速策略可以联合使用,进而可获得更为极致的压缩比与加速比。例如结合Network Slimming与TensorRT int8优化,在1080ti Pascal平台上,Resnet101-v1d在压缩比为1.4倍时(Size=170MB->121MB,FLOPS=16.14G->11.01G),经TensorRT int8量化之后,推理耗时仅为7.4ms(Batch size=8)。
其中知识蒸馏(Knowledge Distillation)相关的讨论可参考:
https://blog.csdn.net/nature553863/article/details/80568658
[1] https://arxiv.org/abs/1801.02108, Github: https://github.com/uber/sbnet
[2] https://basicmi.github.io/Deep-Learning-Processor-List/
[3] https://devblogs.nvidia.com/tensorrt-3-faster-tensorflow-inference/
[4] https://devblogs.nvidia.com/int8-inference-autonomous-vehicles-tensorrt/
[5] https://arxiv.org/abs/1510.00149
[6] https://arxiv.org/abs/1802.06367, https://ai.intel.com/winograd-2/, Github: https://github.com/xingyul/Sparse-Winograd-CNN
[7] https://arxiv.org/abs/1707.06168, Github: https://github.com/yihui-he/channel-pruning
[8] https://arxiv.org/abs/1707.06342
[9] https://arxiv.org/abs/1810.11809, Github: https://github.com/Tencent/PocketFlow
[10] https://arxiv.org/abs/1708.06519, Github: https://github.com/foolwood/pytorch-slimming
[11] https://arxiv.org/abs/1611.06440, Github: https://github.com/jacobgil/pytorch-pruning
[12] http://xuanyidong.com/publication/ijcai-2018-sfp/
[13] https://arxiv.org/abs/1603.05279, Github: https://github.com/ayush29feb/Sketch-A-XNORNet
Github: https://github.com/jiecaoyu/XNOR-Net-PyTorch
[14] https://arxiv.org/abs/1711.11294, Github: https://github.com/layog/Accurate-Binary-Convolution-Network
[15] https://arxiv.org/abs/1708.08687
[16] https://arxiv.org/abs/1808.00278, Github: https://github.com/liuzechun/Bi-Real-net
[17] https://arxiv.org/abs/1605.04711
[18] https://arxiv.org/abs/1612.01064, Github: https://github.com/czhu95/ternarynet
[19] http://phwl.org/papers/syq_cvpr18.pdf, Github: https://github.com/julianfaraone/SYQ
[20] https://arxiv.org/abs/1712.05877
[21] http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf
[22] https://arxiv.org/abs/1702.03044
[23] https://papers.nips.cc/paper/6390-cnnpack-packing-convolutional-neural-networks-in-the-frequency-domain
[24] https://blog.csdn.net/nature553863/article/details/97631176
[25] https://blog.csdn.net/nature553863/article/details/96857133
[26] https://github.com/NVIDIA/DeepLearningExamples/tree/master/FasterTransformer
[27] https://github.com/onnx/onnx-tensorrt
[28] https://blog.csdn.net/nature553863/article/details/97760040