项目地址:NVIDIA TensorRT


TensorRT(GIE)是一个C++库,适用于Jetson TX1和Pascal架构的显卡(Tesla P100, K80, M4 and Titan X等),支持fp16特性,也就是半精度运算。由于采用了“精度换速度”的策略,在精度无明显下降的同时,其对inference的加速很明显,往往可以有一倍的性能提升,而且还支持使用caffe模型。目前网上关于TensorRT的介绍很少,这里博主尝试着写一些,有空还会继续补充。


TensorRT目前基于gcc4.8而写成,其独立于任何深度学习框架。对于caffe而言,TensorRT是把caffe那一套东西转化后独立运行,能够解析caffe模型的相关工具叫做 NvCaffeParser,它根据prototxt文件和caffemodel权值,转化为支持半精度的新的模型。

目前TensorRT 支持caffe大部分常用的层,包括:

  • Convolution(卷积层), with or without bias. Currently only 2D convolutions (i.e. 4D input and output tensors) are supported. Note: The operation this layer performs is actually a correlation, which is a consideration if you are formatting weights to import via GIE’s API rather than the caffe parser library.
  • Activation(激活层): ReLU, tanh and sigmoid.
  • Pooling(池化层): max and average.
  • Scale(尺度变换层): per-tensor, per channel or per-weight affine transformation and exponentiation by constant values. Batch Normalization can be implemented using the Scale layer.
  • ElementWise(矩阵元素运算): sum, product or max of two tensors.
  • LRN(局部相应归一化层): cross-channel only.
  • Fully-connected(全连接层) with or without bias
  • SoftMax: cross-channel only
  • Deconvolution(反卷积层), with and without bias


  • Deconvolution groups
  • PReLU
  • Scale, other than per-channel scaling
  • EltWise with more than two inputs


  • In the build phase, the toolkit takes a network definition, performs optimizations, and generates the inference engine.
  • In the execution phase, the engine runs inference tasks using input and output buffers on the GPU.

Production Deep Learning with NVIDIA GPU Inference Engine




首先,Jetson TX1可以通过Jetpack 2.3.1的完全安装而自动获得TensorRT的支持,可参考博主之前的教程。TX1刷机之后,已经添加了一系列的C++运行库去支持TensorRT,如果掌握API的话,写一个C++程序就可以实现功能。

没有TX1,只有Pascal架构的显卡(如TITAN X),那也能感受TensorRT的效果,方法是去官网申NVIDIA TensorRT请测试资格,需要详细说明自己的研究目的,一般经过一两次邮件沟通后就能通过。博主目前已经获得TensorRT 1.0和2.0的测试资格,有机会也会进行TITAN X的TensorRT测试。


这里,博主就先以Jetson TX1为例,看看官方自带的例程是如何运行的。自带例程的地址是:/usr/src/gie_samples/samples,我们打开文件夹,发现如下文件:

其中,data文件夹存放LeNet和GoogleNet的模型描述文件和权值,giexec文件夹是TensorRT通用接口的源代码,剩下的文件夹是特定网络的接口源代码。Makefile是配置文件,在gie_sample文件夹位置打开终端,输入sudo make就能完成编译,生成一系列可执行文件,存放在bin文件夹中,那我们就来看看bin文件夹的内容:

cd /usr/src/gie_samples/samples

Mandatory params:
  --model=       Caffe model file
  --deploy=      Caffe deploy file
  --output=      Output blob name (can be specified multiple times
Optional params:
  --batch=N            Set batch size (default = 1)
  --device=N           Set cuda device to N (default = 0)
  --iterations=N       Run N iterations (default = 10)
  --avgRuns=N          Set avgRuns to N - perf is measured as an average of avgRuns (default=10)
  --workspace=N        Set workspace size in megabytes (default = 16)
  --half2              Run in paired fp16 mode - default = false
  --verbose            Use verbose logging - default = false
  --hostTime           Measure host time rather than GPU time - default = false
  --engine=      Generate a serialized GIE engine



cd /usr/src/gie_samples/samples
./bin/giexec --model=/usr/src/gie_samples/samples/data/samples/mnist/mnist.caffemodel --deploy=/usr/src/gie_samples/samples/data/samples/mnist/mnist.prototxt --output=prob --half2=true --batch=12


model: /usr/src/gie_samples/samples/data/samples/mnist/mnist.caffemodel
deploy: /usr/src/gie_samples/samples/data/samples/mnist/mnist.prototxt
output: prob
batch: 12
Average over 10 runs is 1.1353 ms.
Average over 10 runs is 1.1563 ms.
Average over 10 runs is 1.25929 ms.
Average over 10 runs is 1.28759 ms.
Average over 10 runs is 1.16477 ms.
Average over 10 runs is 1.17869 ms.
Average over 10 runs is 1.27358 ms.
Average over 10 runs is 1.14625 ms.
Average over 10 runs is 1.15732 ms.
Average over 10 runs is 1.18451 ms.


cd /usr/src/gie_samples/samples
./bin/giexec --model=/usr/src/gie_samples/samples/data/samples/mnist/mnist.caffemodel --deploy=/usr/src/gie_samples/samples/data/samples/mnist/mnist.prototxt --output=prob --batch=12


model: /usr/src/gie_samples/samples/data/samples/mnist/mnist.caffemodel
deploy: /usr/src/gie_samples/samples/data/samples/mnist/mnist.prototxt
output: prob
batch: 12
Average over 10 runs is 1.68441 ms.
Average over 10 runs is 1.72155 ms.
Average over 10 runs is 1.70917 ms.
Average over 10 runs is 1.60902 ms.
Average over 10 runs is 1.67406 ms.
Average over 10 runs is 1.59474 ms.
Average over 10 runs is 1.73178 ms.
Average over 10 runs is 1.56204 ms.
Average over 10 runs is 1.68201 ms.
Average over 10 runs is 1.62693 ms.


拥有Jetson TX1的小伙伴可以打开/usr/share/doc/gie/doc/API/index.html查看官方API文档,我这里连同例程源代码一起,都上传到了csdn,有兴趣者可以下载来看看。(貌似离开了TX1,API文档效果不佳)



不过Nvidia官方已经开始重视目标检测这一块了,博主和Nvidia技术人员的邮件往来中,获悉未来的TensorRT将会支持Faster RCNN以及SSD,他们已经在开发中了,相信到时使用Jetson TX1进行目标检测,帧率达到10fps以上不是梦。

You mentioned the SSD Single Shot Detector. We are working on SSD right now. I think you’ll find there are some layers needed for SSD that aren’t supported in the versions of TensorRT available through these early release programs. We are adding a custom layer capability to the next version and providing support using that mechanism for Faster R-CNN and SSD.
