Infer模式下, yolov5 默认的detect层输出的数据是一个形状为[batches, 25200, 85]
的张量。如果部署在Nvidia Triton
中,输出层的张量大小过大,处理输出的时间会变大,造成队列积压。 特别是在Triton Server
不在同一台机器,无法使用shared memory
的情况下,通过网络将数据传输到client的时间还会变大,影响推理服务的性能。 相关代码链接
将模型转换为tensorrt engine, 并部署在Triton Inference Server,instance group数量为1,类型为GPU,在其他机器上通过Triton提供的perf_analyzer工具进行性能测试。
将onnx转换为tensorrt engine
/usr/src/tensorrt/bin/trtexec \
--onnx=yolov5s.onnx \
--minShapes=images:1x3x640x640 \
--optShapes=images:8x3x640x640 \
--maxShapes=images:32x3x640x640 \
--workspace=4096 \
--saveEngine=yolov5s.engine \
--shapes=images:1x3x640x640 \
--verbose \
--fp16 \
> result-FP16.txt
部署在Triton Inference Server
模型上传到Triton server 设置的model repository路径,编写模型服务配置
python --input_images <image_path> ----output_file <real_data>.json
perf_analyzer -m <triton_model_name> -b 1 --input-data <real_data>.json --concurrency-range 1:10 --measurement-interval 10000 -u <triton server endpoint> -i gRPC -f <triton_model_name>.csv
如下为使用默认detect层的yolov5 trt engine, 部署在triton的性能测试结果,可以看到,使用默认的detect层,大量时间消耗在队列积压(Server Queue
)和输出数据的处理(Server Compute Output
),吞吐量甚至达不到 1 infer/sec
除了吞吐,其余指标的单位均为us, 其中Client Send和Client Recv分别为gRPC序列化、反序列化数据的时间
Concurrency | Inferences/Second | Client Send | Network+Server Send/Recv | Server Queue | Server Compute Input | Server Compute Infer | Server Compute Output | p90 latency |
1 | 0.7 | 1683 | 1517232 | 466 | 8003 | 4412 | 9311 | 1592936 |
2 | 0.8 | 1464 | 1514475 | 393 | 10659 | 4616 | 956736 | 2583025 |
3 | 0.7 | 2613 | 1485868 | 1013992 | 7370 | 4396 | 1268070 | 3879331 |
4 | 0.7 | 2268 | 1463386 | 2230040 | 9933 | 5734 | 1250245 | 4983687 |
5 | 0.6 | 2064 | 1540583 | 3512025 | 11057 | 4843 | 1226058 | 6512305 |
6 | 0.6 | 2819 | 1573869 | 4802885 | 10134 | 4320 | 1234644 | 7888080 |
7 | 0.5 | 1664 | 1507386 | 6007235 | 11197 | 4899 | 1244482 | 8854777 |
因此,改造的一个方案就是将数据层进行精简,在送入nms之前根据conf对bbox进行粗略的筛选, 最后参考tensorrtx中对detect层的处理,将输出改造成形状为[batches, num_bboxes, 6]
的向量, 其中num_bboxes=1000
6 = [cx,cy,w,h,conf,cls_id]
, 其中conf = obj_conf * cls_prob
git clone -b v6.1
def forward(self, x):
z = [] # inference output
for i in range(
x[i] = self.m[i](x[i]) # conv
bs, _, ny, nx = x[i].shape # x(bs,255,20,20) to x(bs,3,20,20,85)
x[i] = x[i].view(bs,,, ny, nx).permute(0, 1, 3, 4, 2).contiguous()
if not # inference
if self.onnx_dynamic or self.grid[i].shape[2:4] != x[i].shape[2:4]:
self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i)
y = x[i].sigmoid()
if self.inplace:
y[..., 0:2] = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i] # xy
y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i] # wh
else: # for YOLOv5 on AWS Inferentia
xy = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i] # xy
wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i] # wh
y =, wh, y[..., 4:]), -1)
z.append(y.view(bs, -1,
# custom output >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
# [bs, 25200, 85]
origin_output =, 1)
output_bboxes_nums = 1000
# operator argsort to ONNX opset version 12 is not supported.
# top_conf_index = origin_output[..., 4].argsort(descending=True)[:,:output_bboxes_nums]
# [bs, 1000]
top_conf_index =origin_output[..., 4].topk(k=output_bboxes_nums)[1]
# torch.Size([bs, 1000, 85])
filter_output = origin_output.gather(1, top_conf_index.unsqueeze(-1).expand(-1, -1, 85))
filter_output[...,5:] *= filter_output[..., 4].unsqueeze(-1) # conf = obj_conf * cls_conf
bboxes = filter_output[..., :4]
conf, cls_id = filter_output[..., 5:].max(2, keepdim=True)
# [bs, 1000, 6]
filter_output =, conf, cls_id.float()), 2)
return x if else filter_output
# custom output >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
# return x if else (, 1), x)
onnx simplify
的时候,必须注释掉下面的代码,否则导出的onnx模型仍然为static shape
model_onnx, check = onnxsim.simplify(
# 必须注释
#input_shapes={'images': list(im.shape)} if dynamic else None
运行python --weight --dynamic --simplify --include onnx
batch size = 1
吞吐量提升了25倍以上,Server Queue
和Server Compute Output
Concurrency | Inferences/Second | Client Send | Network+Server Send/Recv | Server Queue | Server Compute Input | Server Compute Infer | Server Compute Output | Client Recv | p90 latency |
1 | 11.9 | 1245 | 69472 | 286 | 7359 | 5022 | 340 | 3 | 93457 |
2 | 19.2 | 1376 | 89804 | 341 | 7538 | 4997 | 161 | 3 | 118114 |
3 | 20.2 | 1406 | 131265 | 1500 | 8240 | 4881 | 500 | 3 | 171370 |
4 | 20 | 1382 | 180621 | 2769 | 9051 | 5184 | 496 | 3 | 235043 |
5 | 20.5 | 1362 | 226046 | 2404 | 8112 | 5068 | 622 | 3 | 286810 |
6 | 20.8 | 1487 | 271714 | 2034 | 8331 | 5076 | 506 | 3 | 406248 |
7 | 20.1 | 1535 | 328144 | 2626 | 8444 | 5122 | 405 | 3 | 430850 |
8 | 19.9 | 1512 | 384690 | 3511 | 8168 | 5018 | 581 | 5 | 465658 |
9 | 20.2 | 1433 | 420893 | 3499 | 9034 | 5180 | 389 | 3 | 522285 |
10 | 20.5 | 1476 | 469029 | 3369 | 8280 | 5165 | 442 | 3 | 622745 |
batch size = 8
相对 batch size = 1, Server Compute Input、Server Compute Infer, Server Compute Output
Concurrency | Inferences/Second | Client Send | Network+Server Send/Recv | Server Queue | Server Compute Input | Server Compute Infer | Server Compute Output | Client Recv | p90 latency |
1 | 15.2 | 11202 | 527075 | 360 | 5386 | 2488 | 43 | 5 | 570189 |
2 | 18.4 | 10424 | 829927 | 124 | 5780 | 2491 | 33 | 4 | 901743 |
3 | 20 | 10203 | 1178111 | 2290 | 5640 | 2570 | 20 | 4 | 1267145 |
4 | 20 | 10097 | 1595614 | 4843 | 5998 | 2454 | 104 | 5 | 1716309 |
5 | 19.2 | 9117 | 1971608 | 2397 | 5376 | 2480 | 203 | 4 | 2518530 |
6 | 20 | 8728 | 2338066 | 2914 | 6304 | 2496 | 96 | 4 | 2706257 |
7 | 20 | 14785 | 2708292 | 6581 | 5556 | 2489 | 160 | 5 | 3170047 |
8 | 20 | 13035 | 3052707 | 5067 | 6353 | 2492 | 62 | 4 | 3235293 |
9 | 17.6 | 10870 | 3535601 | 7037 | 6307 | 2480 | 136 | 5 | 3856391 |
10 | 18.4 | 9357 | 3953830 | 8044 | 5629 | 2520 | 64 | 3 | 4531638 |