Triton Inference Server

github address
install model analysis
yolov4性能分析例子
中文博客介绍
关于服务器延迟,并发性,并发度,吞吐量经典讲解
client py examples
用于模型仓库管理,性能测试工具
1、性能监测,优化
Model Analyzer section帮助你了解model 的 GPU内存使用率 — you can decide how to run multipe models on a single GPU.
提供analysis Concurrency: 1, throughput: 62.6 infer/sec, latency 21371 usec
2、开启自动Dynamic Batcher
就是将并发合并,然后推理
需要关闭Triton,在configuration file那里添加 dynamic_batching { }, restart Triton
3、In general the benefit of the dynamic batcher and multiple instances is model specific, so you should experiment with perf_analyzer to determine the settings that best satisfy your throughput and latency requirements.
4、perf_analyzer -m inception_graphdef --concurrency-range 1:4 -f perf.csv
将测试数据写到csv里
5、model_analyzer 详细
6、可以生成曲线图等
7、deploying on k8s
8、Performance Analyzer,性能分析器。
Model Analyzer,使用 Performance Analyzer 分析测量一个模型的 GPU 内存和计算使用率。
By default perf_analyzer sends input tensor data and receives output tensor data over the network. You can instead instruct perf_analyzer to use system shared memory or CUDA shared memory to communicate tensor data. By using these options you can model the performance that you can achieve by using shared memory in your application. Use –shared-memory=system to use system (CPU) shared memory or –shared-memory=cuda to use CUDA shared memory.

你可能感兴趣的:(项目笔记,深度学习,视觉检测,边缘计算)