千呼万唤始出来,备受期待的Tensorrt-LLM终于发布,发布版本0.5.0。
github:
https://github.com/NVIDIA/TensorRT-LLM/tree/main
TensorRT-LLM可以视为TensorRT和FastTransformer的结合体,旨为大模型推理加速而生。
除了FastTransformer对Transformer做的attention优化、softmax优化、算子融合等方式之外,还引入了众多的大模型推理优化特性:
Multi-head Attention(MHA)
Multi-query Attention (MQA)
Group-query Attention(GQA)
In-flight Batching
Paged KV Cache for the Attention
Tensor Parallelism
Pipeline Parallelism
INT4/INT8 Weight-Only Quantization (W4A16 & W8A16)
SmoothQuant
GPTQ
AWQ
FP8
Greedy-search
Beam-search
RoPE
对众多开源大模型都做了调用实例,其中包括:
Baichuan
Bert
Blip2
BLOOM
ChatGLM-6B
ChatGLM2-6B
Falcon
GPT
GPT-J
GPT-Nemo
GPT-NeoX
LLaMA
LLaMA-v2
MPT
OPT
SantaCoder
StarCoder
使用上仍然保持了TensorRT两阶段的调用方式——build+run:
build:通过配置参数将模型文件序列化为tensorRT的engine文件
run:加载engine文件,传入数据,进行inference
目前,TensorRT-LLM必须由源码编译获得,为方便构建编译环境,官方提供了docker构建方式
详细指引可参考:
https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/installation.md
Build env
下载源码后,在代码根目录,执行构建命令
docker build --pull \
--target devel \
--file docker/Dockerfile.multi \
--tag tensorrt_llm/devel:latest \
.
Build TensorRT-LLM
# To build the TensorRT-LLM code.
python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt
# Deploy TensorRT-LLM in your environment.
pip install ./build/tensorrt_llm*.whl
也可以通过创建本地环境的方式完成编译过程,可参考:
http://t.csdnimg.cn/IrB2e
./examples中有较多参考示例
此处选择bert例子做简单测试,设置layer=12
build
将huggingface格式的模型转为tensor engine文件
python3 build.py --dtype=float16 --log_level=verbose
run
python3 run.py
性能对比
pytorch
max_seqlen | batch_size | inference time(ms) | gpumem(MB) |
---|---|---|---|
32 | 1 | 9.119 | 631.0 |
64 | 1 | 9.177 | 653.0 |
128 | 1 | 9.069 | 701.0 |
256 | 1 | 9.193 | 701.0 |
32 | 4 | 9.673 | 705.0 |
64 | 4 | 9.708 | 721.0 |
128 | 4 | 9.751 | 821.0 |
256 | 4 | 14.154 | 1321.0 |
tensorrt-llm
max_seqlen | batch_size | inference time(ms) | gpumem(MB) |
---|---|---|---|
32 | 1 | 2.317 | 843.0 |
64 | 1 | 2.404 | 843.0 |
128 | 1 | 2.868 | 843.0 |
256 | 1 | 3.564 | 843.0 |
32 | 4 | 2.788 | 843.0 |
64 | 4 | 3.372 | 843.0 |
128 | 4 | 5.252 | 843.0 |
256 | 4 | 10.54 | 863.0 |
|128|1|2.868|843.0|
|256|1|3.564|843.0|
|32|4|2.788|843.0|
|64|4|3.372|843.0|
|128|4|5.252|843.0|
|256|4|10.54|863.0|
GPT2
git clone https://huggingface.co/gpt2-medium
模型保存在/root/autodl-tmp/models/gpt/
运行步骤参考/root/autodl-tmp/files/TensorRT-LLM/examples/gpt目录下的readme文档
(trt-llm) root@xxx:~/autodl-tmp/files/TensorRT-LLM/examples/gpt# python3 hf_gpt_convert.py -i gpt2 -o ./gpt2-trt --tensor-parallelism 1 --storage-type float16
(trt-llm) root@xxx:~/autodl-tmp/files/TensorRT-LLM/examples/gpt# python3 build.py --model_dir=./gpt2-trt/1-gpu --use_gpt_attention_plugin --remove_input_padding
(trt-llm) root@xxx:~/autodl-tmp/files/TensorRT-LLM/examples/gpt# python3 run.py --max_output_len=16
Input: "Born in north-east France, Soyer trained as a"
Output: " chef and a cook at the local restaurant, La Boulangerie. He"