作者 | Oldpan 编辑 | oldpan博客
点击下方卡片,关注“自动驾驶之心”公众号
ADAS巨卷干货,即可获取
点击进入→自动驾驶之心【模型部署】技术交流群
本文只做学术分享,如有侵权,联系删文
TensorRT-9.0和TensorRT-LLM马上要发布了,这里先汇总一下信息,之后会搞,需要注意这俩是两个不一样的东西,后者继承自fastertransformer,是大语言版本的tensorrt,依赖tensorr9.0去跑。
TensorRT-LLM将FasterTransformer进行了进一步的增强,使其成为一个产品化的解决方案。使用TensorRT-LLM,AI开发者可以更简单地实现深度学习推理应用,并且能够通过优化的LLMs来提升性能。TensorRT-LLM保留了FasterTransformer的核心功能,并通过一个开源模块化Python API来支持新架构和增强功能,从而提高了易用性和可扩展性。有了这个新发布的开源代码,AI推理开发者现在可以部署生产级应用程序、降低成本、减少复杂性并改善整体用户体验。
TensorRT-LLM目前没有ONNX parser,不能走ONNX workflow,必须手工搭建模型。
现在主流搞大模型都是手动搭建网络,转onnx的话,需要拆成很多部分才可以转
kv cache(学习vllm的page方法?)
高度优化的self-attention(极制的性能优化)
服务端优化(支持inflight batching,和continuous batching类似)
虽然取的名字提到LLM(Large Language Model,大语言模型),但其实TensorRT-LLM可以用来搭建任意AI模型,单卡多卡版本的都可以搞。
TensorRT-LLM将TensorRT、来自FasterTransformer的优化版kernel、预处理和后处理以及多GPU/多节点通信封装在一个Python API中,用于定义、优化和执行推理生产中的LLMs。
TensorRT-LLM 还包含创建Python 和 C++ 运行时的组件,这些运行时执行那些 TensorRT 引擎。它还包括一个后端,用于与 NVIDIA Triton inference server 集成。
TensorRT-LLM 的 Python API 的架构设计与 PyTorch API 类似。它为用户提供了一个 functional 模块,其中包含 einsum
、softmax
、matmul
或 view
等函数。
layer 模块将用于组装LLMs的有用构建块捆绑在一起,例如 Attention
块、MLP
或整个 Transformer
层。模型特定的组件,如 GPTAttention
或 BertAttention
,可以在 model 模块中找到。
TensorRT-LLM 为用户提供了预定义的模型,这些模型可以轻松地修改和扩展。TensorRT-LLM 的当前版本支持 BERT、GPT、NVIDIA GPT-2B、GPT-J、LLaMA、OPT、SantaCoder 和 StarCoder。
为了最大化性能并减少内存占用,TensorRT-LLM 允许使用不同的量化模式来执行模型(请参见 examples/gpt
以获取具体示例)。TensorRT-LLM 支持 INT4 或 INT8 权重(以及 FP16 激活;即 INT4/INT8 仅权重)以及 SmoothQuant 技术的完整实现。
更详细的看下架构介绍吧,有些没来得及翻译,直接看原文更好:
TensorRT-LLM 是一个工具集,用于组装优化的解决方案来执行大型语言模型(LLM)推理。提供了一个Python API来定义模型,并为NVIDIA GPU编译高效的TensorRT engine。它还包含Python和C++组件来构建运行时以执行这些引擎,以及Triton inference server的后端,方便地为LLM创建基于网络的服务。TensorRT-LLM支持多GPU和多节点配置(通过MPI)。
在拥有模型定义和权重后,用户必须使用TensorRT-LLM的Python API重新创建模型,这样可以由TensorRT编译成一个高效的引擎。为了方便使用,TensorRT-LLM已经支持了一些标准模型。
除了Python API描述模型外,TensorRT-LLM还为用户提供组件来创建运行高效TensorRT引擎的运行时。运行时组件提供beam search,以及诸如top-K和top-P采样之类的extensive sampling functionalities。
TensorRT-LLM还包括NVIDIA Triton推理服务器的Python和C++后端,以组装LLM在线服务的解决方案。
如上所述,TensorRT-LLM具有一个Python API,可用于定义大型语言模型。此API基于功能强大的TensorRT Python API,在TensorRT中创建深度神经网络的图表示。也就是手搓模型。
在TensorRT-LLM中,tensorrt_llm.Builder
类包含一个tensorrt.Builder
对象。该实例用于tensorrt_llm.Builder.create_network
方法中创建tensorrt.INetworkDefinition
类的实例。然后可以使用在tensorrt_llm.functional
中定义的自由函数来填充INetworkDefinition
对象。
其中一个简单的自由函数示例是tensorrt_llm.activation
,它在模型的图中插入一个tensorrt.IActivationLayer
节点:
# 在 tensorrt_llm.functional 中:
def activation(input: Tensor, act_type: trt.ActivationType) -> Tensor:
layer = default_trtnet().add_activation(input.trt_tensor, act_type) # default_trtnet() -> INetworkDefinition
return _create_tensor(layer.get_output(0), layer)
To make it even easier for users, a few of the most standard activation functions found in LLMs are derived from that function,有一些已经定义好的可以直接使用:
# In tensorrt_llm.functional:
relu = partial(activation, act_type=trt.ActivationType.RELU)
sigmoid = partial(activation, act_type=trt.ActivationType.SIGMOID)
Specialized activation functions can be used to assemble more advanced functions such as the silu
activation 特殊的激活层可以直接拼起来:
# In tensorrt_llm.functional:
def silu(input: Tensor) -> Tensor:
return input * sigmoid(input)
When the TensorRT-LLM's Python API is utilized, a graph of the network is assembled. The graph can later be traversed or transformed using the graph traversal API exposed by thetensorrt.ILayer
class. That graph will also be optimized by TensorRT during the compilation ofthe engine, as explained in the next section.
使用python API拼网络,然后解析,然后构建。
Once populated, the instance of the tensorrt.INetworkDefinition
, can be compiled into an efficient engine by the tensorrt.Builder
In TensorRT-LLM, it is done through the build_engine
member function of thetensorrt_llm.Builder
class that calls the build_serialized_network
method of the tensorrt.Builder
object. That call, if everything works as expected, produces an instance of the tensorrt.IHostMemory
class. That object is an optimized TensorRT engine that can be stored as a binary file.
TensorRT engines embed the network weights, that must be known for compilation. For that reason, the weights must be bound to parameters in the model definition before calling tensorrt_llm.Builder.build_engine
. It leads to code like:
# The Linear operator exposes two parameters (see tensorrt_llm/layers/linear.py):
class Linear(Module):
def __init__(self, ...):
self.weight = Parameter(shape=(self.out_features, self.in_features), dtype=dtype)
self.bias = Parameter(shape=(self.out_features, ), dtype=dtype)
# The parameters are bound to the weights before compiling the model. See examples/gpt/weight.py:
tensorrt_llm_gpt.layers[i].mlp.fc.weight.value = fromfile(...)
tensorrt_llm_gpt.layers[i].mlp.fc.bias.value = fromfile(...)
Note that TensorRT can also refit engines to update the weights after compilation. This feature is available to TensorRT-LLM users through the refit_engine
method in the tensorrt_llm.Builder
class.
One of the key steps performed by TensorRT when it compiles the network graph is the fusion of operations. Fusion is a well-known technique to improve the efficiency when executing LLMs. It helps reduce the amount of data transferred between the memory (DRAM) and the compute cores (CUDA cores as well as Tensor Cores located on the Streaming Multiprocessors of a GPU). It also removes kernel launch overhead (each time a kernel is launched on the GPU, there is a small additional CPU cost that is called the launch overhead). A classical example is the fusion of the activation function with the matrix multiplication (matmul) that usually precedes it in the network.
In TensorRT-LLM, when defining the model, such a sequence can be written as:
c = tensorrt_llm.functional.matmul(a, b)
c = tensorrt_llm.functional.relu(c)
During inference, if the above sequence is executed without fusion, the c
tensor has to be written to global memory at the end of the matmul
, read from that same memory in relu
and written again after relu
. If no other operation uses the intermediate values between matmul
and relu
, it is suboptimal. That is why, during compilation, TensorRT will identify that pattern and automatically produce a GPU kernel that applies relu
at the end of matmul
without an intermediate step through global memory. With that optimization, the c
tensor is written only once (after relu
) instead of twice, and is not read between the two operations.
The process of identifying the sequences of operations that can be fused is called pattern-matching. TensorRT has a powerful pattern-matching algorithm that can identify a lot of possible fusions. All the identified patterns are converted into more efficient kernels by an advanced kernel compiler.
The number of possible fusions is almost infinite and some useful fusions involve very advanced modifications of the graph. A well-known example is the Flash-Attention technique to optimize the Multihead-Attention block found in many LLMs. Flash-Attention requires modifications to the arithmetic performed in the sequence BMM-Softmax-BMM
(where BMM
stands for Batched Matrix-Matrix product) and the interleaving of the for
-loops of the two batched matrix products. That's non-trivial and not necessarily something you can expect a compiler to "discover" on its own (or it might require the support for a polyhedral model).
As a result, even if TensorRT has a powerful pattern-matching algorithm and supports a lot of possible fusions, there is always the risk that it cannot identify uncommon and/or very advanced patterns. To overcome that inevitable limitation, TensorRT offers a powerful mechanism known as plugins.
The plugins are nodes inserted in the network graph definition that map to user-defined GPU kernels. TensorRT-LLM uses a number of such plugins. They can be found in the cpp/tensorrt_llm/plugins
directory.
Plugins are written in C++ and follow a well-defined interface described in the Extending TensorRT with Custom Layers section of the TensorRT Developer Guide.
When executed within a TensorRT engine, plugins trigger the execution of their encapsulated GPU kernels. A fairly simple example of plugins is the QuantizeTensorPlugin
that triggers a CUDA kernel in the QuantizeTensorPlugin::enqueue
member function:
// In cpp/tensorrt_llm/plugins/quantizeTensorPlugin/quantizeTensorPlugin.cpp:
int QuantizeTensorPlugin::enqueue(...) {
if (inputDesc[0].type == DataType::kFLOAT) {
invokeQuantization(...);
} else {
invokeQuantization(...);
}
return 0;
}
// In cpp/tensorrt_llm/kernels/quantization.cu:
template
void invokeQuantization(...) {
// The standard <<< >>> construct to launch CUDA kernels
quantizedKernel<<>>(...);
}
For more details on how TensorRT-LLM implements the GPT Attention operator, see the Multihead and Multiquery Attention document.
TensorRT-LLM includes an API to implement Python and C++ runtimes. The role of the runtime components is to load the TensorRT engines and drive their execution. Typically, for an auto-regressive model like GPT, the runtime is in charge of loading the engine that implements both the processing of the input sequence as well as the body of the generation loop. See the GPT C++ Runtime document for details on the C++ Runtime.
和之前fastertransformer一样,C++中包含整个生成的循环,不仅仅是模型。
Even if TensorRT is designed for single-GPU systems, TensorRT-LLM adds the support for systems with multiple GPUs and nodes. It is enabled using TensorRT plugins that wrap communication primitives from the NCCL library.
The communication plugins can be found in cpp/tensorrt_llm/plugins/ncclPlugin and the multi-GPU functions are exposed in the TensorRT-LLM Python API as:
# In tensorrt_llm/functional.py:
# Collectives.
def allreduce(tensor: Tensor, group: List[int]) -> Tensor
def allgather(tensor: Tensor, group: List[int]) -> Tensor
# Point-to-point communication primitives.
def send(tensor: Tensor, tgt: int) -> Tensor
def recv(tensor: Tensor, src: int) -> Tensor
TensorRT-LLM supports in-flight batching of requests (also known as continuous batching or iteration-level batching) for higher serving throughput.
https://developer.nvidia.com/tensorrt-llm-early-access
https://www.bilibili.com/video/BV1h44y1c72B/?spm_id_from=333.788&vd_source=eec038509607175d58cdfe2e824e8ba2
① 全网独家视频课程
BEV感知、毫米波雷达视觉融合、多传感器标定、多传感器融合、多模态3D目标检测、点云3D目标检测、目标跟踪、Occupancy、cuda与TensorRT模型部署、协同感知、语义分割、自动驾驶仿真、传感器部署、决策规划、轨迹预测等多个方向学习视频(扫码学习)
视频官网:www.zdjszx.com② 国内首个自动驾驶学习社区
近2000人的交流社区,涉及30+自动驾驶技术栈学习路线,想要了解更多自动驾驶感知(2D检测、分割、2D/3D车道线、BEV感知、3D目标检测、Occupancy、多传感器融合、多传感器标定、目标跟踪、光流估计)、自动驾驶定位建图(SLAM、高精地图、局部在线地图)、自动驾驶规划控制/轨迹预测等领域技术方案、AI模型部署落地实战、行业动态、岗位发布,欢迎扫描下方二维码,加入自动驾驶之心知识星球,这是一个真正有干货的地方,与领域大佬交流入门、学习、工作、跳槽上的各类难题,日常分享论文+代码+视频,期待交流!
③【自动驾驶之心】技术交流群
自动驾驶之心是首个自动驾驶开发者社区,聚焦目标检测、语义分割、全景分割、实例分割、关键点检测、车道线、目标跟踪、3D目标检测、BEV感知、多模态感知、Occupancy、多传感器融合、transformer、大模型、点云处理、端到端自动驾驶、SLAM、光流估计、深度估计、轨迹预测、高精地图、NeRF、规划控制、模型部署落地、自动驾驶仿真测试、产品经理、硬件配置、AI求职交流等方向。扫码添加汽车人助理微信邀请入群,备注:学校/公司+方向+昵称(快速入群方式)
④【自动驾驶之心】平台矩阵,欢迎联系我们!