zenRRan

大模型的好伙伴，浅析推理加速引擎FasterTransformer

来自：吃果冻不吐果冻皮

进NLP群—>加入NLP交流群

最近几个月，随着ChatGPT的现象级表现，大模型如雨后春笋般涌现。而模型推理是抽象的算法模型触达具体的实际业务的最后一公里。

但是在这个环节中，仍然还有很多已经是大家共识的痛点和诉求，比如：

任何线上产品的用户体验都与服务的响应时长成反比，复杂的模型如何极致地压缩请求时延？
模型推理通常是资源常驻型服务，如何通过提升服务单机性能从而增加QPS，同时大幅降低资源成本？
端-边-云是现在模型服务发展的必然趋势，如何让离线训练的模型“瘦身塑形”从而在更多设备上快速部署使用？

因此，模型推理的加速优化成为了AI界的重要研究领域。

本文给大家分享大模型推理加速引擎FasterTransformer的基本使用。

image.png

FasterTransformer简介

NVIDIA FasterTransformer (FT) 是一个用于实现基于Transformer的神经网络推理的加速引擎。它包含Transformer块的高度优化版本的实现，其中包含编码器和解码器部分。使用此模块，您可以运行编码器-解码器架构模型（如：T5）、仅编码器架构模型（如：BERT）和仅解码器架构模型（如：GPT）的推理。

FT框架是用C++/CUDA编写的，依赖于高度优化的 cuBLAS、cuBLASLt 和 cuSPARSELt 库，这使您可以在 GPU 上进行快速的 Transformer 推理。

与NVIDIA TensorRT等其他编译器相比，FT 的最大特点是它支持以分布式方式进行 Transformer 大模型推理。

下图显示了如何使用张量并行 (TP) 和流水线并行 (PP) 技术将基于Transformer架构的神经网络拆分到多个 GPU 和节点上。

当每个张量被分成多个块时，就会发生张量并行，并且张量的每个块都可以放置在单独的 GPU 上。在计算过程中，每个块在不同的 GPU 上单独并行处理；最后，可以通过组合来自多个 GPU 的结果来计算最终张量。
当模型被深度拆分，并将不同的完整层放置到不同的 GPU/节点上时，就会发生流水线并行。

在底层，节点间或节点内通信依赖于 MPI 、 NVIDIA NCCL、Gloo等。因此，使用FasterTransformer，您可以在多个 GPU 上以张量并行运行大型Transformer，以减少计算延迟。同时，TP 和 PP 可以结合在一起，在多 GPU 节点环境中运行具有数十亿、数万亿个参数的大型 Transformer 模型。

除了使用 C ++ 作为后端部署，FasterTransformer 还集成了 TensorFlow（使用 TensorFlow op）、PyTorch （使用 Pytorch op）和 Triton作为后端框架进行部署。当前，TensorFlow op 仅支持单 GPU，而 PyTorch op 和 Triton 后端都支持多 GPU 和多节点。

目前，FT 支持了 Megatron-LM GPT-3、GPT-J、BERT、ViT、Swin Transformer、Longformer、T5 和 XLNet 等模型。您可以在 GitHub 上的 FasterTransformer库中查看最新的支持矩阵。

FT 适用于计算能力 >= 7.0 的 GPU，例如 V100、A10、A100 等。

下图展示了 GPT-J 6B 参数的模型推断加速比较：

image.png

FasterTransformer 中的优化技术

与深度学习训练的通用框架相比，FT 使您能够获得更快的推理流水线以及基于 Transformer 的神经网络具有更低的延迟和更高的吞吐量。FT 对 GPT-3 和其他大型Transformer模型进行的一些优化技术包括：

层融合（Layer fusion）

这是预处理阶段的一组技术，将多层神经网络组合成一个单一的神经网络，将使用一个单一的核（kernel）进行计算。这种技术减少了数据传输并增加了数学密度，从而加速了推理阶段的计算。例如， multi-head attention 块中的所有操作都可以合并到一个核（kernel）中。

自回归模型的推理优化(激活缓存)

为了防止通过Transformer重新计算每个新 token 生成器的先前键和值，FT 分配了一个缓冲区来在每一步存储它们。

虽然需要一些额外的内存使用，但 FT 可以节省重新计算的成本。该过程如下图所示。相同的缓存机制用于 NN 的多个部分。

image.png

内存优化

与 BERT 等传统模型不同，大型 Transformer 模型具有多达数万亿个参数，占用数百 GB 存储空间。即使我们以半精度存储模型，GPT-3 175b 也需要 350 GB。因此有必要减少其他部分的内存使用。

例如，在 FasterTransformer 中，我们在不同的解码器层重用了激活/输出的内存缓冲（buffer）。由于 GPT-3 中的层数为 96，因此我们只需要 1/96 的内存量用于激活。

使用 MPI 和 NCCL 实现节点间/节点内通信并支持模型并行

FasterTransormer 同时提供张量并行和流水线并行。对于张量并行，FasterTransformer 遵循了 Megatron 的思想。对于自注意力块和前馈网络块，FT 按行拆分第一个矩阵的权重，并按列拆分第二个矩阵的权重。通过优化，FT 可以将每个 Transformer 块的归约（reduction）操作减少到两次。

对于流水线并行，FasterTransformer 将整批请求拆分为多个微批，隐藏了通信的空泡（bubble）。FasterTransformer 会针对不同情况自动调整微批量大小。

MatMul 核自动调整（GEMM 自动调整）

矩阵乘法是基于Transformer的神经网络中最主要和繁重的操作。FT 使用来自 CuBLAS 和 CuTLASS 库的功能来执行这些类型的操作。重要的是要知道 MatMul 操作可以在“硬件”级别使用不同的底层（low-level）算法以数十种不同的方式执行。

GemmBatchedEx 函数实现了 MatMul 操作，并以cublasGemmAlgo_t作为输入参数。使用此参数，您可以选择不同的底层算法进行操作。

FasterTransformer 库使用此参数对所有底层算法进行实时基准测试，并为模型的参数和您的输入数据（注意层的大小、注意头的数量、隐藏层的大小）选择最佳的一个。此外，FT 对网络的某些部分使用硬件加速的底层函数，例如：__expf、__shfl_xor_sync。

低精度推理

FT 的核（kernels）支持使用 fp16 和 int8 等低精度输入数据进行推理。由于较少的数据传输量和所需的内存，这两种机制都会加速。同时，int8 和 fp16 计算可以在特殊硬件上执行，例如：Tensor Core（适用于从 Volta 开始的所有 GPU 架构）和即将推出的 Hopper GPU 中的Transformer引擎。

除此之外还有快速的 C++ BeamSearch 实现、当模型的权重部分分配到八个 GPU 之间时，针对 TensorParallelism 8 模式优化的 all-reduce。

上面简述了FasterTransformer，下面将演示针对 Bloom 模型以 PyTorch 作为后端使用FasterTransformer。

FasterTransformer GPT 简介

下文将会使用BLOOM模型进行演示，而 BLOOM 是一个利用 ALiBi(用于添加位置嵌入) 的 GPT 模型的变体，因此，本文先简要介绍一下 GPT 的相关工作。GPT是仅解码器架构模型的一种变体，没有编码器模块，使用GeLU作为激活。

FasterTransformer GPT 工作流程

下图展示了 FasterTransformer GPT 的工作流程。与 BERT（仅编码器结构）和编码器-解码器结构不同，GPT 接收一些输入 id 作为上下文，并生成相应的输出 id 作为响应。在此工作流程中，主要瓶颈是 GptDecoderLayer（transformer块），因为当我们增加层数时，时间会线性增加。在 GPT-3 中，GptDecoderLayer 占用了大约 95% 的总时间。

image.png

FasterTransformer 将整个工作流程分成两部分。

第一部分是“计算上下文（输入 ids）的 k/v 缓存”。
第二部分是“自回归生成输出 ids”。

这两部分的操作类似，只是SelfAttention中张量的形状不同。因此，我们使用 2 种不同的实现来处理两种不同的情况，如下图所示。

image.png

在 DecoderSelfAttention 中，查询的序列长度始终为 1，因此我们使用自定义的 fused masked multi-head attention kernel 来处理。另一方面，ContextSelfAttention 中查询的序列长度是最大输入长度，因此我们使用 cuBLAS 来利用tensor core。

以下的示例演示了如何运行多 GPU 和多节点的 GPT 模型。

examples/cpp/multi_gpu_gpt_example.cc：它使用MPI来组织所有的GPU。
examples/cpp/multi_gpu_gpt_triton_example.cc：它在节点内使用线程，在节点间使用 MPI。此示例还演示了如何使用基于 FasterTransformer 的 Triton 后端 API 来运行 GPT 模型。
examples/pytorch/gpt/multi_gpu_gpt_example.py：这个例子和examples/cpp/multi_gpu_gpt_example.cc类似，但是通过PyTorch OP封装了FasterTransformer的实例。

总之，运行 GPT 模型的工作流程是：

通过 MPI 或线程初始化 NCCL 通信并设置张量并行和流水线并行的ranks
按张量并行、流水线并行和其他模型超参数的ranks加载权重。
通过张量并行、流水线并行和其他模型超参数的ranks创建ParalelGpt实例。
接收来自客户端的请求并将请求转换为 ParallelGpt 的输入张量格式。
运行forward
将 ParallelGpt 的输出张量转换为客户端的响应并返回响应。

在C++示例代码中，我们跳过第4步和第6步，通过examples/cpp/multi_gpu_gpt/start_ids.csv加载该请求。在 PyTorch 示例代码中，该请求来自 PyTorch 端。在 Triton 示例代码中，我们有从步骤 1 到步骤 6 的完整示例。

源代码放在 src/fastertransformer/models/multi_gpu_gpt/ParallelGpt.cc 中。其中，GPT的构造函数参数包括head_num、num_layer、tensor_para、pipeline_para等，GPT的输入参数包括input_ids、input_lengths、output_seq_len等；GPT的输出参数包括output_ids（包含 input_ids 和生成的 id）、sequence_length、output_log_probs、cum_log_probs、context_embeddings。

FasterTransformer GPT 优化

核优化：很多核都是基于已经高度优化的解码器和解码码模块的核。为了防止重新计算以前的键和值，我们将在每一步分配一个缓冲区来存储它们。虽然它需要一些额外的内存使用，但我们可以节省重新计算的成本。
内存优化：与 BERT 等传统模型不同，GPT-3 有 1750 亿个参数，即使我们以半精度存储模型也需要 350 GB。因此，我们必须减少其他部分的内存使用。在 FasterTransformer 中，我们将重用不同解码器层的内存缓冲。由于 GPT-3 的层数是 96，我们只需要 1/96 的内存。
模型并行：在GPT模型中，FasterTransormer同时提供张量并行和流水线并行。对于张量并行，FasterTransformer 遵循了 Megatron 的思想。对于自注意力块和前馈网络块，我们按行拆分第一个矩阵乘法的权重，按列拆分第二个矩阵乘法的权重。通过优化，我们可以将每个transformer block的归约操作减少到 2 次，工作流程如下图所示。对于流水线并行，FasterTransformer 将整批请求拆分为多个微批并隐藏通信空泡。FasterTransformer 会针对不同情况自动调整微批量大小。用户可以通过修改 gpt_config.ini 文件来调整模型并行度。我们建议在节点内使用张量并行，在节点间使用流水线并行，因为，张量并行需要更多的 NCCL 通信。
多框架：FasterTransformer除了c上的源代码，还提供了TensorFlow op、PyTorch op和Triton backend。目前TensorFlow op只支持单GPU，而PyTorch op和Triton backend支持多GPU和多节点。FasterTransformer 还提供了一个工具，可以将 Megatron 的模型拆分并转换为FasterTransformer二进制文件，以便 FasterTransformer 可以直接加载二进制文件，从而避免为模型并行而进行的额外拆分模型工作。

FasterTransformer GPT 推理选项

FasterTransformer GPT 还提供环境变量以针对特定用途进行调整。

名称	描述	默认值	可接受的值
`FMHA_ENABLE`	启用融合多头注意力核 (fp16 accumulation)	disabled	`ON` = enable fmha, otherwise disabled
`CONTEXT_ATTENTION_BMM1_HALF_ACCUM`	对 qk gemm 使用 fp16 累加，并且只对未融合的多头注意力核产生影响	fp32 accumulation	`ON` = fp32 accumulation, otherwise fp16 accumulation

环境搭建

基础环境配置

首先确保您具有以下组件：

NVIDIA Docker 和 NGC 容器
NVIDIA Pascal/Volta/Turing/Ampere 系列的 GPU

基础组件版本要求：

CMake: 3.13及以上版本
CUDA: 11.0及以上版本
NCCL: 2.10及以上版本
Python: 3.8.13
PyTorch: 1.13.0

这些组件在 Nvidia 官方提供的 TensorFlow/PyTorch Docker 镜像中很容易获得。

构建FasterTransformer

推荐使用Nvidia官方提供的镜像，如：nvcr.io/nvidia/tensorflow:22.09-tf1-py3 、 nvcr.io/nvidia/pytorch:22.09-py3等，当然也可以使用Pytorch官方提供的镜像。

首先，拉取相应版本的PyTorch镜像。

docker pull nvcr.io/nvidia/pytorch:22.09-py3

镜像下载完成之后，创建容器，以便后续进行编译和构建FasterTransformer。

nvidia-docker run -dti --name bloom_faster_transformer \
--restart=always --gpus all --network=host \
--shm-size 5g \
-v /home/gdong/workspace/code:/workspace/code \
-v /home/gdong/workspace/data:/workspace/data \
-v /home/gdong/workspace/model:/workspace/model \
-v /home/gdong/workspace/output:/workspace/output \
-w /workspace \
nvcr.io/nvidia/pytorch:22.09-py3 \
bash

进入容器。

docker exec -it bloom_faster_transformer bash

下载FasterTransformer代码。

cd code
git clone https://github.com/NVIDIA/FasterTransformer.git
cd FasterTransformer/
git submodule init && git submodule update

进入build构建FasterTransformer。

mkdir -p build
cd build

然后，执行cmake PATH命令生成 Makefile 文件。

cmake -DSM=80 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_MULTI_GPU=ON ..

注意：

第一点：脚本中-DMS=xx的xx表示GPU的计算能力。下表显示了常见GPU的计算能力。

GPU	计算能力
P40	60
P4	61
V100	70
T4	75
A100	80
A30	80
A10	86

默认情况下，-DSM 设置为 70、75、80 和 86。当用户设置更多类型的 -DSM 时，需要更长的编译时间。因此，我们建议只为您使用的设备设置 -DSM。

第二点：本文使用Pytorch作为后端，因此，脚本中添加了-DBUILD_PYT=ON配置项。这将构建 TorchScript 自定义类。因此，请确保 PyTorch 版本大于 1.5.0。

运行过程：

-- The CXX compiler identification is GNU 9.4.0
-- The CUDA compiler identification is NVIDIA 11.8.89
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found CUDA: /usr/local/cuda (found suitable version "11.8", minimum required is "10.2") 
CUDA_VERSION 11.8 is greater or equal than 11.0, enable -DENABLE_BF16 flag
-- Found CUDNN: /usr/lib/x86_64-linux-gnu/libcudnn.so  
-- Add DBUILD_CUTLASS_MOE, requires CUTLASS. Increases compilation time
-- Add DBUILD_CUTLASS_MIXED_GEMM, requires CUTLASS. Increases compilation time
-- Running submodule update to fetch cutlass
-- Add DBUILD_MULTI_GPU, requires MPI and NCCL
-- Found MPI_CXX: /opt/hpcx/ompi/lib/libmpi.so (found version "3.1") 
-- Found MPI: TRUE (found version "3.1")  
-- Found NCCL: /usr/include  
-- Determining NCCL version from /usr/include/nccl.h...
-- Looking for NCCL_VERSION_CODE
-- Looking for NCCL_VERSION_CODE - not found
-- Found NCCL (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libnccl.so.2.15.1)
-- NVTX is enabled.
-- Assign GPU architecture (sm=80)
-- Use WMMA
CMAKE_CUDA_FLAGS_RELEASE: -O3 -DNDEBUG -Xcompiler -O3 -DCUDA_PTX_FP8_F2FP_ENABLED --use_fast_math
-- COMMON_HEADER_DIRS: /workspace/code/FasterTransformer;/usr/local/cuda/include;/workspace/code/FasterTransformer/3rdparty/cutlass/include;/workspace/code/FasterTransformer/src/fastertransformer/cutlass_extensions/include;/workspace/code/FasterTransformer/3rdparty/trt_fp8_fmha/src;/workspace/code/FasterTransformer/3rdparty/trt_fp8_fmha/generated
-- Found CUDA: /usr/local/cuda (found version "11.8") 
-- Caffe2: CUDA detected: 11.8
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 11.8
-- Found cuDNN: v8.6.0  (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libcudnn.so)
-- /usr/local/cuda/lib64/libnvrtc.so shorthash is 672ee683
-- Added CUDA NVCC flags for: -gencode;arch=compute_80,code=sm_80
-- Found Torch: /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch.so  
-- USE_CXX11_ABI=True
-- The C compiler identification is GNU 9.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Found Python: /opt/conda/bin/python3.8 (found version "3.8.13") found components: Interpreter 
-- Configuring done
-- Generating done
-- Build files have been written to: /workspace/code/FasterTransformer/build

之后，通过make使用12个线程去执行编译加快编译速度：

make -j12

运行过程：

[  0%] Building CXX object src/fastertransformer/kernels/cutlass_kernels/CMakeFiles/cutlass_preprocessors.dir/cutlass_preprocessors.cc.o
[  0%] Building CXX object src/fastertransformer/utils/CMakeFiles/nvtx_utils.dir/nvtx_utils.cc.o
[  0%] Building CUDA object src/fastertransformer/kernels/CMakeFiles/layernorm_kernels.dir/layernorm_kernels.cu.o
[  0%] Building CXX object src/fastertransformer/utils/CMakeFiles/cuda_utils.dir/cuda_utils.cc.o
[  0%] Building CXX object src/fastertransformer/utils/CMakeFiles/logger.dir/logger.cc.o
[  1%] Building CXX object 3rdparty/common/CMakeFiles/cuda_driver_wrapper.dir/cudaDriverWrapper.cpp.o
[  1%] Building CUDA object src/fastertransformer/kernels/CMakeFiles/custom_ar_kernels.dir/custom_ar_kernels.cu.o
[  1%] Building CUDA object src/fastertransformer/kernels/CMakeFiles/add_residual_kernels.dir/add_residual_kernels.cu.o
[  1%] Building CUDA object src/fastertransformer/kernels/CMakeFiles/activation_kernels.dir/activation_kernels.cu.o
[  1%] Building CUDA object src/fastertransformer/kernels/CMakeFiles/transpose_int8_kernels.dir/transpose_int8_kernels.cu.o
[  2%] Building CUDA object src/fastertransformer/kernels/CMakeFiles/unfused_attention_kernels.dir/unfused_attention_kernels.cu.o
[  2%] Building CUDA object src/fastertransformer/kernels/CMakeFiles/bert_preprocess_kernels.dir/bert_preprocess_kernels.cu.o
[  2%] Linking CUDA device code CMakeFiles/cuda_driver_wrapper.dir/cmake_device_link.o
[  2%] Linking CXX static library ../../lib/libcuda_driver_wrapper.a
[  2%] Built target cuda_driver_wrapper
...
[100%] Linking CXX executable ../../../bin/gptneox_example
[100%] Built target gptj_triton_example
[100%] Building CXX object examples/cpp/multi_gpu_gpt/CMakeFiles/multi_gpu_gpt_triton_example.dir/multi_gpu_gpt_triton_example.cc.o
[100%] Built target gptj_example
[100%] Building CXX object examples/cpp/multi_gpu_gpt/CMakeFiles/multi_gpu_gpt_interactive_example.dir/multi_gpu_gpt_interactive_example.cc.o
[100%] Built target gptneox_example
[100%] Linking CXX executable ../../../bin/multi_gpu_gpt_example
[100%] Linking CXX executable ../../../bin/gptneox_triton_example
[100%] Built target multi_gpu_gpt_example
[100%] Built target gptneox_triton_example
[100%] Linking CXX executable ../../../bin/multi_gpu_gpt_triton_example
[100%] Linking CXX static library ../../../../lib/libth_t5.a
[100%] Built target th_t5
[100%] Built target multi_gpu_gpt_triton_example
[100%] Linking CXX executable ../../../bin/multi_gpu_gpt_async_example
[100%] Linking CXX executable ../../../bin/multi_gpu_gpt_interactive_example
[100%] Built target multi_gpu_gpt_async_example
[100%] Linking CXX static library ../../../../lib/libth_parallel_gpt.a
[100%] Built target th_parallel_gpt
[100%] Linking CXX shared library ../../../lib/libth_transformer.so
[100%] Built target multi_gpu_gpt_interactive_example
[100%] Built target th_transformer

至此，构建FasterTransformer完成。

安装依赖包

安装进行模型推理所需要的依赖包。

cd /workspace/code/FasterTransformer
pip install -r examples/pytorch/gpt/requirement.txt -i https://pypi.tuna.tsinghua.edu.cn/simple --trusted-host pypi.tuna.tsinghua.edu.cn

数据与模型准备

模型

本文使用BLOOM模型进行演示，它不需要学习位置编码，并允许模型生成比训练中使用的序列长度更长的序列。BLOOM 也具有与 OpenAI GPT 相似的结构。因此，像 OPT 一样，FT 通过 GPT 类提供了 BLOOM 模型作为变体。用户可以使用 examples/pytorch/gpt/utils/huggingface_bloom_convert.py 将预训练的 Huggingface BLOOM 模型转换为 fastertransformer 文件格式。

我们使用bloomz-560m作为基础模型。该模型是基于bloom-560m在xP3数据集上对多任务进行了微调而得到的。

下载模型：

cd /workspace/model
git lfs clone https://huggingface.co/bigscience/bloomz-560m

模型文件：

> ls -al bloomz-560m
total 2198796
drwxr-xr-x 4 root root       4096 Apr 25 16:50 .
drwxr-xr-x 4 root root       4096 Apr 26 07:06 ..
drwxr-xr-x 9 root root       4096 Apr 25 16:53 .git
-rw-r--r-- 1 root root       1489 Apr 25 16:50 .gitattributes
-rw-r--r-- 1 root root      24778 Apr 25 16:50 README.md
-rw-r--r-- 1 root root        715 Apr 25 16:50 config.json
drwxr-xr-x 4 root root       4096 Apr 25 16:50 logs
-rw-r--r-- 1 root root 1118459450 Apr 25 16:53 model.safetensors
-rw-r--r-- 1 root root 1118530423 Apr 25 16:53 pytorch_model.bin
-rw-r--r-- 1 root root         85 Apr 25 16:50 special_tokens_map.json
-rw-r--r-- 1 root root   14500438 Apr 25 16:50 tokenizer.json
-rw-r--r-- 1 root root        222 Apr 25 16:50 tokenizer_config.json

数据集

本文使用Lambada数据集，它是一个NLP（自然语言处理）任务中使用的数据集。它包含大量的英文句子，并要求模型去预测下一个单词，这种任务称为语言建模。Lambada数据集的特点是它的句子长度较长，并且包含更丰富的语义信息。因此，对于语言模型的评估来说是一个很好的测试数据集。

下载LAMBADA测试数据集。

cd /workspace/data
wget -c https://github.com/cybertronai/bflm/raw/master/lambada_test.jsonl

数据格式如下：

{"text": "In my palm is a clear stone, and inside it is a small ivory statuette. A guardian angel.\n\n"Figured if you're going to be out at night getting hit by cars, you might as well have some backup."\n\nI look at him, feeling stunned. Like this is some sort of sign. But as I stare at Harlin, his mouth curved in a confident grin, I don't care about signs"}
{"text": "Give me a minute to change and I'll meet you at the docks." She'd forced those words through her teeth.\n\n"No need to change. We won't be that long."\n\nShane gripped her arm and started leading her to the dock.\n\n"I can make it there on my own, Shane"}
...
{"text": ""Only one source I know of that would be likely to cough up enough money to finance a phony sleep research facility and pay people big bucks to solve crimes in their dreams," Farrell concluded dryly.\n\n"What can I say?" Ellis unfolded his arms and widened his hands. "Your tax dollars at work."\n\nBefore Farrell could respond, Leila's voice rose from inside the house.\n\n"No insurance?" she wailed. "What do you mean you don't have any insurance"}
{"text": "Helen's heart broke a little in the face of Miss Mabel's selfless courage. She thought that because she was old, her life was of less value than the others'. For all Helen knew, Miss Mabel had a lot more years to live than she did. "Not going to happen," replied Helen"}
{"text": "Preston had been the last person to wear those chains, and I knew what I'd see and feel if they were slipped onto my skin-the Reaper's unending hatred of me. I'd felt enough of that emotion already in the amphitheater. I didn't want to feel anymore.\n\n"Don't put those on me," I whispered. "Please."\n\nSergei looked at me, surprised by my low, raspy please, but he put down the chains"}

模型格式转换

为了避免在模型并行时，拆分模型的额外工作，FasterTransformer 提供了一个工具，用于将模型从不同格式拆分和转换为 FasterTransformer 二进制文件格式；然后， FasterTransformer 可以直接以二进制格式加载模型。

将Huggingface Transformer模型权重文件格式转换成FasterTransformer格式。

cd /workspace/code/FasterTransformer

python examples/pytorch/gpt/utils/huggingface_bloom_convert.py \
    --input-dir /workspace/model/bloomz-560m \
    --output-dir /workspace/model/bloomz-560m-convert \
    --data-type fp16 \
    -tp 1 -v

转换过程：

python examples/pytorch/gpt/utils/huggingface_bloom_convert.py \
>     --input-dir /workspace/model/bloomz-560m \
>     --output-dir /workspace/model/bloomz-560m-convert \
>     --data-type fp16 \
>     -tp 1 -v

======================= Arguments =======================
 - input_dir...........: /workspace/model/bloomz-560m
 - output_dir..........: /workspace/model/bloomz-560m-convert
 - tensor_para_size....: 1
 - data_type...........: fp16
 - processes...........: 1
 - verbose.............: True
 - by_shard............: False
=========================================================
loading from pytorch bin format
model file num: 1
 - model.wte.......................................: shape (250880, 1024)     | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.wte.bin
 - model.pre_decoder_layernorm.weight..............: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.pre_decoder_layernorm.weight.bin
 - model.pre_decoder_layernorm.bias................: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.pre_decoder_layernorm.bias.bin
 - model.layers.0.input_layernorm.weight...........: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.0.input_layernorm.weight.bin
 - model.layers.0.input_layernorm.bias.............: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.0.input_layernorm.bias.bin
 - model.layers.0.attention.query_key_value.weight.: shape (1024, 3, 1024)  s | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.0.attention.query_key_value.weight.0.bin (0/1)
 - model.layers.0.attention.query_key_value.bias...: shape (3, 1024)        s | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.0.attention.query_key_value.bias.0.bin (0/1)
 - model.layers.0.attention.dense.weight...........: shape (1024, 1024)     s | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.0.attention.dense.weight.0.bin (0/1)
 - model.layers.0.attention.dense.bias.............: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.0.attention.dense.bias.bin
 - model.layers.0.post_attention_layernorm.weight..: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.0.post_attention_layernorm.weight.bin
 - model.layers.0.post_attention_layernorm.bias....: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.0.post_attention_layernorm.bias.bin
 - model.layers.0.mlp.dense_h_to_4h.weight.........: shape (1024, 4096)     s | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.0.mlp.dense_h_to_4h.weight.0.bin (0/1)
 - model.layers.0.mlp.dense_h_to_4h.bias...........: shape (4096,)          s | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.0.mlp.dense_h_to_4h.bias.0.bin (0/1)
...
rs.22.mlp.dense_4h_to_h.bias.bin
 - model.layers.23.input_layernorm.weight..........: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.23.input_layernorm.weight.bin
 - model.layers.23.input_layernorm.bias............: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.23.input_layernorm.bias.bin
 - model.layers.23.attention.query_key_value.weight: shape (1024, 3, 1024)  s | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.23.attention.query_key_value.weight.0.bin (0/1)
 - model.layers.23.attention.query_key_value.bias..: shape (3, 1024)        s | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.23.attention.query_key_value.bias.0.bin (0/1)
 - model.layers.23.attention.dense.weight..........: shape (1024, 1024)     s | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.23.attention.dense.weight.0.bin (0/1)
 - model.layers.23.attention.dense.bias............: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.23.attention.dense.bias.bin
 - model.layers.23.post_attention_layernorm.weight.: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.23.post_attention_layernorm.weight.bin
 - model.layers.23.post_attention_layernorm.bias...: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.23.post_attention_layernorm.bias.bin
 - model.layers.23.mlp.dense_h_to_4h.weight........: shape (1024, 4096)     s | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.23.mlp.dense_h_to_4h.weight.0.bin (0/1)
 - model.layers.23.mlp.dense_h_to_4h.bias..........: shape (4096,)          s | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.23.mlp.dense_h_to_4h.bias.0.bin (0/1)
 - model.layers.23.mlp.dense_4h_to_h.weight........: shape (4096, 1024)     s | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.23.mlp.dense_4h_to_h.weight.0.bin (0/1)
 - model.layers.23.mlp.dense_4h_to_h.bias..........: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.layers.23.mlp.dense_4h_to_h.bias.bin
 - model.final_layernorm.weight....................: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.final_layernorm.weight.bin
 - model.final_layernorm.bias......................: shape (1024,)            | saved at /workspace/model/bloomz-560m-convert/1-gpu/model.final_layernorm.bias.bin
Checkpoint conversion (HF >> FT) has done (elapsed time: 17.07 sec)

转换成FasterTransformer格式后的文件如下所示：

> tree bloomz-560m-convert/
bloomz-560m-convert/
└── 1-gpu
    ├── config.ini
    ├── model.final_layernorm.bias.bin
    ├── model.final_layernorm.weight.bin
    ├── model.layers.0.attention.dense.bias.bin
    ├── model.layers.0.attention.dense.weight.0.bin
    ├── model.layers.0.attention.query_key_value.bias.0.bin
    ├── model.layers.0.attention.query_key_value.weight.0.bin
    ├── model.layers.0.input_layernorm.bias.bin
    ├── model.layers.0.input_layernorm.weight.bin
    ├── model.layers.0.mlp.dense_4h_to_h.bias.bin
    ├── model.layers.0.mlp.dense_4h_to_h.weight.0.bin
    ├── model.layers.0.mlp.dense_h_to_4h.bias.0.bin
    ├── model.layers.0.mlp.dense_h_to_4h.weight.0.bin
    ├── model.layers.0.post_attention_layernorm.bias.bin
    ├── model.layers.0.post_attention_layernorm.weight.bin
    ├── model.layers.1.attention.dense.bias.bin
    ...
    ├── model.layers.8.post_attention_layernorm.weight.bin
    ├── model.layers.9.attention.dense.bias.bin
    ├── model.layers.9.attention.dense.weight.0.bin
    ├── model.layers.9.attention.query_key_value.bias.0.bin
    ├── model.layers.9.attention.query_key_value.weight.0.bin
    ├── model.layers.9.input_layernorm.bias.bin
    ├── model.layers.9.input_layernorm.weight.bin
    ├── model.layers.9.mlp.dense_4h_to_h.bias.bin
    ├── model.layers.9.mlp.dense_4h_to_h.weight.0.bin
    ├── model.layers.9.mlp.dense_h_to_4h.bias.0.bin
    ├── model.layers.9.mlp.dense_h_to_4h.weight.0.bin
    ├── model.layers.9.post_attention_layernorm.bias.bin
    ├── model.layers.9.post_attention_layernorm.weight.bin
    ├── model.pre_decoder_layernorm.bias.bin
    ├── model.pre_decoder_layernorm.weight.bin
    └── model.wte.bin

模型基准测试

下面使用官方提供的样例进行基准测试对比下Huggingface Transformers和FasterTransformer的响应时长。

Huggingface Transformers基准测试

运行命令：

# Run HF benchmark
CUDA_VISIBLE_DEVICES=1 python examples/pytorch/gpt/bloom_lambada.py \
    --tokenizer-path /workspace/model/bloomz-560m \
    --dataset-path /workspace/data/lambada_test.jsonl \
    --lib-path bulid/lib/libth_transformer.so \
    --test-hf \
    --show-progress

运行过程：

python examples/pytorch/gpt/bloom_lambada.py \
>     --tokenizer-path /workspace/model/bloomz-560m \
>     --dataset-path /workspace/data/lambada_test.jsonl \
>     --lib-path bulid/lib/libth_transformer.so \
>     --test-hf \
>     --show-progress

=================== Arguments ===================
 - num_heads................: None
 - size_per_head............: None
 - inter_size...............: None
 - num_layers...............: None
 - vocab_size...............: None
 - tensor_para_size.........: 1
 - pipeline_para_size.......: 1
 - remove_padding...........: True
 - shared_contexts_ratio....: 1.0
 - batch_size...............: 8
 - output_length............: 32
 - beam_width...............: 1
 - top_k....................: 1
 - top_p....................: 1.0
 - temperature..............: 1.0
 - len_penalty..............: 0.0
 - beam_search_diversity_rate: 0.0
 - start_id.................: 0
 - end_id...................: 2
 - repetition_penalty.......: 1.0
 - random_seed..............: None
 - return_cum_log_probs.....: 0
 - checkpoint_path..........: None
 - dataset_path.............: /workspace/data/lambada_test.jsonl
 - output_path..............: None
 - tokenizer_path...........: /workspace/model/bloomz-560m
 - lib_path.................: bulid/lib/libth_transformer.so
 - test_hf..................: True
 - acc_threshold............: None
 - show_progress............: True
 - inference_data_type......: None
 - weights_data_type........: None
 - int8_mode................: 0
=================================================
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 645/645 [02:33<00:00,  4.21it/s]
Accuracy: 39.4722% (2034/5153) (elapsed time: 146.7230 sec)

FasterTransformer基准测试

运行命令：

# Run FT benchmark
python examples/pytorch/gpt/bloom_lambada.py \
    --checkpoint-path /workspace/model/bloomz-560m-convert/1-gpu \
    --tokenizer-path /workspace/model/bloomz-560m \
    --dataset-path /workspace/data/lambada_test.jsonl \
    --lib-path build/lib/libth_transformer.so \
    --show-progress

注：还可添加--data-type fp16以半精度方式加载模型，以减少模型对于显存的消耗。

运行过程：

python examples/pytorch/gpt/bloom_lambada.py \
>     --checkpoint-path /workspace/model/bloomz-560m-convert/1-gpu \
>     --tokenizer-path /workspace/model/bloomz-560m \
>     --dataset-path /workspace/data/lambada_test.jsonl \
>     --lib-path build/lib/libth_transformer.so \
>     --show-progress

=================== Arguments ===================
 - num_heads................: None
 - size_per_head............: None
 - inter_size...............: None
 - num_layers...............: None
 - vocab_size...............: None
 - tensor_para_size.........: 1
 - pipeline_para_size.......: 1
 - remove_padding...........: True
 - shared_contexts_ratio....: 1.0
 - batch_size...............: 8
 - output_length............: 32
 - beam_width...............: 1
 - top_k....................: 1
 - top_p....................: 1.0
 - temperature..............: 1.0
 - len_penalty..............: 0.0
 - beam_search_diversity_rate: 0.0
 - start_id.................: 0
 - end_id...................: 2
 - repetition_penalty.......: 1.0
 - random_seed..............: None
 - return_cum_log_probs.....: 0
 - checkpoint_path..........: /workspace/model/bloomz-560m-convert/1-gpu
 - dataset_path.............: /workspace/data/lambada_test.jsonl
 - output_path..............: None
 - tokenizer_path...........: /workspace/model/bloomz-560m
 - lib_path.................: build/lib/libth_transformer.so
 - test_hf..................: False
 - acc_threshold............: None
 - show_progress............: True
 - inference_data_type......: None
 - weights_data_type........: None
 - int8_mode................: 0
=================================================
[FT][INFO] Load BLOOM model
 - head_num.................: 16
 - size_per_head............: 64
 - layer_num................: 24
 - tensor_para_size.........: 1
 - vocab_size...............: 250880
 - start_id.................: 1
 - end_id...................: 2
 - weights_data_type........: fp16
 - layernorm_eps............: 1e-05
 - inference_data_type......: fp16
 - lib_path.................: build/lib/libth_transformer.so
 - pipeline_para_size.......: 1
 - shared_contexts_ratio....: 1.0
 - int8_mode................: 0
[WARNING] gemm_config.in is not found; using default GEMM algo
[FT][WARNING] Skip NCCL initialization since requested tensor/pipeline parallel sizes are equals to 1.
[FT][INFO] Device NVIDIA A800 80GB PCIe
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 645/645 [00:18<00:00, 34.58it/s]
Accuracy: 39.4722% (2034/5153) (elapsed time: 13.0032 sec)

对比Huggingface Transformers和FasterTransformer

HF:    Accuracy: 39.4722% (2034/5153) (elapsed time: 146.7230 sec)
FT:    Accuracy: 39.4722% (2034/5153) (elapsed time: 13.0032 sec)

可以看到它们的准确率一致，但是FasterTransformer比Huggingface Transformers的推理速度更加快速。

模型并行推理（多卡）

对于像GPT3（175B）、OPT-175B这样的大模型，单卡无法加载整个模型，因此，我们需要以分布式（模型并行）方式进行大模型推理。模型并行推理有两种方式：张量并行和流水线并行，前面已经进行过相应的说明，这里不再赘述。

张量并行

模型转换

如果想使用张量并行 (TP) 技术将模型拆分多个GPU进行推理，可参考如下命令将模型转换到2个GPU上进行推理。

python examples/pytorch/gpt/utils/huggingface_bloom_convert.py \
--input-dir /workspace/model/bloomz-560m \
--output-dir /workspace/model/bloomz-560m-convert \
--data-type fp16 \
-tp 2 -v

转换成张量并行度为2的FasterTransformer格式后的文件如下所示：

tree /workspace/model/bloomz-560m-convert/2-gpu
/workspace/model/bloomz-560m-convert/2-gpu
├── config.ini
├── model.final_layernorm.bias.bin
├── model.final_layernorm.weight.bin
├── model.layers.0.attention.dense.bias.bin
├── model.layers.0.attention.dense.weight.0.bin
├── model.layers.0.attention.dense.weight.1.bin
├── model.layers.0.attention.query_key_value.bias.0.bin
├── model.layers.0.attention.query_key_value.bias.1.bin
├── model.layers.0.attention.query_key_value.weight.0.bin
├── model.layers.0.attention.query_key_value.weight.1.bin
├── model.layers.0.input_layernorm.bias.bin
├── model.layers.0.input_layernorm.weight.bin
├── model.layers.0.mlp.dense_4h_to_h.bias.bin
├── model.layers.0.mlp.dense_4h_to_h.weight.0.bin
├── model.layers.0.mlp.dense_4h_to_h.weight.1.bin
├── model.layers.0.mlp.dense_h_to_4h.bias.0.bin
├── model.layers.0.mlp.dense_h_to_4h.bias.1.bin
├── model.layers.0.mlp.dense_h_to_4h.weight.0.bin
├── model.layers.0.mlp.dense_h_to_4h.weight.1.bin
├── model.layers.0.post_attention_layernorm.bias.bin
├── model.layers.0.post_attention_layernorm.weight.bin
...
├── model.layers.9.attention.dense.bias.bin
├── model.layers.9.attention.dense.weight.0.bin
├── model.layers.9.attention.dense.weight.1.bin
├── model.layers.9.attention.query_key_value.bias.0.bin
├── model.layers.9.attention.query_key_value.bias.1.bin
├── model.layers.9.attention.query_key_value.weight.0.bin
├── model.layers.9.attention.query_key_value.weight.1.bin
├── model.layers.9.input_layernorm.bias.bin
├── model.layers.9.input_layernorm.weight.bin
├── model.layers.9.mlp.dense_4h_to_h.bias.bin
├── model.layers.9.mlp.dense_4h_to_h.weight.0.bin
├── model.layers.9.mlp.dense_4h_to_h.weight.1.bin
├── model.layers.9.mlp.dense_h_to_4h.bias.0.bin
├── model.layers.9.mlp.dense_h_to_4h.bias.1.bin
├── model.layers.9.mlp.dense_h_to_4h.weight.0.bin
├── model.layers.9.mlp.dense_h_to_4h.weight.1.bin
├── model.layers.9.post_attention_layernorm.bias.bin
├── model.layers.9.post_attention_layernorm.weight.bin
├── model.pre_decoder_layernorm.bias.bin
├── model.pre_decoder_layernorm.weight.bin
└── model.wte.bin

0 directories, 438 files

张量并行模型推理

运行命令：

mpirun -n 2 --allow-run-as-root python examples/pytorch/gpt/bloom_lambada.py \
    --checkpoint-path /workspace/model/bloomz-560m-convert/2-gpu \
    --tokenizer-path /workspace/model/bloomz-560m \
    --dataset-path /workspace/data/lambada_test.jsonl \
    --lib-path build/lib/libth_transformer.so \
    --tensor-para-size 2 \
    --pipeline-para-size 1 \
    --show-progress

运行过程：

mpirun -n 2 --allow-run-as-root python examples/pytorch/gpt/bloom_lambada.py \
>     --checkpoint-path /workspace/model/bloomz-560m-convert/2-gpu \
>     --tokenizer-path /workspace/model/bloomz-560m \
>     --dataset-path /workspace/data/lambada_test.jsonl \
>     --lib-path build/lib/libth_transformer.so \
>     --tensor-para-size 2 \
>     --pipeline-para-size 1 \
>     --show-progress

=================== Arguments ===================
 - num_heads................: None
 - size_per_head............: None
 - inter_size...............: None
 - num_layers...............: None
 - vocab_size...............: None
 - tensor_para_size.........: 2
 - pipeline_para_size.......: 1
 - remove_padding...........: True
 - shared_contexts_ratio....: 1.0
 - batch_size...............: 8
 - output_length............: 32
 - beam_width...............: 1
 - top_k....................: 1
 - top_p....................: 1.0
 - temperature..............: 1.0
 - len_penalty..............: 0.0
 - beam_search_diversity_rate: 0.0
 - start_id.................: 0
 - end_id...................: 2
 - repetition_penalty.......: 1.0
 - random_seed..............: None
 - return_cum_log_probs.....: 0
 - checkpoint_path..........: /workspace/model/bloomz-560m-convert/2-gpu
 - dataset_path.............: /workspace/data/lambada_test.jsonl
 - output_path..............: None
 - tokenizer_path...........: /workspace/model/bloomz-560m
 - lib_path.................: build/lib/libth_transformer.so
 - test_hf..................: False
 - acc_threshold............: None
 - show_progress............: True
 - inference_data_type......: None
 - weights_data_type........: None
 - int8_mode................: 0
=================================================

=================== Arguments ===================
 - num_heads................: None
 - size_per_head............: None
 - inter_size...............: None
 - num_layers...............: None
 - vocab_size...............: None
 - tensor_para_size.........: 2
 - pipeline_para_size.......: 1
 - remove_padding...........: True
 - shared_contexts_ratio....: 1.0
 - batch_size...............: 8
 - output_length............: 32
 - beam_width...............: 1
 - top_k....................: 1
 - top_p....................: 1.0
 - temperature..............: 1.0
 - len_penalty..............: 0.0
 - beam_search_diversity_rate: 0.0
 - start_id.................: 0
 - end_id...................: 2
 - repetition_penalty.......: 1.0
 - random_seed..............: None
 - return_cum_log_probs.....: 0
 - checkpoint_path..........: /workspace/model/bloomz-560m-convert/2-gpu
 - dataset_path.............: /workspace/data/lambada_test.jsonl
 - output_path..............: None
 - tokenizer_path...........: /workspace/model/bloomz-560m
 - lib_path.................: build/lib/libth_transformer.so
 - test_hf..................: False
 - acc_threshold............: None
 - show_progress............: True
 - inference_data_type......: None
 - weights_data_type........: None
 - int8_mode................: 0
=================================================
[FT][INFO] Load BLOOM model
 - head_num.................: 16
 - size_per_head............: 64
 - layer_num................: 24
 - tensor_para_size.........: 2
 - vocab_size...............: 250880
 - start_id.................: 1
 - end_id...................: 2
 - weights_data_type........: fp16
 - layernorm_eps............: 1e-05
 - inference_data_type......: fp16
 - lib_path.................: build/lib/libth_transformer.so
 - pipeline_para_size.......: 1
 - shared_contexts_ratio....: 1.0
 - int8_mode................: 0
[FT][INFO] Load BLOOM model
 - head_num.................: 16
 - size_per_head............: 64
 - layer_num................: 24
 - tensor_para_size.........: 2
 - vocab_size...............: 250880
 - start_id.................: 1
 - end_id...................: 2
 - weights_data_type........: fp16
 - layernorm_eps............: 1e-05
 - inference_data_type......: fp16
 - lib_path.................: build/lib/libth_transformer.so
 - pipeline_para_size.......: 1
 - shared_contexts_ratio....: 1.0
 - int8_mode................: 0
world_size: 2
world_size: 2
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[FT][INFO] NCCL initialized rank=0 world_size=2 tensor_para=NcclParam[rank=0, world_size=2, nccl_comm=0x5556305627d0] pipeline_para=NcclParam[rank=0, world_size=1, nccl_comm=0x5556305d5d20]
[FT][INFO] Device NVIDIA A800 80GB PCIe
[FT][INFO] NCCL initialized rank=1 world_size=2 tensor_para=NcclParam[rank=1, world_size=2, nccl_comm=0x55b9600a9ca0] pipeline_para=NcclParam[rank=0, world_size=1, nccl_comm=0x55b96011cff0]
[FT][INFO] Device NVIDIA A800 80GB PCIe
/workspace/code/FasterTransformer/examples/pytorch/gpt/utils/gpt.py:221: SyntaxWarning: assertion is always true, perhaps remove parentheses?
  assert(self.pre_embed_idx < self.post_embed_idx, "Pre decoder embedding index should be lower than post decoder embedding index.")
  0%|          | 0/645 [00:00

 
   流水线并行 
   模型转换 
   如果仅使用流水线并行，不使用张量并行，则tp设置为1即可，如果需要同时进行张量并行和流水线并行，则需要将tp设置成张量并行度大小。具体命令参考前面的模型转换部分。 
   流水线并行模型推理 
   运行命令： 
   CUDA_VISIBLE_DEVICES=1,2 mpirun -n 2 --allow-run-as-root python examples/pytorch/gpt/bloom_lambada.py \
    --checkpoint-path /workspace/model/bloomz-560m-convert/1-gpu \
    --tokenizer-path /workspace/model/bloomz-560m \
    --dataset-path /workspace/data/lambada_test.jsonl \
    --lib-path build/lib/libth_transformer.so \
    --tensor-para-size 1 \
    --pipeline-para-size 2 \
    --batch-size 1 \
    --show-progress 
   运行过程： 
   CUDA_VISIBLE_DEVICES=1,2 mpirun -n 2 --allow-run-as-root python examples/pytorch/gpt/bloom_lambada.py \      
>     --checkpoint-path /workspace/model/bloomz-560m-convert/1-gpu \                                      
>     --tokenizer-path /workspace/model/bloomz-560m \
>     --dataset-path /workspace/data/lambada_test.jsonl \
>     --lib-path build/lib/libth_transformer.so \
>     --tensor-para-size 1 \
>     --pipeline-para-size 2 \
>     --batch-size 1 \
>     --show-progress

=================== Arguments ===================
 - num_heads................: None
 - size_per_head............: None
 - inter_size...............: None
 - num_layers...............: None
 - vocab_size...............: None
 - tensor_para_size.........: 1
 - pipeline_para_size.......: 2
 - remove_padding...........: True
 - shared_contexts_ratio....: 1.0
 - batch_size...............: 1
 - output_length............: 32
 - beam_width...............: 1
 - top_k....................: 1
 - top_p....................: 1.0
 - temperature..............: 1.0
 - len_penalty..............: 0.0
 - beam_search_diversity_rate: 0.0
 - start_id.................: 0
 - end_id...................: 2
 - repetition_penalty.......: 1.0
 - random_seed..............: None
 - return_cum_log_probs.....: 0
 - checkpoint_path..........: /workspace/model/bloomz-560m-convert/1-gpu
 - dataset_path.............: /workspace/data/lambada_test.jsonl
 - output_path..............: None
 - tokenizer_path...........: /workspace/model/bloomz-560m
 - lib_path.................: build/lib/libth_transformer.so
 - test_hf..................: False
 - acc_threshold............: None
 - show_progress............: True
 - inference_data_type......: None
 - weights_data_type........: None
 - int8_mode................: 0
=================================================

=================== Arguments ===================
 - num_heads................: None
 - size_per_head............: None
 - inter_size...............: None
 - num_layers...............: None
 - vocab_size...............: None
 - tensor_para_size.........: 1
 - pipeline_para_size.......: 2
 - remove_padding...........: True
 - shared_contexts_ratio....: 1.0
 - batch_size...............: 1
 - output_length............: 32
 - beam_width...............: 1
 - top_k....................: 1
 - top_p....................: 1.0
 - temperature..............: 1.0
 - len_penalty..............: 0.0
 - beam_search_diversity_rate: 0.0
 - start_id.................: 0
 - end_id...................: 2
 - repetition_penalty.......: 1.0
 - random_seed..............: None
 - return_cum_log_probs.....: 0
 - checkpoint_path..........: /workspace/model/bloomz-560m-convert/1-gpu
 - dataset_path.............: /workspace/data/lambada_test.jsonl
 - output_path..............: None
 - tokenizer_path...........: /workspace/model/bloomz-560m
 - lib_path.................: build/lib/libth_transformer.so
 - test_hf..................: False
 - acc_threshold............: None
 - show_progress............: True
 - inference_data_type......: None
 - weights_data_type........: None
 - int8_mode................: 0
=================================================
[FT][INFO] Load BLOOM model
 - head_num.................: 16
 - size_per_head............: 64
 - layer_num................: 24
 - tensor_para_size.........: 1
 - vocab_size...............: 250880
 - start_id.................: 1
 - end_id...................: 2
 - weights_data_type........: fp16
 - layernorm_eps............: 1e-05
 - inference_data_type......: fp16
 - lib_path.................: build/lib/libth_transformer.so
 - pipeline_para_size.......: 2
 - shared_contexts_ratio....: 1.0
 - int8_mode................: 0
[FT][INFO] Load BLOOM model
 - head_num.................: 16
 - size_per_head............: 64
 - layer_num................: 24
 - tensor_para_size.........: 1
 - vocab_size...............: 250880
 - start_id.................: 1
 - end_id...................: 2
 - weights_data_type........: fp16
 - layernorm_eps............: 1e-05
 - inference_data_type......: fp16
 - lib_path.................: build/lib/libth_transformer.so
 - pipeline_para_size.......: 2
 - shared_contexts_ratio....: 1.0
 - int8_mode................: 0
world_size: 2
world_size: 2
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[FT][INFO] NCCL initialized rank=0 world_size=2 tensor_para=NcclParam[rank=0, world_size=1, nccl_comm=0x5557a53dc1b0] pipeline_para=NcclParam[rank=0, world_size=2, nccl_comm=0x5557a5444df0]
[FT][INFO] NCCL initialized rank=1 world_size=2 tensor_para=NcclParam[rank=0, world_size=1, nccl_comm=0x560cf3452820] pipeline_para=NcclParam[rank=1, world_size=2, nccl_comm=0x560cf34bb190]
[FT][INFO] Device NVIDIA A800 80GB PCIe
[FT][INFO] Device NVIDIA A800 80GB PCIe
100%|██████████| 5153/5153 [01:51<00:00, 46.12it/s] current process id: 47861   Accuracy: 39.4527% (2033/5153) (elapsed time: 102.1145 sec)
current process id: 47862   Accuracy: 39.4527% (2033/5153) (elapsed time: 102.3391 sec) 
   单卡、流水线并行、张量并行对比 
   下面在BatchSize为1的情况下，对单卡、张量并行、流水线并行进行了简单的测试，仅供参考（由于测试时，有其他训练任务也在运行，可能对结果会产生干扰）。 
   TP=1、PP=1、BZ=1： 
   累积响应时长：
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5153/5153 [02:21<00:00, 36.31it/s]
current process id: 47645   Accuracy: 39.4527% (2033/5153) (elapsed time: 132.2274 sec)

显存占用：
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      8356      C   python                           1740MiB |
+-----------------------------------------------------------------------------+ 
   TP=2、PP=1、BZ=1： 
   累积响应时长：
100%|██████████| 5153/5153 [00:35<00:00, 144.80it/s]current process id: 49111   Accuracy: 39.4916% (2035/5153) (elapsed time: 26.1384 sec)
current process id: 49112   Accuracy: 39.4916% (2035/5153) (elapsed time: 26.5110 sec)


显存占用：
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    1   N/A  N/A     41339      C   python                           1692MiB |
|    2   N/A  N/A     41340      C   python                           1692MiB |
+-----------------------------------------------------------------------------+ 
   TP=1、PP=2、BZ=1： 
   累积响应时长：
100%|██████████| 5153/5153 [00:33<00:00, 153.92it/s]current process id: 48755   Accuracy: 39.4527% (2033/5153) (elapsed time: 24.1695 sec)
current process id: 48754   Accuracy: 39.4527% (2033/5153) (elapsed time: 24.4391 sec)


显存占用：
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    1   N/A  N/A      4001      C   python                           1952MiB |
|    2   N/A  N/A      4002      C   python                           1952MiB |
+-----------------------------------------------------------------------------+ 
   TP=1、PP=3、BZ=1： 
   累积响应时长：
100%|██████████| 5153/5153 [00:33<00:00, 152.46it/s]current process id: 48220   Accuracy: 0.0000% (0/5153) (elapsed time: 24.9212 sec)
100%|██████████| 5153/5153 [00:33<00:00, 153.63it/s]current process id: 48219   Accuracy: 39.4527% (2033/5153) (elapsed time: 24.9767 sec)
current process id: 48221   Accuracy: 39.4527% (2033/5153) (elapsed time: 24.3489 sec)

显存占用：
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     57588      C   python                           1420MiB |
|    1   N/A  N/A     57589      C   python                           1468MiB |
|    2   N/A  N/A     57590      C   python                           1468MiB |
+-----------------------------------------------------------------------------+ 
   结语 
   本文给大家简要介绍了FasterTransformer的基本概念以及如何使用FasterTransformer进行单机及分布式模型推理，希望能够帮助大家快速了解FasterTransformer。 
   参考文档： 
    
    Accelerated Inference for Large Transformer Models Using NVIDIA Triton Inference Server 
    FasterTransformer GPU Doc 
    
    
   进NLP群—>加入NLP交流群

大语言模型应用指南：网页实时浏览 AGI大模型与大数据研究院 AI大模型应用开发实战计算科学神经计算深度学习神经网络大数据人工智能大型语言模型 AI AGI LLM Java Python 架构设计 Agent RPA
大语言模型应用指南：网页实时浏览作者：禅与计算机程序设计艺术1.背景介绍1.1大语言模型的崛起1.1.1自然语言处理的发展历程1.1.2Transformer模型的突破1.1.3预训练语言模型的优势1.2网页浏览的痛点1.2.1信息过载与检索困难1.2.2内容理解与知识提取1.2.3个性化与智能化需求1.3大语言模型与网页浏览的结合1.3.1智能问答与对话系统1.3.2知识图谱与语义搜索1.3.3
基于机器学习的加密货币资金费率预测与套利策略云梦量化科技 python
一、资金费率机制解析永续合约的资金费率是加密货币衍生品市场独有的机制，旨在使永续合约价格锚定现货价格。资金费率每8小时结算一次，结算时多空双方互相支付资金费用：费率为正时，多头支付给空头；费率为负时，空头支付给多头。此机制既促使永续合约价格回归现货价格，也反映市场多空情绪。某安永续合约资金费率计算公式通常为：资金费率 F = 平均溢价指数 P + Clamp(综合利率 I − 溢价指数 P, +0
从文本到语音：使用 ElevenLabs 和 FFmpeg 实现语音合成与播放曦紫沐语音模型 ffmpeg ElevenLabs 语音合成
摘要在当今的人工智能时代，语音合成技术正变得越来越普及。ElevenLabs是一个强大的语音合成平台，能够生成高质量的语音音频。本文将详细介绍如何结合Python、ElevenLabsAPI和FFmpeg工具集，实现从文本到语音的转换，并通过ffplay播放生成的音频文件。同时，我们将解决常见的问题，如ffplay未找到或音频无法播放等。1.引言随着人工智能技术的发展，语音合成（Text-to-S
DK遇见未来：机器人祖爸
人工智能、AI、机械设计、BigData……这些听起来就很高端的专业究竟是什么？这些前沿学科相遇又会碰撞出什么？机器人，将这些前沿领域结合在一起越来越多的融入到我们的工作与生活中可问题来了机器人究竟是什么呢？又该如何给孩子讲机器人呢？这本《DK遇见未来：机器人》完美解决您的烦恼最新数据、系统知识、精美插图可以说这是一本儿童机器人大百科让孩子在这里遇见未来在讲读版视频中与您共同思考未来社会中机器人与
ORACLE 11g 使用ROWNUM完美解决ORA-00600 内部错误代码有点智慧
分享一下我老师大神的人工智能教程！零基础，通俗易懂！http://blog.csdn.net/jiangjunshow也欢迎大家转载本篇文章。分享知识，造福人民，实现我们中华民族伟大复兴！1，ORA-00600：内部错误代码Oracle从11.2.0.1升级到11.2.0.4，开发人员报告说一个job运行失败，调试有报错信息，ORA-00600:内部错误代码，参数：[rwoirw:checkret
Kimi-Audio：最佳音LLM, 如何免费使用 Kimi-Audio AI 模型？知识大胖 NVIDIA GPU和大语言模型开发教程人工智能 kimi
简介继DeepSeek之后，字节跳动（现名MoonShotAI，又名Kimi）也在生成式人工智能领域加速发展，并发布了自己的音频模型Kimi-Audio，据说是迄今为止最好的音频模型。推荐文章《NvidiaGPU入门教程之02ubuntu安装A100显卡驱动(含8步快速浓缩教程)》权重2，安装A100显卡驱动《本地大模型知识库OpenWebUI系列之如何解决知识库上传文件故障Extractedco
Python就业薪资好不好，学Python工作机会多吗？ Python小辰
Python就业薪资好不好？学Python工作机会多吗？人工智能时代的来临让Python崭露头角，各大企业纷纷加大对相关人才的招聘力度吸引了很多人入行学习Python。近年来Python开发发展迅猛，吸引了很多科技公司入驻，且看小编的分析。Python薪资好不好?数据是最有力的答案。职友集统计数据显示，全国Python工程师的平均月资达19160，其中20-30K的工程师数量超过了四成。来自智联招
Jetson平台编译Tengine space01 AIoT Jetson 人工智能深度学习计算机视觉
1.Tengine简介Tengine于2017年在GitHub（https://github.com/OAID/Tengine）开源，是OPENAILAB（开放智能）推出的自主知识产权的边缘AI计算框架，致力于解决AIoT产业链碎片化问题，加速AI产业化落地。Tengine兼容多种操作系统和深度学习算法框架，简化和加速面向场景的AI算法在嵌入式边缘设备上快速迁移，以及实际应用部署落地，可以十倍提升
机器人-组成结构-感知 - 决策 - 执行具身智能-查布嘎具身智能机器人人工智能
目录一、感知系统内部传感器：外部传感器：二、智能决策系统机器学习家族1.1机器学习2.1深度学习2.2深度学习模型(主要属于监督/强化学习范畴，但结构通用)：3.1监督学习3.2监督学习模型4.1半监督学习4.2无/半监督学习模型：5.1无监督学习5.2生成模型(可属于监督/无监督)：6.1强化学习7.1其他学习三、控制系统（运控）①对应小脑和脊柱一、感知系统①对应人体的五官。由具有不同功能的各种
深度学习篇---矩阵 Atticus-Orion 嵌入式知识篇上位机知识篇嵌入式硬件篇深度学习矩阵人工智能
在机械臂解算、深度学习网络等硬件和软件领域中，矩阵运算作为核心数学工具，承担着数据表示、变换、映射和优化的关键作用。以下从具体领域出发，详细总结涉及的矩阵运算及对应的核心知识：一、机械臂解算领域机械臂解算（运动学、动力学分析）的核心是描述“关节空间”与“操作空间”的映射关系，矩阵运算用于精准刻画坐标系转换、运动传递和力/力矩分析。1.运动学解算（正/逆运动学）核心目标：通过矩阵描述关节角度与末端执
飞算JavaAI：力臻开发之本真，破 AI 代码之繁琐，传统项目一键生成微学AI 人工智能 java javaAI
飞算JavaAI：力臻开发之本真，破AI代码之繁琐，传统项目一键生成文章目录飞算JavaAI：力臻开发之本真，破AI代码之繁琐，传统项目一键生成一、前言二、飞算JavaAI是什么？2.1背景与实力2.2飞算JavaAI的“独门绝技”三、飞算JavaAI实战体验3.1IDEA插件安装配置3.2Main中写一个简单的梯度下降算法3.3main函数搭建一个卷积神经网络网络3.4飞算JavaAI：需求分析
机器学习入门（五）：线性回归—从模型函数到目标函数米饭超人
从数据反推公式假设我们获得了这样一张表格，上面列举了美国纽约若干程序员职位的年薪：enterimagedescriptionhere大家可以看到，表格中列举了职位、经验、技能、国家和城市几项特征。除了经验一项，其他都是一样的。不同的经验（工作年限），薪水不同。而且看起来，工作年头越多，工资也就越高。那么我们把Experience与Salary抽取出来，用x和y来分别指代它们。enterimaged
AI驱动的电路仿真革命：从物理模型到智能学习的范式转移
AI驱动的电路仿真革命：从物理模型到智能学习的范式转移人工智能正颠覆传统电路仿真方法，本文将深入解析AI在电路建模、优化与故障诊断中的前沿应用，揭示智能仿真如何提升10倍效率并突破物理限制。一、AI电路仿真的数学基础1.1图神经网络建模电路拓扑电路可抽象为图结构G=(V,E)G=(V,E)G=(V,E)：VVV：节点（电子元件）EEE：边（连接关系）图卷积网络(GCN)更新公式：H(l+1)=σ(
MCP协议技术解析：AI时代的通信基础设施革命
MCP协议技术解析：AI时代的通信基础设施革命在AI从工具演变为协作伙伴的进程中，MCP协议正在成为连接智能体与现实世界的“数字神经系统”。当前人工智能技术正经历从孤立模型向生态系统协作的关键转型，而通信协议作为AI能力的“连接器”，其设计直接决定了智能系统的边界与效率。MCP协议（ModelContextProtocol）作为新一代AI通信基础设施，正在开发者社区引发一场静默革命。本文将从技术原
GENERALIST REWARD MODELS: FOUND INSIDE LARGELANGUAGE MODELS 樱花的浪漫大模型与智能体对抗生成网络与动作识别强化学习语言模型人工智能自然语言处理深度学习机器学习计算机视觉
GeneralistRewardModels:FoundInsideLargeLanguageModelshttps://arxiv.org/pdf/2506.232351.概述将大型语言模型（LLMs）与复杂的人类价值观（如乐于助人和诚实）对齐，仍然是人工智能发展中的一个核心挑战。这项任务的主要范式是来自人类反馈的强化学习（RLHF）[Christianoetal.,2017;Baietal.,
Python深度学习实践：LSTM与GRU在序列数据预测中的应用 AI智能应用 Python入门实战计算科学神经计算深度学习神经网络大数据人工智能大型语言模型 AI AGI LLM Java Python 架构设计 Agent RPA
Python深度学习实践：LSTM与GRU在序列数据预测中的应用作者：禅与计算机程序设计艺术/ZenandtheArtofComputerProgramming1.背景介绍1.1问题的由来序列数据预测是机器学习领域的一个重要研究方向，涉及时间序列分析、自然语言处理、语音识别等多个领域。序列数据具有时间依赖性，即序列中每个元素都受到前面元素的影响。传统的机器学习算法难以捕捉这种时间依赖性，而深度学习
一个例子带你入门机器学习
目录1.为建模选择数据2.选择预测目标3.选择“特征”4.构建您的模型（这篇文章将使用经典墨尔本房价数据集作为例子，引导机器学习的流程，数据集为melb_data.csv，请在csdn的下载区自行下载，运行代码时需要将数据集下载在同个目录下）1.为建模选择数据数据集有太多的变量，多到难以理解，甚至无法很好地打印出来。如何将这海量的数据削减为能够理解的内容？我们将首先凭借直觉选择几个变量。后续将介绍
初探机器学习与力学研究的交叉领域 faderbic 机器学习人工智能深度学习
目录关于如何踏入机器学习领域机器学习与力学研究的交叉方向1.使用机器学习加速有限元求解2.结合有限元计算和机器学习预测复杂材料结构与力学性能的关系3.结构健康检测4.疲劳寿命预测总结关于如何踏入机器学习领域因为我本科的专业是力学，所以当我开始关注机器学习领域时，首先考虑的是机器学习和力学的交叉领域。对于很多对人工智能感兴趣的朋友，想加入人工智能的潮流却不知道从何学起，我提供一个思路，我认为将自己学
[NIPST AI]对抗性机器学习攻击和缓解的分类和术语 Anooyman 人工智能网络安全人工智能大语言模型网络安全安全
原文link：https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-2e2025.pdfIntroduction人工智能（AI）系统在过去几年中持续全球扩展。这些系统正在被众多国家开发并广泛部署于各自的经济体系中，人们在生活的许多领域都获得了更多使用AI系统的机会。本报告区分了两大类AI系统：预测型AI（PredictiveAI，PredAI）和生成型A
Baumer工业相机堡盟工业相机如何通过YoloV8深度学习模型实现打架检测（C#代码，UI界面版）格林威工业相机机器视觉数码相机 YOLO 深度学习计算机视觉人工智能
Baumer工业相机堡盟工业相机如何通过YoloV8深度学习模型实现打架检测（C#代码，UI界面版）工业相机使用YoloV8模型实现打架检测工业相机通过YoloV8模型实现打架检测的技术背景在相机SDK中获取图像转换图像的代码分析工业相机图像转换Bitmap图像格式和Mat图像重要核心代码本地文件图像转换Bitmap图像格式和Mat图像重要核心代码Mat图像导入YoloV8模型重要核心代码代码实现
Baumer工业相机堡盟工业相机如何通过YoloV8深度学习模型实现人脸识别检测（C#代码，UI界面版）格林威机器视觉工业相机数码相机 YOLO 深度学习人工智能视觉检测 c#
Baumer工业相机堡盟工业相机如何通过YoloV8深度学习模型实现人脸识别检测（C#代码，UI界面版）工业相机使用YoloV8模型实现人脸的检测工业相机通过YoloV8模型实现人脸识别检测的技术背景在相机SDK中获取图像转换图像的代码分析工业相机图像转换Bitmap图像格式和Mat图像重要核心代码本地文件图像转换Bitmap图像格式和Mat图像重要核心代码Mat图像导入YoloV8模型重要核心代
Baumer工业相机堡盟工业相机如何通过YoloV8深度学习模型实现人物识别（C#代码，UI界面版）格林威工业相机机器视觉数码相机 YOLO c#人工智能计算机视觉开发语言
Baumer工业相机堡盟工业相机如何通过YoloV8深度学习模型实现人物识别（C#代码，UI界面版）工业相机使用YoloV8模型实现人物识别工业相机实现YoloV8模型实现人物识别的技术背景在相机SDK中获取图像转换图像的代码分析工业相机图像转换Bitmap图像格式和Mat图像重要核心代码本地文件图像转换Bitmap图像格式和Mat图像重要核心代码Mat图像导入YoloV8模型重要核心代码代码实现
Baumer工业相机堡盟工业相机如何通过YoloV8深度学习模型实现动物分类（C#源码，UI界面版）格林威机器视觉工业相机数码相机 YOLO 深度学习计算机视觉人工智能视觉检测 c#
Baumer工业相机堡盟工业相机如何通过YoloV8深度学习模型实现动物分类（C#源码，UI界面版））工业相机使用YoloV8模型实现动物分类工业相机实现YoloV8模型实现动物分类的技术背景在相机SDK中获取图像转换图像的代码分析工业相机图像转换Bitmap图像格式和Mat图像重要核心代码本地文件图像转换Bitmap图像格式和Mat图像重要核心代码Mat图像导入YoloV8模型重要核心代码代码实
通俗易懂：什么是决策树？淦暴尼算法 python 决策树算法机器学习
1.引言：决策树就像“选择题”你是否曾经在生活中做过“选择题”？比如：今天要不要带伞？晚饭吃什么？该不该买那件心仪已久的商品？其实，我们的大脑经常会像“决策树”一样，通过一连串问题和判断，逐步缩小选择范围，最终做出决定。**决策树（DecisionTree）**就是这样一种模拟人类决策过程的机器学习模型。它通过“提问-分支-决策”的方式，把复杂问题拆解成一系列简单的判断，广泛应用于分类（如判断邮件
java毕业设计-基于Javaweb的家常小菜烹饪学习管理系统的设计与实现(源码+LW+部署文档+全bao+远程调试+代码讲解等) 程序猿刘 vue spring boot 毕业设计 java 课程设计学习
博主介绍：✌️码农一枚，专注于大学生项目实战开发、讲解和毕业文撰写修改等。全栈领域优质创作者，博客之星、掘金/华为云/阿里云/InfoQ等平台优质作者、专注于Java、小程序技术领域和毕业项目实战✌️技术范围：：小程序、SpringBoot、SSM、JSP、Vue、PHP、Java、python、爬虫、数据可视化、大数据、物联网、机器学习等设计与开发。主要内容：免费开题报告、任务书、全bao定制+
骗子太猖獗了，打着摩根士丹利何晓斌名义带股民进入虚假宝丰能源节能减排碳交易市场，大量股民被骗真相曝光墨守成法
为什么明明跟老师对过视频，确认是本人，怎么还会被骗了?你有没有想过一个名人大咖怎么会有时间给你们一对一视频，其次我来给大家揭露一下，这个套路AI换脸骗局是一种利用人工智能技术，通过替换视频中的人脸来伪造身份或进行诈骗的行为。你的账户“余额”是真的吗？为什么不能提现呢？其实都是骗子给你的一串数字而已！这些新平台打着“低风险”、“高收益”、“慈善公益投票”等噱头先让投资人尝到甜头再通过恶意操作将投资人
java毕业设计源码案例-基于ssm+协同过滤的个性化小说推荐系统设计与实现(源码+LW+部署文档+全bao+远程调试+代码讲解等) 项目帮 springboot java 计算机毕设 java 课程设计开发语言
博主介绍：✌️码农一枚，专注于大学生项目实战开发、讲解和毕业文撰写修改等。全栈领域优质创作者，博客之星、掘金/华为云/阿里云/InfoQ等平台优质作者、专注于Java、小程序技术领域和毕业项目实战✌️技术范围：：小程序、SpringBoot、SSM、JSP、Vue、PHP、Java、python、爬虫、数据可视化、大数据、物联网、机器学习等设计与开发。主要内容：免费功能设计，开题报告、任务书、全b
AI 大模型重塑软件开发流程万花丛中一抹绿人工智能
一、AI大模型的定义与发展历史AI大模型是基于海量数据训练的深度学习模型，具备强大的自然语言理解、逻辑推理和知识生成能力。在软件开发领域，以GPT-4、CodeLlama、GitHubCopilotX为代表的大模型，能理解代码语法、语义及业务逻辑，实现代码生成、漏洞检测等复杂任务。其发展可追溯至2017年，谷歌提出Transformer架构，为大模型奠定了核心基础。2018年，GPT-1问世，参数
机器学习中的数据预处理：从入门到实践耐思nice～机器学习由浅入深-吴恩达机器学习人工智能
在当今的智能时代，机器学习已经渗透到我们生活的方方面面。比如我们常用的推荐系统，它能根据我们的浏览记录精准推送喜欢的商品或视频，这背后就离不开机器学习的支撑。而一个优秀的机器学习模型，离不开高质量的数据，数据预处理正是保证数据质量的关键环节，它就像烹饪前的食材处理，直接影响着最终“菜品”的口感，也就是模型的性能。今天，我们就来全面学习机器学习中数据预处理的关键步骤。一、数据预处理的重要性数据预处理
PyTorch笔记6----------神经网络案例 HuashuiMu花水木 PyTorch笔记 pytorch 笔记
1.回归网络波士顿房价预测模型搭建波士顿房价数据集下载链接：百度网盘请输入提取码提取码:5279导入所需包importtorchimportnumpyasnpimportre读取数据ff=open('housing.data').readlines()data=[]foriteminff:out=re.sub(r"\s{2,}","",item).strip()#通过正则表达式去除所有空格data
JAVA基础灵静志远位运算加载 Date 字符串池覆盖
一、类的初始化顺序 1 （静态变量，静态代码块）-->（变量，初始化块）--> 构造器同一括号里的，根据它们在程序中的顺序来决定。上面所述是同一类中。如果是继承的情况，那就在父类到子类交替初始化。二、String 1 String a = "abc"; JAVA虚拟机首先在字符串池中查找是否已经存在了值为"abc"的对象，根
keepalived实现redis主从高可用 bylijinnan redis
方案说明两台机器（称为A和B），以统一的VIP对外提供服务 1.正常情况下，A和B都启动，B会把A的数据同步过来（B is slave of A） 2.当A挂了后，VIP漂移到B；B的keepalived 通知redis 执行：slaveof no one，由B提供服务 3.当A起来后，VIP不切换，仍在B上面；而A的keepalived 通知redis 执行slaveof B，开始
java文件操作大全 0624chenhong java
最近在博客园看到一篇比较全面的文件操作文章，转过来留着。 http://www.cnblogs.com/zhuocheng/archive/2011/12/12/2285290.html 转自http://blog.sina.com.cn/s/blog_4a9f789a0100ik3p.html 一.获得控制台用户输入的信息 &nbs
android学习任务不懂事的小屁孩工作
任务完成情况搞清楚带箭头的pupupwindows和不带的使用已完成熟练使用pupupwindows和alertdialog，并搞清楚两者的区别已完成熟练使用android的线程handler,并敲示例代码进行中了解游戏2048的流程，并完成其代码工作进行中-差几个actionbar 研究一下android的动画效果，写一个实例已完成复习fragem
zoom.js 换个号韩国红果果 oom
它的基于bootstrap 的 https://raw.github.com/twbs/bootstrap/master/js/transition.js transition.js模块引用顺序 <link rel="stylesheet" href="style/zoom.css"> <script src=&q
详解Oracle云操作系统Solaris 11.2 蓝儿唯美 Solaris
当Oracle发布Solaris 11时，它将自己的操作系统称为第一个面向云的操作系统。Oracle在发布Solaris 11.2时继续它以云为中心的基调。但是，这些说法没有告诉我们为什么Solaris是配得上云的。幸好，我们不需要等太久。Solaris11.2有4个重要的技术可以在一个有效的云实现中发挥重要作用：OpenStack、内核域、统一存档（UA）和弹性虚拟交换（EVS）。
spring学习——springmvc（一） a-john springMVC
Spring MVC基于模型-视图-控制器（Model-View-Controller，MVC）实现，能够帮助我们构建像Spring框架那样灵活和松耦合的Web应用程序。 1，跟踪Spring MVC的请求请求的第一站是Spring的DispatcherServlet。与大多数基于Java的Web框架一样，Spring MVC所有的请求都会通过一个前端控制器Servlet。前
hdu4342 History repeat itself-------多校联合五 aijuans 数论
水题就不多说什么了。 #include<iostream>#include<cstdlib>#include<stdio.h>#define ll __int64using namespace std;int main(){ int t; ll n; scanf("%d",&t); while(t--)
EJB和javabean的区别 asia007 bean ejb
EJB不是一般的JavaBean,EJB是企业级JavaBean,EJB一共分为3种,实体Bean,消息Bean,会话Bean,书写EJB是需要遵循一定的规范的,具体规范你可以参考相关的资料.另外,要运行EJB,你需要相应的EJB容器,比如Weblogic,Jboss等,而JavaBean不需要,只需要安装Tomcat就可以了 1.EJB用于服务端应用开发, 而JavaBeans
Struts的action和Result总结百合不是茶 struts Action配置 Result配置
一:Action的配置详解: 下面是一个Struts中一个空的Struts.xml的配置文件 <?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE struts PUBLIC &quo
如何带好自已的团队 bijian1013 项目管理团队管理团队
在网上看到博客" 怎么才能让团队成员好好干活"的评论，觉得写的比较好。原文如下：我做团队管理有几年了吧，我和你分享一下我认为带好团队的几点： 1.诚信对团队内成员，无论是技术研究、交流、问题探讨，要尽可能的保持一种诚信的态度，用心去做好，你的团队会感觉得到。 2.努力提
Java代码混淆工具 sunjing ProGuard
Open Source Obfuscators ProGuard http://java-source.net/open-source/obfuscators/proguardProGuard is a free Java class file shrinker and obfuscator. It can detect and remove unused classes, fields, m
【Redis三】基于Redis sentinel的自动failover主从复制 bit1129 redis
在第二篇中使用2.8.17搭建了主从复制，但是它存在Master单点问题，为了解决这个问题，Redis从2.6开始引入sentinel，用于监控和管理Redis的主从复制环境，进行自动failover，即Master挂了后，sentinel自动从从服务器选出一个Master使主从复制集群仍然可以工作，如果Master醒来再次加入集群，只能以从服务器的形式工作。什么是Sentine
使用代理实现Hibernate Dao层自动事务白糖_ DAO spring AOP 框架 Hibernate
都说spring利用AOP实现自动事务处理机制非常好，但在只有hibernate这个框架情况下，我们开启session、管理事务就往往很麻烦。 public void save(Object obj){ Session session = this.getSession(); Transaction tran = session.beginTransaction(); try
maven3实战读书笔记 braveCS maven3
Maven简介是什么？ Is a software project management and comprehension tool.项目管理工具是基于POM概念(工程对象模型) [设计重复、编码重复、文档重复、构建重复，maven最大化消除了构建的重复] [与XP：简单、交流与反馈；测试驱动开发、十分钟构建、持续集成、富有信息的工作区] 功能：
编程之美-子数组的最大乘积 bylijinnan 编程之美
public class MaxProduct { /** * 编程之美子数组的最大乘积 * 题目: 给定一个长度为N的整数数组，只允许使用乘法，不能用除法，计算任意N-1个数的组合中乘积中最大的一组，并写出算法的时间复杂度。 * 以下程序对应书上两种方法，求得“乘积中最大的一组”的乘积——都是有溢出的可能的。 * 但按题目的意思，是要求得这个子数组，而不
读书笔记-2 chengxuyuancsdn 读书笔记
1、反射 2、oracle年-月-日时-分-秒 3、oracle创建有参、无参函数 4、oracle行转列 5、Struts2拦截器 6、Filter过滤器(web.xml) 1、反射 (1)检查类的结构在java.lang.reflect包里有3个类Field,Method,Constructor分别用于描述类的域、方法和构造器。 2、oracle年月日时分秒 s
[求学与房地产]慎重选择IT培训学校 comsci it
关于培训学校的教学和教师的问题,我们就不讨论了,我主要关心的是这个问题培训学校的教学楼和宿舍的环境和稳定性问题我们大家都知道，房子是一个比较昂贵的东西，特别是那种能够当教室的房子... &nb
RMAN配置中通道(CHANNEL)相关参数 PARALLELISM 、FILESPERSET的关系 daizj oracle rman filesperset PARALLELISM
RMAN配置中通道(CHANNEL)相关参数 PARALLELISM 、FILESPERSET的关系转 PARALLELISM --- 我们还可以通过parallelism参数来指定同时"自动"创建多少个通道： RMAN > configure device type disk parallelism 3 ; 表示启动三个通道，可以加快备份恢复的速度。
简单排序:冒泡排序 dieslrae 冒泡排序
public void bubbleSort(int[] array){ for(int i=1;i<array.length;i++){ for(int k=0;k<array.length-i;k++){ if(array[k] > array[k+1]){
初二上学期难记单词三 dcj3sjt126com sciet
concert 音乐会 tonight 今晚 famous 有名的；著名的 song 歌曲 thousand 千 accident 事故；灾难 careless 粗心的，大意的 break 折断；断裂；破碎 heart 心（脏） happen 偶尔发生，碰巧 tourist 旅游者；观光者 science （自然）科学 marry 结婚 subject 题目；
I.安装Memcahce 1. 安装依赖包libevent Memcache需要安装libevent,所以安装前可能需要执行 Shell代码收藏代码 dcj3sjt126com redis
wget http://download.redis.io/redis-stable.tar.gz tar xvzf redis-stable.tar.gz cd redis-stable make 前面3步应该没有问题，主要的问题是执行make的时候，出现了异常。异常一： make[2]: cc: Command not found 异常原因：没有安装g
并发容器 shuizhaosi888 并发容器
通过并发容器来改善同步容器的性能，同步容器将所有对容器状态的访问都串行化，来实现线程安全，这种方式严重降低并发性，当多个线程访问时，吞吐量严重降低。并发容器ConcurrentHashMap 替代同步基于散列的Map，通过Lock控制。 &nb
Spring Security（12）——Remember-Me功能 234390216 Spring Security Remember Me 记住我
Remember-Me功能目录 1.1 概述 1.2 基于简单加密token的方法 1.3 基于持久化token的方法 1.4 Remember-Me相关接口和实现
位运算焦志广位运算
一、位运算符Ｃ语言提供了六种位运算符： & 按位与 | 按位或 ^ 按位异或 ~ 取反 << 左移 >> 右移 1. 按位与运算按位与运算符"&"是双目运算符。其功能是参与运算的两数各对应的二进位相与。只有对应的两个二进位均为1时，结果位才为1 ，否则为0。参与运算的数以补码方式出现。例如：9&am
nodejs 数据库连接 mongodb mysql liguangsong mongodb mysql node 数据库连接
1.mysql 连接 package.json中dependencies加入 "mysql":"~2.7.0" 执行 npm install 在config 下创建文件 database.js
java动态编译 olive6615 java HotSpot jvm 动态编译
在HotSpot虚拟机中，有两个技术是至关重要的，即动态编译(Dynamic compilation)和Profiling。 HotSpot是如何动态编译Javad的bytecode呢？Java bytecode是以解释方式被load到虚拟机的。HotSpot里有一个运行监视器，即Profile Monitor,专门监视
Storm0.9.5的集群部署配置优化 roadrunners 优化 storm.yaml
nimbus结点配置（storm.yaml）信息： # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional inf
101个MySQL 的调节和优化的提示 tomcat_oracle mysql
　1. 拥有足够的物理内存来把整个InnoDB文件加载到内存中——在内存中访问文件时的速度要比在硬盘中访问时快的多。　　2. 不惜一切代价避免使用Swap交换分区 – 交换时是从硬盘读取的，它的速度很慢。　　3. 使用电池供电的RAM（注：RAM即随机存储器）。　　4. 使用高级的RAID（注：Redundant Arrays of Inexpensive Disks，即磁盘阵列
zoj 3829 Known Notation(贪心) 阿尔萨斯 ZOJ
题目链接：zoj 3829 Known Notation 题目大意：给定一个不完整的后缀表达式，要求有2种不同操作，用尽量少的操作使得表达式完整。解题思路：贪心，数字的个数要要保证比∗的个数多1，不够的话优先补在开头是最优的。然后遍历一遍字符串，碰到数字+1，碰到∗-1,保证数字的个数大于等1，如果不够减的话，可以和最后面的一个数字交换位置（用栈维护十分方便），因为添加和交换代价都是1

大模型的好伙伴，浅析推理加速引擎FasterTransformer

FasterTransformer简介

FasterTransformer 中的优化技术

FasterTransformer GPT 简介

FasterTransformer GPT 工作流程

FasterTransformer GPT 优化

FasterTransformer GPT 推理选项

环境搭建

基础环境配置

构建FasterTransformer

安装依赖包

数据与模型准备

模型

数据集

模型格式转换

模型基准测试

Huggingface Transformers基准测试

FasterTransformer基准测试

对比Huggingface Transformers和FasterTransformer

模型并行推理（多卡）

张量并行

模型转换

张量并行模型推理

流水线并行

模型转换

流水线并行模型推理

单卡、流水线并行、张量并行对比

结语

你可能感兴趣的:(人工智能,机器学习,深度学习,自然语言处理,神经网络)