MindIE Service整体介绍&快速上手

文章目录

  • MindIE Service整体介绍
    • 是什么
    • 架构介绍
  • MindIE Service的安装部署
    • 安装
    • 部署
  • MindIE Service快速上手
    • 接口调用
    • 精度测试
    • 性能测试
    • 服务停止

MindIE Service整体介绍

是什么

MindIE Service是面向通用模型场景的推理服务化框架,通过开放、可扩展的推理服务化平台架构提供推理服务化能力,支持对接业界主流推理框架接口,满足大语言模型的高性能推理需求。

MindIE Service的组件包括MindIE Service Tools、MindIE Client、MindIE MS(MindIE Management Service)和MindIE Server和,一方面通过对接昇腾推理加速引擎带来大模型在昇腾环境中的性能提升,另一方面通过接入现有主流推理框架生态,逐渐以高性能和易用性牵引用户向全原生推理服务化框架迁移。

架构介绍

整体架构图如下:
MindIE Service整体介绍&快速上手_第1张图片

  • MindIE Service Tools:昇腾推理服务化工具;主要功能有大模型推理性能测试、精度测试和可视化能力,并且支持通过配置提升吞吐。
  • MindIE Client:昇腾推理服务化客户端;配套昇腾推理服务化MindIE Server提供完整的推理服务化能力,包括对接MindIE Server的通信协议、请求和返回的接口,提供给用户应用对接。
  • MindIE MS:服务策略管理,提供服务运维能力。主要功能包括模型Pod级和Pod内实例级管理、简化部署并提供服务质量监控、模型更新、故障重调度和自动扩缩负载均衡能力,不仅能够提升服务质量,同时也能提高推理硬件资源利用率。
  • MindIE Server:推理服务端;提供模型推理服务化能力,支持命令行部署RESTful服务。
    • EndPoint:提供简易的API接口;EndPoint面向推理服务开发者提供极简易用的API接口,推理服务化协议和接口封装,支持Triton/OpenAI/TGI/vLLM主流推理框架请求接口。
    • GMIS:模型推理调度器,提供多实例调度能力;GMIS模块支持推理流程的工作流定义扩展,以工作流为驱动,实现从推理任务调度到任务执行的可扩展架构,适应各类推理方法。
    • BackendManager:模型执行后端,昇腾后端和自定义后端的管理模块;Backend管理模块面向不同推理引擎,不同模型,提供统一抽象接口,便于扩展,减少推理引擎、模型变化带来的修改。
  • MindIE Backends:支持昇腾MindIE LLM后端。

MindIE Service的安装部署

安装

MindIE Service不支持单独安装,需要把MindIE整个包安装,安装指南参考昇腾社区文档:MindIE安装指导

安装完成后,若显示如下信息,则说明软件安装成功:

xxx install success

xxx表示安装的实际软件包名。

部署

  • 单机推理模式:参考链接
  • 多机推理模式:参考链接

MindIE Service快速上手

接口调用

MindIE Service提供了兼容TGI 0.9.4版本、vLLM 0.2.6版本、OpenAI、Triton等三方框架的接口,同时也提供原生的推理接口和相关的服务健康检查接口,详细接口列表参考:接口列表

下面提供主要接口的样例:

  • 兼容TGI 0.9.4版本接口
    单模态文本推理:POST https://{ip}:{port}/generate
    消息请求体:
{
    "inputs": "My name is Olivier and I",
    "parameters": {
        "decoder_input_details": true,
        "details": true,
        "do_sample": true,
        "max_new_tokens": 20,
        "repetition_penalty": 1.03,
        "return_full_text": false,
        "seed": null,
        "temperature": 0.5,
        "top_k": 10,
        "top_p": 0.95,
        "truncate": null,
        "typical_p": 0.5,
        "watermark": false,
        "stop": null,
        "adapter_id": "None"
    }
}

响应体:

{
    "details": {
        "finish_reason": "length",
        "generated_tokens": 1,
        "prefill": [{
            "id": 0,
            "logprob":null,
            "special": null,
            "text": "test"
        }],
        "prompt_tokens": 74,
        "seed": 42,
        "tokens": [{
            "id": 0,
            "logprob": null,
            "special": null,
            "text": "test"
        }]
    },
    "generated_text": "am a Frenchman living in the UK. I have been working as an IT consultant for "
}
  • 兼容vLLM 0.2.6版本接口
    单模态文本推理:POST https://{ip}:{port}/generate
    消息请求体:
{
    "prompt": "My name is Olivier and I",
    "max_tokens": 20,
    "repetition_penalty": 1.03,
    "presence_penalty": 1.2,
    "frequency_penalty": 1.2,
    "temperature": 0.5,
    "top_p": 0.95,
    "top_k": 10,
    "seed": null,
    "stream": false,
    "stop": null,
    "stop_token_ids": null,
    "model": "None",
    "include_stop_str_in_output": false,
    "skip_special_tokens": true,
    "ignore_eos": false
}

响应体:

{"text":["My name is Olivier and I am a Frenchman living in the UK. I am a keen photographer and"]}
  • 兼容OpenAI接口
    文本推理:POST https://{ip}:{port}/v1/chat/completions
    消息请求体:
{
    "model": "gpt-3.5-turbo",
    "messages": [{
        "role": "user",
        "content": "You are a helpful assistant."
    }],
    "stream": false,
    "presence_penalty": 1.03,
    "frequency_penalty": 1.0,
    "repetition_penalty": 1.0,
    "temperature": 0.5,
    "top_p": 0.95,
    "top_k": 0,
    "seed": null,
    "stop": ["stop1", "stop2"],
    "stop_token_ids": [2, 13],
    "include_stop_str_in_output": false,
    "skip_special_tokens": true,
    "ignore_eos": false,
    "max_tokens": 20
}

响应体:

{
    "id": "chatcmpl-123",
    "object": "chat.completion",
    "created": 1677652288,
    "model": "gpt-3.5-turbo-0613",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "\n\nHello there, how may I assist you today?"
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 9,
        "completion_tokens": 12,
        "total_tokens": 21
    }
}
  • 兼容Triton接口
    单模态文本推理:POST https://{ip}:{port}/v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/generate
    消息请求体:
{
    "id":"a123",
    "text_input": "My name is Olivier and I",
    "parameters": {
        "details": true,
        "do_sample": true,
        "max_new_tokens":20,
        "repetition_penalty": 1.1,
        "seed": 123,
        "temperature": 1,
        "top_k": 10,
        "top_p": 0.99,
        "batch_size":100,
        "typical_p": 0.5,
        "watermark": false,
        "perf_stat": false,
        "priority": 5,
        "timeout": 10
    }
}

响应体:

{
    "id": "a123",
    "model_name": "llama_65b",
    "model_version": null,
    "text_output": "am living in South of France.\nI have been addicted to Jurassic Park since very young. I played some video game versions but especially the great first pinball model from William which reminds me a lot of JPOG1 by song (deluxe). Unfortunately, it stopped working and has been unprofitable for a long time before being exchanged for another game. Fortunately there was the computer version. Nevertheless, it came out only on PC in 2003 when mine was too weak... It's just been a couple of months that the game came out on Mac (a whole 15 years late) with the Version 0.91JAMS ! I know this may be a little antique with the realistic animations and versions today, but the memories are very deep-seated . So thank you all rebuilders for keeping alive wonderful games like this one.\nSince then, I try to keep me updated about this game and test if possible later Alpha. Thank you so much for your work!",
    "details": {
        "finish_reason": "eos_token",
        "generated_tokens": 221,
        "first_token_cost": null,
        "decode_cost": null
    }
}
  • 自研接口
    单模态文本推理:POST https://{ip}:{port}/infer
    消息请求体:
{
    "inputs": "My name is Olivier and I",
    "stream": false,
    "parameters": {
        "temperature": 0.5,
        "top_k": 10,
        "top_p": 0.95,
        "max_new_tokens": 20,
        "do_sample": true,
        "seed": null,
        "repetition_penalty": 1.03,
        "details": true,
        "typical_p": 0.5,
        "watermark": false,
        "priority": 5,
        "timeout": 10
    }
}

响应体:

{
    "generated_text": "am a french native speaker. I am looking for a job in the hospitality industry. I",
    "details": {
        "finish_reason": "length",
        "generated_tokens": 20,
        "seed": 846930886
    }
}

精度测试

使用MindIE Benchmark工具来完成精度测试任务,更详细的MindIE Benchmark介绍参考:链接

参考样例:

benchmark \
--DatasetPath "/{数据集路径}/GSM8K" \
--DatasetType "gsm8k" \
--ModelName "baichuan2_13b" \
--ModelPath "/{模型路径}/baichuan2-13b" \
--TestType client \
--Http https://{ipAddress}:{port} \
--ManagementHttp https://{managementIpAddress}:{managementPort} \
--MaxOutputLen 512 \
--TestAccuracy True

其中–TestAccuracy True 参数是开启精度测试的开关。

性能测试

使用MindIE Benchmark工具来完成性能测试任务,更详细的MindIE Benchmark介绍参考:链接

参考样例:

benchmark \
--DatasetPath "/{数据集路径}/GSM8K" \
--DatasetType "gsm8k" \
--ModelName "baichuan2_13b" \
--ModelPath "/{模型路径}/baichuan2-13b" \
--TestType client \
--Http https://{ipAddress}:{port} \
--ManagementHttp https://{managementIpAddress}:{managementPort} \
--MaxOutputLen 512

服务停止

  • 方式1:使用kill命令停止进程
# 查看所有与mindieservice相关的进程列表
ps -ef | grep mindieservice

# 使用kill命令停止进程
kill {mindieservice_daemon 主进程ID}
  • 方式2:使用pkill命令停止进程
pkill -9 mindieservice
时延的指标和lpct(latency per compelete token,prefill阶段平均每个token时延)、Throughput等测试吞吐量的指标。

## 服务停止
- 方式1:使用kill命令停止进程
```shell
# 查看所有与mindieservice相关的进程列表
ps -ef | grep mindieservice

# 使用kill命令停止进程
kill {mindieservice_daemon 主进程ID}
  • 方式2:使用pkill命令停止进程
pkill -9 mindieservice

你可能感兴趣的:(python,人工智能)