https://kserve.github.io/website/0.10/modelserving/v1beta1/triton/torchscript/
虽然Python对于许多需要动态性和易迭代性的场景来说是一种合适且首选的语言,但在同样多的情况下,Python的这些特性都是不利的。后者经常适用的一个环境是生产——延迟低、部署要求严格的土地。对于生产场景,C++通常是首选语言。以下示例将概述PyTorch提供的从现有Python模型到序列化表示的路径,该表示可以纯粹从C++(如Triton推理服务器)加载和执行,而不依赖于Python。
1.确保您已安装KServe
2.跳过nvcr.io的标签解析,该解析需要auth来解析triton推理服务器图像摘要
kubectl patch cm config-deployment --patch '{"data":{"registriesSkippingTagResolving":"nvcr.io"}}' -n knative-serving
3.增加进度截止日期,因为拉triton图像和big bert模型可能比默认超时时间长120秒,此设置需要奖励0.15.0+
kubectl patch cm config-deployment --patch '{"data":{"progressDeadline": "600s"}}' -n knative-serving
PyTorch模型从Python到C++的历程是由Torch Script实现的,Torch Script是PyTorch模式的一种表示,可以由Torch脚本编译器理解、编译和序列化。如果您是从一个现有的PyTorch模型开始编写的vanilla API,您必须首先将您的模型转换为Torch脚本。
通过Tracing转换上述模型,并将脚本模块序列化为文件
import torch
# Use torch.jit.trace to generate a torch.jit.ScriptModule via tracing.
example = torch.rand(1, 3, 32, 32)
traced_script_module = torch.jit.trace(net, example)
traced_script_module.save("model.pt")
一旦模型被导出为Torchscript模型文件,下一步就是将模型上传到GCS存储bucket。Triton支持加载多个模型,因此它希望在bucket中有一个遵循所需布局的模型存储库。
/
/
[config.pbtxt]
[ ...]
/
/
...
/
[config.pbtxt]
[ ...]
/
/
例如,在您的模型存储库bucket gs://kserving examples/models/torchscript中,布局可以是
torchscript/
cifar/
config.pbtxt
1/
model.pt
config.pbtxt定义了一个模型配置,该配置提供了模型所需的和可选的信息。最小模型配置必须指定name, platform, max_batch_size, input和output。由于TorchScript模型中没有输入和输出名称,配置中输入和输出的名称属性必须遵循特定的命名约定,即“__”。其中可以是任何字符串,并指相应输入/输出的位置。这意味着,如果有两个输入和两个输出,它们必须命名为:INPUT__0、INPUT__1和OUTPUT__0、OUTPUT__1,这样INPUT__0表示第一个输入,INPUT__1表示第二个输入,等等。
name: "cifar"
platform: "pytorch_libtorch"
max_batch_size: 1
input [
{
name: "INPUT__0"
data_type: TYPE_FP32
dims: [3,32,32]
}
]
output [
{
name: "OUTPUT__0"
data_type: TYPE_FP32
dims: [10]
}
]
instance_group [
{
count: 1
kind: KIND_CPU
}
]
要在GPU上调度模型,您需要使用GPU类型更改instance_group
instance_group [
{
count: 1
kind: KIND_GPU
}
]
使用上面指定的模型存储库uri创建推理服务yaml。
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: torchscript-cifar10
spec:
predictor:
triton:
storageUri: gs://kfserving-examples/models/torchscript
runtimeVersion: 20.10-py3
env:
- name: OMP_NUM_THREADS
value: "1"
警告
设置OMP_NUM_THREADS env对性能至关重要,OMP_NUM_THREADS通常在numpy、PyTorch和Tensorflow中用于执行多线程线性代数。我们希望每个工作线程一个,而不是每个工作线程多个,以避免争用。
kubectl
kubectl apply -f torchscript.yaml
期望输出
$ inferenceservice.serving.kserve.io/torchscript-cifar10 created
第一步是确定入口IP和端口,并设置INGRESS_HOST和INGRESS_PORT
最新的Triton推断服务器已经切换到使用KServe预测V2协议,因此输入请求需要遵循V2模式和指定的数据type、shape。
# download the input file
curl -O https://raw.githubusercontent.com/kserve/kserve/master/docs/samples/v1beta1/triton/torchscript/input.json
MODEL_NAME=cifar10
INPUT_PATH=@./input.json
SERVICE_HOSTNAME=$(kubectl get inferenceservice torchscript-cifar10 -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -v -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models/${MODEL_NAME}/infer -d $INPUT_PATH
期望输出
* Connected to torchscript-cifar.default.svc.cluster.local (10.51.242.87) port 80 (#0)
> POST /v2/models/cifar10/infer HTTP/1.1
> Host: torchscript-cifar.default.svc.cluster.local
> User-Agent: curl/7.47.0
> Accept: */*
> Content-Length: 110765
> Content-Type: application/x-www-form-urlencoded
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
* We are completely uploaded and fine
< HTTP/1.1 200 OK
< content-length: 315
< content-type: application/json
< date: Sun, 11 Oct 2020 21:26:51 GMT
< x-envoy-upstream-service-time: 8
< server: istio-envoy
<
* Connection #0 to host torchscript-cifar.default.svc.cluster.local left intact
{"model_name":"cifar10","model_version":"1","outputs":[{"name":"OUTPUT__0","datatype":"FP32","shape":[1,10],"data":[-2.0964810848236086,-0.13700756430625916,-0.5095657706260681,2.795621395111084,-0.5605481863021851,1.9934231042861939,1.1288187503814698,-1.4043136835098267,0.6004879474639893,-2.1237082481384279]}]}
QPS速率–速率可以在perf.yaml中更改。
kubectl create -f perf.yaml
Requests [total, rate, throughput] 6000, 100.02, 100.01
Duration [total, attack, wait] 59.995s, 59.99s, 4.961ms
Latencies [min, mean, 50, 90, 95, 99, max] 4.222ms, 5.7ms, 5.548ms, 6.384ms, 6.743ms, 9.286ms, 25.85ms
Bytes In [total, mean] 1890000, 315.00
Bytes Out [total, mean] 665874000, 110979.00
Success [ratio] 100.00%
Status Codes [code:count] 200:6000
Error Set:
创建推理服务yaml并公开gRPC端口,目前只允许一个端口公开HTTP或gRPC端口并且默认情况下公开HTTP端口。
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: torchscript-cifar10
spec:
predictor:
triton:
storageUri: gs://kfserving-examples/models/torchscript
runtimeVersion: 20.10-py3
ports:
- containerPort: 9000
name: h2c
protocol: TCP
env:
- name: OMP_NUM_THREADS
value: "1"
应用gRPC推理服务yaml,然后可以在推理服务准备好后使用tritonclient python库调用模型。
kubectl apply -f torchscript_grpc.yaml
gRPC推理服务就绪后,grpurl可用于向推理服务发送gRPC请求。
# download the proto file
curl -O https://raw.githubusercontent.com/kserve/kserve/master/docs/predict-api/v2/grpc_predict_v2.proto
# download the input json file
curl -O https://raw.githubusercontent.com/kserve/website/main/docs/modelserving/v1beta1/triton/torchscript/input-grpc.json
INPUT_PATH=input-grpc.json
PROTO_FILE=grpc_predict_v2.proto
SERVICE_HOSTNAME=$(kubectl get inferenceservice torchscript-cifar10 -o jsonpath='{.status.url}' | cut -d "/" -f 3)
gRPC API遵循KServe预测V2协议。
例如,ServerReady API可用于检查服务器是否已就绪:
grpcurl \
-plaintext \
-proto ${PROTO_FILE} \
-authority ${SERVICE_HOSTNAME}" \
${INGRESS_HOST}:${INGRESS_PORT} \
inference.GRPCInferenceService.ServerReady
期望输出
{
"ready": true
}
ModelInfer API按照grpc_predict_v2.proto文件中定义的ModelInferRequest模式获取输入。请注意,输入文件与上一个curl示例中使用的文件不同。
grpcurl \
-vv \
-plaintext \
-proto ${PROTO_FILE} \
-H "Host: ${SERVICE_HOSTNAME}" \
-d @ \
${INGRESS_HOST}:${INGRESS_PORT} \
inference.GRPCInferenceService.ModelInfer \
<<< $(cat "$INPUT_PATH")
期望输出
Resolved method descriptor:
// The ModelInfer API performs inference using the specified model. Errors are
// indicated by the google.rpc.Status returned for the request. The OK code
// indicates success and other codes indicate failure.
rpc ModelInfer ( .inference.ModelInferRequest ) returns ( .inference.ModelInferResponse );
Request metadata to send:
host: torchscript-cifar10.default.example.com
Response headers received:
accept-encoding: identity,gzip
content-type: application/grpc
date: Fri, 12 Aug 2022 01:49:53 GMT
grpc-accept-encoding: identity,deflate,gzip
server: istio-envoy
x-envoy-upstream-service-time: 16
Response contents:
{
"modelName": "cifar10",
"modelVersion": "1",
"outputs": [
{
"name": "OUTPUT__0",
"datatype": "FP32",
"shape": [
"1",
"10"
]
}
],
"rawOutputContents": [
"wCwGwOJLDL7icgK/dusyQAqAD799KP8/In2QP4zAs7+WuRk/2OoHwA=="
]
}
Response trailers received:
(empty)
Sent 1 request and received 1 response
输出张量的内容编码在rawOutputContents字段中。它可以进行base64解码,并加载到具有给定数据类型和形状的Numpy数组中。
另外,Triton还提供了Python客户端库,其中有许多示例显示了如何与KServe V2 gPRC协议交互。
Triton推理服务器期望张量作为输入数据,当用户以原始输入格式发送请求时,通常在进行预测调用之前需要进行预处理步骤。对于用户实现的预/后处理代码,可以在推理服务规范中指定转换器组件。用户负责创建一个python类,该类从KServe Model基类扩展而来,该基类实现了根据V2预测协议将原始输入格式转换为张量格式的预处理程序,后处理句柄将原始预测响应转换为更友好的响应。
image_transformer_v2.py
import kserve
from typing import Dict
from PIL import Image
import torchvision.transforms as transforms
import logging
import io
import numpy as np
import base64
logging.basicConfig(level=kserve.constants.KSERVE_LOGLEVEL)
transform = transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
def image_transform(instance):
byte_array = base64.b64decode(instance['image_bytes']['b64'])
image = Image.open(io.BytesIO(byte_array))
a = np.asarray(image)
im = Image.fromarray(a)
res = transform(im)
logging.info(res)
return res.tolist()
class ImageTransformerV2(kserve.Model):
def __init__(self, name: str, predictor_host: str, protocol: str):
super().__init__(name)
self.predictor_host = predictor_host
self.protocol = protocol
def preprocess(self, inputs: Dict) -> Dict:
return {
'inputs': [
{
'name': 'INPUT__0',
'shape': [1, 3, 32, 32],
'datatype': "FP32",
'data': [image_transform(instance) for instance in inputs['instances']]
}
]
}
def postprocess(self, results: Dict) -> Dict:
return {output["name"]: np.array(output["data"]).reshape(output["shape"]).tolist()
for output in results["outputs"]}
请找到代码示例和Dockerfile。
docker build -t $DOCKER_USER/image-transformer-v2:latest -f transformer.Dockerfile . --rm
请使用YAML文件创建推理服务,该服务添加了带有从上面构建的docker图像的图像转换器组件。
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: torch-transfomer
spec:
predictor:
triton:
storageUri: gs://kfserving-examples/models/torchscript
runtimeVersion: 20.10-py3
env:
- name: OMP_NUM_THREADS
value: "1"
transformer:
containers:
- image: kfserving/image-transformer-v2:latest
name: kserve-container
command:
- "python"
- "-m"
- "image_transformer_v2"
args:
- --model_name
- cifar10
- --protocol
- v2
kubectl apply -f torch_transformer.yaml
期望输出
$ inferenceservice.serving.kserve.io/torch-transfomer created
转换器不强制执行特定的类似模式的预测器,但一般建议以对象列表(dict)的形式发送:“instances”: |
{
"instances": [
{
"image_bytes": { "b64": "aW1hZ2UgYnl0ZXM=" },
"caption": "seaside"
},
{
"image_bytes": { "b64": "YXdlc29tZSBpbWFnZSBieXRlcw==" },
"caption": "mountains"
}
]
}
# download the input file
curl -O https://raw.githubusercontent.com/kserve/kserve/master/docs/samples/v1beta1/triton/torchscript/image.json
SERVICE_NAME=torch-transfomer
MODEL_NAME=cifar10
INPUT_PATH=@./image.json
SERVICE_HOSTNAME=$(kubectl get inferenceservice $SERVICE_NAME -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -v -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict -d $INPUT_PATH
期望输出
> POST /v1/models/cifar10:predict HTTP/1.1
> Host: torch-transformer.kserve-triton.example.com
> User-Agent: curl/7.68.0
> Accept: */*
> Content-Length: 3400
> Content-Type: application/x-www-form-urlencoded
> Expect: 100-continue
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 100 Continue
* We are completely uploaded and fine
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< content-length: 219
< content-type: application/json; charset=UTF-8
< date: Sat, 19 Mar 2022 12:15:54 GMT
< server: istio-envoy
< x-envoy-upstream-service-time: 41
<
{"OUTPUT__0": [[-2.0964810848236084, -0.137007474899292, -0.5095658302307129, 2.795621395111084, -0.560547947883606, 1.9934231042861938, 1.1288189888000488, - 4043136835098267, 0.600488007068634, -2.1237082481384277]]}%
来自变换器的双向嵌入表示(BERT)是一种预训练语言表示的方法,它在一系列自然语言处理(NLP)任务中获得了最先进的结果。
此示例演示
我们可以在微调后的BERT模型上为问答等任务进行推理。
在这里,我们使用了一个在SQuaD 2.0数据集上微调的BERT模型,该数据集包含500多篇文章的100000多个问答对,以及50000多个新的无法回答的问题。
1.集群的Istio Ingress网关必须可以通过网络访问。
2.跳过nvcr.io的标签解析,该解析需要auth来解析triton推理服务器图像摘要
kubectl patch cm config-deployment --patch '{"data":{"registriesSkippingTagResolving":"nvcr.io"}}' -n knative-serving
3.增加进度截止日期,因为拉triton图像和big-bert模型可能比默认超时时间长120秒,此设置需要奖励0.15.0+
kubectl patch cm config-deployment --patch '{"data":{"progressDeadline": "600s"}}' -n knative-serving
class BertTransformer(kserve.Model):
def __init__(self, name: str, predictor_host: str):
super().__init__(name)
self.short_paragraph_text = "The Apollo program was the third United States human spaceflight program. First conceived as a three-man spacecraft to follow the one-man Project Mercury which put the first Americans in space, Apollo was dedicated to President John F. Kennedy's national goal of landing a man on the Moon. The first manned flight of Apollo was in 1968. Apollo ran from 1961 to 1972 followed by the Apollo-Soyuz Test Project a joint Earth orbit mission with the Soviet Union in 1975."
self.predictor_host = predictor_host
self.tokenizer = tokenization.FullTokenizer(vocab_file="/mnt/models/vocab.txt", do_lower_case=True)
self.model_name = "bert_tf_v2_large_fp16_128_v2"
self.triton_client = None
def preprocess(self, inputs: Dict) -> Dict:
self.doc_tokens = data_processing.convert_doc_tokens(self.short_paragraph_text)
self.features = data_processing.convert_examples_to_features(self.doc_tokens, inputs["instances"][0], self.tokenizer, 128, 128, 64)
return self.features
def predict(self, features: Dict) -> Dict:
if not self.triton_client:
self.triton_client = httpclient.InferenceServerClient(
url=self.predictor_host, verbose=True)
unique_ids = np.zeros([1,1], dtype=np.int32)
segment_ids = features["segment_ids"].reshape(1,128)
input_ids = features["input_ids"].reshape(1,128)
input_mask = features["input_mask"].reshape(1,128)
inputs = []
inputs.append(httpclient.InferInput('unique_ids', [1,1], "INT32"))
inputs.append(httpclient.InferInput('segment_ids', [1, 128], "INT32"))
inputs.append(httpclient.InferInput('input_ids', [1, 128], "INT32"))
inputs.append(httpclient.InferInput('input_mask', [1, 128], "INT32"))
inputs[0].set_data_from_numpy(unique_ids)
inputs[1].set_data_from_numpy(segment_ids)
inputs[2].set_data_from_numpy(input_ids)
inputs[3].set_data_from_numpy(input_mask)
outputs = []
outputs.append(httpclient.InferRequestedOutput('start_logits', binary_data=False))
outputs.append(httpclient.InferRequestedOutput('end_logits', binary_data=False))
result = self.triton_client.infer(self.model_name, inputs, outputs=outputs)
return result.get_response()
def postprocess(self, result: Dict) -> Dict:
end_logits = result['outputs'][0]['data']
start_logits = result['outputs'][1]['data']
n_best_size = 20
# The maximum length of an answer that can be generated. This is needed
# because the start and end predictions are not conditioned on one another
max_answer_length = 30
(prediction, nbest_json, scores_diff_json) = \
data_processing.get_predictions(self.doc_tokens, self.features, start_logits, end_logits, n_best_size, max_answer_length)
return {"predictions": prediction, "prob": nbest_json[0]['probability'] * 100.0}
请在此处找到代码示例。
使用上面的代码构建KServe Transformer映像
cd bert_tokenizer_v2
docker build -t $USER/bert_transformer-v2:latest . --rm
或者,您可以使用预生成映像kfserving/bert-transformer-v2:latest
将上述自定义KServe Transformer图像和Triton Predictor添加到推理服务规范中
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "bert-v2"
spec:
transformer:
containers:
- name: kserve-container
image: kfserving/bert-transformer-v2:latest
command:
- "python"
- "-m"
- "bert_transformer_v2"
env:
- name: STORAGE_URI
value: "gs://kfserving-examples/models/triton/bert-transformer"
predictor:
triton:
runtimeVersion: 20.10-py3
resources:
limits:
cpu: "1"
memory: 8Gi
requests:
cpu: "1"
memory: 8Gi
storageUri: "gs://kfserving-examples/models/triton/bert"
应用InferenceService yaml.
kubectl apply -f bert_v1beta1.yaml
期望输出
$ inferenceservice.serving.kserve.io/bert-v2 created
kubectl get inferenceservice bert-v2
NAME URL READY AGE
bert-v2 http://bert-v2.default.35.229.120.99.xip.io True 71s
您将看到转换器和预测器都已创建并处于就绪状态
kubectl get revision -l serving.kserve.io/inferenceservice=bert-v2
NAME CONFIG NAME K8S SERVICE NAME GENERATION READY REASON
bert-v2-predictor-default-plhgs bert-v2-predictor-default bert-v2-predictor-default-plhgs 1 True
bert-v2-transformer-default-sd6nc bert-v2-transformer-default bert-v2-transformer-default-sd6nc 1 True
第一步是确定入口IP和端口,并设置INGRESS_HOST和INGRESS_PORT
发送一个带有以下输入的问题请求,转换器期望发送一个实例或输入列表,并进行预处理,然后将输入转换为期望的张量,发送到Triton推理服务器。
{
"instances": [
"What President is credited with the original notion of putting Americans in space?"
]
}
MODEL_NAME=bert-v2
INPUT_PATH=@./input.json
SERVICE_HOSTNAME=$(kubectl get inferenceservices bert-v2 -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -v -H "Host: ${SERVICE_HOSTNAME}" -d $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict
期望输出
{"predictions": "John F. Kennedy", "prob": 77.91848979818604}
AMD推理服务器是一个易于使用的推理解决方案,专门为AMD CPU、GPU和FPGA设计。它可以作为独立的可执行文件部署,也可以部署在带有KServe的Kubernetes集群上,或者通过链接到其C++API来创建自定义应用程序。此示例演示了如何在KServe上部署Tensorflow GraphDef模型和AMD推理服务器,以便在AMD EPYC CPUs上运行推理。
这个例子是使用Bash shell在Ubuntu18.04主机上测试的。
这些说明假定:
如果需要,请参阅这些工具的安装说明进行安装。
此示例使用AMD ZenDNN后端在AMD EPYC CPU上的TensorFlow模型上运行推理。
要为使用此后端的AMD推理服务器构建Docker映像,请从ZenDNN下载TF_v2.9_ZenDNN_v3.3_C++_API.zip包。您必须同意EULA才能下载此软件包。你需要一个现代版本的Docker(至少18.09)来构建这个镜像。
# clone the inference server repository
git clone https://github.com/Xilinx/inference-server.git
# place the downloaded ZenDNN zip in the repository
mv TF_v2.9_ZenDNN_v3.3_C++_API.zip ./inference-server/
# build the image
cd inference-server
./amdinfer dockerize --production --tfzendnn=./TF_v2.9_ZenDNN_v3.3_C++_API.zip
这将在您的主机上构建一个映像:/amdinfer:latest。要与KServe一起使用,您需要将此映像上传到Docker注册表服务器,例如本地服务器上。您还需要更新本例中的YAML文件才能使用此图像。
为KServe构建ZenDNN映像的更多文档可用:ZenDNN+AMD推理服务器和KServe+AMD推理服务器。
在本例中,您将使用MNIST Tensorflow模型。AMD推理服务器还支持PyTorch、ONNX和Vitis AI模型,并带有适当的Docker映像。要准备新模型,请查看KServe+AMD推理服务器文档,以了解有关预期模型格式的更多信息。
AMD推理服务器可用于KServe中的单模式服务模式。下面的代码段使用环境变量INGRESS_HOST和INGRESS_PORT向集群发出请求。找到用于向集群发出请求的入口主机和端口,并适当地设置这些值。
要将AMD推理服务器与KServe一起使用,请将其添加为服务运行时。此示例中包含一个ClusterServingRuntime配置文件。要应用它:
# update the kserve-amdserver.yaml to use the right image
# if you have a different image name, you'll need to edit it manually
sed -i "s//$(whoami)\/amdinfer:latest/" kserve-amdserver.yaml
kubectl apply -f kserve-amdserver.yaml
一旦AMD推理服务器被添加为服务运行时,您就可以启动使用它的服务。
# download the inference service file and input data
curl -O https://raw.githubusercontent.com/kserve/website/master/docs/modelserving/v1beta1/amd/single_model.yaml
curl -O https://raw.githubusercontent.com/kserve/website/master/docs/modelserving/v1beta1/amd/input.json
# create the inference service
kubectl apply -f single_model.yaml
# wait for service to be ready
kubectl wait --for=condition=ready isvc -l app=example-amdserver-runtime-isvc
export SERVICE_HOSTNAME=$(kubectl get inferenceservice example-amdserver-runtime-isvc -o jsonpath='{.status.url}' | cut -d "/" -f 3)
一旦服务准备好,您就可以向它发出请求。假设INGRESS_HOST、INGRESS_PORT和SERVICE_HOSTNAME已经如上所述定义,那么下面的命令将通过REST对示例MNIST模型进行推理。
export MODEL_NAME=mnist
export INPUT_DATA=@./input.json
curl -v -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models/${MODEL_NAME}/infer -d ${INPUT_DATA}
这以KServe的v2 API格式显示了来自服务器的响应。对于此示例,它将类似于:
期望输出
{
"id":"",
"model_name":"TFModel",
"outputs":
[
{
"data": [
0.11987821012735367,
0.18648317456245422,
-0.83796119689941406,
-0.088459312915802002,
0.030454874038696289,
0.074872657656669617,
-1.1334009170532227,
-0.046301722526550293,
-0.31683838367462158,
0.32014602422714233
],
"datatype":"FP32",
"name":"input-0",
"parameters":{},
"shape":[10]
}
]
}
对于MNIST,数据指示输入图像的可能分类,即数字9。在该响应中,具有最高值的索引是最后一个,表明图像被正确地分类为9。
当开箱即用的服务运行时不适合您的需求时,您可以选择使用KServe ModelServer API构建自己的模型服务器,作为自定义服务运行时部署在KServe上。
1.安装pack CLI以构建自定义模型服务器映像。
KServe.Model基类主要定义了预处理、预测和后处理三个处理程序,这些处理程序按顺序执行,预处理的输出作为输入传递给预测,预测器处理程序为您的模型执行推理,后处理程序将原始预测结果转化为用户友好的推理响应。还有一个额外的加载处理程序,用于编写自定义代码,将您的模型从本地文件系统或远程模型存储加载到内存中,一个通常的好做法是调用模型服务器类__init__函数中的负载处理程序,这样您的模型就可以在启动时加载,并准备好提供预测请求。
import argparse
from torchvision import models
from typing import Dict, Union
import torch
import numpy as np
from kserve import Model, ModelServer
class AlexNetModel(Model):
def __init__(self, name: str):
super().__init__(name)
self.name = name
self.load()
def load(self):
self.model = models.alexnet(pretrained=True)
self.model.eval()
self.ready = True
def predict(self, payload: Dict, headers: Dict[str, str] = None) -> Dict:
img_data = payload["instances"][0]["image"]["b64"]
raw_img_data = base64.b64decode(img_data)
input_image = Image.open(io.BytesIO(raw_img_data))
preprocess = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]),
])
input_tensor = preprocess(input_image).unsqueeze(0)
output = self.model(input_tensor)
torch.nn.functional.softmax(output, dim=1)
values, top_5 = torch.topk(output, 5)
result = values.flatten().tolist()
response_id = generate_uuid()
return {"predictions": result}
if __name__ == "__main__":
model = AlexNetModel("custom-model")
ModelServer().start([model])
Buildpacks允许您将推理代码转换为镜像,这些镜像可以部署在KServe上,而无需定义Dockerfile。Buildpacks会自动确定python应用程序,然后从requirements.txt文件中安装依赖项,它会查看Procfile来确定如何启动模型服务器。在这里,我们展示了如何使用pack手动构建服务镜像,您也可以选择使用kpack在云上运行镜像构建,并从源git存储库中不断构建/部署新版本。
您可以使用pack cli来构建和推送自定义模型服务器映像
pack build --builder=heroku/buildpacks:20 ${DOCKER_USER}/custom-model:v1
docker push ${DOCKER_USER}/custom-model:v1
注意:如果您的buildpack命令失败,请确保您有一个指定了正确python版本的runtimes.txt文件。请参阅custom model server runtime.txt文件作为示例。
使用buildpack启动从上一步构建的docker镜像。
docker run -ePORT=8080 -p8080:8080 ${DOCKER_USER}/custom-model:v1
使用input.json在本地发送测试推理请求
curl localhost:8080/v1/models/custom-model:predict -d @./input.json
期望输出
{"predictions": [[14.861763000488281, 13.94291877746582, 13.924378395080566, 12.182709693908691, 12.00634765625]]}
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: custom-model
spec:
predictor:
containers:
- name: kserve-container
image: ${DOCKER_USER}/custom-model:v1
在custom.yaml文件中编辑容器映像,并将${DOCKER_USER}替换为DOCKER Hub用户名。
自变量
您可以在容器规范中提供其他命令参数来配置模型服务器。
环境变量
可以在容器等级库上提供其他环境变量。
应用yaml在KServe上部署推理服务
kubectl
kubectl apply -f custom.yaml
期望输出
$ inferenceservice.serving.kserve.io/custom-model created
第一步是确定入口IP和端口,并设置INGRESS_HOST和INGRESS_PORT
MODEL_NAME=custom-model
INPUT_PATH=@./input.json
SERVICE_HOSTNAME=$(kubectl get inferenceservice ${MODEL_NAME} -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -v -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict -d $INPUT_PATH
期望输出
* Trying 169.47.250.204...
* TCP_NODELAY set
* Connected to 169.47.250.204 (169.47.250.204) port 80 (#0)
> POST /v1/models/custom-model:predict HTTP/1.1
> Host: custom-model.default.example.com
> User-Agent: curl/7.64.1
> Accept: */*
> Content-Length: 105339
> Content-Type: application/x-www-form-urlencoded
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
* We are completely uploaded and fine
< HTTP/1.1 200 OK
< content-length: 232
< content-type: text/html; charset=UTF-8
< date: Wed, 26 Feb 2020 15:19:15 GMT
< server: istio-envoy
< x-envoy-upstream-service-time: 213
<
* Connection #0 to host 169.47.250.204 left intact
{"predictions": [[14.861762046813965, 13.942917823791504, 13.9243803024292, 12.182711601257324, 12.00634765625]]}
kubectl delete -f custom.yaml
KServe gRPC ServingRuntimes支持实现Open(v2)推理协议的高性能推理数据平面:
与REST相比,它对浏览器的支持有限,并且消息不可读,这需要额外的调试工具。
对于Open(v2)推理协议,KServe为预测、预处理、后处理处理程序提供了InferRequest和InferResponse API对象,以抽象REST/gRPC解码和编码的实现细节。
model_grpc.py
import io
from typing import Dict
import torch
from kserve import InferRequest, InferResponse, InferOutput, Model, ModelServer
from kserve.utils.utils import generate_uuid
from PIL import Image
from torchvision import models, transforms
# This custom predictor example implements the custom model following KServe v2 inference gPPC protocol,
# the input can be raw image bytes or image tensor which is pre-processed by transformer
# and then passed to predictor, the output is the prediction response.
class AlexNetModel(Model):
def __init__(self, name: str):
super().__init__(name)
self.name = name
self.load()
self.model = None
self.ready = False
def load(self):
self.model = models.alexnet(pretrained=True)
self.model.eval()
self.ready = True
def predict(self, payload: InferRequest, headers: Dict[str, str] = None) -> InferResponse:
req = payload.inputs[0]
input_image = Image.open(io.BytesIO(req.data[0]))
preprocess = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]),
])
input_tensor = preprocess(input_image)
input_tensor = input_tensor.unsqueeze(0)
output = self.model(input_tensor)
torch.nn.functional.softmax(output, dim=1)
values, top_5 = torch.topk(output, 5)
result = values.flatten().tolist()
response_id = generate_uuid()
infer_output = InferOutput(name="output-0", shape=list(values.shape), datatype="FP32", data=result)
infer_response = InferResponse(model_name=self.name, infer_outputs=[infer_output], response_id=response_id)
return infer_response
if __name__ == "__main__":
model = AlexNetModel("custom-model")
model.load()
ModelServer().start([model])
与构建REST自定义映像类似,您也可以使用packcli来构建和推送自定义gRPC模型服务器映像
pack build --builder=heroku/buildpacks:20 ${DOCKER_USER}/custom-model-grpc:v1
docker push ${DOCKER_USER}/custom-model-grpc:v1
注意:如果您的buildpack命令失败,请确保您有一个指定了正确python版本的runtimes.txt文件。请参阅custom model server runtime.txt文件作为示例。
使用buildpack启动从上一步构建的docker镜像。
docker run -ePORT=8081 -p8081:8081 ${DOCKER_USER}/custom-model-grpc:v1
使用推理服务器客户端grpc_test_client.py在本地发送测试推理请求
from kserve import InferRequest, InferInput, InferenceServerClient
import json
import base64
import os
client = InferenceServerClient(url=os.environ.get("INGRESS_HOST", "localhost")+":"+os.environ.get("INGRESS_PORT", "8081"),
channel_args=(('grpc.ssl_target_name_override', os.environ.get("SERVICE_HOSTNAME", "")),))
json_file = open("./input.json")
data = json.load(json_file)
infer_input = InferInput(name="input-0", shape=[1], datatype="BYTES", data=[base64.b64decode(data["instances"][0]["image"]["b64"])])
request = InferRequest(infer_inputs=[infer_input], model_name="custom-model")
res = client.infer(infer_request=request)
print(res)
python grpc_test_client.py
期望输出
id: "df27b8a5-f13e-4c7a-af61-20bdb55b6523"
outputs {
name: "output-0"
datatype: "FP32"
shape: 1
shape: 5
contents {
fp32_contents: 14.9756203
fp32_contents: 14.036808
fp32_contents: 13.9660349
fp32_contents: 12.2522783
fp32_contents: 12.0862684
}
}
model_name: "custom-model"
id: "df27b8a5-f13e-4c7a-af61-20bdb55b6523"
outputs {
name: "output-0"
datatype: "FP32"
shape: 1
shape: 5
contents {
fp32_contents: 14.9756203
fp32_contents: 14.036808
fp32_contents: 13.9660349
fp32_contents: 12.2522783
fp32_contents: 12.0862684
}
}
创建推理服务yaml并通过在ports部分指定公开gRPC端口,当前只允许公开一个端口,默认情况下公开HTTP端口。
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: custom-model-grpc
spec:
predictor:
containers:
- name: kserve-container
image: ${DOCKER_USER}/custom-model-grpc:v1
ports:
- name: h2c
containerPort: 8081
protocol: TCP
在custom_grpc.yaml文件中,编辑容器映像,并将${DOCKER_USER}替换为Docker Hub用户名。
自变量
您可以在容器规范中提供其他命令参数来配置模型服务器。
应用yaml在KServe上部署推理服务
kubectl
kubectl apply -f custom_grpc.yaml
期望输出
$ inferenceservice.serving.kserve.io/custom-model-grpc created
第一步是确定入口IP和端口,并设置INGRESS_HOST和INGRESS_PORT
MODEL_NAME=custom-model
SERVICE_HOSTNAME=$(kubectl get inferenceservice custom-model-grpc -o jsonpath='{.status.url}' | cut -d "/" -f 3)
使用推理服务器客户端grpc_test_client.py向gRPC服务发送推理请求。
python grpc_test_client.py
期望输出
id: "df27b8a5-f13e-4c7a-af61-20bdb55b6523"
outputs {
name: "output-0"
datatype: "FP32"
shape: 1
shape: 5
contents {
fp32_contents: 14.9756203
fp32_contents: 14.036808
fp32_contents: 13.9660349
fp32_contents: 12.2522783
fp32_contents: 12.0862684
}
}
model_name: "custom-model"
id: "df27b8a5-f13e-4c7a-af61-20bdb55b6523"
outputs {
name: "output-0"
datatype: "FP32"
shape: 1
shape: 5
contents {
fp32_contents: 14.9756203
fp32_contents: 14.036808
fp32_contents: 13.9660349
fp32_contents: 12.2522783
fp32_contents: 12.0862684
}
}
默认情况下,模型加载在与HTTP或gRPC服务器相同的进程中,推理在同一进程中执行,如果您托管多个模型,则一次只能为一个模型运行推理,这限制了共享模型容器时的并发性。KServe集成了RayServe,它提供了一个可编程的API来将模型部署为单独的python工作程序,因此在为多个自定义模型提供服务时可以并行执行推理。
import kserve
from typing import Dict
from ray import serve
@serve.deployment(name="custom-model", num_replicas=2)
class AlexNetModel(kserve.Model):
def __init__(self):
self.name = "custom-model"
super().__init__(self.name)
self.load()
def load(self):
...
def predict(self, request: Dict) -> Dict:
...
if __name__ == "__main__":
kserve.ModelServer().start({"custom-model": AlexNetModel})
fractional gpu示例
@serve.deployment(name="custom-model", num_replicas=2, ray_actor_options={"num_cpus":1, "num_gpus": 0.5})
class AlexNetModel(kserve.Model):
def __init__(self):
self.name = "custom-model"
super().__init__(self.name)
self.load()
def load(self):
...
def predict(self, request: Dict) -> Dict:
...
if __name__ == "__main__":
ray.init(num_cpus=2, num_gpus=1)
kserve.ModelServer().start({"custom-model": AlexNetModel})
有关ray分数cpu和gpu的更多详细信息,请参阅此处。
完整的代码示例可以在这里找到。
将Procfile修改为web:python-m model_remote,然后运行上面的pack命令,它会构建服务映像,该映像将每个模型作为单独的python工作程序和web服务器路由到模型工作程序。