导出并运行其他机器学习模型

导出并运行其他机器学习模型

本教程系列将涵盖txtai的主要用例,这是一个 AI 驱动的语义搜索平台。该系列的每章都有相关代码,可也可以在colab 中使用。
colab 地址

txtai 主要支持Hugging Face Transformers和ONNX模型。这使得 txtai 能够连接到 Python 中可用的丰富模型框架,通过 API 将此功能导出到其他语言(JavaScript、Java、Go、Rust),甚至可以使用 ONNX 导出和本地加载模型。

其他机器学习框架呢?假设我们有一个经过良好调整的现有 TF-IDF + Logistic 回归模型。这个模型可以导出到 ONNX 并在 txtai 中用于标记和相似性查询吗?或者一个简单的 PyTorch 文本分类器怎么样?是的,这两个都可以!

使用onnxmltools库,可以将来自scikit-learn、XGBoost等的传统模型导出到 ONNX 并加载 txtai。此外,Hugging Face 的训练器模块可以训练通用 PyTorch 模块。本笔记本将介绍所有这些示例。

安装依赖

安装txtai和所有依赖项。由于本文使用 ONNX 导出,因此我们需要安装管道附加包。

pip install txtai[pipeline,similarity] datasets

训练一个 TF-IDF + Logistic 回归模型

在这个例子中,我们将从 Hugging Face 数据集中加载情感数据集,并使用 scikit-learn 构建一个 TF-IDF + Logistic 回归模型。

情感数据集具有以下标签:

  • 悲伤 (0)
  • 喜悦 (1)
  • 爱 (2)
  • 天使 (3)
  • 恐惧 (4)
  • 惊喜 (5)
from datasets import load_dataset

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

ds = load_dataset("emotion")

# Train the model
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('lr', LogisticRegression(max_iter=250))
])

pipeline.fit(ds["train"]["text"], ds["train"]["label"])

# Determine accuracy on validation set
results = pipeline.predict(ds["validation"]["text"])
labels = ds["validation"]["label"]

results = [results[x] == label for x, label in enumerate(labels)]
print("Accuracy =", sum(results) / len(ds["validation"]))
Accuracy = 0.8595

86% 的准确率 - 还不错!虽然我们都沉迷于深度学习和高级方法,但好的 ole TF-IDF + Logistic Regression 仍然表现出色,运行速度更快。如果这种精度水平有效,就没有理由让事情变得过于复杂。

使用 txtai 导出和加载

下一部分将此模型导出到 ONNX,并展示如何将该模型用于相似性查询。

from txtai.pipeline import Labels, MLOnnx, Similarity

def tokenize(inputs, **kwargs):
    if isinstance(inputs, str):
        inputs = [inputs]

    return {"input_ids": [[x] for x in inputs]}

def query(model, tokenizer, multilabel=False):
    # Load models into similarity pipeline
    similarity = Similarity((model, tokenizer), dynamic=False)

    # Add labels to model
    similarity.pipeline.model.config.id2label = {0: "sadness", 1: "joy", 2: "love", 3: "anger", 4: "fear", 5: "surprise"}
    similarity.pipeline.model.config.label2id = dict((v, k) for k, v in similarity.pipeline.model.config.id2label.items())

    inputs = ["that caught me off guard", "I didn t see that coming", "i feel bad", "What a wonderful goal!"]
    scores = similarity("joy", inputs, multilabel)
    for uid, score in scores[:5]:
        print(inputs[uid], score)

# Export to ONNX
onnx = MLOnnx()
model = onnx(pipeline)

# Create labels pipeline using scikit-learn ONNX model
sklabels = Labels((model, tokenize), dynamic=False)

# Add labels to model
sklabels.pipeline.model.config.id2label = {0: "sadness", 1: "joy", 2: "love", 3: "anger", 4: "fear", 5: "surprise"}
sklabels.pipeline.model.config.label2id = dict((v, k) for k, v in sklabels.pipeline.model.config.id2label.items())

# Run test query using model
query(model, tokenize, None)
What a wonderful goal! 0.909473717212677
I didn t see that coming 0.47113093733787537
that caught me off guard 0.42067453265190125
i feel bad 0.019547615200281143

txtai 可以使用标准文本分类模型进行相似性查询,其中标签是固定查询列表。上面的输出显示了查询“joy”的最佳结果。

训练 PyTorch 模型

下一节定义了一个简单的 PyTorch 文本分类器。假设使用了一些标准约定/命名,transformers 库有一个支持训练 PyTorch 模型的训练包。

# Set predictable seeds
import os
import random
import torch

import numpy as np

from torch import nn
from torch.nn import CrossEntropyLoss
from transformers import AutoConfig, AutoTokenizer

from txtai.models import Registry
from txtai.pipeline import HFTrainer

from transformers.modeling_outputs import SequenceClassifierOutput

def seed(seed=42):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True

class Simple(nn.Module):
    def __init__(self, vocab, dimensions, labels):
        super().__init__()

        self.config = AutoConfig.from_pretrained("bert-base-uncased")
        self.labels = labels

        self.embedding = nn.EmbeddingBag(vocab, dimensions)
        self.classifier = nn.Linear(dimensions, labels)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.classifier.weight.data.uniform_(-initrange, initrange)
        self.classifier.bias.data.zero_()

    def forward(self, input_ids=None, labels=None, **kwargs):
        embeddings = self.embedding(input_ids)
        logits = self.classifier(embeddings)

        loss = None
        if labels is not None:
            loss_fct = CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.labels), labels.view(-1))

        return SequenceClassifierOutput(
            loss=loss,
            logits=logits,
        )

# Set seed for reproducibility
seed()

# Define model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = Simple(tokenizer.vocab_size, 128, len(ds["train"].unique("label")))

# Train model
train = HFTrainer()
model, tokenizer = train((model, tokenizer), ds["train"], per_device_train_batch_size=8, learning_rate=1e-3, num_train_epochs=15, logging_steps=10000)

# Register custom model to fully support pipelines
Registry.register(model)

# Create labels pipeline using PyTorch model
thlabels = Labels((model, tokenizer), dynamic=False)

# Determine accuracy on validation set
results = [row["label"] == thlabels(row["text"])[0][0] for row in ds["validation"]]
print("Accuracy = ", sum(results) / len(ds["validation"]))
Accuracy =  0.883

这次准确率为88%。对于这样一个简单的网络和肯定可以改进的东西来说非常好。

让我们再次使用此模型运行相似性查询。

query(model, tokenizer)
What a wonderful goal! 1.0
that caught me off guard 0.9998751878738403
I didn t see that coming 0.7328283190727234
i feel bad 5.2972134609891875e-19

与带有评分变化的 scikit-learn 模型相同的结果顺序,这是一个完全不同的模型。

Pooled embeddings

上面的 PyTorch 模型由一个嵌入层和一个线性分类器组成。如果我们采用嵌入层并将其用于相似性查询会怎样?试一试吧。

from txtai.embeddings import Embeddings

class SimpleEmbeddings(nn.Module):
    def __init__(self, embeddings):
        super().__init__()

        self.embeddings = embeddings

    def forward(self, input_ids=None, **kwargs):
        return (self.embeddings(input_ids),)

embeddings = Embeddings({"method": "pooling", "path": SimpleEmbeddings(model.embedding), "tokenizer": "bert-base-uncased"})
print(embeddings.similarity("mad", ["Glad you found it", "Happy to see you", "I'm angry"]))
[(2, 0.8323876857757568), (1, -0.11010512709617615), (0, -0.16152513027191162)]

看起来嵌入已经存储了知识。鉴于训练数据集,这些嵌入是否足以构建语义搜索索引,尤其是基于情感的数据?可能。它肯定会比标准变压器模型运行得更快(见下文)。

训练变压器模型并比较准确度/速度

train = HFTrainer()
model, tokenizer = train("microsoft/xtremedistil-l6-h384-uncased", ds["train"], logging_steps=2000)

tflabels = Labels((model, tokenizer), dynamic=False)

# Determine accuracy on validation set
results = [row["label"] == tflabels(row["text"])[0][0] for row in ds["validation"]]
print("Accuracy = ", sum(results) / len(ds["validation"]))
Accuracy =  0.93

正如预期的那样,准确性更好。上面的模型是一个蒸馏模型,使用“roberta-base”之类的模型可以获得更好的准确性,但要权衡增加训练/推理时间。

说到速度,我们来对比一下这些模型的速度。

import time

# Test inputs
inputs = ds["test"]["text"]
print("Testing speed of %d items" % len(inputs))

start = time.time()
r1 = sklabels(inputs, multilabel=None)
print("TF-IDF + Logistic Regression time =", time.time() - start)

start = time.time()
r2 = thlabels(inputs)
print("PyTorch time =", time.time() - start)

start = time.time()
r3 = tflabels(inputs)
print("Transformers time =", time.time() - start, "\n")

# Compare model results
for x in range(5):
  print("index: %d" % x)
  print(r1[x][0])
  print(r2[x][0])
  print(r3[x][0], "\n")
Testing speed of 2000 items
TF-IDF + Logistic Regression time = 1.116208791732788
PyTorch time = 2.2385385036468506
Transformers time = 15.705108880996704 

index: 0
(0, 0.7258279323577881)
(0, 1.0)
(0, 0.998250424861908) 

index: 1
(0, 0.854256272315979)
(0, 1.0)
(0, 0.9981004595756531) 

index: 2
(0, 0.6306578516960144)
(0, 0.9999700784683228)
(0, 0.9981676340103149) 

index: 3
(1, 0.554378092288971)
(1, 0.9998960494995117)
(1, 0.9985388517379761) 

index: 4
(0, 0.8961835503578186)
(0, 1.0)
(0, 0.9981957077980042) 

Wrapping up

本展示了如何将 Transformers 和 ONNX 之外的框架用作 txtai 中的模型。

如上一节所示,TF-IDF + Logistic 回归比蒸馏变压器模型快 16 倍。一个简单的 PyTorch 网络要快 8 倍。根据您的准确性要求,使用更简单的模型来获得更好的运行时性能可能是有意义的。

参考

https://dev.to/neuml/tutorial-series-on-txtai-ibg

你可能感兴趣的:(算法,人工智能,深度学习)