本教程系列将涵盖txtai的主要用例,这是一个 AI 驱动的语义搜索平台。该系列的每章都有相关代码,可也可以在colab 中使用。
colab 地址
txtai 主要支持Hugging Face Transformers和ONNX模型。这使得 txtai 能够连接到 Python 中可用的丰富模型框架,通过 API 将此功能导出到其他语言(JavaScript、Java、Go、Rust),甚至可以使用 ONNX 导出和本地加载模型。
其他机器学习框架呢?假设我们有一个经过良好调整的现有 TF-IDF + Logistic 回归模型。这个模型可以导出到 ONNX 并在 txtai 中用于标记和相似性查询吗?或者一个简单的 PyTorch 文本分类器怎么样?是的,这两个都可以!
使用onnxmltools库,可以将来自scikit-learn、XGBoost等的传统模型导出到 ONNX 并加载 txtai。此外,Hugging Face 的训练器模块可以训练通用 PyTorch 模块。本笔记本将介绍所有这些示例。
安装txtai
和所有依赖项。由于本文使用 ONNX 导出,因此我们需要安装管道附加包。
pip install txtai[pipeline,similarity] datasets
在这个例子中,我们将从 Hugging Face 数据集中加载情感数据集,并使用 scikit-learn 构建一个 TF-IDF + Logistic 回归模型。
情感数据集具有以下标签:
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
ds = load_dataset("emotion")
# Train the model
pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('lr', LogisticRegression(max_iter=250))
])
pipeline.fit(ds["train"]["text"], ds["train"]["label"])
# Determine accuracy on validation set
results = pipeline.predict(ds["validation"]["text"])
labels = ds["validation"]["label"]
results = [results[x] == label for x, label in enumerate(labels)]
print("Accuracy =", sum(results) / len(ds["validation"]))
Accuracy = 0.8595
86% 的准确率 - 还不错!虽然我们都沉迷于深度学习和高级方法,但好的 ole TF-IDF + Logistic Regression 仍然表现出色,运行速度更快。如果这种精度水平有效,就没有理由让事情变得过于复杂。
下一部分将此模型导出到 ONNX,并展示如何将该模型用于相似性查询。
from txtai.pipeline import Labels, MLOnnx, Similarity
def tokenize(inputs, **kwargs):
if isinstance(inputs, str):
inputs = [inputs]
return {"input_ids": [[x] for x in inputs]}
def query(model, tokenizer, multilabel=False):
# Load models into similarity pipeline
similarity = Similarity((model, tokenizer), dynamic=False)
# Add labels to model
similarity.pipeline.model.config.id2label = {0: "sadness", 1: "joy", 2: "love", 3: "anger", 4: "fear", 5: "surprise"}
similarity.pipeline.model.config.label2id = dict((v, k) for k, v in similarity.pipeline.model.config.id2label.items())
inputs = ["that caught me off guard", "I didn t see that coming", "i feel bad", "What a wonderful goal!"]
scores = similarity("joy", inputs, multilabel)
for uid, score in scores[:5]:
print(inputs[uid], score)
# Export to ONNX
onnx = MLOnnx()
model = onnx(pipeline)
# Create labels pipeline using scikit-learn ONNX model
sklabels = Labels((model, tokenize), dynamic=False)
# Add labels to model
sklabels.pipeline.model.config.id2label = {0: "sadness", 1: "joy", 2: "love", 3: "anger", 4: "fear", 5: "surprise"}
sklabels.pipeline.model.config.label2id = dict((v, k) for k, v in sklabels.pipeline.model.config.id2label.items())
# Run test query using model
query(model, tokenize, None)
What a wonderful goal! 0.909473717212677
I didn t see that coming 0.47113093733787537
that caught me off guard 0.42067453265190125
i feel bad 0.019547615200281143
txtai 可以使用标准文本分类模型进行相似性查询,其中标签是固定查询列表。上面的输出显示了查询“joy”的最佳结果。
下一节定义了一个简单的 PyTorch 文本分类器。假设使用了一些标准约定/命名,transformers 库有一个支持训练 PyTorch 模型的训练包。
# Set predictable seeds
import os
import random
import torch
import numpy as np
from torch import nn
from torch.nn import CrossEntropyLoss
from transformers import AutoConfig, AutoTokenizer
from txtai.models import Registry
from txtai.pipeline import HFTrainer
from transformers.modeling_outputs import SequenceClassifierOutput
def seed(seed=42):
random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True
class Simple(nn.Module):
def __init__(self, vocab, dimensions, labels):
super().__init__()
self.config = AutoConfig.from_pretrained("bert-base-uncased")
self.labels = labels
self.embedding = nn.EmbeddingBag(vocab, dimensions)
self.classifier = nn.Linear(dimensions, labels)
self.init_weights()
def init_weights(self):
initrange = 0.5
self.embedding.weight.data.uniform_(-initrange, initrange)
self.classifier.weight.data.uniform_(-initrange, initrange)
self.classifier.bias.data.zero_()
def forward(self, input_ids=None, labels=None, **kwargs):
embeddings = self.embedding(input_ids)
logits = self.classifier(embeddings)
loss = None
if labels is not None:
loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.labels), labels.view(-1))
return SequenceClassifierOutput(
loss=loss,
logits=logits,
)
# Set seed for reproducibility
seed()
# Define model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = Simple(tokenizer.vocab_size, 128, len(ds["train"].unique("label")))
# Train model
train = HFTrainer()
model, tokenizer = train((model, tokenizer), ds["train"], per_device_train_batch_size=8, learning_rate=1e-3, num_train_epochs=15, logging_steps=10000)
# Register custom model to fully support pipelines
Registry.register(model)
# Create labels pipeline using PyTorch model
thlabels = Labels((model, tokenizer), dynamic=False)
# Determine accuracy on validation set
results = [row["label"] == thlabels(row["text"])[0][0] for row in ds["validation"]]
print("Accuracy = ", sum(results) / len(ds["validation"]))
Accuracy = 0.883
这次准确率为88%。对于这样一个简单的网络和肯定可以改进的东西来说非常好。
让我们再次使用此模型运行相似性查询。
query(model, tokenizer)
What a wonderful goal! 1.0
that caught me off guard 0.9998751878738403
I didn t see that coming 0.7328283190727234
i feel bad 5.2972134609891875e-19
与带有评分变化的 scikit-learn 模型相同的结果顺序,这是一个完全不同的模型。
上面的 PyTorch 模型由一个嵌入层和一个线性分类器组成。如果我们采用嵌入层并将其用于相似性查询会怎样?试一试吧。
from txtai.embeddings import Embeddings
class SimpleEmbeddings(nn.Module):
def __init__(self, embeddings):
super().__init__()
self.embeddings = embeddings
def forward(self, input_ids=None, **kwargs):
return (self.embeddings(input_ids),)
embeddings = Embeddings({"method": "pooling", "path": SimpleEmbeddings(model.embedding), "tokenizer": "bert-base-uncased"})
print(embeddings.similarity("mad", ["Glad you found it", "Happy to see you", "I'm angry"]))
[(2, 0.8323876857757568), (1, -0.11010512709617615), (0, -0.16152513027191162)]
看起来嵌入已经存储了知识。鉴于训练数据集,这些嵌入是否足以构建语义搜索索引,尤其是基于情感的数据?可能。它肯定会比标准变压器模型运行得更快(见下文)。
train = HFTrainer()
model, tokenizer = train("microsoft/xtremedistil-l6-h384-uncased", ds["train"], logging_steps=2000)
tflabels = Labels((model, tokenizer), dynamic=False)
# Determine accuracy on validation set
results = [row["label"] == tflabels(row["text"])[0][0] for row in ds["validation"]]
print("Accuracy = ", sum(results) / len(ds["validation"]))
Accuracy = 0.93
正如预期的那样,准确性更好。上面的模型是一个蒸馏模型,使用“roberta-base”之类的模型可以获得更好的准确性,但要权衡增加训练/推理时间。
说到速度,我们来对比一下这些模型的速度。
import time
# Test inputs
inputs = ds["test"]["text"]
print("Testing speed of %d items" % len(inputs))
start = time.time()
r1 = sklabels(inputs, multilabel=None)
print("TF-IDF + Logistic Regression time =", time.time() - start)
start = time.time()
r2 = thlabels(inputs)
print("PyTorch time =", time.time() - start)
start = time.time()
r3 = tflabels(inputs)
print("Transformers time =", time.time() - start, "\n")
# Compare model results
for x in range(5):
print("index: %d" % x)
print(r1[x][0])
print(r2[x][0])
print(r3[x][0], "\n")
Testing speed of 2000 items
TF-IDF + Logistic Regression time = 1.116208791732788
PyTorch time = 2.2385385036468506
Transformers time = 15.705108880996704
index: 0
(0, 0.7258279323577881)
(0, 1.0)
(0, 0.998250424861908)
index: 1
(0, 0.854256272315979)
(0, 1.0)
(0, 0.9981004595756531)
index: 2
(0, 0.6306578516960144)
(0, 0.9999700784683228)
(0, 0.9981676340103149)
index: 3
(1, 0.554378092288971)
(1, 0.9998960494995117)
(1, 0.9985388517379761)
index: 4
(0, 0.8961835503578186)
(0, 1.0)
(0, 0.9981957077980042)
本展示了如何将 Transformers 和 ONNX 之外的框架用作 txtai 中的模型。
如上一节所示,TF-IDF + Logistic 回归比蒸馏变压器模型快 16 倍。一个简单的 PyTorch 网络要快 8 倍。根据您的准确性要求,使用更简单的模型来获得更好的运行时性能可能是有意义的。
https://dev.to/neuml/tutorial-series-on-txtai-ibg