目录
数据集
Zero-Shot
Few-Shot
Dynamic Few-Shot
总结
Text REtrieval Conference (TREC) Question Classification数据集包含训练集中的约5500个标记问题和测试集中的另外 500 个问题。
该数据集有 6 个粗类标签和 50 个精细类标签。 每个句子的平均长度为10,词汇量为8700。6 个粗类标签为ABBR, ENTY, DESC, HUM, LOC, NUM.
该数据集从四个来源收集:USC发布的4,500个英语问题(Hovy et al., 2001)、针对少数稀有类别的大约500个手动构建的问题、894个 TREC 8 和 TREC 9 问题,以及来自 TREC 10 的500个问题作为测试集。 这些问题是手动标记的。
该数据集的HuggingFace网址为: https://huggingface.co/datasets/trec,使用datasets
模块加载代码如下:
import openai
from datasets import load_dataset
from sklearn.metrics import classification_report
dataset = load_dataset("trec")
dataset
输出结果为:
DatasetDict({
train: Dataset({
features: ['text', 'coarse_label', 'fine_label'],
num_rows: 5452
})
test: Dataset({
features: ['text', 'coarse_label', 'fine_label'],
num_rows: 500
})
})
其中test数据集的第一条数据为:
{'text': 'How far is it from Denver to Aspen ?',
'coarse_label': 5,
'fine_label': 40}
对数据进行预处理,代码如下:
# name of the text and label column
label_type = 'coarse_label'
text_key = "text"
# create mapping of ids2class and class2id
id2class = dict((i, label) for i, label in enumerate(dataset['train'].features[label_type].names))
class2id = dict((label, i) for i, label in enumerate(dataset['train'].features[label_type].names))
# create a dictionary with classes as key and containing all the training examples within that class
class2TrainDataset = dict((label, []) for label in dataset['train'].features[label_type].names)
for example in dataset['train']:
label = id2class[example[label_type]]
class2TrainDataset[label].append(example[text_key])
其中,id2class和class2id分别为id与类别对应表、类型与id对应表,class2TrainDataset为每个类别中的训练数据集。
构建Zero-Shot prompt,代码如下:
# a prompt for asking LLM to perform a task
task_prompt = "As a Question Answering agent, your goal is to categorize questions into different semantic classes that impose constraints on potential answers, so that they can be utilized in later stages of the question answering process.\nFollowing are the semantic classes: ["
task_prompt += ", ".join([label for label in class2TrainDataset]) + "]"
# a prompt for asking LLM to generate the output for current task
query_prompt = "\nClassify the following question into one of the above classes. Please answer in a single word.\nquestion: "
answer_prompt = "\noutput: "
那么,test数据集的第一条的Zero-Shot prompt为:
zeroshot_prompt = task_prompt + query_prompt + dataset['test'][0][text_key] + answer_prompt
>>> zeroshot_prompt
As a Question Answering agent, your goal is to categorize questions into different semantic classes that impose constraints on potential answers, so that they can be utilized in later stages of the question answering process.
Following are the semantic classes: [ABBR, ENTY, DESC, HUM, LOC, NUM]
Classify the following question into one of the above classes. Please answer in a single word.
question: How far is it from Denver to Aspen ?
output:
调用openai的大模型进行回复,调用函数代码如下:
openai.api_key = "sk-xxx"
model_name = "gpt-3.5-turbo-instruct"
import tiktoken
enc = tiktoken.encoding_for_model(model_name)
log_bias_dict = {}
for label in dataset['train'].features["coarse_label"].names:
for token_id in enc.encode(label):
log_bias_dict[token_id] = 5
# Text completion using GPT
def trim_text(text):
return text.strip().strip('\n').strip('\\n')
def generate_using_gpt(prompt):
generated_sentence = ""
try:
# Create a completion for the provided prompt and parameters
response = openai.Completion.create(
model=model_name,
prompt=prompt,
max_tokens=3,
temperature=0,
top_p=1,
stop=None,
frequency_penalty=0,
presence_penalty=0.0,
logit_bias=log_bias_dict
)
choices = response.get("choices", "")
if len(choices) == 0 or "text" not in choices[0]:
print("Text not generated properly")
generated_sentence = choices[0]['text'].lstrip('\\n').rstrip('\\n').lstrip('\n\n').rstrip('\n\n').lstrip('\n').rstrip('\n')
except openai.error.APIError as e:
# Handle API error here, e.g. retry or log
print(f"OpenAI API returned an API Error: {e}")
except openai.error.AuthenticationError as e:
# Handle Authentication error here, e.g. invalid API key
print(f"OpenAI API returned an Authentication Error: {e}")
except openai.error.APIConnectionError as e:
# Handle connection error here
print(f"Failed to connect to OpenAI API: {e}")
except openai.error.InvalidRequestError as e:
# Handle connection error here
print(f"Invalid Request Error: {e}")
except openai.error.RateLimitError as e:
# Handle rate limit error
print(f"OpenAI API request exceeded rate limit: {e}")
except openai.error.ServiceUnavailableError as e:
# Handle Service Unavailable error
print(f"Service Unavailable: {e}")
except openai.error.Timeout as e:
# Handle request timeout
print(f"Request timed out: {e}")
return generated_sentence
使用模型为gpt-3.5-turbo-instruct
, max_tokens为3。为了保证输出token为数据集中的粗类类别,使用tiktoken得到这些粗类类别的token id,采用logit_bias对这些token id的输出进行加强。
对test数据集第一条数据进行测试:
>>> generate_using_gpt(zeroshot_prompt)
'LOC'
对全量test数据集使用Zero-Shot Prompt,代码如下:
# prompt without any examples from the training dataset
labels = []
predictions = []
for example in dataset['test']:
zeroshot_prompt = task_prompt + query_prompt + example[text_key] + answer_prompt
pred = generate_using_gpt(zeroshot_prompt)
pred=trim_text(pred)
labels.append(example[label_type])
if pred not in class2id:
predictions.append(-1)
else:
predictions.append(class2id[pred])
report = classification_report(labels, predictions, digits=4)
评估结果如下:
precision recall f1-score support
0 0.6364 0.7778 0.7000 9
1 0.4432 0.4149 0.4286 94
2 0.7154 0.6377 0.6743 138
3 0.9455 0.8000 0.8667 65
4 0.8222 0.9136 0.8655 81
5 0.8195 0.9646 0.8862 113
accuracy 0.7380 500
macro avg 0.7304 0.7514 0.7369 500
weighted avg 0.7336 0.7380 0.7324 500
weighted avg F1值为0.7324
.
接下来,使用Few-Shot对prompt进行加强,方法为从每个类别的train数据集中提取第一条样本作为Few-Shot,即In-Context Learning
(ICL),代码如下:
# function to selection few examples in each of the classes from the training dataset
def generateFewshotPrompt(class2TrainDataset, N=3):
fewshot_prompt = "\nFollowing are some examples."
for label in class2TrainDataset:
for example in class2TrainDataset[label][:N]:
fewshot_prompt += "\nquestion: " + example
fewshot_prompt += "\noutput: " + label
return fewshot_prompt
# prompt with one example in each of the classes
fewshot_examples = generateFewshotPrompt(class2TrainDataset, N=1)
fewshot_prompt = task_prompt + fewshot_examples + query_prompt + dataset['test'][0][text_key] + answer_prompt
>>> fewshot_prompt
test数据集的第一条数据的Few-Shot prompt如下:
As a Question Answering agent, your goal is to categorize questions into different semantic classes that impose constraints on potential answers, so that they can be utilized in later stages of the question answering process.
Following are the semantic classes: [ABBR, ENTY, DESC, HUM, LOC, NUM]
Following are some examples.
question: What is the full form of .com ?
output: ABBR
question: What films featured the character Popeye Doyle ?
output: ENTY
question: How did serfdom develop in and then leave Russia ?
output: DESC
question: What contemptible scoundrel stole the cork from my lunch ?
output: HUM
question: What sprawling U.S. state boasts the most airports ?
output: LOC
question: When was Ozzy Osbourne born ?
output: NUM
Classify the following question into one of the above classes. Please answer in a single word.
question: How far is it from Denver to Aspen ?
output:
基于Few-Shot prompt,对全量test数据集进行评估,代码如下:
# prompt is created by adding one example in each of the classes
labels = []
predictions = []
for example in dataset['test']:
fewshot_prompt = task_prompt + fewshot_examples + query_prompt + example[text_key] + answer_prompt
pred = generate_using_gpt(fewshot_prompt)
pred=trim_text(pred)
labels.append(example[label_type])
if pred not in class2id:
predictions.append(-1)
else:
predictions.append(class2id[pred])
report = classification_report(labels, predictions, digits=4)
评估结果如下:
precision recall f1-score support
0 0.8182 1.0000 0.9000 9
1 0.5217 0.5106 0.5161 94
2 0.7727 0.7391 0.7556 138
3 1.0000 0.8462 0.9167 65
4 0.8021 0.9506 0.8701 81
5 0.9474 0.9558 0.9515 113
accuracy 0.7980 500
macro avg 0.8103 0.8337 0.8183 500
weighted avg 0.8001 0.7980 0.7969 500
此时,weighted avg F1值为0.7969
.
上面Few-Shot prompt的效果已经比Zero-Shot prompt好很多了,还有提升空间吗?
对于Few-Shot的样本,我们是否可以进行选择,使得评估样本与Few-Shot样本接可能相近。基于此,我们想到了Dynamic Few-Shot,在每次评估测试样本时,在训练集的每个类别中选择与其语义相似度最高的k(本文取k=1)个样本。
考虑到文本的语义相似度,我们需要一个语义相似度计算的基础模型,一般为文本嵌入(Text Embedding)模型,本文选择all-mpnet-base-v2
,使用sentence_transformers
进行文本嵌入。代码如下:
from sentence_transformers import SentenceTransformer, util
import numpy as np
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'
# loading Sentence Transformer based model
model = SentenceTransformer('all-mpnet-base-v2', device=device)
# extract embeddings for a set of examples
def ExtractEmbeddings(examples):
embedding_ls = []
for example in examples:
embedding = model.encode(example)
embedding_ls.append(embedding)
return embedding_ls
# extract embeddings for all the training examples
class2TrainDatasetWithEmbedding = {}
for label in class2TrainDataset:
embeddings = ExtractEmbeddings(class2TrainDataset[label])
class2TrainDatasetWithEmbedding[label] = [class2TrainDataset[label], embeddings]
在上述代码中,我们使用sentence_transformers
加载all-mpnet-base-v2
模型,并对每个类别的训练集数据进行文本嵌入,获取它们的词向量,储存在内存中。
接着,针对每条评估测试样本,选择每个类别中与其语义相似度最高的1条样本,形成Dynamic Few-Shot prompt,代码如下:
# extract similar queries for a given input text from each of the classes
def getSimilarExamples(input_text, dataset, dataset_embedding):
input_embedding = model.encode(input_text)
sim_score = util.dot_score(input_embedding, dataset_embedding)[0]
topN_ids = np.argsort(-sim_score)
return [dataset[i] for i in topN_ids]
def getClasswiseSimilarExamples(input_text, class2TrainDatasetWithEmbedding):
classwiseSimilarExamples = {}
for label in class2TrainDataset:
similarExamples = getSimilarExamples(input_text, class2TrainDatasetWithEmbedding[label][0], class2TrainDatasetWithEmbedding[label][1])
classwiseSimilarExamples[label] = similarExamples
return classwiseSimilarExamples
# generate a prompt with similar examples in each of the classes
def generateDynamicPrompt(input_text, class2TrainDatasetWithEmbedding, N=3):
classwiseSimilarExamples = getClasswiseSimilarExamples(input_text, class2TrainDatasetWithEmbedding)
dynamic_prompt = "\nFollowing are some examples."
for label in classwiseSimilarExamples:
for example in classwiseSimilarExamples[label][:N]:
dynamic_prompt += "\nquestion: " + example
dynamic_prompt += "\noutput: " + label
return dynamic_prompt
# dynamic prompt with one similar example in each of the classes
fewshot_examples = generateDynamicPrompt(dataset['test'][0][text_key], class2TrainDatasetWithEmbedding, N=1)
dynamic_prompt = task_prompt + fewshot_examples + query_prompt + dataset['test'][0][text_key] + answer_prompt
>>> dynamic_prompt
此时,test数据集中的第一条样本的Dynamic Few-Shot prompt为:
As a Question Answering agent, your goal is to categorize questions into different semantic classes that impose constraints on potential answers, so that they can be utilized in later stages of the question answering process.
Following are the semantic classes: [ABBR, ENTY, DESC, HUM, LOC, NUM]
Following are some examples.
question: What do the letters D.C. stand for in Washington , D.C. ?
output: ABBR
question: What race is 1 , 137 miles long ?
output: ENTY
question: Why is the mile 528 feet ?
output: DESC
question: Who lives at 39 Stone Canyon Way ?
output: HUM
question: What Colorado city owns its own glacier ?
output: LOC
question: How high is the city of Denver ?
output: NUM
Classify the following question into one of the above classes. Please answer in a single word.
question: How far is it from Denver to Aspen ?
output:
可以看到此时的Dynamic Few-Shot prompt中的样本明显比Few-Shot prompt中的样本更好。
此时,再对全量test数据集进行评估,代码如下:
labels = []
predictions = []
for example in dataset['test']:
fewshot_examples = generateDynamicPrompt(example[text_key], class2TrainDatasetWithEmbedding, N=1)
dynamic_prompt = task_prompt + fewshot_examples + query_prompt + example[text_key] + answer_prompt
pred = generate_using_gpt(dynamic_prompt)
pred=trim_text(pred)
labels.append(example[label_type])
if pred not in class2id:
predictions.append(-1)
else:
predictions.append(class2id[pred])
report = classification_report(labels, predictions, digits=4)
评估结果如下:
precision recall f1-score support
0 1.0000 0.7778 0.8750 9
1 0.7083 0.7234 0.7158 94
2 0.8615 0.8116 0.8358 138
3 0.9508 0.8923 0.9206 65
4 0.8824 0.9259 0.9036 81
5 0.8926 0.9558 0.9231 113
accuracy 0.8560 500
macro avg 0.8826 0.8478 0.8623 500
weighted avg 0.8572 0.8560 0.8557 500
最终得到的weighted avg F1值为0.8557
.
对上述的内容进行总结,我们使用gpt-3.5-turbo-instruct
对TREC
的中test数据集,分别就Zero-Shot, Few-Shot, Dynamic Few-Shot情形进行评估,得到的评估指标为:
prompt | weighted avg F1 |
Zero-Shot | 0.7324 |
Few-Shot | 0.7969 |
Dynamic Few-Shot | 0.8557 |
显然,Dynamic Few-Shot prompt的效果是最好的,比Zero-Shot prompt的指标高了12%多,而这还是没有对模型进行任何微调的结果!
在平时工作中,我们也可以尝试使用Dynamic Few-Shot prompt。