STI比赛任务二:【答案检验基线方案以及思路分享】

完整代码:https://aistudio.baidu.com/aistudio/projectdetail/5194830

子任务 2:答案检验

任务概述

子任务1涉及的答案抽取过程主要依赖答案片段与搜索query间语义相关性,却无法保证答案片段本身的正确性与可靠性。因此,在答案抽取之后需要设计答案验证方法,从抽取的多个答案片段中选择出大众认可度最高的高置信度答案进行最后的展示。给定一个搜索问题q和其对应的文档集合D,子任务2希望将所有文档基于其包含的答案观点一致性进行聚类,得到每个query下包含用户最公认答案的文档集合,保证深度智能问答系统最终答案的可信度。

任务定义

给定搜索问题集Q,对于每个问题q搜索得到的网页文档集合Dq,任务要求参评系统将其中包含的答案聚类。聚类结果中属于同一子集的答案应观点一致且语义相似,属于不同子集的答案应观点不同或语义矛盾。对于每个集合,我们将其答案认可度定义为包含的答案来自不同文档的数量(避免一个答案在同一文档中反复提及),要求参赛系统返回认可度最高的答案集合所对应的文档集合。

我们鼓励参赛者将本任务转化为一个答案语义推理任务(也可自由选择其他合理方案)。给定一个搜索query,对于两个答案片段ai,aj,当ai和aj与query相关的答案语义相同,或存在蕴含关系,则称ai和aj是答案语义一致的,应被划分到同一答案集合;若ai和aj与query相关的答案语义无关,或存在矛盾关系,则称ai和aj是答案语义不一致的,应被划分到不同的答案集合。

数据集

本任务提供的数据将支持完成同一query下不同答案的语义推理任务。训练集和验证集都包含搜索问题集Q,网页文档集合D,答案集合A以及相同query下答案对之间的语义一致性关系(支持、中立、反对),详细的数据格式可参考下文的数据样例。为在测试时模拟实际的深度智能问答场景,测试集不提供答案集合A,参赛者需要利用子任务1中设计的系统完成答案抽取工作后再进行语义一致性计算。最终参评系统应返回认可度最高的答案集合所对应的文档集合。

训练集对约20万组答案对进行了语义一致性标注;验证集包含约1万组答案对一致性标注;训练集和验证集均可由答案一致性标注得到认可度最高的答案集合和对应文档集合;测试集仅提供query和文档集合。数据的主要特点为:

文档和答案长度普遍较长,存在大量混淆信息,语义计算困难
答案集合内部可能存在复杂的一致性关系

数据样例

问题q:备孕偶尔喝冰的可以吗

篇章d1:备孕能吃冷的食物吗 炎热的夏天让很多人都觉得闷热...,下面一起来看看吧! 备孕能吃冷的食物吗 在中医养生中,女性体质属阴,不可以贪凉。吃了过多寒凉、生冷的食物后,会消耗阳气,导致寒邪内生,侵害子宫。另外,宫寒是肾阳虚的表现,不会直接导致不孕。但宫寒会引起妇科疾病,所以也不可不防。因此处于备孕期的女性最好不要吃冷的食物。 备孕食谱有哪些 ...
答案a1:在中医养生中,女性体质属阴,不可以贪凉。吃了过多寒凉、生冷的食物后,会消耗阳气,导致寒邪内生,侵害子宫。另外,宫寒是肾阳虚的表现,不会直接导致不孕。但宫寒会引起妇科疾病,所以也不可不防。因此处于备孕期的女性最好不要吃冷的食物。

篇章d2:病情分析:备孕通常不能喝冰饮料,避免影响胎儿健康。患者正处于备孕准备阶段,男性和女性患者都需要注意饮食不要太辛辣和刺激,不推荐冷冻和冷饮。...
答案a2:备孕通常不能喝冰饮料,避免影响胎儿健康。

篇章d3:备孕期间能喝冰水?备孕期间能喝冰水吗:这个应该不会有影响的 在线咨询...
答案a3:这个应该不会有影响的

答案对一致性:支持; 答案对一致性:反对; 答案对一致性:反对
答案聚类结果:{a1, a2},其认可度=2; {a3},其认可度=1; 认可度最高答案集合为{a1, a2},所属的文档集合为{d1, d2}。

数据说明

train/dev/test开头的文件分别是训练、开发、测试集数据

1、xxx_query_doc.json是提供的query和若干doc
每个query一行,使用json格式存储,第一级包含query和docs两个字段,docs为列表,每一项是一个doc,包含title、url、doc_text、doc_id四个字段,doc_text是doc的正文,doc_id是doc的编号标识

2、xxx_answer_nli_data.tsv是答案关系标注数据
每一行包含六列,用制表符tab分割,分别是query、url1、answer1、url2、answer2、label
label为1表示两个答案为相互支持关系、语义一致,为0表示两个答案未中立或反对关系
该数据可供训练模型,判断答案关系,用于支持最终任务
测试集不包含该部分数据

3、xxx_label.tsv是答案检验任务所对应标注
每个query一行,每行包含两列,用制表符tab分割,第一列为query,第二列为认可度最高答案所属的文档集合,使用英文逗号连接的doc_id
测试集不包含该部分数据

数据加载与分析

train_query_doc=pd.read_json('data_task2/train_query_doc.json',lines=True)
STI比赛任务二:【答案检验基线方案以及思路分享】_第1张图片

答案检验任务所对应标注数据

# 案检验任务所对应标注
train_label=pd.read_table('data_task2/train_label.tsv')
train_label.columns=['query','doc_ids']
train_label.shape
STI比赛任务二:【答案检验基线方案以及思路分享】_第2张图片

思路1:基于无监督算法SinglePass对相似文档聚类

对query检索的文档直接进行聚类,选取相似性比较高的文档

import numpy as np
from gensim import corpora, models, matutils


class SingelPassClusterTfidf():
    '''
        1.利用tfidf vec计算cossim
    '''
    def tfidf_vec(self, corpus, pivot=10, slope=0.25):
        dictionary = corpora.Dictionary(corpus)  # 形成词典映射
        self.dict_size = len(dictionary)
        # print('dictionary size:{}'.format(len(dictionary)))
        corpus = [dictionary.doc2bow(text) for text in corpus]  # 词的向量表示
        tfidf = models.TfidfModel(corpus, pivot=pivot, slope=slope)
        corpus_tfidf = tfidf[corpus]
        return corpus_tfidf

    def get_max_similarity(self, cluster_cores, vector):
        max_value = 0
        max_index = -1
        for k, core in cluster_cores.items():
            similarity = matutils.cossim(vector, core)
            if similarity > max_value:
                max_value = similarity
                max_index = k
        return max_index, max_value

    def single_pass(self, corpus_vec, corpus, theta):
        clusters = {}
        cluster_cores = {}
        cluster_text = {}
        num_topic = 0
        cnt = 0
        for vector, text in zip(corpus_vec, corpus):
            if num_topic == 0:
                clusters.setdefault(num_topic, []).append(vector)
                cluster_cores[num_topic] = vector
                cluster_text.setdefault(num_topic, []).append(text)
                num_topic += 1
            else:
                max_index, max_value = self.get_max_similarity(cluster_cores, vector)
                if max_value > theta:
                    clusters[max_index].append(vector)
                    text_matrix = matutils.corpus2dense(clusters[max_index], num_terms=self.dict_size,
                                                        num_docs=len(clusters[max_index])).T  # 稀疏转稠密
                    core = np.mean(text_matrix, axis=0)  # 更新簇中心
                    core = matutils.any2sparse(core)  # 将稠密向量core转为稀疏向量
                    cluster_cores[max_index] = core
                    cluster_text[max_index].append(text)
                else:  # 创建一个新簇
                    clusters.setdefault(num_topic, []).append(vector)
                    cluster_cores[num_topic] = vector
                    cluster_text.setdefault(num_topic, []).append(text)
                    num_topic += 1
            cnt += 1
            if cnt % 100 == 0:
                print('processing {}...'.format(cnt))
        return clusters, cluster_text

    def fit_transform(self, corpus, raw_data, theta=0.6):
        tfidf_vec = self.tfidf_vec(corpus)  # tfidf_vec是稀疏向量
        clusters, cluster_text = self.single_pass(tfidf_vec, raw_data, theta)
        return clusters, cluster_text
    

class ClusterTfidf:
    def __init__(self):
        self.clustor = SingelPassClusterTfidf()
        return

    """聚类主函数"""
    def cluster(self, corpus, text2index, theta=0.6):
        clusters, cluster_text = self.clustor.fit_transform(corpus, text2index, theta)
        return clusters, cluster_text
import jieba
import json
import collections

handler_tfidf = ClusterTfidf()

class SinglePassCluster(object):
    """初始化"""
    def __init__(self):
        pass

    """读取文件数据"""
    def load_data(self, filepath):
        datas = []
        with open(filepath, 'r', encoding='utf-8') as f:
            for line in f:
                line = line.strip()
                if not line:
                    continue
                datas.append(line)
        return datas

    """加载文档,并进行转换"""
    def load_docs(self, docs):
        corpus = [list(jieba.cut(s['title']+s['doc_text'])) for s in docs]
        doc_ids = [s['doc_id'] for s in docs]
        
        index2corpus = dict()
        
        for index, line in zip(doc_ids,docs):
            index2corpus[index] = line
        text2index = list(index2corpus.keys())
        # print('docs total size:{}'.format(len(text2index)))
        return text2index, index2corpus, corpus

    """保存聚类结果"""
    def save_cluster(self, method, index2corpus, cluster_text, cluster_path):
        clusterTopic_list = sorted(cluster_text.items(), key=lambda x: len(x[1]), reverse=True)
        # print(clusterTopic_list)
        with open(cluster_path + '/cluster_%s.json' % method, 'w+', encoding='utf-8') as save_obj:
            for k in clusterTopic_list:
                data = dict()
                data["cluster_id"] = k[0]
                data["cluster_nums"] = len(k[1])
                data["cluster_docs"] = [{"doc_id": index, "doc_content": index2corpus.get(value)} for index, value in
                                        enumerate(k[1], start=1)]
                json_obj = json.dumps(data, ensure_ascii=False)
                save_obj.write(json_obj)
                save_obj.write('\n')

    """聚类运行主控函数"""
    def cluster(self, docs,method="doc2vec", theta=0.6):
        # docs = self.load_data(self.train_corpus_filepath)
        text2index, index2corpus, corpus = self.load_docs(docs)
        # print("loaded %s samples...." % len(docs))
        if method == "tfidf":
            clusters, cluster_text = handler_tfidf.cluster(corpus, text2index, theta)
            # self.save_cluster(method, index2corpus, cluster_text, cluster_path)
            return clusters, cluster_text
        else:
            clusters, cluster_text = handler_docvec.cluster(corpus, text2index, theta)
            return clusters, cluster_text
            # self.save_cluster(method, index2corpus, cluster_text, cluster_path)
        return

预测结果提交

submit_file=open('subtask2_test_pred.txt','w',encoding='utf-8')
for idx,row in test_query_doc.iterrows():
    method = "tfidf"
    theta = 0.4
    handler = SinglePassCluster()
    clusters, cluster_text=handler.cluster(row['docs'],method=method, theta=theta)
    # 第一个query    d1,d2,d4,d8
    # 第二个query    d0,d1,d3
    similar_docs=[]
    for key,value in cluster_text.items():
        if len(value)>1:
            similar_docs.extend(value)
    submit_file.write(row['query']+'\t'+','.join(similar_docs)+'\n')
submit_file.close()

思路2:基于任务1抽取答案进行语义推理

思路二主要是:我们先通过答案关系标注数据(xxxx_nli)数据,训练出一个答案语义一致性推断模型,然后利用任务一堆docs里面的query和doc进行答案抽取,最后判断具有答案的文档的答案之间的相似性,将相似性大于一定值的文档放在一块即可

由于时间问题,目前训练比较慢,后续有分数继续更新提交部分代码

答案语义推理模型

直接当做二分类任务,用于训练集nli数据量比较大,我们可以进行采样进行训练

构建训练集

def concat_text(row):
    return str(row['answer1']) + '[SEP]' + row['answer2']
 # print(weight)
train = pd.read_table('data_task2/train_answer_nli_data.tsv',header=None)
train.columns=['query','url1','answer1','url2','answer2','label']

train.fillna('', inplace=True)
train['text'] = train.apply(lambda row: concat_text(row), axis=1)
STI比赛任务二:【答案检验基线方案以及思路分享】_第3张图片

模型训练

import os
import random
from functools import partial
from sklearn.utils.class_weight import compute_class_weight
 
import numpy as np
import paddle
import paddle as P
import paddle.nn.functional as F
import paddlenlp as ppnlp #===抱抱脸的transformers
import pandas as pd
from paddle.io import Dataset
from paddlenlp.data import Stack, Tuple, Pad
from paddlenlp.datasets import MapDataset
from paddlenlp.transformers import LinearDecayWithWarmup
from sklearn.model_selection import StratifiedKFold
from tqdm import tqdm
import numpy as np
import paddle.fluid as fluid
import paddle.nn as nn
 
 
# =============================== 初始化 ========================
class Config:
    text_col = 'text'
    target_col = 'label'
    # 最大长度大小
    max_len = 256 # len(text) or toeknizer:256覆盖95%
    # 模型运行批处理大小
    batch_size = 32
    target_size = 2
    seed = 71
    n_fold = 5
    # 训练过程中的最大学习率
    learning_rate = 5e-5
    # 训练轮次
    epochs = 3  # 3
    # 学习率预热比例
    warmup_proportion = 0.1
    # 权重衰减系数,类似模型正则项策略,避免模型过拟合
    weight_decay = 0.01
    model_name = "ernie-gram-zh"
    print_freq = 100
 
 
def seed_torch(seed=42):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
 

 
CFG = Config()
seed_torch(seed=CFG.seed)
 
 
# y = train[CFG.target_col]
# class_weight = 'balanced'
# classes = train[CFG.target_col].unique()  # 标签类别
# weight = compute_class_weight(class_weight=class_weight,classes= classes, y=y)
 
 

 
# CV split:5折 StratifiedKFold 分层采样
folds = train.copy()
Fold = StratifiedKFold(n_splits=CFG.n_fold, shuffle=True, random_state=CFG.seed)
for n, (train_index, val_index) in enumerate(Fold.split(folds, folds[CFG.target_col])):
    folds.loc[val_index, 'fold'] = int(n)
folds['fold'] = folds['fold'].astype(int)
 
 
# ====================================== 数据集以及转换函数==============================
# Torch 
class CustomDataset(Dataset):
    def __init__(self, df):
        self.data = df.values.tolist()
        self.texts = df[CFG.text_col]
        self.labels = df[CFG.target_col]
 
    def __len__(self):
        return len(self.texts)
 
    def __getitem__(self, idx):
        """
        索引数据
        :param idx:
        :return:
        """
        text = str(self.texts[idx])
        label = self.labels[idx]
        example = {'text': text, 'label': label}
 
        return example
 
 
def convert_example(example, tokenizer, max_seq_length=512, is_test=False):
    """
    创建Bert输入
    ::
        0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
        | first sequence    | second sequence |
    Returns:
        input_ids(obj:`list[int]`): The list of token ids.
        token_type_ids(obj: `list[int]`): List of sequence pair mask.
        label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test.
    """
    encoded_inputs = tokenizer(text=example["text"], max_seq_len=max_seq_length)
    input_ids = encoded_inputs["input_ids"]
    token_type_ids = encoded_inputs["token_type_ids"]
 
    if not is_test:
        label = np.array([example["label"]], dtype="int64")
        return input_ids, token_type_ids, label
    else:
        return input_ids, token_type_ids
 
 
def create_dataloader(dataset,
                      mode='train',
                      batch_size=1,
                      batchify_fn=None,
                      trans_fn=None):
    if trans_fn:
        dataset = dataset.map(trans_fn)
 
    shuffle = True if mode == 'train' else False
    if mode == 'train':
        batch_sampler = paddle.io.DistributedBatchSampler(
            dataset, batch_size=batch_size, shuffle=shuffle)
    else:
        batch_sampler = paddle.io.BatchSampler(
            dataset, batch_size=batch_size, shuffle=shuffle)
 
    return paddle.io.DataLoader(
        dataset=dataset,
        batch_sampler=batch_sampler,
        collate_fn=batchify_fn,
        return_list=True)
 
 
# tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained(CFG.model_name)
tokenizer = ppnlp.transformers.ErnieGramTokenizer.from_pretrained(CFG.model_name)
 
trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=CFG.max_len)
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
    Stack(dtype="int64")  # label
): [data for data in fn(samples)]
# ====================================== 训练、验证与预测函数 ==============================
 
@paddle.no_grad()
def evaluate(model, criterion, metric, data_loader):
    """
    验证函数
    """
    model.eval()
    metric.reset()
    losses = []
    for batch in data_loader:
        input_ids, token_type_ids, labels = batch
        logits = model(input_ids, token_type_ids)
        loss = criterion(logits, labels)
        losses.append(loss.numpy())
        correct = metric.compute(logits, labels)
        metric.update(correct)
        accu = metric.accumulate()
    print("eval loss: %.5f, accu: %.5f" % (np.mean(losses), accu))
    model.train()
    metric.reset()
    return accu
 

 
def train():
    # ====================================  交叉验证训练 ==========================
    for fold in range(5):
        print(f"===============training fold_nth:{fold + 1}======================")
        trn_idx = folds[folds['fold'] != fold].index
        val_idx = folds[folds['fold'] == fold].index
 
        train_folds = folds.loc[trn_idx].reset_index(drop=True)
        valid_folds = folds.loc[val_idx].reset_index(drop=True)
 
        train_dataset = CustomDataset(train_folds)
        train_ds = MapDataset(train_dataset)
 
        dev_dataset = CustomDataset(valid_folds)
        dev_ds = MapDataset(dev_dataset)
 
        train_data_loader = create_dataloader(
            train_ds,
            mode='train',
            batch_size=CFG.batch_size,
            batchify_fn=batchify_fn,
            trans_fn=trans_func)
        dev_data_loader = create_dataloader(
            dev_ds,
            mode='dev',
            batch_size=CFG.batch_size,
            batchify_fn=batchify_fn,
            trans_fn=trans_func)
 
        model = ppnlp.transformers.ErnieGramForSequenceClassification.from_pretrained(CFG.model_name,
                                                                                      num_classes=25)
 
        num_training_steps = len(train_data_loader) * CFG.epochs
        lr_scheduler = LinearDecayWithWarmup(CFG.learning_rate, num_training_steps, CFG.warmup_proportion)
        optimizer = paddle.optimizer.AdamW(
            learning_rate=lr_scheduler,
            parameters=model.parameters(),
            weight_decay=CFG.weight_decay,
            apply_decay_param_fun=lambda x: x in [
                p.name for n, p in model.named_parameters()
                if not any(nd in n for nd in ["bias", "norm"])
            ])
 
        criterion = paddle.nn.loss.CrossEntropyLoss()
        metric = paddle.metric.Accuracy()
 
        global_step = 0
        best_val_acc = 0
        for epoch in range(1, CFG.epochs + 1):
            for step, batch in enumerate(train_data_loader, start=1):
                input_ids, segment_ids, labels = batch
                logits = model(input_ids, segment_ids)
                # probs_ = paddle.to_tensor(logits, dtype="float64")
                loss = criterion(logits, labels)
                probs = F.softmax(logits, axis=1)
                correct = metric.compute(probs, labels)
                metric.update(correct)
                acc = metric.accumulate()
 
                global_step += 1
                if global_step % CFG.print_freq == 0:
                    print("global step %d, epoch: %d, batch: %d, loss: %.5f, acc: %.5f" % (
                        global_step, epoch, step, loss, acc))
                loss.backward()
                optimizer.step()
                lr_scheduler.step()
                optimizer.clear_grad()
            acc = evaluate(model, criterion, metric, dev_data_loader)
            if acc > best_val_acc:
                best_val_acc = acc
                P.save(model.state_dict(), f'{CFG.model_name}_fold{fold}.bin')
            print('Best Val acc %.5f' % best_val_acc)
        del model
        if fold>0:
            break# 训练一折
 
if __name__ == '__main__':
    train()

预测结果提交

model = ppnlp.transformers.ErnieGramForSequenceClassification.from_pretrained(CFG.model_name,num_classes=2)
model.load_dict(P.load('ernie-gram-zh_fold0.bin'))


submit_file = open('subtask2_test_pred.txt', 'w', encoding='utf-8')
querys = qa_task2['query'].unique()
for query in querys:
    # print(query)
    group = qa_task2[qa_task2['query'] == query]
    group=group[group['answer']!='NoAnswer'].reset_index(drop=True)
    group = group.sort_values(by=['query', 'score'], ascending=False).reset_index(drop=True)
    group['doc_id'] = group['doc_id'].apply(lambda x: x.split('_')[-1])
    # print(group)
    
    similar_docs = []
    # 添加第一个文档,作为基准答案
    top_text = group['answer'][0]
    similar_docs.append(group['doc_id'][0])

    texts = [{'text':top_text+'[SEP]'+text} for text in group['answer'][1:]]
    candidate_docs = [ doc_id for doc_id in group['doc_id'][1:]]
    
    pred = predict(model,texts, tokenizer, 16)
    preds=list(pred[:,1])
    print(len(texts),len(pred),len(candidate_docs))
    # print(pred)
    for doc_id,prob in zip(candidate_docs,preds):
        # print(prob)
        if prob >0.2:
            similar_docs.append(doc_id)
    submit_file.write(query+'\t'+','.join(similar_docs)+'\n')
    del group,query
    # break
submit_file.close()

你可能感兴趣的:(STI比赛任务二:【答案检验基线方案以及思路分享】)