中文聊天机器人BLEU值的测算,以小黄鸭数据集作训练

机器人源码:GitHub地址

测试思路

BLEU的计算使用nltk库的sentence_bleu函数

from nltk.translate.bleu_score import sentence_bleu

1. 首先我们需要从数据集中提取问题和参考答案

with open('train_data/xiaohuangji50w_nofenci.conv', 'r') as f:
	# 读取数据集
    f.readline()  # 读取E无效信息
    question = f.readline()  # 读取问题
    question = question[2:]  # 去掉问题的前缀
    _answer = f.readline()  # 读取参考答案
    _answer = _answer[2:]  # 去前缀

2. 然后向机器人提问得到机器人的回答

answer = execute.predict(question_fenci)

3. 得到机器人的答案后,需要将参考答案和机器人给出的答案进行BLEU的测算。具体过程如下:

3.1 测算之前需要对参考答案和机器人的回答分词
这里使用jieba库进行分词

import jieba
# 分词
question_fenci = ' '.join(jieba.cut(question))
_answer_fenci = ' '.join(jieba.cut(_answer))

3.2 构造sentence_bleu的参数reference,candidate
reference是标准答案 是一个列表,可以有多个参考答案,每个参考答案都是分词后使用split()函数拆分的子列表

# 举个reference例子
reference = [['this', 'is', 'a', 'duck']]
reference.append(_answer_fenci.split())

candidate是对机器人的回答分词后经过split得到的一个词的列表

candidate = (answer_fenci.split())

3.3 下面就可以开始计算BLEU值了

score1 = sentence_bleu(reference, candidate, weights=(1, 0, 0, 0))
score2 = sentence_bleu(reference, candidate, weights=(0.5, 0.5, 0, 0))
score3 = sentence_bleu(reference, candidate, weights=(0.33, 0.33, 0.33, 0))
score4 = sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25))

weights的代表了1-gram 2-gram 3-gram 4-gram占得比重,缺省情况下为各占1/4。这样我们就完成了BLEU的测算。
关于BLEU怎么计算:这位博主讲的很清楚:BLEU算法
在调用sentence_bleu函数时可能会遇到下面的提示:
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
warnings.warn(_msg)
这是因为你的答案中没有2-gram,2-gram就会得到一个非常小的值。

完整代码

最后给出完整的代码:

# 计算前100、1k、1w项的BLEU值
import execute
import jieba
from nltk.translate.bleu_score import sentence_bleu
import time

count = 0
question = ''  # 问题
_answer = ''  # 参考答案
answer = ''  # 机器人的回答

reference = []  # BLEU参考内容
candidate = []  # 聊天机器人返回的内容

# 分别计算1-gram 2-gram 3-gram 4-gram
score_total1 = 0
score_total2 = 0
score_total3 = 0
score_total4 = 0
i = 0
with open('train_data/xiaohuangji50w_nofenci.conv', 'r') as f:
	# 这里用来忽略前1000个问答对,从1001个开始测试
    # while i < 1000:
    #     f.readline()
    #     f.readline()
    #     f.readline()
    #     i += 1
    #     print('i: ' + str(i))
    start_time = time.time()
    
    # 更改判断条件即可选择测试的样本数
    while count < 1000:
        # 读取数据集
        f.readline()  # 读取E无效信息
        question = f.readline()  # 读取问题
        question = question[2:]  # 去掉问题的前缀
        _answer = f.readline()  # 读取参考答案
        _answer = _answer[2:]  # 去前缀

        # 分词
        question_fenci = ' '.join(jieba.cut(question))
        _answer_fenci = ' '.join(jieba.cut(_answer))

        # 与机器人聊天
        # 使用了前面github提供的源码,中文聊天机器人。
        answer = execute.predict(question_fenci)

        # 答案分词
        answer_fenci = ' '.join(jieba.cut(answer))
        print('---------------分割线-----------------')
        print('question_fenci: ' + str(question_fenci))
        print('_answer_fenci: ' + str(_answer_fenci))
        print('answer_fenci: ' + str(answer_fenci))


        # 计算BLEU
        reference.append(_answer_fenci.split())
        candidate = (answer_fenci.split())
        score1 = sentence_bleu(reference, candidate, weights=(1, 0, 0, 0))
        score2 = sentence_bleu(reference, candidate, weights=(0.5, 0.5, 0, 0))
        score3 = sentence_bleu(reference, candidate, weights=(0.33, 0.33, 0.33, 0))
        score4 = sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25))
        reference.clear()
        print('Cumulate 1-gram :%f' \
              % score1)
        print('Cumulate 2-gram :%f' \
              % score2)
        print('Cumulate 3-gram :%f' \
              % score3)
        print('Cumulate 4-gram :%f' \
              % score4)
        score_total1 += score1
        score_total2 += score2
        score_total3 += score3
        score_total4 += score4
        count += 1
        print('count:' + str(count) + ' score: ' + str(score1))
        print('count:' + str(count) + ' score: ' + str(score2))
        print('count:' + str(count) + ' score: ' + str(score3))
        print('count:' + str(count) + ' score: ' + str(score4))
        print('---------------分割线-----------------')

print('最终结果')
print('测试耗时:' + str(time.time() - start_time))
print('count: ' + str(count))

print('score_tatal1: ' + str(score_total1))
print('BLEU 1-gram: ' + str(score_total1 / count))
print('---------------分割线-----------------')
print('score_tatal2: ' + str(score_total2))
print('BLEU 2-gram: ' + str(score_total2 / count))
print('---------------分割线-----------------')
print('score_tatal3: ' + str(score_total3))
print('BLEU 3-gram: ' + str(score_total3 / count))
print('---------------分割线-----------------')
print('score_tatal4: ' + str(score_total4))
print('BLEU 4-gram: ' + str(score_total4 / count))





你可能感兴趣的:(中文聊天机器人BLEU值的测算,以小黄鸭数据集作训练)