被课程大作业逼着学习了解了一下bert,转念一想,这不是正好用来解答英语完形填空作业吗,因此有了以下代码:
首先我们导入会用到的库啊,pytorch_pretrained_bert这个库我是第一次用啊,需要去pip install 或者conda install:
import numpy as np
import torch
from pytorch_pretrained_bert import BertTokenizer, BertForMaskedLM
import re
from random import *
之后进行一些bert模型及词库的导入:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') #导入词库
bert = BertForMaskedLM.from_pretrained('bert-base-uncased') #导入模型
bert.eval()
bert.to('cuda:0') #移动到GPU,没有GPU就不要这行代码了
然后我先导入了问题(选项)文本,用的是txt文件:
#选项提取
choices=ques_proce('question.txt') #处理问题(选项)文本
choices_idx=[]
for choice in choices: #进行tokenize化
choice_idx=tokenizer.convert_tokens_to_ids(choice)
choices_idx.append(choice_idx)
处理问题的函数如下:
def ques_proce(file):
f = open(file, 'r', encoding='gb18030', errors='ignore')
buffer = f.readline()
choices=[]
while buffer!='':
list=buffer.split()
one_que=[list[idx] for idx in [2,4,6,8]]
choices.append(one_que)
buffer=f.readline()
return choices
question.txt大概是这样的,因为不太会字符串的处理,所以需要这个txt文件满足一定格式,比如题号、选项号、选项之间必须有个空格:
1. A. cried B. talked C. shouted D. laughed
2. A. spoke B. told C. shouted D. asked
3. A. her B. him C. it D. them
4. A. brought B. took C. put D. got
5. A. only B. ago C. later D. before
6. A. hurt B. well C. healthy D. bad
7. A. on B. in C. out D. off
8. A. other B. one C. another D. others
9. A. much B. very C. still D. also
10. A. kept B. pulled C. done D. thrown
建立一个预测矩阵,记录的是每一个问题的4个选项的概率:
#建立预测概率矩阵
ans_prob=[]
for i in range(len(choices)):
ans_prob.append([0.0,0.0,0.0,0.0])
之后就是重要的部分了,首先讲一下文本的处理函数,(可能很复杂)主要思想是对于每一个有缺(mask、问题)的句子,随机从前后选另一个句子来和这个句子组合作为bert模型的输入,代码的复杂是因为这个随机选择的句子也有可能是一个有缺(mask、问题)的句子,因此要首先调用sen2maskIdx函数来记录每个句子对应哪一个或一些问题(可能一个句子有很多缺)
def sen2maskIdx(sen_list):
now_mask=0
result=[]
for idx in range(len(sen_list)):
sen=sen_list[idx]
mask_num=sen.count('[MASK]')
maskIdx=[i+now_mask for i in range(mask_num)]
now_mask+=mask_num
result.append(maskIdx)
return result
def pass_proce(file,per_times): #per_times是指每个句子和其他句子组合多少次
f = open(file, 'r', encoding='gb18030', errors='ignore') #处理一下文本
buffer = f.read()
buffer = re.sub(u'\n', ' ', buffer)
buffer = re.sub(u'\(\d{1,2}\)_{1,9}', '[MASK]', buffer)
sen_list=buffer.split('.')
for_all_mask=[]
sen2maskidx=sen2maskIdx(sen_list) #建立句子到问题的映射
for sen in sen_list:
if '[MASK]' in sen: #仅对有缺的句子进行处理
for_this_mask=[]
for i in range(per_times):
ano_idx=randint(0,len(sen_list)-1)
while sen_list[ano_idx]==sen:
ano_idx = randint(0, len(sen_list))
if sen_list.index(sen)>ano_idx:
temp_sen = '[CLS]' + sen_list[ano_idx] + ' [SEP] ' + sen + '[SEP]'
# segments_idx = [0] * (1 + len(sen_list[ano_idx]) + 1) + [1] * (len(sen) + 1)
mask_idx = sen2maskidx[ano_idx] + sen2maskidx[sen_list.index(sen)]
else:
temp_sen = '[CLS]' + sen + ' [SEP] ' + sen_list[ano_idx] + '[SEP]'
# segments_idx = [0] * (1 + len(sen) + 1) + [1] * (len(sen_list[ano_idx]) + 1)
mask_idx = sen2maskidx[sen_list.index(sen)] + sen2maskidx[ano_idx]
for_this_mask.append((temp_sen,mask_idx))
for_all_mask.append(for_this_mask)
return for_all_mask
实验用的文章如下,网上搜的中考真题,因为水平有限,需要这个文本自己满足一定的格式,比如题号和下划线距离其他英文字母有个空格之类的,对标点符号也是:
Helen was seven years old. One day one of her teeth began to hurt. She (1)____ in class at school , and her teacher (2)______ kindly, "What's the matter, Helen?"
"One of my teeth hurts, "answered Helen.
"Tell your mother about (3)_____ , " said the teacher, "and then go to see the dentist."
That afternoon Helen told her mother about her tooth, and her mother (4)_____ her to the dentist's a few hours (5)______. The dentist looked at the tooth and then said to Helen. "It's very (6)______. I'm going to pull it (7)_____ , and then you're going to get a new tooth. It will be as nice as (8)______ next year." Then he did it with no trouble.
The next day Helen's teacher asked her about the tooth. She said to her, "Does it (9)______ hurt, Helen?"
"I don't know. You'd better ask the dentist, "Helen answered.
"Why?" the teacher asked.
"Because the dentist has (10)______ it, " Helen answered.
预测过程的代码如下,对pass_proce函数返回的很多个句子的组合依次进行处理,对于每一个组合,预测在其所有mask的位置是4个选项的概率,加到上面建立的那个概率矩阵里(为什么要加呢,因为每一个mask对应的句子会有很多组合,因此根据不同的组合会预测很多次,想当与综合这些预测结果一起做预测):
#文本处理
text=pass_proce("passage.txt",10) #处理文章文本
for mask_sen in text:
for per_sen in mask_sen:
tokenized_text = tokenizer.tokenize(per_sen[0])
broke_point=tokenized_text.index('[SEP]')
segments_ids=[0]*(broke_point+1)+[1]*(len(tokenized_text)-broke_point-1)
que_idxs=per_sen[1]
ids = torch.tensor([tokenizer.convert_tokens_to_ids(tokenized_text)])
segments_tensors = torch.tensor([segments_ids])
ids = ids.to('cuda:0')
segments_tensors = segments_tensors.to('cuda:0')
#mask的位置提取
mask_num=tokenized_text.count('[MASK]')
mask_idxs=[idx for idx in range(len(tokenized_text)) if tokenized_text[idx]=='[MASK]']
#预测答案
result = bert(ids,segments_tensors)
for i in range(mask_num):
mask_idx=mask_idxs[i]
this_ans_prob = [result[0][mask_idx][choice_idx] for choice_idx in choices_idx[que_idxs[i]]]
ans_prob[que_idxs[i]]=[ans_prob[que_idxs[i]][j]+this_ans_prob[j] for j in range(4)]
这就得到了预测结果(概率矩阵),我们可以对其做归一化,不过在这个任务里做归一化没什么用,如果以后有余力说不定可以考虑其他的比如不同权重的组合方式:
#归一化
for i in range(len(choices)):
for j in range(4):
ans_prob[i][j]/=10
简单地根据概率最大的那个做出预测结果即可:
#计算预测答案
print(ans_prob)
ans_pred=[]
for per_que in ans_prob:
ans=['A','B','C','D'][np.array(per_que).argmax(axis=0)]
ans_pred.append(ans)
print(ans_pred)
然后为了计算正确率,我们导入正确答案的那个文本文件,之后比较计算正确率即可:
def ans_proce(file):
f = open(file, 'r', encoding='gb18030', errors='ignore')
buffer = f.readline()
answers = []
while buffer!='':
buffer=buffer.strip()
answers.append(buffer)
buffer=f.readline()
# print(answers)
return answers
#导入正确答案
ans_conrrect=ans_proce('answer.txt')
#计算正确率
correct=0.0
for i in range(len(choices)):
if ans_pred[i]==ans_conrrect[i]:
correct+=1
print("the correct rate is :"+str(correct/len(choices)*100.0)+"%")
上文选用的文章的正确答案如下:
A
D
C
B
C
D
C
D
C
A
bert给出的答案与准确率是:
['A', 'D', 'C', 'B', 'C', 'D', 'C', 'B', 'C', 'C']
the correct rate is :80.0%
下面是一些总结与思考(碎碎念):
(1)bert不加改进或者迁移学习的话,应该是不能用于那种一个选项有两个词的那种问题,比如那种从‘others’、‘the other’、‘another’、‘other’选择正确答案的那种,因为对于‘the other’,bert没法预测一个缺实际要填两个单词的这种。可能会想,那我们搞两个缺不就行了,我想了想,对于‘the other’给两个缺的话,那bert给出的概率会是两个缺分别为‘the’和‘other’的乘积,那肯定比一个单词的概率低啊(p<1),相当于模型几乎不可能认为缺里该填‘the other’,这肯定是不对的。
(2)我只拿了一篇文章来试,按道理讲这正确率实在不可靠,应该爬更多的真题来试的,或者随便一些文章让代码自己预测(自动随机缺)然后自己检查(类似于bert的无监督训练模式)。之后试试其他的一些准确度度量标准(比如ROC之类的)。
(3)其实我第一次尝试的时候没想这么麻烦,就是将一整个文章都输入进去,然后预测10个缺选择每个选项的概率,但是正确率只有60%。猜测可能是输入文本太多,重要的信息被稀释了,于是才尝试采用这种两个句子组合作为输入的方式。这种方式很重要的一个参数是那个per_times(就是每个有mask的句子和其他句子的组合的次数),如果和文章总长达到某种合适,正确率就会比较高。
水平很菜请勿喷,属第一次尝试在csdn投稿
我的联系方式:
QQ:839963220
原创文章,转载请保留原文地址、作者等