从2018年google提出bert后,一直在使用bert模型作为训练基础
经常会需要标注语料数据
在数据量极大的情况下,数万标注后头晕眼花,如何快速差错也是一个问题
于是设置了3条规则作为检查的基本属性,后续欢迎补充
前置:
我们有一个充满label的标签
还有一个已经做好标注的语料
1.标签的正确性:
首先我们要保证每个字后面的标签不会因为我们的手误或者复制粘贴出错
# rule 1 判定正确性,保证语料标注中每行“一个字+空格+对应标签”,语料之中句子与句子之前单独使用一个回车\n来隔断
# 无遗漏,无错误标签
# label.txt 文件包含标签目录,同时最后一行无回车\n
with open("label.txt", 'r', encoding='utf-8') as fd:
content = fd.read()
label = content.split("\n")
print(label)
with open("corpus_2.txt", 'r', encoding='utf-8') as f4:
content = f4.read()
word_vector = content.split("\n")
# word_vector = [i for i in word_vector if(len(str(i)) != 0)]
print(word_vector)
num = 0
for word in word_vector:
num = num + 1
if word != "":
if word[0] == " ":
print(num)
print("wrong")
chinese = word.split(" ")
if chinese[1] not in label:
print(num)
print(chinese[1])
print("有标错的标签")
break
# print("标签正确性检测通过")
2.rule2:IBO的标注原则,B-X,I-X,,,不会出现以I-x作为开头的标注词,同时,在一组标注词中后缀标签应该保持一致
# rule 2 test IBO 标准原则
# 我的数据语料标注了43k行
num = 43042
logo = []
# 把标签组取出
for word in word_vector:
if word != "":
chinese = word.split(" ")
logo.append(chinese[1])
else:
logo.append(word)
#
for i in range(0, len(logo))[::-1]:
num = num - 1
temp = i
lab = logo[temp]
if len(lab) > 0 and logo[temp][0] == "I":
label_word = logo[temp][2:]
str_test = 'B-' + label_word
str_test_2 = 'I-' + label_word
# 设定报错阈值 100
length = 1
while logo[temp] != str_test:
if logo[temp] != str_test_2:
print(num)
print("wrong")
return -1
temp = temp - 1
length = length + 1
print("IBO规则通过")
3rule3:在进行检测结束后,我们在生成train.txt,dev,test等文件的时候应该将句子作为整体进行打混,(爬虫爬下来的时候会出现领域句子大部分聚在一起的情况)
# rule 3 按句子 将顺序打混 并切分文件
with open("corpus_2.txt", 'r', encoding='utf-8') as fa:
content = fa.read()
# for w in range(0, len(content)):
# if content[w] == "\n":
# num = num + 1
# if content[w] == "\n" and content[w+1] == "\n" and content[w+2] == "\n":
# print(num)
# break
# if "\n\n\n" in content:
# print("aaaaa")
sentence_vector = content.split("\n\n")
print(sentence_vector)
random.shuffle(sentence_vector)
print(sentence_vector)
with open("corpus_1.txt", 'w', encoding='utf-8') as fw:
for sentence in sentence_vector:
fw.write(sentence)
fw.write("\n\n")
# 按比例拆分成文件后最终保证dev.txt,train.txt等以两个\n结束作为模型训练的输入
# 43000 * 8/10
print(len(sentence_vector))
with open("train.txt", 'w', encoding='utf-8') as ftr:
for ss in range(0, 1250):
ftr.write(sentence)
ftr.write("\n\n")
with open("dev.txt", 'w', encoding='utf-8') as ftr:
for ss in range(1250, 1850):
ftr.write(sentence)
ftr.write("\n\n")
with open("test.txt", 'w', encoding='utf-8') as ftr:
for ss in range(1850, len(sentence_vector)):
ftr.write(sentence)
ftr.write("\n\n")
这里我遇到了一个问题,那是打乱后生成的corpus_1.txt长度与corpus_2.txt
原文不一致,我先后检查了标签和实体字长度,没有发现问题,直到最后检测出来是因为我是以“\n\n”作为切分条件的,而当时我
corpus_2.txt的最后一句话是以\n作为结尾的,就会造成在生成新文件的时候新文件中有一句话自带一个\n,所以会使新语料文件多一行,所以在切分之前一定要考虑好最后一句话的回车
以下图为例
a = "a\n\nb\n\nO"
b = a.split("\n\n")
random.shuffle(b)
print(b)
a = "a\n\nb\n\nO\n"
b = a.split("\n\n")
random.shuffle(b)
print(b)
a = "a\n\nb\n\nO\n\n"
b = a.split("\n\n")
random.shuffle(b)
print(b)
result:
['a', 'b', 'O']
['a', 'O\n', 'b']
['O', 'a', '', 'b']
目前考虑到的加速检测就是这些,如果还有遗漏欢迎大家提醒分享