【数据集分析】NYT-Wiki关系抽取数据集分析(一)—— 理解单条实例
【数据集分析】NYT-Wiki关系抽取数据集分析(二)—— 统计类别和实例数
【数据集分析】NYT-Wiki关系抽取数据集分析(三)—— 绘制Relation分布图
最近拿到一个关系抽取数据集,nyt-wiki,分析了一波分布、重合等,分享一下分析思路和代码。
[’/m/0124gn1g’, ‘/m/02lx2r’, ‘trick’, ‘album’, ‘instance_of’, ‘utah saints pulls off a similar trick on its hit single, “something good,” the opening track on its eponymous debut album (london/plg 828 374-2; cd and cassette).’, ‘###END###’]
可以看到这一个instance的构成为:
[头实体id,尾实体id,头实体,尾实体,关系名,句子,终止记号]
一共包含七个部分,其中我们需要的是前六个部分。
我们很难记住list
的序号对应的值的构成,因此将instance数据类型转化成dict
,这样取数据时就会比较方便,dict
的key
值如下:
{“text”: , “relation”: , “h”: {“id”: , “name”: , “pos”: }, “t”: {“id”: , “name”: , “pos”: }}
这样我们在取数据时就直接用instance["text"]
就可以直接取出数据,非常的方便,list
转化dict
后,一个instance如下:
{
"text":"utah saints pulls off a similar trick on its hit single, "something good," the opening track on its eponymous debut album (london/plg 828 374-2; cd and cassette).",
"relation":"instance_of",
"h":{
"id":"/m/0124gn1g",
"name":"trick",
"pos":[
32,
37
]
},
"t":{
"id":"/m/02lx2r",
"name":"album",
"pos":[
116,
121
]
}
}
NOTE:
dict
,该dict
包含三部分{id,name,pos}。dict
类型可以用json相互转化,存储和读取比较规范。import json
train_rel_fre_dict = {}
train_data = {}
temp1 = {}
temp2 = {}
with open("nytwiki_train.txt", 'w', encoding = 'utf-8') as f_op:
with open("train.txt", 'r', encoding = 'utf-8') as f:
lines = f.readlines()
for line in lines:
line = line.strip().split('\t') #loads后面字符串, load(文件名字)
# 获取头(尾)实体在句子中的位置
pos1 = [line[5].index(line[2]),line[5].index(line[2])+len(line[2])]
pos2 = [line[5].index(line[3]),line[5].index(line[3])+len(line[3])]
train_data['text'] = line[5]
train_data['relation'] = line[4]
temp1['id'] = line[0]
temp1['name'] = line[2]
temp1['pos'] = pos1
train_data['h'] = temp1
temp2['id'] = line[1]
temp2['name'] = line[3]
temp2['pos'] = pos2
train_data['t'] = temp2
json.dump(train_data, f_op)
f_op.write('\n')