BIO 格式介绍
AGAC Track 官网
Bio-NLP课程链接
1.json 模块使用 资料
import json
data=[{"a":1,"b":2,"c":3,"d":4,"e":5}]
json.dumps(data)
print(json.dumps({'a': 'Runoob', 'b': 7}, sort_keys=True, indent=4, separators=(',', ':')))
###
jsonData = '{"a":1,"b":2,"c":3,"d":4,"e":5}';
text=json.loads(jsonData)
1、json.dumps()和json.loads()是json格式处理函数(可以这么理解,json是字符串)
(1)json.dumps()函数是将一个Python数据类型列表进行json格式的编码(可以这么理解,json.dumps()函数是将字典转化为字符串)
(2)json.loads()函数是将json格式数据转换为字典(可以这么理解,json.loads()函数是将字符串转化为字典)
2、json.dump()和json.load()主要用来读写json文件函数
2.spacy 模块使用 资料1 资料2
pycharm 安装第三方模块
win 添加conda 到环境变量
linux 环境如下:
- pip install spacy
- python -m spacy download en
结果如下:
列出部分的spacy用法:请参考资料1
text = "The sequel, Yes, Prime Minister, ran from 1986 to 1988. In total there were 38 episodes, of which all but one lasted half an hour. Almost all episodes ended with a variation of the title of the series spoken as the answer to a question posed by the same character, Jim Hacker. Several episodes were adapted for BBC Radio, and a stage play was produced in 2010, the latter leading to a new television series on UKTV Gold in 2013."
text
import spacy
nlp = spacy.load("en")
doc = nlp(text)
doc
###让Spacy帮我们分析这段话中出现的全部词例(token)
for token in doc:
print('"' + token.text + '"')
for token in doc[:10]:
print("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}".format(
token.text,
token.idx,
token.lemma_,
token.is_punct,
token.is_space,
token.shape_,
token.pos_,
token.tag_
))
for ent in doc.ents:
print(ent.text, ent.label_)
from spacy import displacy
displacy.render(doc, style='ent', jupyter=True)
3.glob 模块 资料 类似linux find - name 功能
glob.glob
- 返回所有匹配的文件路径列表。它只有一个参数pathname,定义了文件路径匹配规则,这里可以是绝对路径,也可以是相对路径。
glob.iglob
-
获取一个可编历对象,使用它可以逐个获取匹配的文件路径名。与glob.glob()的区别是:glob.glob同时获取所有的匹配路径,而 glob.iglob一次只获取一个匹配路径。这有点类似于.NET中操作数据库用到的DataSet与DataReader。
4.训练集,测试集。7:3分
import random
import os
import shutil
def random_copyfile(srcPath,dstPath,lastpath,numfiles):
name_list=list(os.path.join(srcPath,name) for name in os.listdir(srcPath))
random_name_list=list(random.sample(name_list,numfiles))
last=[ item for item in name_list if item not in random_name_list ]
if not os.path.exists(dstPath):
os.mkdir(dstPath)
for oldname in random_name_list:
shutil.copyfile(oldname,oldname.replace(srcPath, dstPath))
for file in last:
shutil.copyfile(file,file.replace(srcPath, lastpath))
srcPath='/home/kcao/test/tmp/AGAC_training'
dstPath = '/home/kcao/test/tmp/kcao_train_data'
lastpath='/home/kcao/test/tmp/kcao_test_data'
random_copyfile(srcPath,dstPath,lastpath,175)
5.将josn转换成BIO格式,供CRF使用
将训练的josn,与测试的json 分别转成BIO格式
# -*- coding: utf-8 -*-
"""
Created on Tue Apr 10 09:35:15 2019
@author: wyx
"""
import json
from glob import glob
import spacy
nlp = spacy.load('en')
def json2bio(fpath,output,splitby = 's'):
'''
输入json文件,返回bio(token pmid label)
splitby = 's' ----以句子空格
splitby = 'a' ----以摘要空格
'''
with open(fpath) as f:
pmid = fpath[-13:-5]
annotations = json.load(f)
text = annotations['text'].replace('\n',' ')
all_words = text.split(' ')
all_words2 = [token for token in nlp(text)]
all_label = ['O']*len(all_words)
for i in annotations['denotations']:
b_location = i['span']['begin']
e_location = i['span']['end']
label = i['obj']
B_wordloc = text.count(' ',0,b_location)
I_wordloc = text.count(' ',0,e_location)
all_label[B_wordloc] = 'B-'+label
if B_wordloc != I_wordloc:
for word in range(B_wordloc+1,I_wordloc+1):
all_label[word] = 'I-'+label
#得到以空格分词的词列表和对应标签列表
for w,_ in enumerate(all_words):
all_words[w] = nlp(all_words[w])
#对单个元素分词
labelchange = []
for i,_ in enumerate(all_words):
token = [token for token in all_words[i]]
if len(token)==1:
labelchange.append(all_label[i])
else:
if all_label[i] == 'O':
labelchange.extend(['O']*len(token))
if all_label[i] != 'O':
labelchange.append(all_label[i])
if str(token[-1]) == '.' or str(token[-1]) == ',':
labelchange.extend(['I-'+all_label[i][2:]]*(len(token)-2))
labelchange.append('O')
else:
labelchange.extend(['I-'+all_label[i][2:]]*(len(token)-1))
#写入文件
with open(output,'a',encoding='utf-8') as f:
#以句子空行
if splitby == 's':
for j,_ in enumerate(all_words2):
if str(all_words2[j]) == '.' and str(all_words2[j-1]) != 'p':
line =str(all_words2[j])+'\t'+pmid+'\t'+labelchange[j]+'\n'
f.write(line+'\n')
else:
line =str(all_words2[j])+'\t'+pmid+'\t'+labelchange[j]+'\n'
f.write(line)
#以摘要空行
if splitby == 'a':
for j,_ in enumerate(all_words2):
line =str(all_words2[j])+'\t'+pmid+'\t'+labelchange[j]+'\n'
f.write(line)
f.write('\n')
if __name__ == "__main__":
fpathlist = glob('/home/kcao/test/tmp/kcao_train_data/*.json')
output = "/home/kcao/test/tmp/kcao_train_data/train.tab"
for i in fpathlist:
json2bio(i,output,'s')
6.使用Wapiti进行训练
pat内容如下:
*
U:tok:1:-1:%X[-1,0]
U:tok:1:+0:%X[0,0]
U:tok:1:+1:%X[1,0]
U:tok:2:-1:%X[-1,0]/%X[0,0]
U:tok:2:+0:%X[0,0]/%X[1,0]
U:tok:3:-2:%X[-2,0]/%X[-1,0]/%X[0,0]
U:tok:3:-1:%X[-1,0]/%X[0,0]/%X[1,0]
U:tok:3:+0:%X[0,0]/%X[1,0]/%X[2,0]
U:pre:1:+0:4:%M[0,0,"^.?.?.?.?"]
U:suf:1:+0:4:%M[0,0,".?.?.?.?$"]
test-wapiti.sh 如下
#! bin/bash
traininput_dir="kcao_train_data"
testinput_dir="kcao_test_data"
output_dir="kcao_output"
pattern_file="pat/Tok321dis.pat"
training_options=' -a sgd-l1 -t 3 -i 10 '
debug=0
verbose=0
patname=$(basename $pattern_file .pat)
corpus_name=$(basename $traininput_dir)
echo "================ Training $corpus_name (this may take some time) ================" 1>&2
# training: create a MODEL based on PATTERNS and TRAINING-CORPUS
# wapiti train -p PATTERNS TRAINING-CORPUS MODEL
echo "wapiti train $training_options -p $pattern_file <(cat $1) $output_dir/$patname-train-$corpus_name-$3.mod" 1>&2
wapiti train $training_options -p $pattern_file <(cat $traininput_dir/*.tab) $output_dir/$patname-train-$corpus_name.mod
# wapiti train -a bcd -t 2 -i 5 -p t.pat train-bio.tab t-train-bio.mod
#
# Note: The default algorithm, l-bfgs, stops early and does not succeed in annotating any token (all O)
# sgd-l1 works; bcd works
wapiti dump $output_dir/$patname-train-$corpus_name.mod $output_dir/$patname-train-$corpus_name.txt
echo "================ Inference $corpus_name ================" 1>&2
# inference (labeling): apply the MODEL to label the TEST-CORPUS, put results in TEST-RESULTS
# wapiti label -m MODEL TEST-CORPUS TEST-RESULTS
# -c : check (= evaluate)
# <(COMMAND ARGUMENTS ...) : runs COMMAND on ARGUMENTS ... and provides the results as if in a file
echo "wapiti label -c -m $output_dir/$patname-train-$corpus_name-$3.mod <(cat $1) $output_dir/$patname-train-test-$corpus_name-$3.tab" 1>&2
wapiti label -c -m $output_dir/$patname-train-$corpus_name.mod <(cat $testinput_dir/*) $output_dir/$patname-train-test-$corpus_name.tab
# wapiti label -c -m t-train-bio.mod test-bio.tab t-train-test-bio.tab
#echo "================ Evaluation with conlleval.pl $corpus_name ================" 1>&2
echo "Finished!"
# evaluate the resulting entities
# $'\t' is a way to obtain a tabulation in bash
#echo "$BINDIR/conlleval.pl -d $'\t' < $output_dir/$patname-train-test-$corpus_name-$3.tab | tee $output_dir/$patname-train-test-$corpus_name-$3.eval" 1>&2
perl conlleval.pl -d $'\t' < $output_dir/$patname-train-test-$corpus_name.tab | tee -a $output_dir/$patname-train-test-$corpus_name.eval
显示的FBI只有12.73还是太低,可能需要进行修改tab 里面特征。
1.修改pat 增加到U:tok:4:-3:%X[-3,0]/%X[-2,0]/%X[-1,0]/%X[0,0]