Bio-NLP CRF小作业

BIO 格式介绍
AGAC Track 官网
Bio-NLP课程链接

data dowmload

1.json 模块使用资料

import json
data=[{"a":1,"b":2,"c":3,"d":4,"e":5}]
json.dumps(data)
print(json.dumps({'a': 'Runoob', 'b': 7}, sort_keys=True, indent=4, separators=(',', ':')))

###
jsonData = '{"a":1,"b":2,"c":3,"d":4,"e":5}';
text=json.loads(jsonData)

1、json.dumps()和json.loads()是json格式处理函数（可以这么理解，json是字符串）
　　(1)json.dumps()函数是将一个Python数据类型列表进行json格式的编码（可以这么理解，json.dumps()函数是将字典转化为字符串）
　　(2)json.loads()函数是将json格式数据转换为字典（可以这么理解，json.loads()函数是将字符串转化为字典）

2、json.dump()和json.load()主要用来读写json文件函数

2.spacy 模块使用资料1 资料2

pycharm 安装第三方模块
win 添加conda 到环境变量

linux 环境如下：

pip install spacy
python -m spacy download en

下载en

结果如下：

前10个token

列出部分的spacy用法：请参考资料1

text = "The sequel, Yes, Prime Minister, ran from 1986 to 1988. In total there were 38 episodes, of which all but one lasted half an hour. Almost all episodes ended with a variation of the title of the series spoken as the answer to a question posed by the same character, Jim Hacker. Several episodes were adapted for BBC Radio, and a stage play was produced in 2010, the latter leading to a new television series on UKTV Gold in 2013."
text
import spacy
nlp = spacy.load("en")
doc = nlp(text)
doc
###让Spacy帮我们分析这段话中出现的全部词例（token）
for token in doc:
    print('"' + token.text + '"')

for token in doc[:10]:
    print("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}".format(
        token.text,
        token.idx,
        token.lemma_,
        token.is_punct,
        token.is_space,
        token.shape_,
        token.pos_,
        token.tag_
    ))
    
for ent in doc.ents:
    print(ent.text, ent.label_)
    

from spacy import displacy
displacy.render(doc, style='ent', jupyter=True)

3.glob 模块资料类似linux find - name 功能

glob.glob

返回所有匹配的文件路径列表。它只有一个参数pathname，定义了文件路径匹配规则，这里可以是绝对路径，也可以是相对路径。

glob.iglob

获取一个可编历对象，使用它可以逐个获取匹配的文件路径名。与glob.glob()的区别是：glob.glob同时获取所有的匹配路径，而 glob.iglob一次只获取一个匹配路径。这有点类似于.NET中操作数据库用到的DataSet与DataReader。

glob 模块

4.训练集，测试集。7:3分

import random
import os
import shutil

def random_copyfile(srcPath,dstPath,lastpath,numfiles):
    name_list=list(os.path.join(srcPath,name) for name in os.listdir(srcPath))
    random_name_list=list(random.sample(name_list,numfiles))
    last=[ item for item in name_list if item not in random_name_list ]
    if not os.path.exists(dstPath):
        os.mkdir(dstPath)
    for oldname in random_name_list:
        shutil.copyfile(oldname,oldname.replace(srcPath, dstPath))
    for file in last:
        shutil.copyfile(file,file.replace(srcPath, lastpath))

srcPath='/home/kcao/test/tmp/AGAC_training'
dstPath = '/home/kcao/test/tmp/kcao_train_data'
lastpath='/home/kcao/test/tmp/kcao_test_data'
random_copyfile(srcPath,dstPath,lastpath,175)

5.将josn转换成BIO格式，供CRF使用

将训练的josn,与测试的json 分别转成BIO格式

# -*- coding: utf-8 -*-
"""
Created on Tue Apr 10 09:35:15 2019

@author: wyx
"""

import json
from glob import glob
import spacy

nlp = spacy.load('en')

def json2bio(fpath,output,splitby = 's'):
    '''
    输入json文件，返回bio(token pmid label)
    splitby = 's' ----以句子空格
    splitby = 'a' ----以摘要空格
    '''
    with open(fpath) as f:
        pmid = fpath[-13:-5]
        annotations = json.load(f)
        text = annotations['text'].replace('\n',' ')
        all_words = text.split(' ')
        all_words2 = [token for token in nlp(text)]
        all_label = ['O']*len(all_words)
        for i in annotations['denotations']:
            b_location = i['span']['begin']
            e_location = i['span']['end']
            label = i['obj']
            B_wordloc = text.count(' ',0,b_location)
            I_wordloc = text.count(' ',0,e_location)
            all_label[B_wordloc] = 'B-'+label
            if B_wordloc != I_wordloc:
                for word in range(B_wordloc+1,I_wordloc+1):
                    all_label[word] = 'I-'+label
        #得到以空格分词的词列表和对应标签列表
        for w,_ in enumerate(all_words):
            all_words[w] = nlp(all_words[w])
        #对单个元素分词 
        labelchange = []
        for i,_ in enumerate(all_words):
            token = [token for token in all_words[i]]
            if len(token)==1:
                labelchange.append(all_label[i])
            else:
                if all_label[i] == 'O':
                    labelchange.extend(['O']*len(token))
                if all_label[i] != 'O':
                    labelchange.append(all_label[i])
                    if str(token[-1]) == '.' or str(token[-1]) == ',':
                        labelchange.extend(['I-'+all_label[i][2:]]*(len(token)-2))
                        labelchange.append('O')
                    else:
                        labelchange.extend(['I-'+all_label[i][2:]]*(len(token)-1))
        
        #写入文件
        with open(output,'a',encoding='utf-8') as f:
            #以句子空行
            if splitby == 's':
                for j,_ in enumerate(all_words2):
                    if str(all_words2[j]) == '.' and str(all_words2[j-1]) != 'p':
                        line =str(all_words2[j])+'\t'+pmid+'\t'+labelchange[j]+'\n'
                        f.write(line+'\n')
                    else:
                        line =str(all_words2[j])+'\t'+pmid+'\t'+labelchange[j]+'\n'
                        f.write(line)
            #以摘要空行
            if splitby == 'a':
                for j,_ in enumerate(all_words2):
                    line =str(all_words2[j])+'\t'+pmid+'\t'+labelchange[j]+'\n'
                    f.write(line)
                f.write('\n')


if __name__ == "__main__":
    fpathlist = glob('/home/kcao/test/tmp/kcao_train_data/*.json')
    output = "/home/kcao/test/tmp/kcao_train_data/train.tab"
    for i in fpathlist:
        json2bio(i,output,'s')

6.使用Wapiti进行训练

pat内容如下：

*

U:tok:1:-1:%X[-1,0]
U:tok:1:+0:%X[0,0]
U:tok:1:+1:%X[1,0]

U:tok:2:-1:%X[-1,0]/%X[0,0]
U:tok:2:+0:%X[0,0]/%X[1,0]

U:tok:3:-2:%X[-2,0]/%X[-1,0]/%X[0,0]
U:tok:3:-1:%X[-1,0]/%X[0,0]/%X[1,0]
U:tok:3:+0:%X[0,0]/%X[1,0]/%X[2,0]


U:pre:1:+0:4:%M[0,0,"^.?.?.?.?"]

U:suf:1:+0:4:%M[0,0,".?.?.?.?$"]

test-wapiti.sh 如下

#! bin/bash
traininput_dir="kcao_train_data"
testinput_dir="kcao_test_data"
output_dir="kcao_output"
pattern_file="pat/Tok321dis.pat"
training_options=' -a sgd-l1 -t 3 -i 10 '
debug=0
verbose=0
patname=$(basename $pattern_file .pat)
corpus_name=$(basename $traininput_dir)

echo "================ Training $corpus_name (this may take some time) ================" 1>&2
# training: create a MODEL based on PATTERNS and TRAINING-CORPUS
# wapiti train -p PATTERNS TRAINING-CORPUS MODEL
echo "wapiti train $training_options -p $pattern_file <(cat $1) $output_dir/$patname-train-$corpus_name-$3.mod" 1>&2

wapiti train $training_options -p $pattern_file <(cat $traininput_dir/*.tab) $output_dir/$patname-train-$corpus_name.mod
# wapiti train -a bcd -t 2 -i 5 -p t.pat train-bio.tab t-train-bio.mod
#
# Note: The default algorithm, l-bfgs, stops early and does not succeed in annotating any token (all O)
# sgd-l1 works; bcd works

wapiti dump $output_dir/$patname-train-$corpus_name.mod $output_dir/$patname-train-$corpus_name.txt

echo "================ Inference $corpus_name ================" 1>&2
# inference (labeling): apply the MODEL to label the TEST-CORPUS, put results in TEST-RESULTS
# wapiti label -m MODEL TEST-CORPUS TEST-RESULTS
# -c : check (= evaluate)
# <(COMMAND ARGUMENTS ...) : runs COMMAND on ARGUMENTS ... and provides the results as if in a file
echo "wapiti label -c -m $output_dir/$patname-train-$corpus_name-$3.mod <(cat $1) $output_dir/$patname-train-test-$corpus_name-$3.tab" 1>&2
wapiti label -c -m $output_dir/$patname-train-$corpus_name.mod <(cat $testinput_dir/*) $output_dir/$patname-train-test-$corpus_name.tab
# wapiti label -c -m t-train-bio.mod test-bio.tab t-train-test-bio.tab
#echo "================ Evaluation with conlleval.pl $corpus_name ================" 1>&2
echo "Finished!"
# evaluate the resulting entities
# $'\t' is a way to obtain a tabulation in bash
#echo "$BINDIR/conlleval.pl -d $'\t' < $output_dir/$patname-train-test-$corpus_name-$3.tab | tee $output_dir/$patname-train-test-$corpus_name-$3.eval" 1>&2
perl conlleval.pl -d $'\t' < $output_dir/$patname-train-test-$corpus_name.tab | tee -a $output_dir/$patname-train-test-$corpus_name.eval

结果统计

显示的FBI只有12.73还是太低，可能需要进行修改tab 里面特征。

1.修改pat 增加到U:tok:4:-3:%X[-3,0]/%X[-2,0]/%X[-1,0]/%X[0,0]

image.png