CRF++

https://blog.csdn.net/lilong117194/article/details/83106711   ----命名实体识别—CRF++地名识别(这篇文章很详细)

http://www.hankcs.com/nlp/the-crf-model-format-description.html  -----CRF++模型格式说明

https://taku910.github.io/crfpp/   ---官方文档

https://www.jianshu.com/p/0c99ea1c730c   ----crf实验

 

至于B特征函数(这里特指简单的f(s', s)),在Viterbi后向解码的时候,前一个标签确定了后就可以代入当前的B特征函数,计算出每个输出标签的分数,再次求和排序即可。

 

 

 

import codecs
import sys

import CRFPP

def crf_segmenter(input_file, output_file, tagger):
	input_data = codecs.open(input_file, 'r', 'utf-8')
	output_data = codecs.open(output_file, 'w', 'utf-8')
	for line in input_data.readlines():
		tagger.clear()
		for word in line.strip():
			word = word.strip()
			if word:
				tagger.add((word + "\to\tB").encode('utf-8'))  #使用多个特征
		tagger.parse()
		size = tagger.size()
		xsize = tagger.xsize()
		for i in range(0, size):
			for j in range(0, xsize):
				char = tagger.x(i, j).decode('utf-8')
				tag = tagger.y2(i)
				if tag == 'B':
					output_data.write(' ' + char)
				elif tag == 'M':
					output_data.write(char)
				elif tag == 'E':
					output_data.write(char + ' ')
				else:
					output_data.write(' ' + char + ' ')
		output_data.write('\n')
	input_data.close()
	output_data.close()

if __name__ == '__main__':
	if len(sys.argv) != 4:
		print "Usage: python " + sys.argv[0] + " model input output"
		sys.exit(-1)
	crf_model = sys.argv[1]
	input_file = sys.argv[2]
	output_file = sys.argv[3]
	tagger = CRFPP.Tagger("-m " + crf_model)
	crf_segmenter(input_file, output_file, tagger)

 

 

 

你可能感兴趣的:(自然语言处理,python)