使用Stanford NER训练自己的model

Standford NER

Standford NER(Stanford Named Entity Recognizer )是斯坦福大学提供开源命名实体识别库,使用Java语言实现, 可以用来识别文本中的人名、地名、组织名称等实体。采用的是CRF分类器进行实体识别。

使用Standford NER进行命名实体识别

该过程参考官方文档
1. 下载源代码stanford-ner-2015-12-09.zip
2. 将stanford-ner-2015-12-09.zip解压到某个目录下,比如stanford-ner
3. 进入stanford-ner目录cd stanford-ner
4. 在linux/mac系统中可以使用运行一下命令,使用sample.txt文件进行命名实体测试,采用的是Stanford NER库自带的英文模型,该模型可以识别人名、地名和组织关系名称

java -mx600m -cp "*:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -textFile sample.txt

5 . 运行以上命令后得到以下结果,其中每个单词后面都有标定结果, 0表示未识别,PERSON/ORGANIZATION分别表示人名和组织名称

The/O fate/O of/O Lehman/ORGANIZATION Brothers/ORGANIZATION ,/O the/O beleaguered/O investment/O bank/O ,/O hung/O in/O the/O balance/O on/O Sunday/O as/O Federal/ORGANIZATION Reserve/ORGANIZATION officials/O and/O the/O leaders/O of/O major/O financial/O institutions/O continued/O to/O gather/O in/O emergency/O meetings/O trying/O to/O complete/O a/O plan/O to/O rescue/O the/O stricken/O bank/O ./O 
Several/O possible/O plans/O emerged/O from/O the/O talks/O ,/O held/O at/O the/O Federal/ORGANIZATION Reserve/ORGANIZATION Bank/ORGANIZATION of/ORGANIZATION New/ORGANIZATION York/ORGANIZATION and/O led/O by/O Timothy/PERSON R./PERSON Geithner/PERSON ,/O the/O president/O of/O the/O New/ORGANIZATION York/ORGANIZATION Fed/ORGANIZATION ,/O and/O Treasury/ORGANIZATION Secretary/O Henry/PERSON M./PERSON Paulson/PERSON Jr./PERSON ./O 

使用Standford NER 训练自己语言模型

该过程参考官方文档
1. 准备训练数据,训练数据中,每行有两列,用tab分隔,第一列为单词,第二列为该单词的标记。用空行来分隔不同的”文档”, 这里”文档”可以是一个句子或一个段落,”文档”不宜过长,不然会很耗内存资源。 以印度尼西亚语为例。将以下句子转换成训练数据

Pengamat politik dari Universitas Gadjah Mada, Arie Sudjito, menilai, keinginan Ketua Umum Partai Golkar Aburizal Bakrie untuk maju kembali sebagai ketua umum merupakan pemaksaan kehendak.

训练数据存储到jane-austen-emma-ch1.tsv

Pengamat    O
politik O
dari    O
Universitas ORGANIZATION
Gadjah  ORGANIZATION
Mada    ORGANIZATION
Arie    PERSON
Sudjito PERSON
,   O
menilai O
keinginan   O
Ketua   Umum
Partai  ORGANIZATION
Golkar  ORGANIZATION
Aburizal    PERSON
Bakrie  PERSON
untuk   O
maju    O
kembali O
sebagai O

2 . 配置属性文件,存储到austen.prop

# location of the training file
trainFile = jane-austen-emma-ch1.tsv
# location where you would like to save (serialize) your
# classifier; adding .gz at the end automatically gzips the file,
# making it smaller, and faster to load
serializeTo = ner-model.ser.gz

# structure of your training file; this tells the classifier that
# the word is in column 0 and the correct answer is in column 1
map = word=0,answer=1

# This specifies the order of the CRF: order 1 means that features
# apply at most to a class pair of previous class and current class
# or current class and next class.
maxLeft=1

# these are the features we'd like to train with
# some are discussed below, the rest can be
# understood by looking at NERFeatureFactory
useClassFeature=true
useWord=true
# word character ngrams will be included up to length 6 as prefixes
# and suffixes only 
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useDisjunctive=true
useSequences=true
usePrevSequences=true
# the last 4 properties deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC

3 . 运行以下命令进行模型训练,生成NER模型ner-model.ser.gz

java -mx600m -cp "*:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -prop austen.prop

4 . 创建测试数据,生成测试文件test.txt

Hal ini, kata Arie, berpotensi menimbukan perpecahan di kalangan kader Golkar di daerah.

5 . 运行以下命令进行测试

java -mx600m -cp "*:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz  -textFile test.txt

6 . 运行结果

Hal/O ini/O ,/O kata/O Arie/PERSON ,/O berpotensi/O menimbukan/O perpecahan/O di/O kalangan/O kader/O Golkar/ORGANIZATION di/O daerah/O ./O

备注

本文只是介绍Stanford NER的命令行使用过程,如何在代码中使用 Stanford NER可以参照Stanford NER.

你可能感兴趣的:(自然语言处理)