Standford NER(Stanford Named Entity Recognizer )是斯坦福大学提供开源命名实体识别库,使用Java语言实现, 可以用来识别文本中的人名、地名、组织名称等实体。采用的是CRF分类器进行实体识别。
该过程参考官方文档
1. 下载源代码stanford-ner-2015-12-09.zip
2. 将stanford-ner-2015-12-09.zip解压到某个目录下,比如stanford-ner
3. 进入stanford-ner目录cd stanford-ner
4. 在linux/mac系统中可以使用运行一下命令,使用sample.txt文件进行命名实体测试,采用的是Stanford NER库自带的英文模型,该模型可以识别人名、地名和组织关系名称
java -mx600m -cp "*:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -textFile sample.txt
5 . 运行以上命令后得到以下结果,其中每个单词后面都有标定结果, 0表示未识别,PERSON/ORGANIZATION分别表示人名和组织名称
The/O fate/O of/O Lehman/ORGANIZATION Brothers/ORGANIZATION ,/O the/O beleaguered/O investment/O bank/O ,/O hung/O in/O the/O balance/O on/O Sunday/O as/O Federal/ORGANIZATION Reserve/ORGANIZATION officials/O and/O the/O leaders/O of/O major/O financial/O institutions/O continued/O to/O gather/O in/O emergency/O meetings/O trying/O to/O complete/O a/O plan/O to/O rescue/O the/O stricken/O bank/O ./O
Several/O possible/O plans/O emerged/O from/O the/O talks/O ,/O held/O at/O the/O Federal/ORGANIZATION Reserve/ORGANIZATION Bank/ORGANIZATION of/ORGANIZATION New/ORGANIZATION York/ORGANIZATION and/O led/O by/O Timothy/PERSON R./PERSON Geithner/PERSON ,/O the/O president/O of/O the/O New/ORGANIZATION York/ORGANIZATION Fed/ORGANIZATION ,/O and/O Treasury/ORGANIZATION Secretary/O Henry/PERSON M./PERSON Paulson/PERSON Jr./PERSON ./O
该过程参考官方文档
1. 准备训练数据,训练数据中,每行有两列,用tab分隔,第一列为单词,第二列为该单词的标记。用空行来分隔不同的”文档”, 这里”文档”可以是一个句子或一个段落,”文档”不宜过长,不然会很耗内存资源。 以印度尼西亚语为例。将以下句子转换成训练数据
Pengamat politik dari Universitas Gadjah Mada, Arie Sudjito, menilai, keinginan Ketua Umum Partai Golkar Aburizal Bakrie untuk maju kembali sebagai ketua umum merupakan pemaksaan kehendak.
训练数据存储到jane-austen-emma-ch1.tsv
Pengamat O
politik O
dari O
Universitas ORGANIZATION
Gadjah ORGANIZATION
Mada ORGANIZATION
Arie PERSON
Sudjito PERSON
, O
menilai O
keinginan O
Ketua Umum
Partai ORGANIZATION
Golkar ORGANIZATION
Aburizal PERSON
Bakrie PERSON
untuk O
maju O
kembali O
sebagai O
2 . 配置属性文件,存储到austen.prop
# location of the training file
trainFile = jane-austen-emma-ch1.tsv
# location where you would like to save (serialize) your
# classifier; adding .gz at the end automatically gzips the file,
# making it smaller, and faster to load
serializeTo = ner-model.ser.gz
# structure of your training file; this tells the classifier that
# the word is in column 0 and the correct answer is in column 1
map = word=0,answer=1
# This specifies the order of the CRF: order 1 means that features
# apply at most to a class pair of previous class and current class
# or current class and next class.
maxLeft=1
# these are the features we'd like to train with
# some are discussed below, the rest can be
# understood by looking at NERFeatureFactory
useClassFeature=true
useWord=true
# word character ngrams will be included up to length 6 as prefixes
# and suffixes only
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useDisjunctive=true
useSequences=true
usePrevSequences=true
# the last 4 properties deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
3 . 运行以下命令进行模型训练,生成NER模型ner-model.ser.gz
java -mx600m -cp "*:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -prop austen.prop
4 . 创建测试数据,生成测试文件test.txt
Hal ini, kata Arie, berpotensi menimbukan perpecahan di kalangan kader Golkar di daerah.
5 . 运行以下命令进行测试
java -mx600m -cp "*:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz -textFile test.txt
6 . 运行结果
Hal/O ini/O ,/O kata/O Arie/PERSON ,/O berpotensi/O menimbukan/O perpecahan/O di/O kalangan/O kader/O Golkar/ORGANIZATION di/O daerah/O ./O
本文只是介绍Stanford NER的命令行使用过程,如何在代码中使用 Stanford NER可以参照Stanford NER.