python识别中文人名_使用Python在NLP中的命名实体识别中提取人名

我有一句话,我需要单独识别人名:

例如:

sentence = "Larry Page is an American business magnate and computer scientist who is the co-founder of Google, alongside Sergey Brin"

我使用下面的代码来识别NER.

from nltk import word_tokenize, pos_tag, ne_chunk

print(ne_chunk(pos_tag(word_tokenize(sentence))))

我收到的输出是:

(S

(PERSON Larry/NNP)

(ORGANIZATION Page/NNP)

is/VBZ

an/DT

(GPE American/JJ)

business/NN

magnate/NN

and/CC

computer/NN

scientist/NN

who/WP

is/VBZ

the/DT

co-founder/NN

of/IN

(GPE Google/NNP)

,/,

alongside/RB

(PERSON Sergey/NNP Brin/NNP))

我想提取所有人名,例如

Larry Page

Sergey Brin

为了达到这个目的,我对这个link进行了测试并尝试了这一点.

from nltk.tag.stanford import StanfordNERTagger

st = StanfordNERTagger('/usr/share/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz','/usr/share/stanford-ner/stanford-ner.jar')

但是我继续得到这个错误:

LookupError: Could not find stanford-ner.jar jar file at /usr/share/stanford-ner/stanford-ner.jar

我在哪里可以下载这个文件?

如上所述,我期望以列表或字典的形式出现的结果是:

Larry Page

Sergey Brin

最佳答案 在龙

请仔细阅读:

了解解决方案,不要只是复制和粘贴.

TL; DR

在终端:

pip install -U nltk

wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip

unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \

-preload tokenize,ssplit,pos,lemma,parse,depparse \

-status_port 9000 -port 9000 -timeout 15000

在Python中

from nltk.tag.stanford import CoreNLPNERTagger

def get_continuous_chunks(tagged_sent):

continuous_chunk = []

current_chunk = []

for token, tag in tagged_sent:

if tag != "O":

current_chunk.append((token, tag))

else:

if current_chunk: # if the current chunk is not empty

continuous_chunk.append(current_chunk)

current_chunk = []

# Flush the final current_chunk into the continuous_chunk, if any.

if current_chunk:

continuous_chunk.append(current_chunk)

return continuous_chunk

stner = CoreNLPNERTagger()

tagged_sent = stner.tag('Rami Eid is studying at Stony Brook University in NY'.split())

named_entities = get_continuous_chunks(tagged_sent)

named_entities_str_tag = [(" ".join([token for token, tag in ne]), ne[0][1]) for ne in named_entities]

print(named_entities_str_tag)

[OUT]:

[('Rami Eid', 'PERSON'), ('Stony Brook University', 'ORGANIZATION'), ('NY', 'LOCATION')]

你可能感兴趣的:(python识别中文人名)