Is this possible: to get (similar to) Stanford Named Entity Recognizer functionality using just NLTK?
Is there any example?
In particular, I am interested in extraction LOCATION part of text. For example, from text
The meeting will be held at 22 West Westin st., South Carolina, 12345
on Nov.-18
ideally I would like to get something like
(S
22/LOCATION
(LOCATION West/LOCATION Westin/LOCATION)
st./LOCATION
,/,
(South/LOCATION Carolina/LOCATION)
,/,
12345/LOCATION
.....
or simply
22 West Westin st., South Carolina, 12345
Instead, I am only able to get
(S
The/DT
meeting/NN
will/MD
be/VB
held/VBN
at/IN
22/CD
(LOCATION West/NNP Westin/NNP)
st./NNP
,/,
(GPE South/NNP Carolina/NNP)
,/,
12345/CD
on/IN
Nov.-18/-NONE-)
Note that if I enter my text into http://nlp.stanford.edu:8080/ner/process I get results far from perfect (street number and zip code are still missing) but at least "st." is a part of LOCATION and South Carolina is a LOCATION and not some "GPE / NNP" : ?
What I am doing wrong please? how can I fix it to use NLTK for extracting location piece from some text please?
Many thanks in advance!
解决方案
nltk DOES have an interface for Stanford NER, check nltk.tag.stanford.NERTagger.
from nltk.tag.stanford import NERTagger
st = NERTagger('/usr/share/stanford-ner/classifiers/all.3class.distsim.crf.ser.gz',
'/usr/share/stanford-ner/stanford-ner.jar')
st.tag('Rami Eid is studying at Stony Brook University in NY'.split())
output:
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
However every time you call tag, nltk simply writes the target sentence into a file and runs Stanford NER command line tool to parse that file and finally parses the output back to python. Therefore the overhead of loading classifiers (around 1 min for me every time) is unavoidable.
If that's a problem, use Pyner.
First run Stanford NER as a server
java -mx1000m -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer \
-loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -port 9191
then go to pyner folder
import ner
tagger = ner.SocketNER(host='localhost', port=9191)
tagger.get_entities("University of California is located in California, United States")
# {'LOCATION': ['California', 'United States'],
# 'ORGANIZATION': ['University of California']}
tagger.json_entities("Alice went to the Museum of Natural History.")
#'{"ORGANIZATION": ["Museum of Natural History"], "PERSON": ["Alice"]}'
Hope this helps.