原文地址:https://pypi.python.org/pypi/corenlp-python
无聊翻译一下。以下是正文。
corenlp-python 3.4.1-1
一个斯坦福Core NLP工具封装
# 一个关于Java版斯坦福Core NLP工具的Python封装
---------------------------
这是一个从Dustin Smith's [stanford-corenlp-python](https://github.com/dasmith/stanford-corenlp-python)fork出来的关于[斯坦福CoreNLP](http://nlp.stanford.edu/software/corenlp.shtml)Python接口。它既可当作python package来使用,也可以作为一个JSON-RPC运行。
## 自原封装的升级:
*支持斯坦福CoreNLP v3.x.x (兼容最近版本)
* 修正了很多bug & 改进了性能
* 调整了参数以不在高负荷时超时
* 为了稳定性和性能使用了jsonrpclib
* 对长文本支持情感分析的批量解析器
* 兼容Python 3(感谢Valentin Lorentz)
* [包](https://pypi.python.org/pypi/corenlp-python)
## 软件需求
* [pexpect](http://www.noah.org/wiki/pexpect)
* [jsonrpclib](https://github.com/joshmarshall/jsonrpclib) (可选)
## 下载及用法
要使用此程序,你必须下载(http://nlp.stanford.edu/software/corenlp.shtml#Download) 并解压包含有斯坦福CoreNLP包的zip文件。'corenlp.py'默认斯坦福CoreNLP位置在所要运行的script所在的目录的子目录下。
换言之:
sudo pip install pexpect unidecode jsonrpclib # jsonrpclib 为可选项
git clone https://bitbucket.org/torotoki/corenlp-python.git
cd corenlp-python
# 假设斯坦福CoreNLP的版本为3.4.1
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2014-08-27.zip
unzip stanford-corenlp-full-2014-08-27.zip
然后,启动服务器:
python corenlp/corenlp.py
作为可选项,你可以指定一个主机或端口:
python corenlp/corenlp.py -H 0.0.0.0 -p 3456
这将在端口上运行一个公共JSON-RPC服务。
并且你可以指定斯坦福CoreNLP的目录:
python corenlp/corenlp.py -S stanford-corenlp-full-2014-08-27/
假设你运行在8080端口上,并且CoreNLP目录为当前目录下的`stanford-corenlp-full-2014-08-27/`,此封装支持的有同样输出格式的最近版本为3.4.1。
`client.py`内的代码演示了一个解析示例:
import jsonrpclib
from simplejson import loads
server = jsonrpclib.Server("http://localhost:8080")
result = loads(server.parse("Hello world. It is so beautiful"))
print "Result", result
其返回一个包含键(key) `sentences`及(如可用)`corefs`的字典。 键`sentences`包含一个针对每一个句子的字典列表,该列表包含`parsetree`,`text`,包含依赖关系的`tuples`,和包含词性,NER等信息的`words`的列表:
{u'sentences': [{u'parsetree': u'(ROOT (S (VP (NP (INTJ (UH Hello)) (NP (NN world)))) (. !)))',
u'text': u'Hello world!',
u'tuples': [[u'dep', u'world', u'Hello'],
[u'root', u'ROOT', u'world']],
u'words': [[u'Hello',
{u'CharacterOffsetBegin': u'0',
u'CharacterOffsetEnd': u'5',
u'Lemma': u'hello',
u'NamedEntityTag': u'O',
u'PartOfSpeech': u'UH'}],
[u'world',
{u'CharacterOffsetBegin': u'6',
u'CharacterOffsetEnd': u'11',
u'Lemma': u'world',
u'NamedEntityTag': u'O',
u'PartOfSpeech': u'NN'}],
[u'!',
{u'CharacterOffsetBegin': u'11',
u'CharacterOffsetEnd': u'12',
u'Lemma': u'!',
u'NamedEntityTag': u'O',
u'PartOfSpeech': u'.'}]]},
{u'parsetree': u'(ROOT (S (NP (PRP It)) (VP (VBZ is) (ADJP (RB so) (JJ beautiful))) (. .)))',
u'text': u'It is so beautiful.',
u'tuples': [[u'nsubj', u'beautiful', u'It'],
[u'cop', u'beautiful', u'is'],
[u'advmod', u'beautiful', u'so'],
[u'root', u'ROOT', u'beautiful']],
u'words': [[u'It',
{u'CharacterOffsetBegin': u'14',
u'CharacterOffsetEnd': u'16',
u'Lemma': u'it',
u'NamedEntityTag': u'O',
u'PartOfSpeech': u'PRP'}],
[u'is',
{u'CharacterOffsetBegin': u'17',
u'CharacterOffsetEnd': u'19',
u'Lemma': u'be',
u'NamedEntityTag': u'O',
u'PartOfSpeech': u'VBZ'}],
[u'so',
{u'CharacterOffsetBegin': u'20',
u'CharacterOffsetEnd': u'22',
u'Lemma': u'so',
u'NamedEntityTag': u'O',
u'PartOfSpeech': u'RB'}],
[u'beautiful',
{u'CharacterOffsetBegin': u'23',
u'CharacterOffsetEnd': u'32',
u'Lemma': u'beautiful',
u'NamedEntityTag': u'O',
u'PartOfSpeech': u'JJ'}],
[u'.',
{u'CharacterOffsetBegin': u'32',
u'CharacterOffsetEnd': u'33',
u'Lemma': u'.',
u'NamedEntityTag': u'O',
u'PartOfSpeech': u'.'}]]}],
u'coref': [[[[u'It', 1, 0, 0, 1], [u'Hello world', 0, 1, 0, 2]]]]}
不使用JSON-RPC而直接加载模块:
from corenlp import StanfordCoreNLP
corenlp_dir = "stanford-corenlp-full-2014-08-27/"
corenlp = StanfordCoreNLP(corenlp_dir) # 等几分钟...
corenlp.raw_parse("Parse it")
若你需要解析长文字(多于30-50句),你必须使用batch_parse函数。该函数从输入目录中读取文本文件,并返回解析每个文件结果的字典生成器对象:
from corenlp import batch_parse
corenlp_dir = "stanford-corenlp-full-2014-08-27/"
raw_text_directory = "sample_raw_text/"
parsed = batch_parse(raw_text_directory, corenlp_dir) # 返回一个生成器对象
print parsed #=> [{'coref': ..., 'sentences': ..., 'file_name': 'new_sample.txt'}]
该函数使用斯坦福CoreNLP的XML输出功能,并且你可以使用raw_output选项来获取所有的信息。如果为true,CoreNLP XML返回值为一无需转换格式的字典。
parsed = batch_parse(raw_text_directory, corenlp_dir, raw_output=True)
(注意:在此该函数需要xmltodict,你可使用sudo pip install xmltodict来安装)
### 注意
* JSON-RPC server [在处理大段文字时挂起](https://bitbucket.org/torotoki/corenlp-python/issue/7/server-halts-on-large-text). 可能是因为stdout的限制,你应使用批量解析或是[另一个封装](https://github.com/brendano/stanford_corenlp_pywrapper).
* JSON-RPC服务器不支持情感分析工具,因为原版CoreNLP工具还不输出情感结果到stdout(批量解析器的输出包括从原版CoreNLP工具获取的XML格式情感输出)
## License
## 许可
corenlp-python使用GNU通用公共许可证(v2或更高版本)。需要注意的是,这是完全的 GPL,它允许许多自由使用,但不许用在分发版本的软件中。
## Developer
## 开发者
* Hiroyoshi Komatsu [[email protected]]
* Johannes Castner [[email protected]]
下载地址及md5码见原文网址
附原文如下:
corenlp-python 3.4.1-1
A Stanford Core NLP wrapper
# A Python wrapper for the Java Stanford Core NLP tools
---------------------------
This is a fork of Dustin Smith's [stanford-corenlp-python](https://github.com/dasmith/stanford-corenlp-python), a Python interface to [Stanford CoreNLP](http://nlp.stanford.edu/software/corenlp.shtml). It can either use as a python package, or run as a JSON-RPC server.
## Updates from the original wrapper
* Supports Stanford CoreNLP v3.x.x (compatible with recent versions)
* Fixed many bugs & improved performance
* Adjusted parameters not to timeout in high load
* Using jsonrpclib for stability and performance
* Batch parser for long text which supports sentiment analysis
* Python 3 compatibility (thanks to Valentin Lorentz)
* [Packaging](https://pypi.python.org/pypi/corenlp-python)
## Requirements
* [pexpect](http://www.noah.org/wiki/pexpect)
* [jsonrpclib](https://github.com/joshmarshall/jsonrpclib) (optionally)
## Download and Usage
To use this program you must [download](http://nlp.stanford.edu/software/corenlp.shtml#Download) and unpack the zip file containing Stanford's CoreNLP package. By default, `corenlp.py` looks for the Stanford Core NLP folder as a subdirectory of where the script is being run.
In other words:
sudo pip install pexpect unidecode jsonrpclib # jsonrpclib is optional
git clone https://bitbucket.org/torotoki/corenlp-python.git
cd corenlp-python
# assuming the version 3.4.1 of Stanford CoreNLP
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2014-08-27.zip
unzip stanford-corenlp-full-2014-08-27.zip
Then, to launch a server:
python corenlp/corenlp.py
Optionally, you can specify a host or port:
python corenlp/corenlp.py -H 0.0.0.0 -p 3456
That will run a public JSON-RPC server on port 3456.
And you can specify Stanford CoreNLP directory:
python corenlp/corenlp.py -S stanford-corenlp-full-2014-08-27/
Assuming you are running on port 8080 and CoreNLP directory is `stanford-corenlp-full-2014-08-27/` in current directory, this wrapper supports recently version around of 3.4.1 which has same output format.
The code in `client.py` shows an example parse:
import jsonrpclib
from simplejson import loads
server = jsonrpclib.Server("http://localhost:8080")
result = loads(server.parse("Hello world. It is so beautiful"))
print "Result", result
That returns a dictionary containing the keys `sentences` and (when applicable) `corefs`. The key `sentences` contains a list of dictionaries for each sentence, which contain `parsetree`, `text`, `tuples` containing the dependencies, and `words`, containing information about parts of speech, NER, etc:
{u'sentences': [{u'parsetree': u'(ROOT (S (VP (NP (INTJ (UH Hello)) (NP (NN world)))) (. !)))',
u'text': u'Hello world!',
u'tuples': [[u'dep', u'world', u'Hello'],
[u'root', u'ROOT', u'world']],
u'words': [[u'Hello',
{u'CharacterOffsetBegin': u'0',
u'CharacterOffsetEnd': u'5',
u'Lemma': u'hello',
u'NamedEntityTag': u'O',
u'PartOfSpeech': u'UH'}],
[u'world',
{u'CharacterOffsetBegin': u'6',
u'CharacterOffsetEnd': u'11',
u'Lemma': u'world',
u'NamedEntityTag': u'O',
u'PartOfSpeech': u'NN'}],
[u'!',
{u'CharacterOffsetBegin': u'11',
u'CharacterOffsetEnd': u'12',
u'Lemma': u'!',
u'NamedEntityTag': u'O',
u'PartOfSpeech': u'.'}]]},
{u'parsetree': u'(ROOT (S (NP (PRP It)) (VP (VBZ is) (ADJP (RB so) (JJ beautiful))) (. .)))',
u'text': u'It is so beautiful.',
u'tuples': [[u'nsubj', u'beautiful', u'It'],
[u'cop', u'beautiful', u'is'],
[u'advmod', u'beautiful', u'so'],
[u'root', u'ROOT', u'beautiful']],
u'words': [[u'It',
{u'CharacterOffsetBegin': u'14',
u'CharacterOffsetEnd': u'16',
u'Lemma': u'it',
u'NamedEntityTag': u'O',
u'PartOfSpeech': u'PRP'}],
[u'is',
{u'CharacterOffsetBegin': u'17',
u'CharacterOffsetEnd': u'19',
u'Lemma': u'be',
u'NamedEntityTag': u'O',
u'PartOfSpeech': u'VBZ'}],
[u'so',
{u'CharacterOffsetBegin': u'20',
u'CharacterOffsetEnd': u'22',
u'Lemma': u'so',
u'NamedEntityTag': u'O',
u'PartOfSpeech': u'RB'}],
[u'beautiful',
{u'CharacterOffsetBegin': u'23',
u'CharacterOffsetEnd': u'32',
u'Lemma': u'beautiful',
u'NamedEntityTag': u'O',
u'PartOfSpeech': u'JJ'}],
[u'.',
{u'CharacterOffsetBegin': u'32',
u'CharacterOffsetEnd': u'33',
u'Lemma': u'.',
u'NamedEntityTag': u'O',
u'PartOfSpeech': u'.'}]]}],
u'coref': [[[[u'It', 1, 0, 0, 1], [u'Hello world', 0, 1, 0, 2]]]]}
Not to use JSON-RPC, load the module instead:
from corenlp import StanfordCoreNLP
corenlp_dir = "stanford-corenlp-full-2014-08-27/"
corenlp = StanfordCoreNLP(corenlp_dir) # wait a few minutes...
corenlp.raw_parse("Parse it")
If you need to parse long texts (more than 30-50 sentences), you must use a `batch_parse` function. It reads text files from input directory and returns a generator object of dictionaries parsed each file results:
from corenlp import batch_parse
corenlp_dir = "stanford-corenlp-full-2014-08-27/"
raw_text_directory = "sample_raw_text/"
parsed = batch_parse(raw_text_directory, corenlp_dir) # It returns a generator object
print parsed #=> [{'coref': ..., 'sentences': ..., 'file_name': 'new_sample.txt'}]
The function uses XML output feature of Stanford CoreNLP, and you can take all information by `raw_output` option. If true, CoreNLP's XML is returned as a dictionary without converting the format.
parsed = batch_parse(raw_text_directory, corenlp_dir, raw_output=True)
(Note: The function requires xmltodict now, you should install it by `sudo pip install xmltodict`)
### Note
* JSON-RPC server [halts on large text](https://bitbucket.org/torotoki/corenlp-python/issue/7/server-halts-on-large-text). it maybe because of restriction of stdout, you should use the batch parser or [an other wrapper](https://github.com/brendano/stanford_corenlp_pywrapper).
* JSON-RPC server doesn't support sentiment analysis tools because original CoreNLP tools don't output sentiment results to stdout yet (batch parser's output includes sentiment results retrieved from the original CoreNLP tools's XML output)
## License
corenlp-python is licensed under the GNU General Public License (v2 or later). Note that this is the /full/ GPL, which allows many free uses, but not its use in distributed proprietary software.
## Developer
* Hiroyoshi Komatsu [[email protected]]
* Johannes Castner [[email protected]]