preface: 最近忙着的项目想试着用斯坦福的parser,来解析句子生成句法分析树,然后分析子树,与treekernal结合起来,训练。stanford parser神器下载下来了,可使用却是蛋疼。一大堆说明,却没个方便快捷关于总的介绍。
stanford parser主页:
stanford parser下载:
$ sudo add-apt-repository ppa:webupd8team/java $ sudo apt-get update $ sudo apt-get install oracle-java8-installer $ java -version准备好了java8后,才可继续在ubuntu下编译使用stanford parser. 根据说明,运行lexparser.sh文件,加入文件名参数,运行即可。testsent.txt包含5句英文。
On a Unix system you should be able to parse the English test file with the following command: ./ data/testsent.txt This uses the PCFG parser, which is quick to load and run, and quite accurate. [Notes: it takes a few seconds to load the parser data before parsing begins; continued parsing is quicker. To use the lexicalized parser, replace englishPCFG.ser.gz with englishFactored.ser.gz in the script and use the flag -mx600m to give more memory to java.]在包含lexparser.sh文件夹里终端运行 ./ data/tentsent.txt 得到结果如下(部分):Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.5 sec]. Parsing file: data/testsent.txt Parsing [sent. 1 len. 21]: Scores of properties are under extreme fire threat as a huge blaze continues to advance through Sydney 's north-western suburbs . (ROOT (S (NP (NP (NNS Scores)) (PP (IN of) (NP (NNS properties)))) (VP (VBP are) (PP (IN under) (NP (JJ extreme) (NN fire) (NN threat))) (SBAR (IN as) (S (NP (DT a) (JJ huge) (NN blaze)) (VP (VBZ continues) (S (VP (TO to) (VP (VB advance) (PP (IN through) (NP (NP (NNP Sydney) (POS 's)) (JJ north-western) (NNS suburbs)))))))))) (. .))) nsubj(threat-8, Scores-1) case(properties-3, of-2) nmod:of(Scores-1, properties-3) cop(threat-8, are-4) case(threat-8, under-5) amod(threat-8, extreme-6) compound(threat-8, fire-7) root(ROOT-0, threat-8) mark(continues-13, as-9) det(blaze-12, a-10) amod(blaze-12, huge-11) nsubj(continues-13, blaze-12) nsubj(advance-15, blaze-12) advcl(threat-8, continues-13) mark(advance-15, to-14) xcomp(continues-13, advance-15) case(suburbs-20, through-16) nmod:poss(suburbs-20, Sydney-17) case(Sydney-17, 's-18) amod(suburbs-20, north-western-19) nmod:through(advance-15, suburbs-20)可以看出,stanford parser将英文很好的解析,而且有两种解析方式。换其他英文数据,也能很好的解析。骚年,你以为到这里就结束了么,too young too simple.
同仁看到我忙着stanford parser,说到NLTK里面就有这个,瞬间就演示了下怎么在nltk里面用,我了个XX啊,神器在身边可是不会用啊,不知道nltk神器有这功能。不过只有列表形式的结果:
In [8]: from nltk.parse import stanford In [9]: stanford.StanfordParser? Type: type String form: <class 'nltk.parse.stanford.StanfordParser'> File: /home/shifeng/anaconda/lib/python2.7/site-packages/nltk/parse/ Init definition: stanford.StanfordParser(self, path_to_jar=None, path_to_models_jar=None, model_path=u'edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz', encoding=u'UTF-8', verbose=False, java_options=u'-mx1000m') Docstring: Interface to the Stanford Parser >>> parser=StanfordParser( ... model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz" ... ) >>> parser.raw_parse_sents(( ... "the quick brown fox jumps over the lazy dog", ... "the quick grey wolf jumps over the lazy fox" ... )) [Tree('ROOT', [Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['quick']), Tree('JJ', ['brown']), Tree('NN', ['fox'])]), Tree('NP', [Tree('NP', [Tree('NNS', ['jumps'])]), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['lazy']), Tree('NN', ['dog'])])])])])]), Tree('ROOT', [Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['quick']), Tree('JJ', ['grey']), Tree('NN', ['wolf'])]), Tree('NP', [Tree('NP', [Tree('NNS', ['jumps'])]), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['lazy']), Tree('NN', ['fox'])])])])])])]我特么也这么干,不行啊不行啊。同仁说是没有下载jar包,打算通过nltk.download下载,结果没下好,在身边看得一愣一愣的我说已经在网上下好了。通过网上的博客介绍,nltk结合stanford-parser.jar解析句子:In [12]: import os In [13]: os.environ["STANFORD_PARSER"] = "stanford-parser.jar" In [14]: os.environ["STANFORD_MODELS"] = "stanford-parser-3.5.2-models.jar" In [15]: parser = stanford.StanfordParser(model_path=u'edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz') In [16]: sentences = parser.raw_parse_sents(("the quick brown fox jumps over the lazy dog","the quick grey wolf jumps over the lazy fox")) In [17]: sentences Out[17]: [Tree('ROOT', [Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['quick']), Tree('JJ', ['brown']), Tree('NN', ['fox'])]), Tree('NP', [Tree('NP', [Tree('NNS', ['jumps'])]), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['lazy']), Tree('NN', ['dog'])])])])])]), Tree('ROOT', [Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['quick']), Tree('JJ', ['grey']), Tree('NN', ['wolf'])]), Tree('NP', [Tree('NP', [Tree('NNS', ['jumps'])]), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['lazy']), Tree('NN', ['fox'])])])])])])] In [18]: sentences = parser.raw_parse_sents(("Hello, My name is Melroy.", "What is your name?")) In [19]: sentences Out[19]: [Tree('ROOT', [Tree('S', [Tree('INTJ', [Tree('UH', ['Hello'])]), Tree(',', [',']), Tree('NP', [Tree('PRP$', ['My']), Tree('NN', ['name'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('ADJP', [Tree('JJ', ['Melroy'])])]), Tree('.', ['.'])])]), Tree('ROOT', [Tree('SBARQ', [Tree('WHNP', [Tree('WP', ['What'])]), Tree('SQ', [Tree('VBZ', ['is']), Tree('NP', [Tree('PRP$', ['your']), Tree('NN', ['name'])])]), Tree('.', ['?'])])])]
import; import; import; import; import; import java.util.ArrayList; import java.util.List; import edu.stanford.nlp.ling.Word; import edu.stanford.nlp.parser.lexparser.LexicalizedParser; import edu.stanford.nlp.trees.Tree; public class Parser { public static void main(String[] args) throws IOException { // String grammar = "edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz"; String grammar = "edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz"; String[] options = {}; LexicalizedParser lp = LexicalizedParser.loadModel(grammar, options); String line = "我 的 名字 叫 小明 ?"; Tree parse = lp.parse(line); parse.pennPrint(); String[] arg2 = {"-encoding", "utf-8", "-outputFormat", "penn,typedDependenciesCollapsed", "edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz", "/home/shifeng/shifengworld/study/tool/stanford_parser/stanford-parser-full-2015-04-20/data/chinese-onesent-utf8.txt"}; LexicalizedParser.main(arg2); } }运行结果:Picked up JAVA_TOOL_OPTIONS: -javaagent:/usr/share/java/jayatanaag.jar Loading parser from serialized file edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz ... done [0.8 sec]. (ROOT Loading parser from serialized file edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz ... (IP (NP (DNP (NP (PN 我)) (DEG 的)) (NP (NN 名字))) (VP (VV 叫) (NP (NN 小明))) (PU ?))) done [4.1 sec]. Parsing file: /home/shifeng/shifengworld/study/tool/stanford_parser/stanford-parser-full-2015-04-20/data/chinese-onesent-utf8.txt Parsing [sent. 1 len. 8]: 俄国 希望 伊朗 没有 制造 核武器 计划 。 (ROOT (IP (NP (NR 俄国)) (VP (VV 希望) (IP (NP (NR 伊朗)) (VP (ADVP (AD 没有)) (VP (VV 制造) (NP (NN 核武器) (NN 计划)))))) (PU 。))) nsubj(希望-2, 俄国-1) root(ROOT-0, 希望-2) nsubj(制造-5, 伊朗-3) neg(制造-5, 没有-4) ccomp(希望-2, 制造-5) nn(计划-7, 核武器-6) dobj(制造-5, 计划-7) Parsed file: /home/shifeng/shifengworld/study/tool/stanford_parser/stanford-parser-full-2015-04-20/data/chinese-onesent-utf8.txt [1 sentences]. Parsed 8 words in 1 sentences (30.42 wds/sec; 3.80 sents/sec).
1. stackoverflow:
2. nltk官网:
3. nltk官网:
4. stanford parser官网:
5. stanford parser下载:
6. 博友博客:
7. 博友博客:
8. 百度文库: