StanfordCoreNLP 踩过的坑(python,java)

python--stanfordcorenlp

stanford core nlp 是一个用于nlp的工具库。它是用java写的,但是现在也为python提供了接口。前段时间笔者尝试在python中使用它:
首先引入stanfordcorenlp的包
在python文件中引用:

from stanfordcorenlp import StanfordCoreNLP

stanfordcorenlp 中只有 StanfordCoreNLP 一个类
获得StanfordCoreNLP 的对象:
创建StanfordCoreNLP 对象需要传入一个路径参数,从而获得一个存放相应jar包的文件夹:该文件夹下载地址:https://stanfordnlp.github.io/CoreNLP/download.html
笔者使用的是:stanford-corenlp-full-2016-10-31

nlp = StanfordCoreNLP(path)  # 这里的path即是stanford-corenlp-full-2016-10-31 的路径

使用
它的使用非常简单

from stanfordcorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP(path)
sentence = "i 've had the player for about 2 years now and it still performs nicely with the exception of an occasional wwhhhrrr sound from the motor ."
print(nlp.dependency_parse(sentence))
nlp.close()

但是直接运行会出错

$ python WordFormation.py
Traceback (most recent call last):
  File "WordFormation.py", line 1, in 
    from stanfordcorenlp import StanfordCoreNLP
ModuleNotFoundError: No module named 'stanfordcorenlp'

或是:

PermissionError: [Errno 1] Operation not permitted

因此使用root权限运行:成功获得dependency

$ sudo python WordFormation.py
[('ROOT', 0, 3), ('nsubj', 3, 1), ('aux', 3, 2), ('det', 5, 4), ('dobj', 3, 5), ('case', 9, 6), ('advmod', 8, 7), ('nummod', 9, 8), ('nmod', 5, 9), ('advmod', 3, 10), ('cc', 3, 11), ('nsubj', 14, 12), ('advmod', 14, 13), ('conj', 3, 14), ('advmod', 14, 15), ('case', 18, 16), ('det', 18, 17), ('nmod', 14, 18), ('case', 23, 19), ('det', 23, 20), ('amod', 23, 21), ('compound', 23, 22), ('nmod', 18, 23), ('case', 26, 24), ('det', 26, 25), ('nmod', 23, 26), ('punct', 3, 27)]

其中的那些数字代表的是第几个单词,但是它是从1开始数的,('ROOT', 0, 3) 中的0不代表sentence中的单词

StanfordCoreNLP 还有一些功能,比如词性标注等都可以使用

但是笔者没有从StanfordCoreNLP 类中获得可以进一步获得dependency的方法:比如复合名词修饰 nmod 在这里我只能获得 nmod 而不能获得修饰用的介词 nmod:for 的形式

笔者没能找到合适的方法,因此我决定改用java尝试一下

JAVA--stanfordcorenlp

java 的话,语句相应会复杂一些

首先引入相应的jar包:
由于笔者建的maven项目
pom.xml 中加入:

    
        3.9.2
    

    
        
            edu.stanford.nlp
            stanford-corenlp
            ${corenlp.version}
        

        
            edu.stanford.nlp
            stanford-corenlp
            ${corenlp.version}
            models
        

    

开始运行:

import edu.stanford.nlp.ling.CoreAnnotations;
import java.util.Properties;
public class StanfordEnglishNlpExample {
    public static void main(String[] args) {
        Properties props = new Properties();
        // 设置相应的properties
        props.put("annotators", "tokenize,ssplit,pos,parse,depparse");
        props.put("tokenize.options", "ptb3Escaping=false");
        props.put("parse.maxlen", "10000");
        props.put("depparse.extradependencies", "SUBJ_ONLY");
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props); // 获得StanfordCoreNLP 对象

        String str = "i 've had the player for about 2 years now and it still performs nicely with the exception of an occasional wwhhhrrr sound from the motor .";
        Annotation document = new Annotation(str);
        pipeline.annotate(document);
        CoreMap sentence = document.get(CoreAnnotations.SentencesAnnotation.class).get(0);
        SemanticGraph dependency_graph = sentence.get(SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation.class);  // 获得依赖关系图
        System.out.println("\n\nDependency Graph: " + dependency_graph.toString(SemanticGraph.OutputFormat.LIST));
    }// 直接打印关系
}

获得结果:

Dependency Graph: root(ROOT-0, had-3)
nsubj(had-3, i-1)
aux(had-3, 've-2)
det(player-5, the-4)
dobj(had-3, player-5)
case(years-9, for-6)
advmod(2-8, about-7)
nummod(years-9, 2-8)
nmod:for(player-5, years-9)
advmod(had-3, now-10)
cc(had-3, and-11)
nsubj(performs-14, it-12)
advmod(performs-14, still-13)
conj:and(had-3, performs-14)
advmod(performs-14, nicely-15)
case(exception-18, with-16)
det(exception-18, the-17)
nmod:with(performs-14, exception-18)
case(sound-23, of-19)
det(sound-23, an-20)
amod(sound-23, occasional-21)
compound(sound-23, wwhhhrrr-22)
nmod:of(exception-18, sound-23)
case(motor-26, from-24)
det(motor-26, the-25)
nmod:from(sound-23, motor-26)
punct(had-3, .-27)

在这里面就可以看到nmod:with nmod:of 这样的依存关系了

但是其实还是有问题的:
对于以上代码中的SemanticGraph 对象 dependency_graph 来说
如果想要获得它的对象的依存关系

List list = dependencies.edgeListSorted();

这时候就会发现,这个list中没有root的关系
实际上,如果想要root 关系,只能通过从 dependency_graph 再获取root关系列表,这样的话,没有很好的顺序关系

因此用另一种方法来获得:
为了将工作做的更完整一些,这里笔者将完成词性标注工作
想要使用词性标注器,首先需要获得english-left3words-distsim.tagger文件,这个文件在stanford-corenlp-2016-10-31 中有,可以直接用。但是很有可能由于引用的jar包和使用的tagger文件的版本不一致导致错误。

实际上,在我们引入的stanford-corenlp-models的jar包里就有这个tagger文件,但是想要将它读出来需要一点工作

URL url = new URL("jar:file:"+ path +
                "!/edu/stanford/nlp/models/pos-tagger/english-left3words/" +
                "english-left3words-distsim.tagger"); 
# 这里的path是jar包的路径,!后面的是tagger文件在jar包内部路径
JarURLConnection jarURLConnection = (JarURLConnection) url.openConnection();

由于词性标注器,MaxentTagger 类构造器,可以传入路径,也可以传入InputStream 对象:

MaxentTagger tagger = new MaxentTagger(jarURLConnection.getInputStream());

成功获得对象:

    public static void main(String[] args) throws java.net.MalformedURLException, IOException {

        URL url = new URL("jar:file:"+ path +
                "!/edu/stanford/nlp/models/pos-tagger/english-left3words/" +
                "english-left3words-distsim.tagger");
        JarURLConnection jarURLConnection = (JarURLConnection) url.openConnection();

        MaxentTagger tagger = new MaxentTagger(jarURLConnection.getInputStream());
        DependencyParser parser = DependencyParser.loadFromModelFile(DependencyParser.DEFAULT_MODEL); // 依存关系解析器

        String review = "i 've had the player for about 2 years now and it still performs nicely with the exception of an occasional wwhhhrrr sound from the motor .";
        String result = "[";
        DocumentPreprocessor tockenizer = new DocumentPreprocessor(new StringReader(review)); // 将一段话,分成多个句子
        for(List sentence: tockenizer){
            List tagged = tagger.tagSentence(sentence); // 对句子中的词打标签
            GrammaticalStructure gs = parser.predict(tagged);
            List tdl = gs.typedDependenciesCCprocessed(); // 获得依赖关系
            for(TypedDependency td: tdl){
                result = result.concat(td.reln()+"("+td.gov()+", "+td.dep()+"),");
            }
        }
        System.out.println(result.substring(0,result.length()-1)+"]");
    }

获得结果:

[nsubj(had/VBD, i/FW),aux(had/VBD, 've/VBP),root(ROOT, had/VBD),det(player/NN, the/DT),dobj(had/VBD, player/NN),case(years/NNS, for/IN),advmod(2/CD, about/IN),nummod(years/NNS, 2/CD),nmod:for(player/NN, years/NNS),advmod(had/VBD, now/RB),cc(had/VBD, and/CC),nsubj(performs/VBZ, it/PRP),advmod(performs/VBZ, still/RB),conj:and(had/VBD, performs/VBZ),advmod(performs/VBZ, nicely/RB),case(exception/NN, with/IN),det(exception/NN, the/DT),nmod:with(performs/VBZ, exception/NN),case(sound/NN, of/IN),det(sound/NN, an/DT),amod(sound/NN, occasional/JJ),compound(sound/NN, wwhhhrrr/NN),nmod:of(exception/NN, sound/NN),case(motor/NN, from/IN),det(motor/NN, the/DT),nmod:from(sound/NN, motor/NN),punct(had/VBD, ./.)]

你可能感兴趣的:(StanfordCoreNLP 踩过的坑(python,java))