最近考虑做些英文词语词干化的工作,听说coreNLP这个工具不错,就拿来用了。
coreNLP是斯坦福大学开发的一套关于自然语言处理的工具(toolbox),使用简单功能强大,有;命名实体识别、词性标注、词语词干化、语句语法树的构造还有指代关系等功能,使用起来比较方便。
coreNLP是使用Java编写的,运行环境需要在JDK1.8,1.7貌似都不支持。这是需要注意的
coreNLP官方文档不多,但是给的几个示例文件也差不多能摸索出来怎么用,刚才摸索了一下,觉得还挺顺手的。
环境:
window7 64位
JDK1.8
需要引进的ar包:
说明:这里只是测试了英文的,所以使用的Stanford-corenlp-3.6.0.models.jar文件,如果使用中文的需要在官网上下对应中文的model jar包,然后引进项目即可。
直接看代码比较简单:
package com.luchi.corenlp; import java.util.List; import java.util.Map; import java.util.Properties; import edu.stanford.nlp.hcoref.CorefCoreAnnotations.CorefChainAnnotation; import edu.stanford.nlp.hcoref.data.CorefChain; import edu.stanford.nlp.ling.CoreAnnotations; import edu.stanford.nlp.ling.CoreAnnotations.LemmaAnnotation; import edu.stanford.nlp.ling.CoreAnnotations.NamedEntityTagAnnotation; import edu.stanford.nlp.ling.CoreAnnotations.PartOfSpeechAnnotation; import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation; import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation; import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation; import edu.stanford.nlp.ling.CoreLabel; import edu.stanford.nlp.pipeline.Annotation; import edu.stanford.nlp.pipeline.StanfordCoreNLP; import edu.stanford.nlp.semgraph.SemanticGraph; import edu.stanford.nlp.semgraph.SemanticGraphCoreAnnotations; import edu.stanford.nlp.trees.Tree; import edu.stanford.nlp.trees.TreeCoreAnnotations.TreeAnnotation; import edu.stanford.nlp.util.CoreMap; public class TestNLP { public void test(){ //构造一个StanfordCoreNLP对象,配置NLP的功能,如lemma是词干化,ner是命名实体识别等 Properties props = new Properties(); props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref"); StanfordCoreNLP pipeline = new StanfordCoreNLP(props); // 待处理字符串 String text = "judy has been to china . she likes people there . and she went to Beijing ";// Add your text here! // 创造一个空的Annotation对象 Annotation document = new Annotation(text); // 对文本进行分析 pipeline.annotate(document); //获取文本处理结果 List<CoreMap> sentences = document.get(SentencesAnnotation.class); for(CoreMap sentence: sentences) { // traversing the words in the current sentence // a CoreLabel is a CoreMap with additional token-specific methods for (CoreLabel token: sentence.get(TokensAnnotation.class)) { // 获取句子的token(可以是作为分词后的词语) String word = token.get(TextAnnotation.class); System.out.println(word); //词性标注 String pos = token.get(PartOfSpeechAnnotation.class); System.out.println(pos); // 命名实体识别 String ne = token.get(NamedEntityTagAnnotation.class); System.out.println(ne); //词干化处理 String lema=token.get(LemmaAnnotation.class); System.out.println(lema); } // 句子的解析树 Tree tree = sentence.get(TreeAnnotation.class); tree.pennPrint(); // 句子的依赖图 SemanticGraph graph = sentence.get(SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation.class); System.out.println(graph.toString(SemanticGraph.OutputFormat.LIST)); } // 指代词链 //每条链保存指代的集合 // 句子和偏移量都从1开始 Map<Integer, CorefChain> corefChains = document.get(CorefChainAnnotation.class); if (corefChains == null) { return; } for (Map.Entry<Integer,CorefChain> entry: corefChains.entrySet()) { System.out.println("Chain " + entry.getKey() + " "); for (CorefChain.CorefMention m : entry.getValue().getMentionsInTextualOrder()) { // We need to subtract one since the indices count from 1 but the Lists start from 0 List<CoreLabel> tokens = sentences.get(m.sentNum - 1).get(CoreAnnotations.TokensAnnotation.class); // We subtract two for end: one for 0-based indexing, and one because we want last token of mention not one following. System. out.println(" " + m + ", i.e., 0-based character offsets [" + tokens.get(m.startIndex - 1).beginPosition() + ", " + tokens.get(m.endIndex - 2).endPosition() + ")"); } } } public static void main(String[]args){ TestNLP nlp=new TestNLP(); nlp.test(); } }
具体的注释都给出来了,我们可以直接看结果就知道代码的作用了:
对于每个token的识别结果:
原句中的:
judy 识别结果为:词性为NN,也就是名词,命名实体对象识别结果为O,词干识别为Judy
注意到has识别的词干已经被识别出来了,是“have”
而Beijing的命名实体标注识别结果为“Location”,也就意味着识别出了地名
然后我们看 解析树的识别(以第一句为例):
最后我们看一下指代的识别:
每个chain包含的是指代相同内容的词语,如chain1中两个she虽然在句子的不同位置,但是都指代的是第一句的“Judy”,这和原文的意思一致,表示识别正确,offsets表示的是该词语在句子中的位置
当然我只是用到了coreNLP的词干化功能,所以只需要把上面代码一改就可以处理词干化了,测试代码如下:
package com.luchi.corenlp; import java.util.List; import java.util.Properties; import edu.stanford.nlp.ling.CoreLabel; import edu.stanford.nlp.ling.CoreAnnotations.LemmaAnnotation; import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation; import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation; import edu.stanford.nlp.pipeline.Annotation; import edu.stanford.nlp.pipeline.StanfordCoreNLP; import edu.stanford.nlp.util.CoreMap; public class Lemma { // 词干化 public String stemmed(String inputStr) { Properties props = new Properties(); props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref"); StanfordCoreNLP pipeline = new StanfordCoreNLP(props); Annotation document = new Annotation(inputStr); pipeline.annotate(document); List<CoreMap> sentences = document.get(SentencesAnnotation.class); String outputStr = ""; for (CoreMap sentence : sentences) { // traversing the words in the current sentence // a CoreLabel is a CoreMap with additional token-specific methods for (CoreLabel token : sentence.get(TokensAnnotation.class)) { String lema = token.get(LemmaAnnotation.class); outputStr += lema+" "; } } return outputStr; } public static void main(String[]args){ Lemma lemma=new Lemma(); String input="jack had been to china there months ago. he likes china very much,and he is falling love with this country"; String output=lemma.stemmed(input); System.out.print("原句 :"); System.out.println(input); System.out.print("词干化:"); System.out.println(output); } }
输出结果为:
结果还是很准确的