最近做一个项目需要对给定的文本中的句子做Parse,根据POS tag及句子成分信息找出词语/短语之间的dependency,然后根据dependency构建句子的parse tree. 需要用到Stanford Parser和OpenNLP 中的Shallow Parser,这两个Parser都用JAVA实现,提供API方式调用,可以根据句子输出语法解析树。下面总结两类Parser的作用及JAVA程序调用方法。
1 Shallow Parser
Shallow Parser主要作用是找出句子中的短语信息,包括名词短语NP,动词短语VP,形容词短语ADJP,副词短语ADVP等等,示例程序如下
package edu.pku.yangliu.nlp.pdt; import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStream; import java.io.StringReader; import java.util.HashMap; import opennlp.tools.chunker.ChunkerME; import opennlp.tools.chunker.ChunkerModel; import opennlp.tools.cmdline.PerformanceMonitor; import opennlp.tools.cmdline.postag.POSModelLoader; import opennlp.tools.postag.POSModel; import opennlp.tools.postag.POSSample; import opennlp.tools.postag.POSTaggerME; import opennlp.tools.tokenize.WhitespaceTokenizer; import opennlp.tools.util.InvalidFormatException; import opennlp.tools.util.ObjectStream; import opennlp.tools.util.PlainTextByLineStream; /**a Shallow Parser based on opennlp * @author yangliu * @blog http://blog.csdn.net/yangliuy * @mail [email protected] */ public class ShallowParser { private static ShallowParser instance = null ; private static POSModel model; private static ChunkerModel cModel ; //Singleton pattern public static ShallowParser getInstance() throws InvalidFormatException, IOException{ if(ShallowParser.instance == null){ POSModel model = new POSModelLoader().load(new File("en-pos-maxent.bin")); InputStream is = new FileInputStream("en-chunker.bin"); ChunkerModel cModel = new ChunkerModel(is); ShallowParser.instance = new ShallowParser(model, cModel); } return ShallowParser.instance; } public ShallowParser(POSModel model, ChunkerModel cModel){ ShallowParser.model = model; ShallowParser.cModel = cModel; } /** A shallow Parser, chunk a sentence and return a map for the phrase * labels of words <wordsIndex, phraseLabel> * Notice: There should be " " BEFORE and after ",", " ","(",")" etc. * @param input The input sentence * @param model The POSModel of the chunk * @param cModel The ChunkerModel of the chunk * @return HashMap<Integer,String> */ public HashMap<Integer,String> chunk(String input) throws IOException { PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent"); POSTaggerME tagger = new POSTaggerME(model); ObjectStream<String> lineStream = new PlainTextByLineStream( new StringReader(input)); perfMon.start(); String line; String whitespaceTokenizerLine[] = null; String[] tags = null; while ((line = lineStream.read()) != null) { whitespaceTokenizerLine = WhitespaceTokenizer.INSTANCE .tokenize(line); tags = tagger.tag(whitespaceTokenizerLine); POSSample posTags = new POSSample(whitespaceTokenizerLine, tags); System.out.println(posTags.toString()); perfMon.incrementCounter(); } perfMon.stopAndPrintFinalResult(); // chunker ChunkerME chunkerME = new ChunkerME(cModel); String result[] = chunkerME.chunk(whitespaceTokenizerLine, tags); HashMap<Integer,String> phraseLablesMap = new HashMap<Integer, String>(); Integer wordCount = 1; Integer phLableCount = 0; for (String phLable : result){ if(phLable.equals("O")) phLable += "-Punctuation"; //The phLable of the last word is OP if(phLable.split("-")[0].equals("B")) phLableCount++; phLable = phLable.split("-")[1] + phLableCount; //if(phLable.equals("ADJP")) phLable = "NP"; //Notice: ADJP included in NP //if(phLable.equals("ADVP")) phLable = "VP"; //Notice: ADVP included in VP System.out.println(wordCount + ":" + phLable); phraseLablesMap.put(wordCount, phLable); wordCount++; } //Span[] span = chunkerME.chunkAsSpans(whitespaceTokenizerLine, tags); //for (Span phLable : span) //System.out.println(phLable.toString()); return phraseLablesMap; } /** Just for testing * @param tdl Typed Dependency List * @return WDTreeNode root of WDTree */ public static void main(String[] args) throws IOException { //Notice: There should be " " BEFORE and after ",", " ","(",")" etc. String input = "We really enjoyed using the Canon PowerShot SD500 ."; //String input = "Bell , based in Los Angeles , makes and distributes electronic , computer and building products ."; ShallowParser swParser = ShallowParser.getInstance(); swParser.chunk(input); } }注意要配置好POS Model及Chunker Model的路径,这两个Model的数据文件都可以从OpenNLP的官网下载。
输出结果
Loading POS Tagger model ... done (1.563s) Average: 9.3 sent/s Total: 1 sent Runtime: 0.107s We_PRP really_RB enjoyed_VBD using_VBG the_DT Canon_NNP PowerShot_NNP SD500_NNP ._. 1:NP1 2:ADVP2 3:VP3 4:VP3 5:NP4 6:NP4 7:NP4 8:NP4 9:Punctuation4
2 Stanford Parser
Stanford Parser可以找出句子中词语之间的dependency关联信息,并且以Stanford Dependency格式输出,包括有向图及树等形式。示例代码如下
package edu.pku.yangliu.nlp.pdt; import java.io.IOException; import java.io.StringReader; import java.util.HashMap; import java.util.List; import opennlp.tools.util.InvalidFormatException; import edu.stanford.nlp.ling.CoreLabel; import edu.stanford.nlp.ling.HasWord; import edu.stanford.nlp.objectbank.TokenizerFactory; import edu.stanford.nlp.parser.lexparser.LexicalizedParser; import edu.stanford.nlp.process.CoreLabelTokenFactory; import edu.stanford.nlp.process.DocumentPreprocessor; import edu.stanford.nlp.process.PTBTokenizer; import edu.stanford.nlp.trees.GrammaticalStructure; import edu.stanford.nlp.trees.GrammaticalStructureFactory; import edu.stanford.nlp.trees.PennTreebankLanguagePack; import edu.stanford.nlp.trees.Tree; import edu.stanford.nlp.trees.TreebankLanguagePack; import edu.stanford.nlp.trees.TypedDependency; /**Phrase sentences based on stanford parser * @author yangliu * @blog http://blog.csdn.net/yangliuy * @mail [email protected] */ public class StanfordParser { private static StanfordParser instance = null ; private static LexicalizedParser lp; //Singleton pattern public static StanfordParser getInstance(){ if(StanfordParser.instance == null){ LexicalizedParser lp = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz","-retainTmpSubcategories"); StanfordParser.instance = new StanfordParser(lp); } return StanfordParser.instance; } public StanfordParser(LexicalizedParser lp){ StanfordParser.lp = lp; } /**Parse sentences in a file * @param SentFilename The input file * @return void */ public void DPFromFile(String SentFilename) { TreebankLanguagePack tlp = new PennTreebankLanguagePack(); GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory(); for (List<HasWord> sentence : new DocumentPreprocessor(SentFilename)) { Tree parse = lp.apply(sentence); parse.pennPrint(); System.out.println(); GrammaticalStructure gs = gsf.newGrammaticalStructure(parse); List<TypedDependency> tdl = (List<TypedDependency>)gs.typedDependenciesCollapsedTree(); System.out.println(tdl); System.out.println(); } } /**Parse sentences from a String * @param sent The input sentence * @return List<TypedDependency> The list for type dependency */ public List<TypedDependency> DPFromString(String sent) { TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), ""); List<CoreLabel> rawWords = tokenizerFactory.getTokenizer(new StringReader(sent)).tokenize(); Tree parse = lp.apply(rawWords); TreebankLanguagePack tlp = new PennTreebankLanguagePack(); GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory(); GrammaticalStructure gs = gsf.newGrammaticalStructure(parse); //Choose the type of dependenciesCollapseTree //so that dependencies which do not //preserve the tree structure are omitted return (List<TypedDependency>) gs.typedDependenciesCollapsedTree(); } }
/**Just for testing * @param args * @throws IOException * @throws InvalidFormatException */ public static void main(String[] args) throws InvalidFormatException, IOException { // TODO Auto-generated method stub //Notice: There should be " " BEFORE and after ",", " ","(",")" etc. String sent = "We really enjoyed using the Canon PowerShot SD500 ."; //String sent = "Bell , based in Los Angeles , makes and distributes electronic , computer and building products ."; //String sent = "It has an exterior design that combines form and function more elegantly than any point-and-shoot we've ever tested . "; //String sent = "A Digic II-powered image-processing system enables the SD500 to snap a limitless stream of 7-megapixel photos at a respectable clip , its start-up time is tops in its class , and it delivers decent photos when compared to its competition . "; //String sent = "I've had it for about a month and it is simply the best point-and-shoot your money can buy . "; StanfordParser sdPaser = StanfordParser.getInstance(); List<TypedDependency> tdl = sdPaser.DPFromString(sent); for(TypedDependency oneTdl : tdl){ System.out.println(oneTdl); } ShallowParser swParser = ShallowParser.getInstance(); HashMap<Integer,String> phraseLablesMap = new HashMap<Integer, String>(); phraseLablesMap = swParser.chunk(sent); WDTree wdtree = new WDTree(); WDTreeNode root = wdtree.bulidWDTreeFromList(tdl, phraseLablesMap); wdtree.printWDTree(root); }
Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [2.1 sec]. nsubj(enjoyed-3, We-1) advmod(enjoyed-3, really-2) root(ROOT-0, enjoyed-3) xcomp(enjoyed-3, using-4) det(SD500-8, the-5) nn(SD500-8, Canon-6) nn(SD500-8, PowerShot-7) dobj(using-4, SD500-8) Loading POS Tagger model ... done (1.492s) We_PRP really_RB enjoyed_VBD using_VBG the_DT Canon_NNP PowerShot_NNP SD500_NNP ._. Average: 200.0 sent/s Total: 1 sent Runtime: 0.0050s 1:NP1 2:ADVP2 3:VP3 4:VP3 5:NP4 6:NP4 7:NP4 8:NP4 9:Punctuation4 children of ROOT-0_ (phLable:null): enjoyed-3_ rel:root phLable:VP3 children of enjoyed-3_ (phLable:VP3): We-1_ rel:nsubj phLable:NP1 really-2_ rel:advmod phLable:ADVP2 using-4_ rel:xcomp phLable:VP3 children of using-4_ (phLable:VP3): SD500-8_ rel:dobj phLable:NP4 children of SD500-8_ (phLable:NP4): the-5_ rel:det phLable:NP4 Canon-6_ rel:nn phLable:NP4 PowerShot-7_ rel:nn phLable:NP4