基于Stanford Parser 及OpenNLP Shallow Parser构建句子语法解析树

最近做一个项目需要对给定的文本中的句子做Parse,根据POS tag及句子成分信息找出词语/短语之间的dependency,然后根据dependency构建句子的parse tree. 需要用到Stanford Parser和OpenNLP 中的Shallow Parser,这两个Parser都用JAVA实现,提供API方式调用,可以根据句子输出语法解析树。下面总结两类Parser的作用及JAVA程序调用方法。

1 Shallow Parser

Shallow Parser主要作用是找出句子中的短语信息,包括名词短语NP,动词短语VP,形容词短语ADJP,副词短语ADVP等等,示例程序如下

package edu.pku.yangliu.nlp.pdt;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.StringReader;
import java.util.HashMap;

import opennlp.tools.chunker.ChunkerME;
import opennlp.tools.chunker.ChunkerModel;
import opennlp.tools.cmdline.PerformanceMonitor;
import opennlp.tools.cmdline.postag.POSModelLoader;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSSample;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import opennlp.tools.util.InvalidFormatException;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;

/**a Shallow Parser based on opennlp
 * @author yangliu
 * @blog http://blog.csdn.net/yangliuy
 * @mail [email protected]
 */

public class ShallowParser {
	
	private static ShallowParser instance = null ;
	private static POSModel model;
	private static ChunkerModel cModel ;
	
	//Singleton pattern
	public static ShallowParser getInstance() throws InvalidFormatException, IOException{
		if(ShallowParser.instance == null){
			POSModel model = new POSModelLoader().load(new File("en-pos-maxent.bin"));
			InputStream is = new FileInputStream("en-chunker.bin");
			ChunkerModel cModel = new ChunkerModel(is);
			ShallowParser.instance = new ShallowParser(model, cModel);
		}
		return ShallowParser.instance;
	}
	
	public ShallowParser(POSModel model, ChunkerModel cModel){
		ShallowParser.model = model;
		ShallowParser.cModel = cModel;
		
	}
	
	 /** A shallow Parser, chunk a sentence and return a map for the phrase
	  *  labels of words 
	 *   Notice: There should be " " BEFORE and after ",", " ","(",")" etc.
	 * @param input The input sentence
	 * @param model The POSModel of the chunk
	 * @param cModel The ChunkerModel of the chunk
	 * @return  HashMap
	 */
	 public HashMap chunk(String input) throws IOException { 	
			PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
			POSTaggerME tagger = new POSTaggerME(model);
			ObjectStream lineStream = new PlainTextByLineStream(
					new StringReader(input));
			perfMon.start();
			String line;
			String whitespaceTokenizerLine[] = null; 
			String[] tags = null;
			while ((line = lineStream.read()) != null) {
				whitespaceTokenizerLine = WhitespaceTokenizer.INSTANCE
						.tokenize(line);
				tags = tagger.tag(whitespaceTokenizerLine);	 
				POSSample posTags = new POSSample(whitespaceTokenizerLine, tags);
				System.out.println(posTags.toString());
				perfMon.incrementCounter();
			}
			perfMon.stopAndPrintFinalResult();
	 
			// chunker
			ChunkerME chunkerME = new ChunkerME(cModel);
			String result[] = chunkerME.chunk(whitespaceTokenizerLine, tags);
			
			HashMap phraseLablesMap = new HashMap();
			Integer wordCount = 1;
			Integer phLableCount = 0;
			for (String phLable : result){
				if(phLable.equals("O")) phLable += "-Punctuation"; //The phLable of the last word is OP
				if(phLable.split("-")[0].equals("B")) phLableCount++;
				phLable = phLable.split("-")[1] + phLableCount;
				//if(phLable.equals("ADJP")) phLable = "NP"; //Notice: ADJP included in NP
				//if(phLable.equals("ADVP")) phLable = "VP"; //Notice: ADVP included in VP
				System.out.println(wordCount + ":" + phLable);
				phraseLablesMap.put(wordCount, phLable);
				wordCount++;
			}
				
			//Span[] span = chunkerME.chunkAsSpans(whitespaceTokenizerLine, tags);
			//for (Span phLable : span)
				//System.out.println(phLable.toString());
			return phraseLablesMap;
		}
	 
	 /** Just for testing
		 * @param tdl Typed Dependency List
		 * @return WDTreeNode root of WDTree
		 */
	 public static void main(String[] args) throws IOException {
		 //Notice: There should be " " BEFORE and after ",", " ","(",")" etc.
		 String input = "We really enjoyed using the Canon PowerShot SD500 .";
		 //String input = "Bell , based in Los Angeles , makes and distributes electronic , computer and building products .";
		 ShallowParser swParser = ShallowParser.getInstance();
		 swParser.chunk(input);
	 }
	     
}
注意要配置好POS Model及Chunker Model的路径,这两个Model的数据文件都可以从OpenNLP的官网下载。

输出结果

Loading POS Tagger model ... done (1.563s)


Average: 9.3 sent/s 
Total: 1 sent
Runtime: 0.107s
We_PRP really_RB enjoyed_VBD using_VBG the_DT Canon_NNP PowerShot_NNP SD500_NNP ._.
1:NP1
2:ADVP2
3:VP3
4:VP3
5:NP4
6:NP4
7:NP4
8:NP4
9:Punctuation4

从结果中可以看出,Shallow Parser首先输出了POS tag信息,然后从句子中找出了两个名词短语NP1和NP4,一个动词短语VP3和一个副词短语ADVP2

2 Stanford Parser

Stanford Parser可以找出句子中词语之间的dependency关联信息,并且以Stanford Dependency格式输出,包括有向图及树等形式。示例代码如下

package edu.pku.yangliu.nlp.pdt;
import java.io.IOException;
import java.io.StringReader;
import java.util.HashMap;
import java.util.List;

import opennlp.tools.util.InvalidFormatException;

import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.ling.HasWord;
import edu.stanford.nlp.objectbank.TokenizerFactory;
import edu.stanford.nlp.parser.lexparser.LexicalizedParser;
import edu.stanford.nlp.process.CoreLabelTokenFactory;
import edu.stanford.nlp.process.DocumentPreprocessor;
import edu.stanford.nlp.process.PTBTokenizer;
import edu.stanford.nlp.trees.GrammaticalStructure;
import edu.stanford.nlp.trees.GrammaticalStructureFactory;
import edu.stanford.nlp.trees.PennTreebankLanguagePack;
import edu.stanford.nlp.trees.Tree;
import edu.stanford.nlp.trees.TreebankLanguagePack;
import edu.stanford.nlp.trees.TypedDependency;

/**Phrase sentences based on stanford parser
 * @author yangliu
 * @blog http://blog.csdn.net/yangliuy
 * @mail [email protected]
 */

public class StanfordParser {
	private static StanfordParser instance = null ;
	private static LexicalizedParser lp;
	
	//Singleton pattern
	public static StanfordParser getInstance(){
		if(StanfordParser.instance == null){
			LexicalizedParser lp = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz","-retainTmpSubcategories");
			StanfordParser.instance = new StanfordParser(lp);
		}
		return StanfordParser.instance;
	}
	
	public StanfordParser(LexicalizedParser lp){
		StanfordParser.lp = lp;
	}
	 /**Parse sentences in a file
	 * @param SentFilename The input file
	 * @return  void
	 */
	  public void DPFromFile(String SentFilename) {
		    TreebankLanguagePack tlp = new PennTreebankLanguagePack();
		    GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
		    
		    for (List sentence : new DocumentPreprocessor(SentFilename)) {
		      Tree parse = lp.apply(sentence);
		      parse.pennPrint();
		      System.out.println();
		      
		      GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
		      List tdl = (List)gs.typedDependenciesCollapsedTree();
		      System.out.println(tdl);
		      System.out.println();
		    }
	  }

	 /**Parse sentences from a String
	 * @param sent The input sentence
	 * @return  List The list for type dependency
	 */
	  public List DPFromString(String sent) {
		    TokenizerFactory tokenizerFactory = 
		      PTBTokenizer.factory(new CoreLabelTokenFactory(), "");
		    List rawWords = 
		      tokenizerFactory.getTokenizer(new StringReader(sent)).tokenize();
		    Tree parse = lp.apply(rawWords);
	
		    TreebankLanguagePack tlp = new PennTreebankLanguagePack();
		    GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
		    GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
		    //Choose the type of dependenciesCollapseTree
		    //so that dependencies which do not 
		    //preserve the tree structure are omitted
		   return (List) gs.typedDependenciesCollapsedTree();   
	  }
}

Main函数如下

/**Just for testing
	 * @param args
	 * @throws IOException 
	 * @throws InvalidFormatException 
	 */
	public static void main(String[] args) throws InvalidFormatException, IOException {
		// TODO Auto-generated method stub
		//Notice: There should be " " BEFORE and after ",", " ","(",")" etc.
		 String sent = "We really enjoyed using the Canon PowerShot SD500 .";
		 //String sent = "Bell , based in Los Angeles , makes and distributes electronic , computer and building products .";
		 //String sent = "It has an exterior design that combines form and function more elegantly than any point-and-shoot we've ever tested . "; 
		 //String sent = "A Digic II-powered image-processing system enables the SD500 to snap a limitless stream of 7-megapixel photos at a respectable clip , its start-up time is tops in its class , and it delivers decent photos when compared to its competition . "; 
		 //String sent = "I've had it for about a month and it is simply the best point-and-shoot your money can buy . "; 
		 
		 StanfordParser sdPaser = StanfordParser.getInstance();
		 
		 List tdl = sdPaser.DPFromString(sent);
		 for(TypedDependency oneTdl : tdl){
		    	System.out.println(oneTdl);
		  } 
		 
		  ShallowParser swParser = ShallowParser.getInstance();
		  HashMap phraseLablesMap = new HashMap();
		  phraseLablesMap = swParser.chunk(sent);
		  WDTree wdtree = new WDTree();
		  WDTreeNode root = wdtree.bulidWDTreeFromList(tdl, phraseLablesMap);
		  wdtree.printWDTree(root);
	}

输出的词语之间的dependency关联,POS tag信息及句子语法解析树如下

Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [2.1 sec].
nsubj(enjoyed-3, We-1)
advmod(enjoyed-3, really-2)
root(ROOT-0, enjoyed-3)
xcomp(enjoyed-3, using-4)
det(SD500-8, the-5)
nn(SD500-8, Canon-6)
nn(SD500-8, PowerShot-7)
dobj(using-4, SD500-8)
Loading POS Tagger model ... done (1.492s)
We_PRP really_RB enjoyed_VBD using_VBG the_DT Canon_NNP PowerShot_NNP SD500_NNP ._.


Average: 200.0 sent/s 
Total: 1 sent
Runtime: 0.0050s
1:NP1
2:ADVP2
3:VP3
4:VP3
5:NP4
6:NP4
7:NP4
8:NP4
9:Punctuation4


children of ROOT-0_ (phLable:null):
enjoyed-3_  rel:root phLable:VP3   


children of enjoyed-3_ (phLable:VP3):
We-1_  rel:nsubj phLable:NP1   really-2_  rel:advmod phLable:ADVP2   using-4_  rel:xcomp phLable:VP3   


children of using-4_ (phLable:VP3):
SD500-8_  rel:dobj phLable:NP4   


children of SD500-8_ (phLable:NP4):
the-5_  rel:det phLable:NP4   Canon-6_  rel:nn phLable:NP4   PowerShot-7_  rel:nn phLable:NP4   


你可能感兴趣的:(NLP/IR)