ictclas4j for lucene analyzer,

版权信息 : 可以任意转载, 转载时请务必以超链接形式标明文章原文出处 , 即下面的声明.

 

原文出处:http://blog.chenlb.com/2009/01/ictclas4j-for-lucene-analyzer.html

在 lucene 的中文分词域里,有好几个分词选择,有:je、paoding、IK。最近想把 ictclas 拿来做 lucene 的中文分词。网上看了下资料,觉得 ictclas4j 是比较好的选择,作者博客相关文章:http://blog.csdn.net/sinboy/category/207165.aspx  。ictclas4j 目前是0.9.1版,项目地址:http://code.google.com/p/ictclas4j/  ,下载地址:http://ictclas4j.googlecode.com/files/ictclas4j_0.9.1.rar  。

 

下载 ictclas4j 看了下源码,正找示例,org.ictclas4j.run.SegMain 可以运行。分词的核心逻辑在org.ictclas4j.segment.Segment 的 split(String src) 方法中。运行 SegMain 的结果是一串字符串(带有词性标注),细看了 Segment 与 org.ictclas4j.bean.SegResult 没看到一个个分好的词。这样就比较难以扩展成为 lucene 的分词器。555,接下还是 hack 一下。

 

hack 的突破口的它的最终结果,在 SegResult 类里的 finalResult 字段记录。 在Segment.split(String src) 生成。慢慢看代码找到 outputResult(ArrayList<SegNode> wrList) 方法把一个个分好的词拼凑成 string。我们可以修改这个方法把一个个分好的词收集起来。下面是 hack 的过程。

1、修改 Segment:
1)把原来的outputResult(ArrayList<SegNode> wrList) 复制为 outputResult(ArrayList<SegNode> wrList, ArrayList<String> words) 方法,并添加收集词的内容,最后为:

    	// 根据分词路径生成分词结果
    	private String outputResult(ArrayList<SegNode> wrList, ArrayList<String> words) {
    		String result = null;
    		String temp=null;
    		char[] pos = new char[2];
    		if (wrList != null &amp;amp;&amp;amp; wrList.size() > 0) {
    			result = "";
    			for (int i = 0; i < wrList.size(); i++) {
    				SegNode sn = wrList.get(i);
    				if (sn.getPos() != POSTag.SEN_BEGIN &amp;amp;&amp;amp; sn.getPos() != POSTag.SEN_END) {
    					int tag = Math.abs(sn.getPos());
    					pos[0] = (char) (tag / 256);
    					pos[1] = (char) (tag % 256);
    					temp=""+pos[0];
    					if(pos[1]>0)
    						temp+=""+pos[1];
    					result += sn.getSrcWord() + "/" + temp + " ";
    					if(words != null) {	//chenlb add
    						words.add(sn.getSrcWord());
    					}
    				}
    			}
    		}
    
    		return result;
    	}
    

    2)原来的outputResult(ArrayList<SegNode> wrList) 改为:

    	//chenlb move to outputResult(ArrayList<SegNode> wrList, ArrayList<String> words)
    	private String outputResult(ArrayList<SegNode> wrList) {
    		return outputResult(wrList, null);
    	}
    

    3)修改调用outputResult(ArrayList<SegNode> wrList)的地方(注意不是所有的调用),大概在 Segment 的126行 String optResult = outputResult(optSegPath); 改为 String optResult = outputResult(optSegPath, words); 当然还要定义ArrayList<String> words了,最终 Segment.split(String src) 如下:

      	public SegResult split(String src) {
      		SegResult sr = new SegResult(src);// 分词结果
      		String finalResult = null;
      
      		if (src != null) {
      			finalResult = "";
      			int index = 0;
      			String midResult = null;
      			sr.setRawContent(src);
      			SentenceSeg ss = new SentenceSeg(src);
      			ArrayList<Sentence> sens = ss.getSens();
      
      			ArrayList<String> words = new ArrayList<String>();	//chenlb add
      
      			for (Sentence sen : sens) {
      				logger.debug(sen);
      				long start=System.currentTimeMillis();
      				MidResult mr = new MidResult();
      				mr.setIndex(index++);
      				mr.setSource(sen.getContent());
      				if (sen.isSeg()) {
      
      					// 原子分词
      					AtomSeg as = new AtomSeg(sen.getContent());
      					ArrayList<Atom> atoms = as.getAtoms();
      					mr.setAtoms(atoms);
      					System.err.println("[atom time]:"+(System.currentTimeMillis()-start));
      					start=System.currentTimeMillis();
      
      					// 生成分词图表,先进行初步分词,然后进行优化,最后进行词性标记
      					SegGraph segGraph = GraphGenerate.generate(atoms, coreDict);
      					mr.setSegGraph(segGraph.getSnList());
      					// 生成二叉分词图表
      					SegGraph biSegGraph = GraphGenerate.biGenerate(segGraph, coreDict, bigramDict);
      					mr.setBiSegGraph(biSegGraph.getSnList());
      					System.err.println("[graph time]:"+(System.currentTimeMillis()-start));
      					start=System.currentTimeMillis();
      
      					// 求N最短路径
      					NShortPath nsp = new NShortPath(biSegGraph, segPathCount);
      					ArrayList<ArrayList<Integer>> bipath = nsp.getPaths();
      					mr.setBipath(bipath);
      					System.err.println("[NSP time]:"+(System.currentTimeMillis()-start));
      					start=System.currentTimeMillis();
      
      					for (ArrayList<Integer> onePath : bipath) {
      						// 得到初次分词路径
      						ArrayList<SegNode> segPath = getSegPath(segGraph, onePath);
      						ArrayList<SegNode> firstPath = AdjustSeg.firstAdjust(segPath);
      						String firstResult = outputResult(firstPath);
      						mr.addFirstResult(firstResult);
      						System.err.println("[first time]:"+(System.currentTimeMillis()-start));
      						start=System.currentTimeMillis();
      
      						// 处理未登陆词,进对初次分词结果进行优化
      						SegGraph optSegGraph = new SegGraph(firstPath);
      						ArrayList<SegNode> sns = clone(firstPath);
      						personTagger.recognition(optSegGraph, sns);
      						transPersonTagger.recognition(optSegGraph, sns);
      						placeTagger.recognition(optSegGraph, sns);
      						mr.setOptSegGraph(optSegGraph.getSnList());
      						System.err.println("[unknown time]:"+(System.currentTimeMillis()-start));
      						start=System.currentTimeMillis();
      
      						// 根据优化后的结果,重新进行生成二叉分词图表
      						SegGraph optBiSegGraph = GraphGenerate.biGenerate(optSegGraph, coreDict, bigramDict);
      						mr.setOptBiSegGraph(optBiSegGraph.getSnList());
      
      						// 重新求取N-最短路径
      						NShortPath optNsp = new NShortPath(optBiSegGraph, segPathCount);
      						ArrayList<ArrayList<Integer>> optBipath = optNsp.getPaths();
      						mr.setOptBipath(optBipath);
      
      						// 生成优化后的分词结果,并对结果进行词性标记和最后的优化调整处理
      						ArrayList<SegNode> adjResult = null;
      						for (ArrayList<Integer> optOnePath : optBipath) {
      							ArrayList<SegNode> optSegPath = getSegPath(optSegGraph, optOnePath);
      							lexTagger.recognition(optSegPath);
      							String optResult = outputResult(optSegPath, words);	//chenlb changed
      							mr.addOptResult(optResult);
      							adjResult = AdjustSeg.finaAdjust(optSegPath, personTagger, placeTagger);
      							String adjrs = outputResult(adjResult);
      							System.err.println("[last time]:"+(System.currentTimeMillis()-start));
      							start=System.currentTimeMillis();
      							if (midResult == null)
      								midResult = adjrs;
      							break;
      						}
      					}
      					sr.addMidResult(mr);
      				} else {
      					midResult = sen.getContent();
      					words.add(midResult);	//chenlb add
      				}
      				finalResult += midResult;
      				midResult = null;
      			}
      
      			sr.setWords(words);	//chenlb add
      
      			sr.setFinalResult(finalResult);
      			DebugUtil.output2html(sr);
      			logger.info(finalResult);
      		}
      
      		return sr;
      	}
      

      4)Segment中的构造方法,词典路径分隔可以改为"/"

      5)同时修改了一个漏词的 bug,请看:ictclas4j的一个bug

      2、修改 SegResult:
      添加以下内容:

        private ArrayList<String> words;	//记录分词后的词结果,chenlb add
        	/**
        	 * 添加词条。
        	 * @param word null 不添加
        	 * @author chenlb 2009-1-21 下午05:01:25
        	 */
        	public void addWord(String word) {
        		if(words == null) {
        			words = new ArrayList<String>();
        		}
        		if(word != null) {
        			words.add(word);
        		}
        	}
        
        	public ArrayList<String> getWords() {
        		return words;
        	}
        
        	public void setWords(ArrayList<String> words) {
        		this.words = words;
        	}
        

        下面是创建 ictclas4j 的 lucene analyzer
        1、新建一个ICTCLAS4jTokenizer类:

          package com.chenlb.analysis.ictclas4j;
          
          import java.io.IOException;
          import java.io.Reader;
          import java.util.ArrayList;
          
          import org.apache.lucene.analysis.Token;
          import org.apache.lucene.analysis.Tokenizer;
          import org.ictclas4j.bean.SegResult;
          import org.ictclas4j.segment.Segment;
          
          /**
           * ictclas4j 切词
           *
           * @author chenlb 2009-1-23 上午11:39:10
           */
          public class ICTCLAS4jTokenizer extends Tokenizer {
          
          	private static Segment segment;
          
          	private StringBuilder sb = new StringBuilder();
          
          	private ArrayList<String> words;
          
          	private int startOffest = 0;
          	private int length = 0;
          	private int wordIdx = 0;
          
          	public ICTCLAS4jTokenizer() {
          		words = new ArrayList<String>();
          	}
          
          	public ICTCLAS4jTokenizer(Reader input) {
          		super(input);
          		char[] buf = new char[8192];
          		int d = -1;
          		try {
          			while((d=input.read(buf)) != -1) {
          				sb.append(buf, 0, d);
          			}
          		} catch (IOException e) {
          			e.printStackTrace();
          		}
          		SegResult sr = seg().split(sb.toString());	//分词
          		words = sr.getWords();
          	}
          
          	public Token next(Token reusableToken) throws IOException {
          		assert reusableToken != null;
          
          		length = 0;
          		Token token = null;
          		if(wordIdx < words.size()) {
          			String word = words.get(wordIdx);
          			length = word.length();
          			token = reusableToken.reinit(word, startOffest, startOffest+length);
          			wordIdx++;
          			startOffest += length;
          
          		}
          
          		return token;
          	}
          
          	private static Segment seg() {
          		if(segment == null) {
          			segment = new Segment(1);
          		}
          		return segment;
          	}
          }
          

          2、新建一个ICTCLAS4jFilter类:

            package com.chenlb.analysis.ictclas4j;
            
            import org.apache.lucene.analysis.Token;
            import org.apache.lucene.analysis.TokenFilter;
            import org.apache.lucene.analysis.TokenStream;
            
            /**
             * 标点符等, 过虑.
             *
             * @author chenlb 2009-1-23 下午03:06:00
             */
            public class ICTCLAS4jFilter extends TokenFilter {
            
            	protected ICTCLAS4jFilter(TokenStream input) {
            		super(input);
            	}
            
                public final Token next(final Token reusableToken) throws java.io.IOException {
                    assert reusableToken != null;
            
                    for (Token nextToken = input.next(reusableToken); nextToken != null; nextToken = input.next(reusableToken)) {
                        String text = nextToken.term();
            
                            switch (Character.getType(text.charAt(0))) {
            
                            case Character.LOWERCASE_LETTER:
                            case Character.UPPERCASE_LETTER:
            
                                // English word/token should larger than 1 character.
                                if (text.length()>1) {
                                    return nextToken;
                                }
                                break;
                            case Character.DECIMAL_DIGIT_NUMBER:
                            case Character.OTHER_LETTER:
            
                                // One Chinese character as one Chinese word.
                                // Chinese word extraction to be added later here.
            
                                return nextToken;
                            }
            
                    }
                    return null;
                }
            }
            

            3、新建一个ICTCLAS4jAnalyzer类:

              package com.chenlb.analysis.ictclas4j;
              
              import java.io.Reader;
              
              import org.apache.lucene.analysis.Analyzer;
              import org.apache.lucene.analysis.LowerCaseFilter;
              import org.apache.lucene.analysis.StopFilter;
              import org.apache.lucene.analysis.TokenStream;
              
              /**
               * ictclas4j 的 lucene 分析器
               *
               * @author chenlb 2009-1-23 上午11:39:39
               */
              public class ICTCLAS4jAnalyzer extends Analyzer {
              
              	private static final long serialVersionUID = 1L;
              
              	// 可以自定义添加更多的过虑的词(高频无多太用处的词)
              	private static final String[] STOP_WORDS = {
              		"and", "are", "as", "at", "be", "but", "by",
              	    "for", "if", "in", "into", "is", "it",
              	    "no", "not", "of", "on", "or", "such",
              	    "that", "the", "their", "then", "there", "these",
              	    "they", "this", "to", "was", "will", "with",
              	    "的"
              	};
              
              	public TokenStream tokenStream(String fieldName, Reader reader) {
              		TokenStream result = new ICTCLAS4jTokenizer(reader);
              		result = new ICTCLAS4jFilter(new StopFilter(new LowerCaseFilter(result), STOP_WORDS));
              		return result;
              	}
              
              }
              

              下面来测试下分词效果:
              文本内容:

              京华时报1月23日报道 昨天,受一股来自中西伯利亚的强冷空气影响,本市出现大风降温天气,白天最高气温只有零下7摄氏度,同时伴有6到7级的偏北风。

              原分词结果:

              京华/nz 时/ng 报/v 1月/t 23日/t 报道/v  昨天/t ,/w 受/v 一/m 股/q 来自/v 中/f 西伯利亚/ns 的/u 强/a 冷空气/n 影响/vn ,/w 本市/r 出现/v 大风/n 降温/vn 天气/n ,/w 白天/t 最高/a 气温/n 只/d 有/v 零下/s 7/m 摄氏度/q ,/w 同时/c 伴/v 有/v 6/m 到/v 7/m 级/q 的/u 偏/a 北风/n 。/w 

              analyzer:

              [京华] [时] [报] [1月] [23日] [报道] [昨天] [受] [一] [股] [来自] [中] [西伯利亚] [强] [冷空气] [影响] [本市] [出现] [大风] [降温] [天气] [白天] [最高] [气温] [只] [有] [零下] [7] [摄氏度] [同时] [伴] [有] [6] [到] [7] [级] [偏] [北风]

              我改过的源码可以下载:ictclas4j-091-for-lucene-src

              依赖的jar:commons-lang-2.1.jar,log4j-1.2.12.jar,lucene-core-2.4.jar

              你可能感兴趣的:(Lucene)