在Lucene6.1.0运用word1.2进行分词

为什么用word1.2?

最新的word分词是1.3版本,但是用1.3的时候会出现一些Bug,产生Java.lang.OutOfMemory错误,所以还是用比较稳定的1.2版本。

在Lucene 6.1.0中,实现一个Analyzer的子类,也就是构建自己的Analyzer的时候,需要实现的方法是createComponet(String fieldName),而在Word 1.2中,没有实现这个方法(word 1.2对lucene 4.+的版本支持较好),运用ChineseWordAnalyzer运行的时候会提示:

Exception in thread "main" java.lang.AbstractMethodError: org.apache.lucene.analysis.Analyzer.createComponents(Ljava/lang/String;)Lorg/apache/lucene/analysis/Analyzer$TokenStreamComponents;
at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:140)

所以要对ChineseWordAnalyzer做一些修改.

实现createComponet(String fieldName)方法

新建一个Analyzer的子类MyWordAnalyzer,根据ChinesWordAnalyzer改写:

public class MyWordAnalyzer extends Analyzer {
    Segmentation segmentation = null;

    public MyWordAnalyzer() {
        segmentation = SegmentationFactory.getSegmentation(
           SegmentationAlgorithm.BidirectionalMaximumMatching);
    }
    public MyWordAnalyzer(Segmentation segmentation) {
        this.segmentation = segmentation;
    }
    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        Tokenizer tokenizer = new MyWordTokenizer(segmentation);
        return new TokenStreamComponents(tokenizer);
    }
}

其中segmentation属性可以设置分词所用的算法,默认的是双向最大匹配算法。接着要实现的是MyWordTokenizer,也是模仿ChineseWordTokenizer来写:

public class MyWordTokenizer extends Tokenizer{     
        private final CharTermAttribute charTermAttribute 
                = addAttribute(CharTermAttribute.class);
        private final OffsetAttribute offsetAttribute 
                = addAttribute(OffsetAttribute.class);
        private final PositionIncrementAttribute 
          positionIncrementAttribute 
                 = addAttribute(PositionIncrementAttribute.class);
        
        private Segmentation segmentation = null;
        private BufferedReader reader = null;
        private final Queue words = new LinkedTransferQueue<>();
        private int startOffset=0;
            
        public MyWordTokenizer() {
            segmentation = SegmentationFactory.getSegmentation(
                   SegmentationAlgorithm.BidirectionalMaximumMatching);
        }   
        public MyWordTokenizer(Segmentation segmentation) {
            this.segmentation = segmentation;
        }
        private Word getWord() throws IOException {
            Word word = words.poll();
            if(word == null){
                String line;
                while( (line = reader.readLine()) != null ){
                    words.addAll(segmentation.seg(line));
                }
                startOffset = 0;
                word = words.poll();
            }
            return word;
        }
        @Override
        public final boolean incrementToken() throws IOException {
            reader=new BufferedReader(input);
            Word word = getWord();
            if (word != null) {
                int positionIncrement = 1;
                //忽略停用词
                while(StopWord.is(word.getText())){
                    positionIncrement++;
                    startOffset += word.getText().length();
                    word = getWord();
                    if(word == null){
                        return false;
                    }
                }
                charTermAttribute.setEmpty().append(word.getText());

                 offsetAttribute.setOffset(startOffset, startOffset
                      +word.getText().length());
                positionIncrementAttribute.setPositionIncrement(
                      positionIncrement);
                startOffset += word.getText().length();
                return true;
            }
            return false;
        }
}

incrementToken()是必需要实现的方法,返回true的时候表示后面还有token,返回false表示解析结束。在incrementToken()的第一行,将input的值赋给reader,input是Tokenizer为Reader的对象,在Tokenizer中还有另一个Reader对象——inputPending,在Tokenizer中源码如下:

public abstract class Tokenizer extends TokenStream {  
  /** The text source for this Tokenizer. */
  protected Reader input = ILLEGAL_STATE_READER;
  
  /** Pending reader: not actually assigned to input until reset() */
  private Reader inputPending = ILLEGAL_STATE_READER;

input中存储的是需要解析的文本,但是文本是先放到inputPending中,直到调用了reset方法之后才将值赋给input。
  reset()方法定义如下:

 @Override
  public void reset() throws IOException {
    super.reset();
    input = inputPending;
    inputPending = ILLEGAL_STATE_READER;
  }

在调用reset()方法之前,input里面的是没有需要解析的文本信息的,所以要在reset()之后再将input的值赋给reader(一个BufferedReader 的对象)。
  
  做了上面的修改之后,就可以运用Word 1.2里面提供的算法进行分词了:

测试类MyWordAnalyzerTest

public class MyWordAnalyzerTest {

    public static void main(String[] args) throws IOException {
        String text = "乒乓球拍卖完了";
        Analyzer analyzer = new MyWordAnalyzer();
        TokenStream tokenStream = analyzer.tokenStream("text", text);
        // 准备消费
        tokenStream.reset();
        // 开始消费
        while (tokenStream.incrementToken()) {
            // 词
            CharTermAttribute charTermAttribute 
               = tokenStream.getAttribute(CharTermAttribute.class);
            // 词在文本中的起始位置
            OffsetAttribute offsetAttribute 
               = tokenStream.getAttribute(OffsetAttribute.class);
            // 第几个词
            PositionIncrementAttribute positionIncrementAttribute 
                = tokenStream
                    .getAttribute(PositionIncrementAttribute.class);

            System.out.println(charTermAttribute.toString() + " " 
                  + "(" + offsetAttribute.startOffset() + " - "
                  + offsetAttribute.endOffset() + ") " 
                  + positionIncrementAttribute.getPositionIncrement());
        }
        // 消费完毕
        tokenStream.close();
    }
}

结果如下:

运用word1.2的分词结果

因为在increamToken()中,将停止词去掉了,所以分词结果中没有出现“了”。从上面的结果也可以看到,Word分词可以将句子分解为“乒乓球拍”和“卖完”,对比用SmartChineseAnalyzer():

运用SmartCineseAnalyzer的分词结果

综上Word的分词效果还是不错的。

你可能感兴趣的:(在Lucene6.1.0运用word1.2进行分词)