lucene(四) lucene分词器

一、分词流程

lucene(四) lucene分词器_第1张图片

Reader:将字符串转换为读入的流

Tokenier:主要负责接收字符流Reader,将Reader进行分词操作

                    Tokenier的一些实现类:

                      


TokenFilter:将语汇单元进行各式各样的过滤

TokenFilter的一些实现类:

                       lucene(四) lucene分词器_第2张图片

TokenStream:分词器做好处理后得到的一个流,这个流中存储了分词的各种信息,可以通过TokenStream有效地获取到分词单元信息



二、例子

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;


public class AnalyzerUtils {

	public static void displayAllTokenInfo(String str,Analyzer a) {
		try {
			TokenStream stream = a.tokenStream("content",new StringReader(str));
			//位置增量的属性,存储语汇单元之间的距离
			PositionIncrementAttribute pia = 
						stream.addAttribute(PositionIncrementAttribute.class);
			//每个语汇单元的位置偏移量
			OffsetAttribute oa = 
						stream.addAttribute(OffsetAttribute.class);
			//存储每一个语汇单元的信息(分词单元信息)
			CharTermAttribute cta = 
						stream.addAttribute(CharTermAttribute.class);
			//使用的分词器的类型信息
			TypeAttribute ta = 
						stream.addAttribute(TypeAttribute.class);
			for(;stream.incrementToken();) {
				System.out.print(pia.getPositionIncrement()+":");
				System.out.print(cta+"["+oa.startOffset()+"-"+oa.endOffset()+"]-->"+ta.type()+"\n");
			}
		} catch (Exception e) {
			e.printStackTrace();
		}
	}
}


测试类:

package org.lucene.test;

import java.io.File;
import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.junit.Test;
import org.lucene.util.AnalyzerUtils;
import org.lucene.util.MySameAnalyzer;
import org.lucene.util.MyStopAnalyzer;
import org.lucene.util.SimpleSamewordContext2;

import com.chenlb.mmseg4j.analysis.MMSegAnalyzer;

public class TestAnalyzer {

	@Test
	public void test() {
		//英文分词器
		Analyzer a1 = new StandardAnalyzer(Version.LUCENE_35);
		Analyzer a2 = new StopAnalyzer(Version.LUCENE_35);
		Analyzer a3 = new SimpleAnalyzer(Version.LUCENE_35);
		Analyzer a4 = new WhitespaceAnalyzer(Version.LUCENE_35);
		String txt = "how are you thank you";
		
		//使用MMSeg中文分词器,参数是字典所在路径
		Analyzer a5 = new MMSegAnalyzer(new File("D:\\tools\\javaTools\\lucene\\mmseg4j-1.8.5\\data"));
		String txt1 = "我来自中国云南昭通昭阳区师专";
		
		AnalyzerUtils.displayAllTokenInfo(txt, a1);
		System.out.println("------------------------------");
		AnalyzerUtils.displayAllTokenInfo(txt, a2);
		System.out.println("------------------------------");
		AnalyzerUtils.displayAllTokenInfo(txt, a3);
		System.out.println("------------------------------");
		AnalyzerUtils.displayAllTokenInfo(txt, a4);
		
		System.out.println("------------------------------");
		AnalyzerUtils.displayAllTokenInfo(txt1, a5);
		
	}
}

中文分词MMSeg在lucene中的应用:

lucene(四) lucene分词器_第3张图片

你可能感兴趣的:(Lucene,分词器,Lucene分词器)