Lucene3.02 添加自定义analyzer

Analyzer主要用来分词,一个是对数据的分词,对分析出来的关键词建索引,第二个就是对查询语句分词,使其能更好的匹配,下面就添加一个最简单的analyzer,功能为如果词是"afei",就能形成关键字。。。不是就忽略。。。这个分词器其实已经失去了意义。。。因为只能返回一个词。。。悲哀。。。

Analyzer:在新建 IndexWriter时会传入Analyzer,并调用tokenStream获得相应的tokenStream,我们这里就获得自己定义 的 AfeiCIGenFilter,它就是一个TokenStream.

 

package org.apache.lucene.analysis; import java.io.Reader; public class AfeiCiGenAnalyzer extends Analyzer { @Override public TokenStream tokenStream(String fieldName, Reader reader) { // TODO Auto-generated method stub return new AfeiCiGenFilter(new LowerCaseTokenizer(reader)); } }

 

Filter:

在lucene中会不停的调用incrementToken() 判断是否还有新的词,如果有就放在 termAtt中,这个termAtt是在初始化Filter时添加上去的,呵呵。这样就行了,最简单的分词器。因为我们在AfeiCiGenFilter中传入的tokenStream是LowerCaseTokenizer,所以得到的是已经小写化的英文单词,呵呵,LUCENE这种分析器的设计实在在太漂亮了,好好学习啊,呵呵

package org.apache.lucene.analysis; import java.io.IOException; import org.apache.lucene.analysis.tokenattributes.TermAttribute; public class AfeiCiGenFilter extends TokenFilter { private TermAttribute termAtt; protected AfeiCiGenFilter(TokenStream input) { super(input); termAtt = addAttribute(TermAttribute.class); } @Override public boolean incrementToken() throws IOException { while (input.incrementToken()) { char[] aBuffer = termAtt.termBuffer(); if (aBuffer[0]=='a' && aBuffer[1]=='f' && aBuffer[2]=='e' && aBuffer[3]=='i') { return true; } } return false; } }

 

使用的时候只需要在新建IndexWriter时传入我们的Analyzer就行了,呵呵

 

附一段测试 Analyzer和tokenStream的代码,呵呵, 发现相当的有用,哈哈

package org.apache.lucene.demo; import java.io.IOException; import java.io.StringReader; import org.apache.lucene.analysis.*; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.analysis.tokenattributes.TermAttribute; import org.apache.lucene.util.Version; import com.sun.org.apache.bcel.internal.generic.NEW; public class AnalyseDemo { private static final String[] examples = { "The quick brown fox jumped over the lazy dogs", "XY&Z Corporation - [email protected]", "最后我们记住的不是敌人 的残暴,而是 朋友的 冷漠" }; private static final Analyzer[] analyzers = new Analyzer[] { new WhitespaceAnalyzer(), new SimpleAnalyzer(), new StopAnalyzer(Version.LUCENE_29), new StandardAnalyzer(Version.LUCENE_29), }; private static TermAttribute termAttribute; private static final TokenStream[] tokenStreams = new TokenStream[] { new LetterTokenizer(new StringReader(examples[2])), new WhitespaceTokenizer(new StringReader(examples[2])), new LowerCaseTokenizer(new StringReader(examples[2])) }; /** * @param args */ public static void main(String[] args) { String[] strings = examples; //测试分析器 for (int i = 0; i < strings.length; i++) { analyze(strings[i]); } //测试tokenStream... for (int i = 0; i < tokenStreams.length; i++) { TokenStream stream = tokenStreams[i]; System.out.println("name: "+stream.getClass().getName()); termAttribute = stream.addAttribute(TermAttribute.class); try { while (stream.incrementToken()) { System.out.println(termAttribute.term()); } } catch (IOException e) { e.printStackTrace(); } } } public static void analyze(String text) { System.out.println("analyzing : "+ text); for (int i = 0; i < analyzers.length; i++) { Analyzer analyzer= analyzers[i]; String name = analyzer.getClass().getName(); System.out.println("full name: "+name); name = name.substring(name.lastIndexOf('.')+1); System.out.println("name: "+ name); AnalyzerUtils.displayTokens(analyzer,text); } } } class AnalyzerUtils { public static TermAttribute termAtt; public static void displayTokens(Analyzer analyzer,String text) { TokenStream stream = analyzer.tokenStream("contents" , new StringReader(text)); termAtt = stream.addAttribute(TermAttribute.class); try { while (stream.incrementToken()) { System.out.println(termAtt.term()); } } catch (IOException e) { e.printStackTrace(); } } }

你可能感兴趣的:(搜索引擎)