public abstract class Analyzer { public abstract TokenStream tokenStream(String fieldName, Reader reader); * * @param fieldName Field name being indexed. * @return position increment gap, added to the next token emitted from {@link #tokenStream(String,Reader)} */ public int getPositionIncrementGap(String fieldName) { return 0; } }
String content = "..."; StringReader reader = new StringReader(content); Analyzer analyzer = new ....(); TokenStream ts = analyzer.tokenStream("",reader); //開始分詞 Token t = null; while ((t = ts.next()) != null){ System.out.println(t.termText()); }
分析器由兩部分組成。一部分是分詞器,被稱Tokenizer, 另一部分是過濾器,TokenFilter. 它們都繼承自TokenStream。一個分析器往由一個分詞器和多個過濾器組成。
public abstract class Tokenizer extends TokenStream { /** The text source for this Tokenizer. */ protected Reader input; /** Construct a tokenizer with null input. */ protected Tokenizer() {} /** Construct a token stream processing the given input. */ protected Tokenizer(Reader input) { this.input = input; } /** By default, closes the input Reader. */ public void close() throws IOException { input.close(); } }
public abstract class TokenFilter extends TokenStream { /** The source of tokens for this filter. */ protected TokenStream input; /** Construct a token stream filtering the given input. */ protected TokenFilter(TokenStream input) { this.input = input; } /** Close the input TokenStream. */ public void close() throws IOException { input.close(); } }
StandardAnalyer的tokenStream方法,除了使用StatandTokenizer進行分詞外,還使用了3個Filtter:
public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new StandardTokenizer(reader); result = new StandardFilter(result); result = new LowerCaseFilter(result); result = new StopFilter(result, stopSet); return result; }
stopSet在構造StandardAnalyer時指定,無構造參加時,使用默認的StopAnalyzer.ENGLISH_STOP_WORDS提供的過濾詞。