Lucene分析器的實現

 public abstract class Analyzer {
    public abstract TokenStream tokenStream(String fieldName, Reader reader);
   *
   * @param fieldName Field name being indexed.
   * @return position increment gap, added to the next token emitted from {@link #tokenStream(String,Reader)}
   */
  public int getPositionIncrementGap(String fieldName)
  {
    return 0;
  }
}
String content = "...";
StringReader reader = new StringReader(content);

Analyzer analyzer = new ....();
TokenStream ts = analyzer.tokenStream("",reader);
//開始分詞
Token t = null;
while ((t = ts.next()) != null){
      System.out.println(t.termText());
}

 

分析器由兩部分組成。一部分是分詞器,被稱Tokenizer, 另一部分是過濾器,TokenFilter. 它們都繼承自TokenStream。一個分析器往由一個分詞器和多個過濾器組成。

public abstract class Tokenizer extends TokenStream {
  /** The text source for this Tokenizer. */
  protected Reader input;

  /** Construct a tokenizer with null input. */
  protected Tokenizer() {}

  /** Construct a token stream processing the given input. */
  protected Tokenizer(Reader input) {
    this.input = input;
  }

  /** By default, closes the input Reader. */
  public void close() throws IOException {
    input.close();
  }
}

 

public abstract class TokenFilter extends TokenStream {
  /** The source of tokens for this filter. */
  protected TokenStream input;

  /** Construct a token stream filtering the given input. */
  protected TokenFilter(TokenStream input) {
    this.input = input;
  }

  /** Close the input TokenStream. */
  public void close() throws IOException {
    input.close();
  }

}

 

StandardAnalyer的tokenStream方法,除了使用StatandTokenizer進行分詞外,還使用了3個Filtter:

  • StandardFilter  標準過濾器,主要對切分出來的省略語(如He's的's),和以"."號分隔的縮略語進行處理。
  • LowerCaseFilter 大小寫轉換器,將大寫轉為小寫
  • StopFilter 忽略詞過濾器,構造其實例時,需傳入一個忽略詞集合。

 

 public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = new StandardTokenizer(reader);
    result = new StandardFilter(result);
    result = new LowerCaseFilter(result);
    result = new StopFilter(result, stopSet);
    return result;
  }

 
stopSet在構造StandardAnalyer時指定,無構造參加時,使用默認的StopAnalyzer.ENGLISH_STOP_WORDS提供的過濾詞。

  

你可能感兴趣的:(Lucene)