lucene 分词相关的类

TokemStream

org.apache.lucene.analysis.TokenStream

一个 抽象类。一个TokenStream会枚举若干个token的序列,要么来自文档的域,要门来自查询文本。

A TokenStream enumerates the sequence of tokens, either from Fields of a Document or from query text. 


TokenStream org.apache.lucene.analysis.Analyzer.tokenStream(String fieldName, Reader reader)
从reader的文本中得到一个Analyzer分词后的TokenStream。
Creates a TokenStream which tokenizes all the text in the provided Reader.


void org.apache.lucene.analysis.TokenStream.reset() throws IOException
将TokenStream的游标重置到初始位置。
Resets this stream to the beginning.


boolean org.apache.lucene.analysis.TokenStream.incrementToken() throws IOException
消费者,也就是IndexWriter使用这个方法来获得下一个token。
Consumers (i.e., IndexWriter) use this method to advance the stream to the next token. 

org.apache.lucene.analysis.tokenattributes.CharTermAttribute
一个token的词文本。
The term text of a Token.


<CharTermAttribute> CharTermAttribute org.apache.lucene.util.AttributeSource.getAttribute(Class<CharTermAttribute> attClass)
获得指定的Attribute。
The caller must pass in a Class<? extends Attribute> value. Returns the instance of the passed in Attribute contained in this AttributeSource。


Tokenizer

org.apache.lucene.analysis. Tokenizer
一个Tokenizer是一个输入为Reader的 TokenStream
A Tokenizer is a TokenStream whose input is a Reader. 

TokenFilter

org.apache.lucene.analysis. TokenFilter
一个TokenFilter是一个输入为其他TokenStream的TokenStream。用于过滤。
A TokenFilter is a TokenStream whose input is another TokenStream. 

org.apache.lucene.analysis. LowerCaseFilter
将token替换为小写。
Normalizes token text to lower case. 

org.apache.lucene.analysis. StopFilter
从一个TokenStream中去除停用词。
Removes stop words from a token stream. 

Analyzer

org.apache.lucene.analysis. KeywordAnalyzer
将整个stream作为一个token。适用于邮政编码、产品名称等。
"Tokenizes" the entire stream as a single token. This is useful for data like zip codes, ids, and some product names.

org.apache.lucene.analysis. ReusableAnalyzerBase
一个Analyzer的方便的子类,可以方便地实现TokenStream的重用。
An convenience subclass of Analyzer that makes it easy to implement TokenStream reuse.


你可能感兴趣的:(lucene 分词相关的类)