lucene

Analyzer,文本分析的过程,实质上是将输入文本转化为文本特征向量的过程。
Analyzer包含两个核心组件,Tokenizer以及TokenFilter。两者的区别在于,前者在字符级别处理流,而后者则在词语级别处理流。Tokenizer是Analyzer的第一步,其构造函数接收一个Reader作为参数,而TokenFilter则是一个类似拦截器的东东,其参数可以是TokenStream、Tokenizer,甚至是另一个TokenFilter。整个Lucene Analyzer的过程如下图所示:



1.Analyzer类:

Analyzer类是一个抽象类,是所有分析器的基类。为了定义分析器的具体工作
其子类必须通过自己的createComponents(String, Reader)方法来定义TokenStreamComponents。
在调用方法tokenStream(String, Reader)的时候,TokenStreamComponents会被重复使用。


2.TokenStreamComponents
Normal 0 7.8 磅 0 2 false false false EN-US ZH-CN X-NONE</w:LidThemeComplexScript. <style. /* Style. Definitions */ table.MsoNormalTable {mso-style-name:普通表格; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-parent:""; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin:0cm; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.5pt; mso-bidi-font-size:11.0pt; font-family:"Calibri","sans-serif"; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi; mso-font-kerning:1.0pt;} </style. 该类封装了一个分词流的外部组件,简单封装输入Tokenizer和输出TokenStream

3.ReuseStrategy
Normal 0 7.8 磅 0 2 false false false EN-US ZH-CN X-NONE</w:LidThemeComplexScript. <style. /* Style. Definitions */ table.MsoNormalTable {mso-style-name:普通表格; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-parent:""; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin:0cm; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.5pt; mso-bidi-font-size:11.0pt; font-family:"Calibri","sans-serif"; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi; mso-font-kerning:1.0pt;} </style. 定义Analyzer每次调用tokenStream(String, java.io.Reader)时,TokenStreamComponents的重用策略。
getReusableComponents(String fieldName)根据字段名获取可重用的TokenStreamComponents组建。
private CloseableThreadLocal<Object> storedValue = new CloseableThreadLocal<Object>();
用于存储已有的TokenStreamComponents。
4.GlobalReuseStrategy
Normal 0 7.8 磅 0 2 false false false EN-US ZH-CN X-NONE</w:LidThemeComplexScript. <style. /* Style. Definitions */ table.MsoNormalTable {mso-style-name:普通表格; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-parent:""; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin:0cm; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.5pt; mso-bidi-font-size:11.0pt; font-family:"Calibri","sans-serif"; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi; mso-font-kerning:1.0pt;} </style. 公共重用策略,所有的字段共用同一个TokenStreamComponents。

5.PerFieldReuseStrategy

为每一个field维持一个TokenStreamComponent Map。
key=field   value=TokenStreamComponent
6.AnalyzerWrapper
使用PerFieldReuseStrategy作为重用策略的Analyzer封装类

你可能感兴趣的:(Lucene)