Standard Analyzer
- 默认的analyzer,适合大多数语言。
- 根据
Unicode Text Segmentation
算法的定义,将文本切分成词元。 - 删除多数标点符号、将词元小写,支持删除停止词。
standard
analyzer由以下构成:
- Standard Tokenizer
- Standard Tokenizer
- Token Filters
- Standard Token Filter
- Lower Case Token Filter
- Stop Token Filter(默认禁用)
Standard Analyzer 示例
POST _analyze
{
"analyzer": "standard",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
产生[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]
。
Standard Analyzer 配置
参数 | 说明 |
---|---|
max_token_length | 提取单词时,允许的单词长度。默认255。 |
stopwords | 可以使用预定义停止词列表(例如_english_ ),或一个停止词数组。默认_none_ 。 |
stopwords_path | 包含停止符的文件的路径。 |
示例
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_english_analyzer": {
"type": "standard",
"max_token_length": 5,
"stopwords": "_english_"
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_english_analyzer",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
产生[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]
。
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_array_analyzer": {
"type": "standard",
"stopwords": ["the","2","quick","brown","foxes","jumped","over","dog's","bone"]
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_array_analyzer",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
产生[lazy]
。
Simple Analyzer
-
simple
analyzer遇到非字母就会切分,并且会小写。
simple
analyzer由以下构成:
- Tokenizer
- Lower Case Tokenizer
Simple Analyzer 示例
POST _analyze
{
"analyzer": "simple",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
产生[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
。
Whitespace Analyzer
- 遇到空格符就切分
whitespace
analyzer由以下构成:
- Tokenizer
- Whitespace Tokenizer
Whitespace Analyzer 示例
POST _analyze
{
"analyzer": "whitespace"
, "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
产生[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]
。
Stop Analyzer
- 类似
simple
analyzer,但是支持删除停止词。默认使用_english_
停止词。
stop
analyzer由以下构成:
- Tokenizer
- Lower Case Tokenizer
- Token filters
- Stop Token Filter
Stop Analyzer 示例
POST _analyze
{
"analyzer": "stop",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
产生[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]
Stop Analyzer 配置
参数 | 说明 |
---|---|
stopwords | 可以使用预定义停止词列表(例如_english_ ),或一个停止词数组。默认_english_ 。 |
stopwords_path | 包含停止符的文件的路径,路径相对于Elasticsearch的config目录。 |
示例
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_stop_analyzer":{
"type": "stop",
"stopwords": ["the","2","quick","brown","foxes","jumped","over","dog","s","bone"]
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_stop_analyzer",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
产生[ lazy ]
Keyword Analyzer
- 不切分作为一整个词元输出。
keyword
analyzer由以下构成:
- Tokenizer
- Keyword Tokenizer
Keyword Analyzer 示例
POST _analyze
{
"analyzer": "keyword",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
产生[ The 2 QUICK Brown-Foxes jumped over the lazy dog's bone. ]
。
Pattern Analyzer
- 按照正则表示式去切分,默认为
\W+
pattern
analyzer由以下构成:
- Tokenizer
- Pattern Tokenizer
- Token Filters
- Lower Case Token Filter
- Stop Token Filter (默认禁用)
Pattern Analyzer 示例
POST _analyze
{
"analyzer": "pattern",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
产生[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
Pattern Analyzer 配置
参数 | 说明 |
---|---|
pattern | 使用Java正则表达式。默认\W+ 。 |
flags | Java正则表达式flags,多个用| 分离,例如"CASE_INSENSITIVE | COMMENTS"。 |
lowercase | 是否小写。默认true 。 |
stopwords | 可以使用预定义停止词列表(例如_english_ ),或一个停止词数组。默认_none_ 。 |
stopwords_path | 包含停止符的文件的路径。 |
示例
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_pattern_analyzer": {
"type": "pattern",
"pattern": "\\W|_",
"lowercase": true
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_pattern_analyzer",
"text": "[email protected]"
}
产生[ john, smith, foo, bar, com ]
。
Fingerprint Analyzer
- 小写,规范化删掉扩展符,排序,去重。
- 也可以配置停止符。
fingerprint
tokenizer 由以下构成: - Tokenizer
- Standard Tokenizer
- Token Filters(依次如下)
- Lower Case Token Filter
- ASCII Folding Token Filter
- Stop Token Filter (默认禁用)
- Fingerprint Token Filter
Fingerprint Analyzer 示例
POST _analyze
{
"analyzer": "fingerprint",
"text": "Yes yes, Gödel said this sentence is consistent and."
}
产生[ and consistent godel is said sentence this yes ]
Fingerprint Analyzer 配置
参数 | 说明 |
---|---|
separator | 连接条件。默认是空格。 |
max_output_size | 词元最大长度,超过会被丢弃(不是超过部分被丢弃,而且超过这个长度整条被丢弃)。默认255。 |
stopwords | 可以使用预定义停止词列表(例如_english_ ),或一个停止词数组。默认_none_ 。 |
stopwords_path | 包含停止符的文件的路径。 |
示例
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_fingerprint_analyzer":{
"type": "fingerprint",
"stopwords": "_english_"
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_fingerprint_analyzer",
"text": "Yes yes, Gödel said this sentence is consistent and."
}
产生[ consistent godel said sentence yes ]
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_fingerprint_analyzer":{
"type": "fingerprint",
"stopwords": "_english_",
"separator": "-"
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_fingerprint_analyzer",
"text": "Yes yes, Gödel said this sentence is consistent and."
}
产生[ consistent-godel-said-sentence-yes ]
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_fingerprint_analyzer":{
"type": "fingerprint",
"max_output_size": 30
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_fingerprint_analyzer",
"text": "Yes yes, Gödel said this sentence is consistent and."
}
什么都不产生,整条被丢弃,
补充说明
- Whitespace会遇到空格就拆分,而Standard则是提取出单词,例如:对于
Brown-Foxes
,Whitespace切分之后还是这样,而Standard切分后则是brown
和foxes
。 - Simple遇到非字母就切分,而Standard未必,例如:对于
dog's
,Simple会切分成dog
和s
,而Standard切分后则是dog's
。 - 总之,Whitespace遇到空格就切分,Simple遇到非字母就切分,Standard切分单词(可以是所有格形式)。
自定义Analyzer
- 0或更多的
character filter
- 一个
tokenizer
- 0或更多的
token filter
自定义Analyzer的配置
参数 | 说明 |
---|---|
tokenizer | 内置或自定义的tokenizer |
char_filter | 内置或自定义的character filter,可选 |
filter | 内置或自定义的token filter,可选 |
position_increment_gap | 当一个字段值为数组且有多个值时,为了防止跨值匹配,修改值的position。默认100 。例如[ "John Abraham", "Lincoln Smith"]为拆分之后position为1,2, 103,104,这样就防止了跨值匹配。更具体的看Mapping文章的position_increment_gap 部分。 |
示例1:
- Character Filter
- HTML Strip Character Filter
- Tokenizer
- Standard Tokenizer
- Token Filters
- Lowercase Token Filter
- ASCII-Folding Token Filter
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter":[
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "Is this déjà vu?"
}
产生[ is, this, deja, vu ]
示例2
- Character Filter
- Mapping Character Filter:替换:)为
_happy_
以及 :( 为_sad_
- Mapping Character Filter:替换:)为
- Tokenizer
- Pattern Tokenizer
- Token Filters
- Lowercase Token Filter
- Stop Token Filter
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"char_filter": [
"emoticons"
],
"tokenizer": "punctuation",
"filter": [
"lowercase",
"english_stop"
]
}
},
"tokenizer": {
"punctuation": {
"type": "pattern",
"pattern": "[ .,!?]"
}
},
"char_filter": {
"emoticons": {
"type": "mapping",
"mappings": [
":) => _happy_",
":( => _sad_"
]
}
},
"filter": {
"english_stop":{
"type": "stop",
"stopwords": "_english_"
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "I'm a :) person, and you?"
}
产生[ i'm, _happy_, person, you ]
。