官方文档:https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis.html
分析器analyzer
包含如下几个属性:
分析器类型
type
:custom
字符过滤器char_filter
: 零个或多个
分词器tokenizer
: 有且仅有一个
词元过滤器filter
:零个或多个 按顺序应用的
字符过滤器
字符过滤器也叫预处理过滤器,用于预处理字符流,然后再将其传递给分词器。
字符过滤器有三种:
1. html_strip
:HTML标签字符过滤器
特性:
a. 从原始文本中过滤掉HTML标签
可选配置:
escaped_tags: 不应从原始文本中过滤掉的HTML标签,数组类型。
example:
GET _analyze
{
"tokenizer": "keyword",
"char_filter": [ "html_strip" ],
"text": "I'm so happy!
"
}
{
"tokens": [
{
"token": """
I'm so happy!
""",
"start_offset": 0,
"end_offset": 32,
"type": "word",
"position": 0
}
]
}
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": ["my_char_filter"]
}
},
"char_filter": {
"my_char_filter": {
"type": "html_strip",
"escaped_tags": ["b"] // 不从原始文本中过滤标签
}
}
}
}
}
GET my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "I'm so happy!
"
}
{
"tokens": [
{
"token": """
I'm so happy!
""",
"start_offset": 0,
"end_offset": 32,
"type": "word",
"position": 0
}
]
}
2. mapping
: 映射字符过滤器
特性:
a. mapping字符过滤器接受键值对数组。每当遇到与键相同的字符串时,它将用与该键关联的值替换它们。
b. 匹配是贪婪的,优先匹配最长的那一个。
c. 允许替换为空字符串。
可选配置:
mappings:定义一个键值对数组
mappings_path: 定义一个包含键值对数组文件的路径
example:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"& => and",
"$ => ¥"
]
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "My license plate is $203 & $110"
}
{
"tokens": [
{
"token": "My license plate is ¥203 and ¥110",
"start_offset": 0,
"end_offset": 31,
"type": "word",
"position": 0
}
]
}
3. pattern_replace
: 正则替换字符过滤器
特性:
a. 使用一个正则表达式匹配,用指定的字符串替换字符。
b. 替换字符串可以引用正则表达式中的捕获组。
可选配置:
pattern: 一个Java的正则表达式,必须。
replacement:替换字符串,可以参考使用捕获组$1
..$9
语法,说明 这里
flags:Java正则表达式标志。标记应以管道分隔,例如"CASE_INSENSITIVE|COMMENTS"
。
example:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "(\\d+)-(?=\\d)",
"replacement": "$1_"
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "My credit card is 123-456-789"
}
{
"tokens": [
{
"token": "My credit card is 123_456_789",
"start_offset": 0,
"end_offset": 29,
"type": "word",
"position": 0
}
]
}
分词器
- Standard Tokenizer
特性:
标准类型的tokenizer对欧洲语言非常友好, 支持Unicode。
可选配置:
max_token_length:最大的token集合,即经过tokenizer过后得到的结果集的最大值。如果token的长度超过了设置的长度,将会继续分,默认255
ex:
POST _analyze
{
"tokenizer": "standard",
"text": "The 2 QUICK Brown-Foxes of dog's bone."
}
结果
[The, 2, QUICK, Brown, Foxes, of, dog's bone]
ex:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "standard",
"max_token_length": 5
}
}
}
}
}
- Letter Tokenizer
特性:
每当遇到一个字符是不是字母的时候,进行分词。
可选配置:
不可配置
ex:
POST _analyze
{
"tokenizer": "letter",
"text": "The 2 QUICK Brown-Foxes of dog's bone."
}
结果:
[The, 2, QUICK, Brown, Foxes, of, dog, s, bone]
- Lowercase Tokenizer
特性:
可以看做是Letter Tokenizer和lower token filter的组合
- Whitespace Tokenizer
特性:
遇到空白字符就分词
可选配置:
无
- UAX URL Email Tokenizer
特性:
和standard 类型的分词器类似,但是能识别url和email
可选配置:
max_token_length:默认256
ex:
POST _analyze
{
"tokenizer": "uax_url_email",
"text": "Email me at [email protected] http://www.baidu.com"
}
结果
[Email, me, at, [email protected], http://www.baidu.com]
- Classic Tokenizer
特性:
为英语而生的分词器. 这个分词器对于英文的首字符缩写、 公司名字、 email 、 大部分网站域名.都能很好的解决。 但是, 对于除了英语之外的其他语言,都不是很好使。
可选配置:
max_token_length: 默认255
- Thai Tokenizer
特性:
泰语专用分词器
- NGram Tokenizer
特性:
N-gram就像一个滑动窗口,在整个单词上移动-连续的指定长度字符序列。它们对于查询不用空格的语言(例如德语,汉语)很有用。
可选配置:
min_gram:分词后词语的最小长度
max_gram: 分词后数据的最大长度
token_chars:设置分词的形式,例如数字还是文字。elasticsearch将根据分词的形式对文本进行分词。
[] (Keep all characters)
token_chars可用的值:
普通字符:letter — for example a, b, ï or 京
数字:digit — for example 3 or 7
空格或回车符:whitespace — for example " " or "\n"
标点符号:punctuation — for example ! or "
特殊字符:symbol — for example $ or √
ex:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer",
"filter":["lowercase"]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "2 2311 Quick Foxes."
}
分词结果:
[231, 2311, 311, qui, quic, quick, uic, uick, ick, fox, foxe, foxes, oxe, oxes, xes]
- Edge NGram Tokenizer
特性
边缘ngram分词器,与ngram分词器的不同之处在于ngram是补全提示,而edge-ngram是自动补全。
例如:
POST _analyze
{
"tokenizer": "ngram",
"text": "a Quick Foxes."
}
ngram
分词测试结果:
["a", "a ", " ", " Q", "Q", "Qu", "u", "ui", "i", "ic", "c", "ck", "k", "k ", " ", " F", "F", "Fo", "o", "ox", "x", "xe", "e", "es", "s", "s.", ".",]
POST _analyze
{
"tokenizer": "edge_ngram",
"text": "a Quick Foxes."
}
edeg_ngram
分词测试结果
["a", "a "]
从上面的测试结果可以看出:
默认ngram和edge_ngram分词器是将
a Quick Foxes.
当成一个整体的。
默认ngram和edge_ngram最小长度和最大长度都是 1 和 2。
ngram是一个固定的小窗口在单词上滑动的。(一般用来搜索补全提示)
edge_ngram是起始位置不动,小窗口从最小值到最大值延长的结果。(一般用来单词自动补全)
- Keyword Tokenizer
特性:
keyword 类型的tokenizer 是将一整块的输入数据作为一个单独的分词
可选配置:
buffer_size: 默认256
- Pattern Tokenizer
特性:
用正则表达式分词
可选配置:
pattern
:一个Java的正则表达式,则默认为\W+
。
flags
: Java正则表达式标志。标记应以管道分隔,例如"CASE_INSENSITIVE|COMMENTS"
。
group
: 要提取哪个捕获组作为令牌。默认为-1
(分割).
group默认为-1,表示以正则表达式匹配字符来进行分割分词
group=0,则表示保留匹配所有正则表达式的字符串来分词
group=1,2,3...,则表示保留匹配正则表达式中的某一个()
中的匹配结果。
ex:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "\"(.*)\"",
"flags": "",
"group": -1
}
}
}
}
}
注意:匹配两端带"
的字符串,注意"\"(.*)\""
与"\".*\""
的区别,同样都是匹配两端带"
的字符串,前面一个有group,而后面这个没有group。
{
"analyzer": "my_analyzer",
"text": "comma,\"separated\",values"
}
分词测试结果:
当group为默认的-1时, 以正则匹配到的结果作为分隔符分词,所得结果: ["comma", "values"]
当group=0时,以正则匹配到的结果作为分词结果,所得结果为匹配的字符串:[""separated""]
当group=1时,以正则中第一个()
中匹配的字符串作为分词结果,所得的分词结果为:["separated"]
当group=2时,以正则中第二个()
中匹配的字符串作为分词结果,这里正则表达式中因为只有一个()
,所以会报异常。
- Path Hierarchy Tokenizer
特性
所述path_hierarchy标记生成器需要像文件系统路径的分层值,分割的路径分隔,并发出一个术语,树中的每个组件。
可选配置
delimiter:路径匹配分割符,默认
/
.
replacement:替换路径分隔符,默认与delimiter一致
buffer_size:分割路径最大长度,默认1024.
reverse:是否翻转,默认 false.
skip: 默认为 0.
ex:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "path_hierarchy",
"delimiter": "-",
"replacement":"/",
"reverse": false,
"skip": 0
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "one-two-three-four"
}
{
"tokens": [
{
"token": "one",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "one/two",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "one/two/three",
"start_offset": 0,
"end_offset": 13,
"type": "word",
"position": 0
},
{
"token": "one/two/three/four",
"start_offset": 0,
"end_offset": 18,
"type": "word",
"position": 0
}
]
}
一、内置的8种分析器:
-
standard analyzer
:默认分词器,它提供了基于语法的分词(基于Unicode文本分割算法,如Unicode® Standard Annex #29所指定的),适用于大多数语言,对中文分词效果很差。
POST _analyze
{
"analyzer":"standard",
"text":"Geneva K. Risk-Issues "
}
{
"tokens": [
{
"token": "geneva",
"start_offset": 0,
"end_offset": 6,
"type": "",
"position": 0
},
{
"token": "k",
"start_offset": 7,
"end_offset": 8,
"type": "",
"position": 1
},
{
"token": "risk",
"start_offset": 10,
"end_offset": 14,
"type": "",
"position": 2
},
{
"token": "issues",
"start_offset": 15,
"end_offset": 21,
"type": "",
"position": 3
}
]
}
-
simple analyzer
:它提供了基于字母的分词,如果遇到不是字母时直接分词,所有字母均置为小写
POST _analyze
{
"analyzer":"simple",
"text":"Geneva K. Risk-Issues "
}
{
"tokens": [
{
"token": "geneva",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 0
},
{
"token": "k",
"start_offset": 7,
"end_offset": 8,
"type": "word",
"position": 1
},
{
"token": "risk",
"start_offset": 10,
"end_offset": 14,
"type": "word",
"position": 2
},
{
"token": "issues",
"start_offset": 15,
"end_offset": 21,
"type": "word",
"position": 3
}
]
}
-
whitespace analyzer
:它提供了基于空格的分词,如果遇到空格时直接分词
POST _analyze
{
"analyzer":"whitespace",
"text":"Geneva K. Risk-Issues "
}
{
"tokens": [
{
"token": "Geneva",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 0
},
{
"token": "K.",
"start_offset": 7,
"end_offset": 9,
"type": "word",
"position": 1
},
{
"token": "Risk-Issues",
"start_offset": 10,
"end_offset": 21,
"type": "word",
"position": 2
}
]
}
-
stop analyzer
:它与simple analyzer相同,但是支持删除停用词。它默认使用 _english_stop 单词。
POST _analyze
{
"analyzer":"stop",
"text":"Geneva K.of Risk-Issues "
}
{
"tokens": [
{
"token": "geneva",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 0
},
{
"token": "k",
"start_offset": 7,
"end_offset": 8,
"type": "word",
"position": 1
},
{
"token": "risk",
"start_offset": 16,
"end_offset": 20,
"type": "word",
"position": 4
},
{
"token": "issues",
"start_offset": 21,
"end_offset": 27,
"type": "word",
"position": 5
}
]
}
-
keyword analyzer
:它提供的是无操作分词,它将整个输入字符串作为一个词返回,即不分词。
POST _analyze
{
"analyzer":"keyword",
"text":"Geneva K.of Risk-Issues "
}
{
"tokens": [
{
"token": "Geneva K.of Risk-Issues ",
"start_offset": 0,
"end_offset": 24,
"type": "word",
"position": 0
}
]
}
-
pattern analyzer
:它提供了基于正则表达式将文本分词。正则表达式应该匹配词语分隔符,而不是词语本身。正则表达式默认为\W+(或所有非单词字符)。
POST _analyze
{
"analyzer":"pattern",
"text":"Geneva K.of Risk-Issues "
}
{
"tokens": [
{
"token": "geneva",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 0
},
{
"token": "k",
"start_offset": 7,
"end_offset": 8,
"type": "word",
"position": 1
},
{
"token": "of",
"start_offset": 9,
"end_offset": 11,
"type": "word",
"position": 2
},
{
"token": "risk",
"start_offset": 12,
"end_offset": 16,
"type": "word",
"position": 3
},
{
"token": "issues",
"start_offset": 17,
"end_offset": 23,
"type": "word",
"position": 4
}
]
}
-
language analyzers
:它提供了一组语言的分词,旨在处理特定语言.它包含了一下语言的分词:阿拉伯语,亚美尼亚语,巴斯克语,巴西语,保加利亚语,加泰罗尼亚语,cjk,捷克语,丹麦语,荷兰语,英语,芬兰语,法语,加利西亚语,德语,希腊语,印度语,匈牙利语,爱尔兰语,意大利语,拉脱维亚语,立陶宛语,挪威语,波斯语,葡萄牙语,罗马尼亚语,俄罗斯语,索拉尼语,西班牙语,瑞典语,土耳其语,泰国语
POST _analyze
{
"analyzer":"english", ## french(法语)
"text":"Geneva K.of Risk-Issues "
}
{
"tokens": [
{
"token": "geneva",
"start_offset": 0,
"end_offset": 6,
"type": "",
"position": 0
},
{
"token": "k.of",
"start_offset": 7,
"end_offset": 11,
"type": "",
"position": 1
},
{
"token": "risk",
"start_offset": 12,
"end_offset": 16,
"type": "",
"position": 2
},
{
"token": "isue",
"start_offset": 17,
"end_offset": 23,
"type": "",
"position": 3
}
]
}
-
fingerprint analyzer
:对text进行排序,重复数据删除然后将它们重新组合为单个text。
POST _analyze
{
"analyzer":"fingerprint",
"text":"Geneva K.of Risk-Issues "
}
{
"tokens": [
{
"token": "geneva issues k.of risk",
"start_offset": 0,
"end_offset": 24,
"type": "fingerprint",
"position": 0
}
]
}
二、测试自定义分析器
分析器analyze API的使用
分析器analyze API可验证分析器的分析效果并解释分析过程。
text: 待分析文本
explain:解释分析过程
char_filter:字符过滤器
tokenizer:分词器
filter:词元过滤器
GET _analyze
{
"char_filter": ["html_strip"],
"tokenizer": "standard",
"filter": ["lowercase"],
"text": "No dreams, why bother Beijing !
",
"explain": true
}
{
"detail": {
"custom_analyzer": true,
"charfilters": [
{
"name": "html_strip",
"filtered_text": [
"""
No dreams, why bother Beijing !
"""
]
}
],
"tokenizer": {
"name": "standard",
"tokens": [
{
"token": "No",
"start_offset": 7,
"end_offset": 9,
"type": "",
"position": 0,
"bytes": "[4e 6f]",
"positionLength": 1
},
{
"token": "dreams",
"start_offset": 13,
"end_offset": 23,
"type": "",
"position": 1,
"bytes": "[64 72 65 61 6d 73]",
"positionLength": 1
},
{
"token": "why",
"start_offset": 25,
"end_offset": 28,
"type": "",
"position": 2,
"bytes": "[77 68 79]",
"positionLength": 1
},
{
"token": "bother",
"start_offset": 29,
"end_offset": 35,
"type": "",
"position": 3,
"bytes": "[62 6f 74 68 65 72]",
"positionLength": 1
},
{
"token": "Beijing",
"start_offset": 39,
"end_offset": 50,
"type": "",
"position": 4,
"bytes": "[42 65 69 6a 69 6e 67]",
"positionLength": 1
}
]
},
"tokenfilters": [
{
"name": "lowercase",
"tokens": [
{
"token": "no",
"start_offset": 7,
"end_offset": 9,
"type": "",
"position": 0,
"bytes": "[6e 6f]",
"positionLength": 1
},
{
"token": "dreams",
"start_offset": 13,
"end_offset": 23,
"type": "",
"position": 1,
"bytes": "[64 72 65 61 6d 73]",
"positionLength": 1
},
{
"token": "why",
"start_offset": 25,
"end_offset": 28,
"type": "",
"position": 2,
"bytes": "[77 68 79]",
"positionLength": 1
},
{
"token": "bother",
"start_offset": 29,
"end_offset": 35,
"type": "",
"position": 3,
"bytes": "[62 6f 74 68 65 72]",
"positionLength": 1
},
{
"token": "beijing",
"start_offset": 39,
"end_offset": 50,
"type": "",
"position": 4,
"bytes": "[62 65 69 6a 69 6e 67]",
"positionLength": 1
}
]
}
]
}
}
归一化分析器 normalizer
针对type为keyword类型的字段,只能精确搜索,而且是区分大小写的。有时候我们希望对于keyword类型的字段不区分大小写也能精确检索怎么办呢?normalizer这个分析器可以帮你解决这个问题。
normalizer的构成比analyzer少了一个tokenizer属性,它的结构如下:
分析器类型
type
:custom
字符过滤器char_filter
: 零个或多个 按顺序应用的
词元过滤器filter
:零个或多个 按顺序应用的
这里借用官方的一个例子:
PUT index
{
"settings": {
"analysis": {
"char_filter": {
"quote": {
"type": "mapping",
"mappings": [
"« => \"",
"» => \""
]
}
},
"normalizer": {
"my_normalizer": {
"type": "custom",
"char_filter": ["quote"],
"filter": ["lowercase", "asciifolding"]
}
}
}
},
"mappings": {
"type": {
"properties": {
"foo": {
"type": "keyword", // normalizer只能用在keyword类型的字段
"normalizer": "my_normalizer"
}
}
}
}
}
PUT testlog/wd_doc/1
{
"title": "Quick Frox"
}
GET testlog/wd_doc/_search
{
"query": {
"match": {
"title": {
"query": "quick Frox" // 大小写不敏感,无论大小写都能检索到
}
}
}
}
内置的12个分词器
- 标准分词器
- 字母分词器
- 小写分词器
- 空格分词器
- UAX URL电子邮件令牌生成器
- 经典分词器
- 泰语分词器
- NGram令牌生成器
- Edge NGram令牌生成器
- 关键字标记器
- 模式标记器
- 路径层次标记器