standard 目前什么都不做;
2.ASCII Folding Token Filter
asciifolding 类型的词元过滤器,将不在前127个ASCII字符(“基本拉丁文”Unicode块)中的字母,数字和符号Unicode字符转换为ASCII等效项(如果存在)。
3.Length Token Filter
length用于去掉过长或者过短的单词;
min 定义最短长度
max 定义最长长度
输入:
GET _analyze
{
"tokenizer" : "standard",
"filter": [{"type": "length", "min":1, "max":3 }],
"text" : "this is a test"
}
输出:
"tokens": [
{
"token": "is",
"start_offset": 5,
"end_offset": 7,
"type": "",
"position": 1
},
{
"token": "a",
"start_offset": 8,
"end_offset": 9,
"type": "",
"position": 2
}
]
4.Lowercase Token Filter
lowercase 类型的词元过滤器,将词元文本规范化为小写
5. Uppercase Token Filter
uppercase类型的词元过滤器,将词元文本规范化为大写;
6.Porter Stem Token Filter(Porter Stem 词元过滤器)
porter_stem类型的词元过滤器,根据波特干扰算法转换词元流
注:输入数据必须已经是小写,以使其可以正常工作
作用解读:返回一个英文单词的词干
等等... 围绕提取词干展开
输入:
GET _analyze
{
"tokenizer" : "standard",
"filter": ["porter_stem"],
"text" : ["I readed books", "eys"]
}
输出:
"tokens": [
{
"token": "I",
"start_offset": 0,
"end_offset": 1,
"type": "",
"position": 0
},
{
"token": "read",
"start_offset": 2,
"end_offset": 8,
"type": "",
"position": 1
},
{
"token": "book",
"start_offset": 9,
"end_offset": 14,
"type": "",
"position": 2
},
{
"token": "ei",
"start_offset": 15,
"end_offset": 18,
"type": "",
"position": 3
}
]
7.Shingle Token Filter
single类型的词元过滤器用于创建词元的组合作为单个词元
输入:
GET _analyze
{
"tokenizer" : "standard",
"filter": [{"type": "shingle", "output_unigrams": "false"}],
"text" : ["this is a test"]
}
输出:
注:这里如果设置output_unigrams = “true” 切默认为true,则会将输入分词原样加入token
#将this is a test 变为 this is, is a, a test, 两两组合,其中组合数目可以自定义
[
{
"token": "this is",
"start_offset": 0,
"end_offset": 7,
"type": "shingle",
"position": 0
},
{
"token": "is a",
"start_offset": 5,
"end_offset": 9,
"type": "shingle",
"position": 1
},
{
"token": "a test",
"start_offset": 8,
"end_offset": 14,
"type": "shingle",
"position": 2
}
]
8.Stop Token Filter
stop 类型的词元过滤器用于将stowords所列的单词从token stream中移除
stopwords 一个词元列表,默认为_english_ stopwords
输入:
{
"tokenizer" : "standard",
"filter": [{"type": "stop", "stopwords": ["this", "a"]}],
"text" : ["this is a test"]
}
输出:
# stopwords中拦截词this, a 被过滤掉;
"tokens": [
{
"token": "is",
"start_offset": 5,
"end_offset": 7,
"type": "",
"position": 1
},
{
"token": "test",
"start_offset": 10,
"end_offset": 14,
"type": "",
"position": 3
}
]
9.Word Delimiter Token Filter
word_delimiter 词元分析器将单词分解为子词,并对子词进行可选的转换
命名为 word_delimiter,它将单词分解为子词,并对子词组进行可选的转换。 词被分为以下规则的子词:
拆分字内分隔符(默认情况下,所有非字母数字字符)。.
"Wi-Fi" → "Wi", "Fi"
按大小写转换拆分: "PowerShot" → "Power", "Shot"
按字母数字转换拆分: "SD500" → "SD", "500"
每个子词的前导和尾随的词内分隔符都被忽略: "//hello---there, dude" → "hello", "there", "dude"
每个子词都删除尾随的“s”: "O’Neil’s" → "O", "Neil"
参数包括:
generate_word_parts
true 导致生成单词部分:"PowerShot" ⇒ "Power" "Shot"。默认 true
generate_number_parts
true 导致生成数字子词:"500-42" ⇒ "500" "42"。默认 true
catenate_numbers
true 导致单词最大程度的链接到一起:"wi-fi" ⇒ "wifi"。默认 false
catenate_numbers
true 导致数字最大程度的连接到一起:"500-42" ⇒ "50042"。默认 false
catenate_all
true 导致所有的子词可以连接:"wi-fi-4000" ⇒ "wifi4000"。默认 false
split_on_case_change
true 导致 "PowerShot" 作为两个词元("Power-Shot" 作为两部分看待)。默认 true
preserve_original
true 在子词中保留原始词: "500-42" ⇒ "500-42" "500" "42"。默认 false
split_on_numerics
true 导致 "j2se" 成为三个词元: "j" "2" "se"。默认 true
stem_english_possessive
true 导致每个子词中的 "'s" 都会被移除:"O’Neil’s" ⇒ "O", "Neil"。默认 true
高级设置:
protected_words
被分隔时的受保护词列表。 一个数组,或者也可以将 protected_words_path 设置为配置有保护字的文件(每行一个)。 如果存在,则自动解析为基于 config/ 位置的位置。
输入:
GET _analyze
{
"tokenizer" : "standard",
"filter": [{"type": "word_delimiter"}],
"text" : ["PowerShot", "219-230"]
}
输出:
# PowerShot 和 219-230 被分解为 Power, Shot, 219, 230
"tokens": [
{
"token": "Power",
"start_offset": 0,
"end_offset": 5,
"type": "",
"position": 0
},
{
"token": "Shot",
"start_offset": 5,
"end_offset": 9,
"type": "",
"position": 1
},
{
"token": "219",
"start_offset": 10,
"end_offset": 13,
"type": "",
"position": 2
},
{
"token": "230",
"start_offset": 14,
"end_offset": 17,
"type": "",
"position": 3
}
]
10.Word Delimiter Graph Token Filter
略过
11.Stemmer Token Filter
stemmer词元过滤器,可以添加几乎所有的词元过滤器,所以是一个通用接口
用法如下:
PUT /my_index
{
"settings": {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "standard",
"filter" : ["standard", "lowercase", "my_stemmer"]
}
},
"filter" : {
"my_stemmer" : {
"type" : "stemmer",
"name" : "light_german"
}
}
}
}
}
The language
/name
parameter controls the stemmer with the following available values (the preferred filters are marked in bold):
Arabic |
|
Armenian |
|
Basque |
|
Bengali |
|
Brazilian Portuguese |
|
Bulgarian |
|
Catalan |
|
Czech |
|
Danish |
|
Dutch |
|
English |
|
Finnish |
|
French |
|
Galician |
|
German |
|
Greek |
|
Hindi |
|
Hungarian |
|
Indonesian |
|
Irish |
|
Italian |
|
Kurdish (Sorani) |
|
Latvian |
|
Lithuanian |
|
Norwegian (Bokmål) |
|
Norwegian (Nynorsk) |
|
Portuguese |
|
Romanian |
|
Russian |
|
Spanish |
|
Swedish |
|
Turkish |
|
具体参见:点击打开链接
12.Stemmer Override Token Filter
通过自定义映射的方式来覆盖词干算法,然后保护这些术语不被词干修改,过滤会放置在其他词干过滤器之前
设置 描述
rules 映射规则列表
rules_path 映射列表文件路径
规则 通过=>分割
详细见: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stemmer-override-tokenfilter.html
输入:
GET _analyze
{
"tokenizer" : "standard",
"filter": [{"type": "stemmer_override", "rules":["this=>These"]}, "lowercase"],
"text" : ["this IS A TEST"]
}
输出:
#this 首先被 替换为 These, 再进行lowercase过滤;
"tokens": [
{
"token": "these",
"start_offset": 0,
"end_offset": 4,
"type": "",
"position": 0
},
{
"token": "is",
"start_offset": 5,
"end_offset": 7,
"type": "",
"position": 1
},
{
"token": "a",
"start_offset": 8,
"end_offset": 9,
"type": "",
"position": 2
},
{
"token": "test",
"start_offset": 10,
"end_offset": 14,
"type": "",
"position": 3
}
]
13.Keyword Marker Token Filter
type: keyword Marker
保护词语不被词干分析器修改,必须放置在任何词干过滤器之前
keywords 关键词列表
keywords_path 关键词列表路径
ignore_case 设置为true会把词转换为小写,默认false
You can configure it like:
PUT /keyword_marker_example
{
"settings": {
"analysis": {
"analyzer": {
"protect_cats": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "protect_cats", "porter_stem"]
},
"normal": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "porter_stem"]
}
},
"filter": {
"protect_cats": {
"type": "keyword_marker",
"keywords": ["cats"]
}
}
}
}
}
And test is with:
POST /keyword_marker_example/_analyze
{
"analyzer" : "protect_cats",
"text" : "I like cats"
}
And it’d respond:
#这里cats 如果不被保护会被porter_stem分析器修正为cat;
{
"tokens": [
{
"token": "i",
"start_offset": 0,
"end_offset": 1,
"type": "",
"position": 0
},
{
"token": "like",
"start_offset": 2,
"end_offset": 6,
"type": "",
"position": 1
},
{
"token": "cats",
"start_offset": 7,
"end_offset": 11,
"type": "",
"position": 2
}
]
}
14.Keyword Repeat Token Filter
keyword_repeat词元过滤器将每一个输入词元发送两次,一次作为关键词,一次作为非关键词. 简单的说就是将同一个词元展示两次,一次展示被后续词元过滤器处理的结果,一次展示原有输入的词元结果, 所以使用unique过滤器可以将修改词元的结果和没有修改的结果都展示出来
官方样例:
这里是一个用keyword_repeat词元过滤器保留被处理过的词元和没有处理的词元
PUT /keyword_repeat_example
{
"settings": {
"analysis": {
"analyzer": {
"stemmed_and_unstemmed": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "keyword_repeat", "porter_stem", "unique_stem"]
}
},
"filter": {
"unique_stem": {
"type": "unique",
"only_on_same_position": true
}
}
}
}
}
And you can test is with
POST /keyword_repeat_example/_analyze
{
"analyzer" : "stemmed_and_unstemmed",
"text" : "I like cats"
}
And it’d respond:
{
"tokens": [
{
"token": "i",
"start_offset": 0,
"end_offset": 1,
"type": "",
"position": 0
},
{
"token": "like",
"start_offset": 2,
"end_offset": 6,
"type": "",
"position": 1
},
{
"token": "cats",
"start_offset": 7,
"end_offset": 11,
"type": "",
"position": 2
},
{
"token": "cat",
"start_offset": 7,
"end_offset": 11,
"type": "",
"position": 2
}
]
}
15.Synonym Token Filter
用于在分析期间处理同义词
具体使用参见:
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-synonym-tokenfilter.html
16.Reverse Token Filter
reverse词元过滤器将词元进行简单的翻转
输入:
GET _analyze
{
"tokenizer": "standard",
"filter": ["reverse"],
"text": ["hello world"]
}
输出:
# 将hello 变换成 olleh, world 变换成 dirow
"tokens": [
{
"token": "olleh",
"start_offset": 0,
"end_offset": 5,
"type": "",
"position": 0
},
{
"token": "dlrow",
"start_offset": 6,
"end_offset": 11,
"type": "",
"position": 1
}
]
17.Truncate Token Filter
truncate词元过滤器的作用是减少词元到特定长度,简单来说就是需要给定一个词元长度length, 如果单个词元长度超过length,超过length的部分会被截断
输入:
{
"tokenizer": "standard",
"filter": [{"type":"truncate", "length":3}],
"text": ["this is a test"]
}
输出:
#this 被修改为 thi, test 被修改为 test;
"tokens": [
{
"token": "thi",
"start_offset": 0,
"end_offset": 4,
"type": "",
"position": 0
},
{
"token": "is",
"start_offset": 5,
"end_offset": 7,
"type": "",
"position": 1
},
{
"token": "a",
"start_offset": 8,
"end_offset": 9,
"type": "",
"position": 2
},
{
"token": "tes",
"start_offset": 10,
"end_offset": 14,
"type": "",
"position": 3
}
]
18.Unique Token Filter
unique词元过滤器的作用就是保证同样结果的词元只出现一次,同时如果only_on_same_position 设置为true, 只会保证在连续位置没有相同的词元.
输入:
GET _analyze
{
"tokenizer": "standard",
"filter": ["unique"],
"text": ["this is a test test test"]
}
输出:
"tokens": [
{
"token": "this",
"start_offset": 0,
"end_offset": 4,
"type": "",
"position": 0
},
{
"token": "is",
"start_offset": 5,
"end_offset": 7,
"type": "",
"position": 1
},
{
"token": "a",
"start_offset": 8,
"end_offset": 9,
"type": "",
"position": 2
},
{
"token": "test",
"start_offset": 10,
"end_offset": 14,
"type": "",
"position": 3
}
]
19.Pattern Capture Token Filter
使用正则表达式匹配词元,每个正则表达式可以匹配多次并且允许重复匹配
如果preserve_original设置为true,则显示原始词元
20.Pattern Replace Token Filter
使用正则表达式匹配被替换的词义,其中pattern定义匹配规则(对应的正则表达式), replacement参数提供替换字符串,并且知识引用原始文档.
具体参见: https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-pattern_replace-tokenfilter.html
21.Trim Token Filter
trim词元过滤器的作用是去除词元周围的空格
22.Limit Token Count Token Filter
limit词元过滤器的作用是限制,每个document或者field被索引的词元个数
max_token_count 表示最大被索引的词元个数
consume_all_tokens 设置为true表示尽最大可能处理词元即使已经超过了最大个数的限制,默认为false
输入:
GET _analyze
{
"tokenizer": "standard",
"filter": [{"type": "limit", "max_token_count":2}],
"text": ["this is a test"]
}
输出:
#本应输出4个词元,因为最大数量的限制为2,所以只输出 this, is;
"tokens": [
{
"token": "this",
"start_offset": 0,
"end_offset": 4,
"type": "",
"position": 0
},
{
"token": "is",
"start_offset": 5,
"end_offset": 7,
"type": "",
"position": 1
}
]
23.Common Grams Token Filter
详细参见: https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-common-grams-tokenfilter.html
# common_grams词元我测试了一下,大概意思就是如果匹配到common_words中的词元,就会将该词元和前面词元用’_‘分割连起来生成一个新的词元,后面有词元也同理.
输入:
GET _analyze
{
"tokenizer": "standard",
"filter": [{"type": "common_grams", "common_words": ["a", "an", "the"]}],
"text": ["a test"]
}
输出:
#a 是一个common_words, 所以将a_test连起来输出了一个新的词元
"tokens": [
{
"token": "a",
"start_offset": 0,
"end_offset": 1,
"type": "",
"position": 0
},
{
"token": "a_test",
"start_offset": 0,
"end_offset": 6,
"type": "gram",
"position": 0,
"positionLength": 2
},
{
"token": "test",
"start_offset": 2,
"end_offset": 6,
"type": "",
"position": 1
}
]
24.Normalization Token Filter
有一些可用的词元过滤去可以规范化一些语言的特殊字符
参见: https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-normalization-tokenfilter.html
25.Delimited Payload Token Filter
delimited_payload_filter用于分割词元为词元和负载(payload) #payload不知道怎么翻译,理解为负载吧.
大概意思就是设定一个delimiter分割符,默认为’|’, 将’|’前部分割为词元,将’I’后部分割为payload;
参数:
delimiter定义分割符号, 默认为’|’
encoding 定义payload的类型, int for integer, float for float and identity for characters, 默认为float;
输入:
GET _analyze
{
"tokenizer": "standard",
"filter": ["delimited_payload_filter"],
"text": ["the|1 quick|2 fox|3"]
}
输出:
"tokens": [
{
"token": "the",
"start_offset": 0,
"end_offset": 3,
"type": "",
"position": 0
},
{
"token": "1",
"start_offset": 4,
"end_offset": 5,
"type": "",
"position": 1
},
{
"token": "quick",
"start_offset": 6,
"end_offset": 11,
"type": "",
"position": 2
},
{
"token": "2",
"start_offset": 12,
"end_offset": 13,
"type": "",
"position": 3
},
{
"token": "fox",
"start_offset": 14,
"end_offset": 17,
"type": "",
"position": 4
},
{
"token": "3",
"start_offset": 18,
"end_offset": 19,
"type": "",
"position": 5
}
]
26.Keep Words Token Filter
keep词元过滤器只保留keep_words中定义的词元
输入:
GET _analyze
{
"tokenizer": "standard",
"filter": [{"type":"keep", "keep_words":["this", "test"]}],
"text": ["this is a test"]
}
输出:
#这里 is, a 因为没有在keep_words中定义而被过滤
"tokens": [
{
"token": "this",
"start_offset": 0,
"end_offset": 4,
"type": "",
"position": 0
},
{
"token": "test",
"start_offset": 10,
"end_offset": 14,
"type": "",
"position": 3
}
]
27.Keep Types Token Filter
keep_types类似keep_words, 但是保留部分词元,该词元的类型需要在types中定义过
例如:
You can set it up like:
PUT /keep_types_example
{
"settings" : {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "standard",
"filter" : ["standard", "lowercase", "extract_numbers"]
}
},
"filter" : {
"extract_numbers" : {
"type" : "keep_types",
"types" : [ "" ]
}
}
}
}
}
And test it like:
POST /keep_types_example/_analyze
{
"analyzer" : "my_analyzer",
"text" : "this is just 1 a test"
}
And it’d respond:
{
"tokens": [
{
"token": "1",
"start_offset": 13,
"end_offset": 14,
"type": "",
"position": 3
}
]
}
只有
28.Apostrophe Token Filter
apostrophe词元过滤器的作用是移除所有的apostrophe, 包括apostrophe自己
Apostrophe(究竟是什么符号不清楚, 测试了一下貌似 单引号后部会被过滤, 不过注意标准分析器一开始就会把单引号去除,测试不出Apostrophe的效果)
29.Decimal Digit Token Filter
decimal_digit的作用是将unicode数字转化为0-9
30.Fingerprint Token Filter
就是讲本来要输出的所有词元按照升序排序,再去重, 最后输出为一个词元,见样例
输入:
GET _analyze
{
"tokenizer": "standard",
"filter": ["fingerprint"],
"text": ["b a f e c f"]
}
输出:
"tokens": [
{
"token": "a b c e f",
"start_offset": 0,
"end_offset": 11,
"type": "fingerprint",
"position": 0
}
]
31.Minhash Token Filter
没明白作用… 大概意思为将每一个词元进行哈希映射,然后将映射结果分散到不同的buckets中, 然后返回这些哈希作为词元
参见: https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-minhash-tokenfilter.html