elasticSearch Analysis Token Filters作用及相关样例

1.Standard Token Filter

standard 目前什么都不做;


2.ASCII Folding Token Filter

asciifolding 类型的词元过滤器,将不在前127ASCII字符(“基本拉丁文”Unicode块)中的字母,数字和符号Unicode字符转换为ASCII等效项(如果存在)。


3.Length Token Filter

length用于去掉过长或者过短的单词;

min 定义最短长度

max 定义最长长度

输入:

GET _analyze
{
  "tokenizer" : "standard",
  "filter": [{"type": "length", "min":1, "max":3 }],  
  "text" : "this is a test"
}

输出:

"tokens": [
    {
      "token": "is",
      "start_offset": 5,
      "end_offset": 7,
      "type": "",
      "position": 1
    },
    {
      "token": "a",
      "start_offset": 8,
      "end_offset": 9,
      "type": "",
      "position": 2
    }
  ]


4.Lowercase Token Filter

lowercase 类型的词元过滤器,将词元文本规范化为小写


5. Uppercase Token Filter

uppercase类型的词元过滤器,将词元文本规范化为大写;


6.Porter Stem Token Filter(Porter Stem 词元过滤器)

porter_stem类型的词元过滤器,根据波特干扰算法转换词元流

:输入数据必须已经是小写,以使其可以正常工作

作用解读:返回一个英文单词的词干

  1. 1> 处理复数,以及eding结束的单词
  2. 2> 复数变为单数
  3. 3> 如果单词包含元音,并且以ys结尾,将ys改为i(测试的,貌似是

 等等... 围绕提取词干展开

输入:

GET _analyze
{
  "tokenizer" : "standard",
  "filter": ["porter_stem"],  
  "text" : ["I readed books", "eys"]
}

输出:

"tokens": [
    {
      "token": "I",
      "start_offset": 0,
      "end_offset": 1,
      "type": "",
      "position": 0
    },
    {
      "token": "read",
      "start_offset": 2,
      "end_offset": 8,
      "type": "",
      "position": 1
    },
    {
      "token": "book",
      "start_offset": 9,
      "end_offset": 14,
      "type": "",
      "position": 2
    },
    {
      "token": "ei",
      "start_offset": 15,
      "end_offset": 18,
      "type": "",
      "position": 3
    }
  ]


7.Shingle Token Filter

single类型的词元过滤器用于创建词元的组合作为单个词元

输入:

GET _analyze
{
  "tokenizer" : "standard",
  "filter": [{"type": "shingle", "output_unigrams": "false"}],  
  "text" : ["this is a test"]
}

输出:

:这里如果设置output_unigrams = true 切默认为true,则会将输入分词原样加入token

#this is a test 变为 this is, is a, a test, 两两组合,其中组合数目可以自定义

[
    {
      "token": "this is",
      "start_offset": 0,
      "end_offset": 7,
      "type": "shingle",
      "position": 0
    },
    {
      "token": "is a",
      "start_offset": 5,
      "end_offset": 9,
      "type": "shingle",
      "position": 1
    },
    {
      "token": "a test",
      "start_offset": 8,
      "end_offset": 14,
      "type": "shingle",
      "position": 2
    }
  ]


8.Stop Token Filter

stop 类型的词元过滤器用于将stowords所列的单词从token stream中移除

stopwords  一个词元列表,默认为_english_  stopwords

输入:

{
  "tokenizer" : "standard",
  "filter": [{"type": "stop", "stopwords": ["this", "a"]}],  
  "text" : ["this is a test"]
}

输出:

# stopwords中拦截词this, a 被过滤掉;
"tokens": [
    {
      "token": "is",
      "start_offset": 5,
      "end_offset": 7,
      "type": "",
      "position": 1
    },
    {
      "token": "test",
      "start_offset": 10,
      "end_offset": 14,
      "type": "",
      "position": 3
    }
  ]


9.Word Delimiter Token Filter

word_delimiter 词元分析器将单词分解为子词,并对子词进行可选的转换

命名为 word_delimiter,它将单词分解为子词,并对子词组进行可选的转换。 词被分为以下规则的子词:

拆分字内分隔符(默认情况下,所有非字母数字字符)。.

"Wi-Fi" → "Wi", "Fi"

按大小写转换拆分: "PowerShot" → "Power", "Shot"

按字母数字转换拆分: "SD500" → "SD", "500"

每个子词的前导和尾随的词内分隔符都被忽略: "//hello---there, dude" → "hello", "there", "dude"

每个子词都删除尾随的“s: "ONeils" → "O", "Neil"
参数包括:
generate_word_parts

true 导致生成单词部分:"PowerShot" "Power" "Shot"。默认 true

generate_number_parts

true 导致生成数字子词:"500-42" "500" "42"。默认 true

catenate_numbers

true 导致单词最大程度的链接到一起:"wi-fi" "wifi"。默认 false

catenate_numbers

true 导致数字最大程度的连接到一起:"500-42" "50042"。默认 false

catenate_all

true 导致所有的子词可以连接:"wi-fi-4000" "wifi4000"。默认 false

split_on_case_change

true 导致 "PowerShot" 作为两个词元("Power-Shot" 作为两部分看待)。默认 true

preserve_original

true 在子词中保留原始词: "500-42" "500-42" "500" "42"。默认 false

split_on_numerics

true 导致 "j2se" 成为三个词元: "j" "2" "se"。默认 true

stem_english_possessive

true 导致每个子词中的 "'s" 都会被移除:"ONeils" "O", "Neil"。默认 true

高级设置:

protected_words

被分隔时的受保护词列表。 一个数组,或者也可以将 protected_words_path 设置为配置有保护字的文件(每行一个)。 如果存在,则自动解析为基于 config/ 位置的位置。


输入:

GET _analyze
{
  "tokenizer" : "standard",
  "filter": [{"type": "word_delimiter"}],  
  "text" : ["PowerShot", "219-230"]
}

输出:

# PowerShot 和 219-230 被分解为 Power, Shot, 219, 230
"tokens": [
    {
      "token": "Power",
      "start_offset": 0,
      "end_offset": 5,
      "type": "",
      "position": 0
    },
    {
      "token": "Shot",
      "start_offset": 5,
      "end_offset": 9,
      "type": "",
      "position": 1
    },
    {
      "token": "219",
      "start_offset": 10,
      "end_offset": 13,
      "type": "",
      "position": 2
    },
    {
      "token": "230",
      "start_offset": 14,
      "end_offset": 17,
      "type": "",
      "position": 3
    }
  ]


10.Word Delimiter Graph Token Filter

略过


11.Stemmer Token Filter

stemmer词元过滤器,可以添加几乎所有的词元过滤器,所以是一个通用接口
用法如下:
PUT /my_index
{
    "settings": {
        "analysis" : {
            "analyzer" : {
                "my_analyzer" : {
                    "tokenizer" : "standard",
                    "filter" : ["standard", "lowercase", "my_stemmer"]
                }
            },
            "filter" : {
                "my_stemmer" : {
                    "type" : "stemmer",
                    "name" : "light_german"
                }
            }
        }
    }
}

The language/name parameter controls the stemmer with the following available values (the preferred filters are marked in bold):

Arabic

arabic

Armenian

armenian

Basque

basque

Bengali

bengali light_bengali

Brazilian Portuguese

brazilian

Bulgarian

bulgarian

Catalan

catalan

Czech

czech

Danish

danish

Dutch

dutchdutch_kp

English

englishlight_englishminimal_english,possessive_englishporter2lovins

Finnish

finnishlight_finnish

French

frenchlight_frenchminimal_french

Galician

galicianminimal_galician (Plural step only)

German

germangerman2light_germanminimal_german

Greek

greek

Hindi

hindi

Hungarian

hungarianlight_hungarian

Indonesian

indonesian

Irish

irish

Italian

italianlight_italian

Kurdish (Sorani)

sorani

Latvian

latvian

Lithuanian

lithuanian

Norwegian (Bokmål)

norwegianlight_norwegianminimal_norwegian

Norwegian (Nynorsk)

light_nynorskminimal_nynorsk

Portuguese

portugueselight_portugueseminimal_portuguese,portuguese_rslp

Romanian

romanian

Russian

russianlight_russian

Spanish

spanishlight_spanish

Swedish

swedishlight_swedish

Turkish

turkish

具体参见:点击打开链接


12.Stemmer Override Token Filter

通过自定义映射的方式来覆盖词干算法,然后保护这些术语不被词干修改,过滤会放置在其他词干过滤器之前

设置 描述

rules         映射规则列表

rules_path 映射列表文件路径

规则 通过=>分割

详细见: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stemmer-override-tokenfilter.html

输入:

GET _analyze
{
  "tokenizer" : "standard",
  "filter": [{"type": "stemmer_override", "rules":["this=>These"]}, "lowercase"],  
  "text" : ["this IS A TEST"]
}

输出:

#this 首先被 替换为 These, 再进行lowercase过滤;
"tokens": [
    {
      "token": "these",
      "start_offset": 0,
      "end_offset": 4,
      "type": "",
      "position": 0
    },
    {
      "token": "is",
      "start_offset": 5,
      "end_offset": 7,
      "type": "",
      "position": 1
    },
    {
      "token": "a",
      "start_offset": 8,
      "end_offset": 9,
      "type": "",
      "position": 2
    },
    {
      "token": "test",
      "start_offset": 10,
      "end_offset": 14,
      "type": "",
      "position": 3
    }
  ]


13.Keyword Marker Token Filter

type: keyword Marker

保护词语不被词干分析器修改,必须放置在任何词干过滤器之前

keywords 关键词列表

keywords_path 关键词列表路径

ignore_case 设置为true会把词转换为小写,默认false

You can configure it like:

PUT /keyword_marker_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "protect_cats": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "protect_cats", "porter_stem"]
        },
        "normal": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "porter_stem"]
        }
      },
      "filter": {
        "protect_cats": {
          "type": "keyword_marker",
          "keywords": ["cats"]
        }
      }
    }
  }
}

And test is with:

POST /keyword_marker_example/_analyze
{
  "analyzer" : "protect_cats",
  "text" : "I like cats"
}

And itd respond:

#这里cats 如果不被保护会被porter_stem分析器修正为cat;

{
  "tokens": [
    {
      "token": "i",
      "start_offset": 0,
      "end_offset": 1,
      "type": "",
      "position": 0
    },
    {
      "token": "like",
      "start_offset": 2,
      "end_offset": 6,
      "type": "",
      "position": 1
    },
    {
      "token": "cats",
      "start_offset": 7,
      "end_offset": 11,
      "type": "",
      "position": 2
    }
  ]
}


14.Keyword Repeat Token Filter

keyword_repeat词元过滤器将每一个输入词元发送两次,一次作为关键词,一次作为非关键词. 简单的说就是将同一个词元展示两次,一次展示被后续词元过滤器处理的结果,一次展示原有输入的词元结果, 所以使用unique过滤器可以将修改词元的结果和没有修改的结果都展示出来

官方样例:

这里是一个用keyword_repeat词元过滤器保留被处理过的词元和没有处理的词元

PUT /keyword_repeat_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "stemmed_and_unstemmed": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "keyword_repeat", "porter_stem", "unique_stem"]
        }
      },
      "filter": {
        "unique_stem": {
          "type": "unique",
          "only_on_same_position": true
        }
      }
    }
  }
}

And you can test is with

POST /keyword_repeat_example/_analyze
{
  "analyzer" : "stemmed_and_unstemmed",
  "text" : "I like cats"
}

And itd respond:

{
  "tokens": [
    {
      "token": "i",
      "start_offset": 0,
      "end_offset": 1,
      "type": "",
      "position": 0
    },
    {
      "token": "like",
      "start_offset": 2,
      "end_offset": 6,
      "type": "",
      "position": 1
    },
    {
      "token": "cats",
      "start_offset": 7,
      "end_offset": 11,
      "type": "",
      "position": 2
    },
    {
      "token": "cat",
      "start_offset": 7,
      "end_offset": 11,
      "type": "",
      "position": 2
    }
  ]
}


15.Synonym Token Filter

用于在分析期间处理同义词

具体使用参见:

https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-synonym-tokenfilter.html


16.Reverse Token Filter

reverse词元过滤器将词元进行简单的翻转

输入:

GET _analyze
{
    "tokenizer": "standard",
    "filter": ["reverse"],
    "text": ["hello world"]
}

输出:

# hello 变换成 olleh, world 变换成 dirow

"tokens": [
    {
      "token": "olleh",
      "start_offset": 0,
      "end_offset": 5,
      "type": "",
      "position": 0
    },
    {
      "token": "dlrow",
      "start_offset": 6,
      "end_offset": 11,
      "type": "",
      "position": 1
    }
  ]


17.Truncate Token Filter

truncate词元过滤器的作用是减少词元到特定长度,简单来说就是需要给定一个词元长度length, 如果单个词元长度超过length,超过length的部分会被截断

输入:

{
    "tokenizer": "standard",
    "filter": [{"type":"truncate", "length":3}],
    "text": ["this is a test"]
}

输出:

#this 被修改为 thi, test 被修改为 test;
"tokens": [
    {
      "token": "thi",
      "start_offset": 0,
      "end_offset": 4,
      "type": "",
      "position": 0
    },
    {
      "token": "is",
      "start_offset": 5,
      "end_offset": 7,
      "type": "",
      "position": 1
    },
    {
      "token": "a",
      "start_offset": 8,
      "end_offset": 9,
      "type": "",
      "position": 2
    },
    {
      "token": "tes",
      "start_offset": 10,
      "end_offset": 14,
      "type": "",
      "position": 3
    }
  ]


18.Unique Token Filter

unique词元过滤器的作用就是保证同样结果的词元只出现一次,同时如果only_on_same_position 设置为true, 只会保证在连续位置没有相同的词元.

输入:

GET _analyze
{
    "tokenizer": "standard",
    "filter": ["unique"],
    "text": ["this is a test test test"]
}

输出:

"tokens": [
    {
      "token": "this",
      "start_offset": 0,
      "end_offset": 4,
      "type": "",
      "position": 0
    },
    {
      "token": "is",
      "start_offset": 5,
      "end_offset": 7,
      "type": "",
      "position": 1
    },
    {
      "token": "a",
      "start_offset": 8,
      "end_offset": 9,
      "type": "",
      "position": 2
    },
    {
      "token": "test",
      "start_offset": 10,
      "end_offset": 14,
      "type": "",
      "position": 3
    }
  ]


19.Pattern Capture Token Filter

使用正则表达式匹配词元,每个正则表达式可以匹配多次并且允许重复匹配

如果preserve_original设置为true,则显示原始词元


20.Pattern Replace Token Filter

使用正则表达式匹配被替换的词义,其中pattern定义匹配规则(对应的正则表达式), replacement参数提供替换字符串,并且知识引用原始文档.

具体参见: https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-pattern_replace-tokenfilter.html


21.Trim Token Filter

trim词元过滤器的作用是去除词元周围的空格


22.Limit Token Count Token Filter

limit词元过滤器的作用是限制,每个document或者field被索引的词元个数

max_token_count 表示最大被索引的词元个数

consume_all_tokens 设置为true表示尽最大可能处理词元即使已经超过了最大个数的限制,默认为false

输入:

GET _analyze
{
    "tokenizer": "standard",
    "filter": [{"type": "limit", "max_token_count":2}],
    "text": ["this is a test"]
}

输出:

#本应输出4个词元,因为最大数量的限制为2,所以只输出 this, is;
"tokens": [
    {
      "token": "this",
      "start_offset": 0,
      "end_offset": 4,
      "type": "",
      "position": 0
    },
    {
      "token": "is",
      "start_offset": 5,
      "end_offset": 7,
      "type": "",
      "position": 1
    }
  ]


23.Common Grams Token Filter

详细参见: https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-common-grams-tokenfilter.html

# common_grams词元我测试了一下,大概意思就是如果匹配到common_words中的词元,就会将该词元和前面词元用’_‘分割连起来生成一个新的词元,后面有词元也同理.

输入:

GET _analyze
{
    "tokenizer": "standard",
    "filter": [{"type": "common_grams", "common_words": ["a", "an", "the"]}],
    "text": ["a test"]
}

输出:

#a 是一个common_words, 所以将a_test连起来输出了一个新的词元
"tokens": [
    {
      "token": "a",
      "start_offset": 0,
      "end_offset": 1,
      "type": "",
      "position": 0
    },
    {
      "token": "a_test",
      "start_offset": 0,
      "end_offset": 6,
      "type": "gram",
      "position": 0,
      "positionLength": 2
    },
    {
      "token": "test",
      "start_offset": 2,
      "end_offset": 6,
      "type": "",
      "position": 1
    }
  ]


24.Normalization Token Filter

有一些可用的词元过滤去可以规范化一些语言的特殊字符

参见: https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-normalization-tokenfilter.html


25.Delimited Payload Token Filter

delimited_payload_filter用于分割词元为词元和负载(payload)  #payload不知道怎么翻译,理解为负载吧.

大概意思就是设定一个delimiter分割符,默认为’|’, 将’|’前部分割为词元,将’I’后部分割为payload;

参数:

delimiter定义分割符号, 默认为’|

encoding 定义payload的类型, int for integer, float for float and identity for characters, 默认为float;

输入:

GET _analyze
{
    "tokenizer": "standard",
    "filter": ["delimited_payload_filter"],
    "text": ["the|1 quick|2 fox|3"]
}

输出:

"tokens": [
    {
      "token": "the",
      "start_offset": 0,
      "end_offset": 3,
      "type": "",
      "position": 0
    },
    {
      "token": "1",
      "start_offset": 4,
      "end_offset": 5,
      "type": "",
      "position": 1
    },
    {
      "token": "quick",
      "start_offset": 6,
      "end_offset": 11,
      "type": "",
      "position": 2
    },
    {
      "token": "2",
      "start_offset": 12,
      "end_offset": 13,
      "type": "",
      "position": 3
    },
    {
      "token": "fox",
      "start_offset": 14,
      "end_offset": 17,
      "type": "",
      "position": 4
    },
    {
      "token": "3",
      "start_offset": 18,
      "end_offset": 19,
      "type": "",
      "position": 5
    }
  ]


26.Keep Words Token Filter

keep词元过滤器只保留keep_words中定义的词元

输入:

GET _analyze
{
    "tokenizer": "standard",
    "filter": [{"type":"keep", "keep_words":["this", "test"]}],
    "text": ["this is a test"]
}

输出:

#这里 is, a 因为没有在keep_words中定义而被过滤
"tokens": [
    {
      "token": "this",
      "start_offset": 0,
      "end_offset": 4,
      "type": "",
      "position": 0
    },
    {
      "token": "test",
      "start_offset": 10,
      "end_offset": 14,
      "type": "",
      "position": 3
    }
  ]


27.Keep Types Token Filter

keep_types类似keep_words, 但是保留部分词元,该词元的类型需要在types中定义过

例如:

You can set it up like:

PUT /keep_types_example
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "my_analyzer" : {
                    "tokenizer" : "standard",
                    "filter" : ["standard", "lowercase", "extract_numbers"]
                }
            },
            "filter" : {
                "extract_numbers" : {
                    "type" : "keep_types",
                    "types" : [ "" ]
                }
            }
        }
    }
}

And test it like:

POST /keep_types_example/_analyze
{
  "analyzer" : "my_analyzer",
  "text" : "this is just 1 a test"
}

And itd respond:

{
  "tokens": [
    {
      "token": "1",
      "start_offset": 13,
      "end_offset": 14,
      "type": "",
      "position": 3
    }
  ]
}

只有类型的词元得到输出;


28.Apostrophe Token Filter

apostrophe词元过滤器的作用是移除所有的apostrophe, 包括apostrophe自己

Apostrophe(究竟是什么符号不清楚, 测试了一下貌似 单引号后部会被过滤, 不过注意标准分析器一开始就会把单引号去除,测试不出Apostrophe的效果)


29.Decimal Digit Token Filter

decimal_digit的作用是将unicode数字转化为0-9


30.Fingerprint Token Filter

就是讲本来要输出的所有词元按照升序排序,再去重, 最后输出为一个词元,见样例

输入:

GET _analyze
{
    "tokenizer": "standard",
    "filter": ["fingerprint"],
    "text": ["b a f e c f"]
}

输出:

"tokens": [
    {
      "token": "a b c e f",
      "start_offset": 0,
      "end_offset": 11,
      "type": "fingerprint",
      "position": 0
    }
  ]


31.Minhash Token Filter

没明白作用大概意思为将每一个词元进行哈希映射,然后将映射结果分散到不同的buckets中, 然后返回这些哈希作为词元 

参见: https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-minhash-tokenfilter.html





你可能感兴趣的:(elasticSearch Analysis Token Filters作用及相关样例)