"settings":{
"analysis": { # 自定义分词
"filter": {
"自定义过滤器": {
"type": "edge_ngram", # 过滤器类型
"min_gram": "1", # 最小边界
"max_gram": "6" # 最大边界
}
}, # 过滤器
"char_filter": {}, # 字符过滤器
"tokenizer": {}, # 分词
"analyzer": {
"自定义分词器名称": {
"type": "custom",
"tokenizer": "上述自定义分词名称或自带分词",
"filter": [
"上述自定义过滤器名称或自带过滤器"
],
"char_filter": [
"上述自定义字符过滤器名称或自带字符过滤器"
]
}
} # 分词器
}
}
查询分词效果:
1.查询指定索引库的分词器效果
POST /discovery-user/_analyze
{
"analyzer": "analyzer_ngram",
"text":"i like cats"
}
2.查询所有索引库通用的分词器效果
POST _analyze
{
"analyzer": "standard", # english,ik_max_word,ik_smart
"text":"i like cats"
}
定义:字符过滤器将原始文本作为字符流来接收,并可以新增,移除或修改字符转换字符流
A character filter receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters.
可去除HTML元素或转换0123为零一二三
一个分词器可应用0或多个字符过滤器,按顺序生效
An analyzer may have zero or more character filters, which are applied in order.
es7自带字符过滤器:
去除HTML元素
The html_strip character filter strips out HTML elements like and decodes HTML entities like &.
符合映射关系的字符进行替换
The mapping character filter replaces any occurrences of the specified strings with the specified replacements.
符合正则表达式的字符替换为指定的字符
The pattern_replace character filter replaces any characters matching a regular expression with the specified replacement.
html_strip接受escaped_tags参数
"char_filter": {
"my_char_filter": {
"type": "html_strip",
"escaped_tags": ["b"]
}
}
escaped_tags:An array of HTML tags which should not be stripped from the original text.
即忽略的HTML标签
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "I'm so happy!
"
}
I'm so happy! # 忽略了b标签
The mapping character filter accepts a map of keys and values. Whenever it encounters a string of characters that is the same as a key, it replaces them with the value associated with that key.
Replacements are allowed to be the empty string允许空值
The mapping character filter accepts the following parameters:映射有以下两个参数,且必选其一
mappings:
A array of mappings, with each element having the form key => value
映射的数组,每个映射的格式为 key => value
mappings_path
A path, either absolute or relative to the config directory, to a UTF-8 encoded text mappings file containing a key => value mapping per line.
文件映射,路径是绝对路径或相对于config文件夹的相对路径,文件需utf-8编码且每行的映射格式为key => value
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"一 => 0",
"二 => 1",
"# => ", # 映射值可以为空
"一二三 => 老虎" # 映射可以多个字符
]
}
}
The pattern_replace character filter uses a regular expression to match characters which should be replaced with the specified replacement string. The replacement string can refer to capture groups in the regular expression.
Beware of Pathological Regular Expressions
使用正则需要注意低效率的正则表达式,此类表达式可能引起StackOverflowError,es7的正则表达式遵从Java 的Pattern
正则表达式有以下参数:
pattern:必选
A Java regular expression. Required.
replacement:
The replacement string, which can reference capture groups using the $1..$9 syntax
要替换的字符串,通过
flags:
Java regular expression flags. Flags should be pipe-separated, eg "CASE_INSENSITIVE|COMMENTS".
123-456-789 → 123_456_789:
"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "(\\d+)-(?=\\d)",
"replacement": "$1_"
}
}
Using a replacement string that changes the length of the original text will work for search purposes, but will result in incorrect highlighting
正则过滤改变长度可能导致高亮结果有误
A token filter receives the token stream and may add, remove, or change tokens. For example, a lowercase token filter converts all tokens to lowercase, a stop token filter removes common words (stop words) like the from the token stream, and a synonym token filter introduces synonyms into the token stream.
Token filters are not allowed to change the position or character offsets of each token.
An analyzer may have zero or more token filters, which are applied in order.
A token filter of type asciifolding that converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the “Basic Latin” Unicode block) into their ASCII equivalents, if one exists
Accepts preserve_original setting which defaults to false but if true will keep the original token as well as emit the folded token
将前127个ASCII字符(基本拉丁语的Unicode块)中不包含的字母、数字和符号Unicode字符转换为对应的ASCII字符(如果存在的话)
过滤掉太短或太长的单词
Setting Description
min
The minimum number. Defaults to 0.
max
The maximum number. Defaults to Integer.MAX_VALUE, which is 2^31-1 or 2147483647
标准化文本为小写
参数language指定除了英语的其他语种
标准化文本为大写
ngram过滤器,将分词进行ngram过滤处理,可实现中文分词器中对英文的分词
Setting Description
min_gram
Defaults to 1.
max_gram
Defaults to 2.
index.max_ngram_diff:即最大最小的差额
The index level setting index.max_ngram_diff controls the maximum allowed difference between max_gram and min_gram.
边界ngram过滤 123过滤为1,12,123没有2,23
Setting Description
min_gram
Defaults to 1.
max_gram
Defaults to 2.
side
deprecated. Either front or back. Defaults to front.
decimal_digit的作用是将unicode数字转化为0-9
\u0032 转成2
A tokenizer receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens.
The tokenizer is also responsible for recording the order or position of each term and the start and end character offsets of the original word which the term represents.
An analyzer must have exactly one tokenizer.
测试tokenzer效果
POST _analyze
{
"tokenizer": "tokenzer名称",
"text": "分词文本:The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
The following tokenizers are usually used for tokenizing full text into individual words
单词取词通常将整个文本切成独立的单词
configuration参数:
max_token_length
The maximum token length. If a token is seen that exceeds this length then it is split at max_token_length intervals. Defaults to 255
超过此长度切割 如长度3,abcd分成abc d
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "standard",
"max_token_length": 5
}
}
}
}
}
These tokenizers break up text or words into small fragments
部分词取词,将文本或单词切分成更小的片段
The ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word of the specified length.
ngram取词会将文本切成单词,然后每个单词是指定长度区间的ngram片段
取词效果:
POST _analyze
{
"tokenizer": "ngram",
"text": "Quick Fox"
}
The above sentence would produce the following terms:
[ Q, Qu, u, ui, i, ic, c, ck, k, "k ", " ", " F", F, Fo, o, ox, x ]
With the default settings, the ngram tokenizer treats the initial text as a single token and produces N-grams with minimum length 1 and maximum length 2
ngram默认最小长度1,最大长度2
Configuration
min_gram
Minimum length of characters in a gram. Defaults to 1
片段的最小长度 默认1
max_gram
Maximum length of characters in a gram. Defaults to 2.
片段的最大长度 默认2
token_chars
默认取值[]保留所有字符 指定的不包含
Character classes that should be included in a token. Elasticsearch will split on characters that don’t belong to the classes specified. Defaults to [] (keep all characters).
Character classes may be any of the following:
letter — for example a, b, ï or 京
digit — for example 3 or 7
whitespace — for example " " or "\n"
punctuation — for example ! or "
symbol — for example $ or √
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "2 Quick Foxes."
}
结果不包含digit和letter
[ Qui, uic, ick, Fox, oxe, xes ]
The edge_ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word where the start of the N-gram is anchored to the beginning of the word.
边界ngram,固定从每个单词的开始生成指定长度的ngram 如abc生成ab和abc不会有bc
参数同ngram
The following tokenizers are usually used with structured text like identifiers, email addresses, zip codes, and paths, rather than with full text
身份验证,邮箱地址,路径等有结构的文书取词
built-in analyzers: 内置分词器
custom analyzer:自定义分词器
If you do not find an analyzer suitable for your needs, you can create a custom analyzer which combines the appropriate character filters, tokenizer, and token filters.
The built-in analyzers can be used directly without any configuration.
内置分词器开箱即用无需配置
Some of them, however, support configuration options to alter their behaviour
一些内置分词器支持有选择的配置
standard分词器搭配stopwords参数
"analysis": {
"analyzer": {
"自定义分词器名称": {
"type": "standard",
"stopwords": "_english_" # 支持英语停用词 即分词忽略the a等
}
}
}
默认分词器
The standard analyzer is the default analyzer which is used if none is specified
分词效果:
POST _analyze
{
"analyzer": "standard",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
The above sentence would produce the following terms:
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]
standard参数:
max_token_length
The maximum token length. If a token is seen that exceeds this length then it is split at max_token_length intervals. Defaults to 255.
分词器分成token的最大长度,例如为5,则jumped分成jumpe d ;此参数最大255
stopwords
A pre-defined stop words list like _english_ or an array containing a list of stop words. Defaults to _none_
可以设置为_english_ 或自定义数组["a", "the"],默认_none_
stopwords_path
The path to a file containing stop words.文件方式
"analyzer": {
"my_english_analyzer": {
"type": "standard",
"max_token_length": 5, # token最长为5
"stopwords": "_english_" # 忽略英语停用词
}
}
definition定义
standard分词器默认组成:
-Tokenizer
If you need to customize the standard analyzer beyond the configuration parameters then you need to recreate it as a custom analyzer and modify it, usually by adding token filters.
若需要自定义standard分词器需要指定type为custom
"analysis": {
"analyzer": {
"rebuilt_standard": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase"
]
}
}
}
自定义standard分词器无法使用max_token_length,stopwords等参数,需要自定义Token Filters过滤器
Lower Case Token Filter
Stop Token Filter (disabled by default)
The simple analyzer breaks text into terms whenever it encounters a character which is not a letter. All terms are lower cased.
遇到
分词效果:
POST _analyze
{
"analyzer": "simple",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
The above sentence would produce the following terms:
[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
The simple analyzer is not configurable.
simple分词器没有参数
definition:
The whitespace analyzer breaks text into terms whenever it encounters a whitespace character
根据空格进行分词
分词效果:
POST _analyze
{
"analyzer": "whitespace",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
The above sentence would produce the following terms:
[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]
没有可选参数
Definition
The stop analyzer is the same as the simple analyzer but adds support for removing stop words. It defaults to using the english stop words.
与simple分词器类似,但是默认提供停止词
分词效果:
POST _analyze
{
"analyzer": "stop",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
The above sentence would produce the following terms:
[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]
可选参数:
stopwords
A pre-defined stop words list like _english_ or an array containing a list of stop words. Defaults to _english_.
stopwords_path
The path to a file containing stop words. This path is relative to the Elasticsearch config directory.
"analyzer": {
"my_stop_analyzer": {
"type": "stop",
"stopwords": ["the", "over"]
}
}
definition:
自定义stop
"settings": {
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
}
},
"analyzer": {
"rebuilt_stop": {
"tokenizer": "lowercase",
"filter": [
"english_stop"
]
}
}
}
}
The keyword analyzer is a “noop” analyzer which returns the entire input string as a single token
无操作的分词器,输入即输出
分词效果
POST _analyze
{
"analyzer": "keyword",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
The above sentence would produce the following single term:
[ The 2 QUICK Brown-Foxes jumped over the lazy dog's bone. ]
keyword分词器无可选参数
definition
The pattern analyzer uses a regular expression to split the text into terms. The regular expression should match the token separators not the tokens themselves. The regular expression defaults to \W+ (or all non-word characters).
正则表达式分词器默认所有非单词字符
分词效果:
POST _analyze
{
"analyzer": "pattern",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
The above sentence would produce the following terms:
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
configuration:可选配置参数
pattern
A Java regular expression, defaults to \W+.
java正则表达式,有默认
flags
Java regular expression flags. Flags should be pipe-separated, eg "CASE_INSENSITIVE|COMMENTS".
lowercase
Should terms be lowercased or not. Defaults to true.
是否小写分词,默认true
stopwords
A pre-defined stop words list like _english_ or an array containing a list of stop words. Defaults to _none_.
停止词,默认_none_
stopwords_path
The path to a file containing stop words.
"analyzer": {
"my_email_analyzer": {
"type": "pattern",
"pattern": "\\W|_",
"lowercase": true
}
}
示例:驼峰分词
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"camel": {
"type": "pattern",
"pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
}
}
}
}
}
GET my_index/_analyze
{
"analyzer": "camel",
"text": "MooseX::FTPClass2_beta"
}
[ moose, x, ftp, class, 2, beta ]
一般正则表达式:
([^\p{L}\d]+) # swallow non letters and numbers,
| (?<=\D)(?=\d) # or non-number followed by number,
| (?<=\d)(?=\D) # or number followed by non-number,
| (?<=[ \p{L} && [^\p{Lu}]]) # or lower case
(?=\p{Lu}) # followed by upper case,
| (?<=\p{Lu}) # or upper case
(?=\p{Lu} # followed by upper case
[\p{L}&&[^\p{Lu}]] # then lower case
)
definition
自定义正则分词器
{
"settings": {
"analysis": {
"tokenizer": {
"split_on_non_word": {
"type": "pattern",
"pattern": "\\W+"
}
},
"analyzer": {
"rebuilt_pattern": {
"tokenizer": "split_on_non_word",
"filter": [
"lowercase"
]
}
}
}
}
}
各种语言的分词器
configure可选参数有:
stopwords 停止词
stem_exclusion:The stem_exclusion parameter allows you to specify an array of lowercase words that should not be stemmed. Internally, this functionality is implemented by adding the keyword_marker token filter with the keywords set to the value of the stem_exclusion parameter
english analyzer
The english analyzer could be reimplemented as a custom analyzer as follows:
英语分词器等同以下自定义分词器
PUT /english_example
{
"settings": {
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": ["example"]
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"rebuilt_english": {
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
}
}
}
}
Input text is lowercased, normalized to remove extended characters, sorted, deduplicated and concatenated into a single token. If a stopword list is configured, stop words will also be removed.
去重,排序并聚集为一个单个的term,若配置停止词则删除停止词
分词效果:
POST _analyze
{
"analyzer": "fingerprint",
"text": "Yes yes, Gödel said this sentence is consistent and."
}
The above sentence would produce the following single term:
[ and consistent godel is said sentence this yes ]
configuration
separator
The character to use to concatenate the terms. Defaults to a space.
默认使用空格聚集所有term 即分隔符
max_output_size
The maximum token size to emit. Defaults to 255. Tokens larger than this size will be discarded.
输出被允许的最大长度,超过则丢弃 默认255
stopwords
stopwords_path
{
"settings": {
"analysis": {
"analyzer": {
"my_fingerprint_analyzer": {
"type": "fingerprint",
"stopwords": "_english_",
"max_output_size": 222,
"separator": ","
}
}
}
}
}
Definition
When the built-in analyzers do not fulfill your needs, you can create a custom analyzer which uses the appropriate combination of:
内置分词器不符合需求可通过filter和tokenizer自定义分词器
Configuration:
tokenizer:必选
A built-in or customised tokenizer. (Required)
char_filter
An optional array of built-in or customised character filters.
filter
An optional array of built-in or customised token filters.
position_increment_gap
When indexing an array of text values, Elasticsearch inserts a fake "gap" between the last term of one value and the first term of the next value to ensure that a phrase query doesn’t match two terms from different array elements. Defaults to 100
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom", # 自定义的analyzer其type固定custom
"tokenizer": "standard",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}