Elasticsearch7 分词器(内置分词器和自定义分词器)

文章目录

  • Elasticsearch7 分词器(内置分词器和自定义分词器)
    • analysis
      • 概览
      • char_filter
        • html_strip
        • mapping
        • pattern_replace
      • filter
        • asciifolding
        • length
        • lowercase
        • uppercase
        • ngram
        • edge_ngram
        • decimal_digit
      • tokenizer
        • Word Oriented Tokenizers
          • Standard tokenizer
        • Partial Word Tokenizers
          • NGram Tokenizer
          • Edge NGram Tokenizer
        • Structured Text Tokenizers
      • analyzer
        • standard /Standard Tokenizer;Lower Case Token Filter,Stop Token Filter
        • simple /Lower Case Tokenizer
        • whitespace /Whitespace Tokenizer
        • stop /Lower Case Tokenizer;Stop Token Filter
        • keyword /Keyword Tokenizer
        • pattern /Pattern Tokenizer;Lower Case Token Filter,Stop Token Filter
        • Language Analyzers
        • fingerprint /Standard Tokenizer;Lower Case Token Filter,ASCII Folding Token Filter,Stop Token Filter,Fingerprint Token Filter
        • customer分词器

Elasticsearch7 分词器(内置分词器和自定义分词器)

analysis

概览

"settings":{
    "analysis": { # 自定义分词
      "filter": {
      	"自定义过滤器": {
            "type": "edge_ngram",  # 过滤器类型
            "min_gram": "1",  # 最小边界 
            "max_gram": "6"  # 最大边界
        }
      },  # 过滤器
      "char_filter": {},  # 字符过滤器
      "tokenizer": {},   # 分词
      "analyzer": {
      	"自定义分词器名称": {
          "type": "custom",
          "tokenizer": "上述自定义分词名称或自带分词",
          "filter": [
            "上述自定义过滤器名称或自带过滤器"
          ],
          "char_filter": [
          	"上述自定义字符过滤器名称或自带字符过滤器"
          ]
        }
      }  # 分词器
    }
}

查询分词效果:

1.查询指定索引库的分词器效果
POST /discovery-user/_analyze
{
  "analyzer": "analyzer_ngram", 
  "text":"i like cats"
}
2.查询所有索引库通用的分词器效果
POST _analyze
{
  "analyzer": "standard",  # english,ik_max_word,ik_smart
  "text":"i like cats"
}

char_filter

定义:字符过滤器将原始文本作为字符流来接收,并可以新增,移除或修改字符转换字符流
A character filter receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters.
可去除HTML元素或转换0123为零一二三

一个分词器可应用0或多个字符过滤器,按顺序生效
An analyzer may have zero or more character filters, which are applied in order.

es7自带字符过滤器:

  • HTML Strip Character Filter:html_strip
去除HTML元素
The html_strip character filter strips out HTML elements like  and decodes HTML entities like &.
  • Mapping Character Filter:mapping
符合映射关系的字符进行替换  
The mapping character filter replaces any occurrences of the specified strings with the specified replacements.
  • Pattern Replace Character Filter:pattern_replace
符合正则表达式的字符替换为指定的字符
The pattern_replace character filter replaces any characters matching a regular expression with the specified replacement.

html_strip

html_strip接受escaped_tags参数

"char_filter": {
        "my_char_filter": {
          "type": "html_strip",
          "escaped_tags": ["b"]
        }
}
escaped_tags:An array of HTML tags which should not be stripped from the original text.
即忽略的HTML标签
POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "

I'm so happy!

" } I'm so happy! # 忽略了b标签

mapping

The mapping character filter accepts a map of keys and values. Whenever it encounters a string of characters that is the same as a key, it replaces them with the value associated with that key.
Replacements are allowed to be the empty string允许空值

The mapping character filter accepts the following parameters:映射有以下两个参数,且必选其一
mappings:

A array of mappings, with each element having the form key => value
映射的数组,每个映射的格式为 key => value

mappings_path

A path, either absolute or relative to the config directory, to a UTF-8 encoded text mappings file containing a key => value mapping per line.
文件映射,路径是绝对路径或相对于config文件夹的相对路径,文件需utf-8编码且每行的映射格式为key => value
"char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "一 => 0",
            "二 => 1",
            "# => ",  # 映射值可以为空
            "一二三 => 老虎"  # 映射可以多个字符
          ]
        }
}

pattern_replace

The pattern_replace character filter uses a regular expression to match characters which should be replaced with the specified replacement string. The replacement string can refer to capture groups in the regular expression.

Beware of Pathological Regular Expressions
使用正则需要注意低效率的正则表达式,此类表达式可能引起StackOverflowError,es7的正则表达式遵从Java 的Pattern

正则表达式有以下参数:
pattern:必选

A Java regular expression. Required.

replacement:

The replacement string, which can reference capture groups using the $1..$9 syntax
要替换的字符串,通过

flags:

Java regular expression flags. Flags should be pipe-separated, eg "CASE_INSENSITIVE|COMMENTS".
123-456-789 → 123_456_789:
"char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "(\\d+)-(?=\\d)",
          "replacement": "$1_"
        }
}

Using a replacement string that changes the length of the original text will work for search purposes, but will result in incorrect highlighting
正则过滤改变长度可能导致高亮结果有误

filter

A token filter receives the token stream and may add, remove, or change tokens. For example, a lowercase token filter converts all tokens to lowercase, a stop token filter removes common words (stop words) like the from the token stream, and a synonym token filter introduces synonyms into the token stream.

Token filters are not allowed to change the position or character offsets of each token.

An analyzer may have zero or more token filters, which are applied in order.

asciifolding

A token filter of type asciifolding that converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the “Basic Latin” Unicode block) into their ASCII equivalents, if one exists

Accepts preserve_original setting which defaults to false but if true will keep the original token as well as emit the folded token
将前127个ASCII字符(基本拉丁语的Unicode块)中不包含的字母、数字和符号Unicode字符转换为对应的ASCII字符(如果存在的话)

length

过滤掉太短或太长的单词
Setting Description

min
	The minimum number. Defaults to 0.

max
	The maximum number. Defaults to Integer.MAX_VALUE, which is 2^31-1 or 2147483647

lowercase

标准化文本为小写
参数language指定除了英语的其他语种

uppercase

标准化文本为大写

ngram

ngram过滤器,将分词进行ngram过滤处理,可实现中文分词器中对英文的分词
Setting Description

min_gram
	Defaults to 1.

max_gram
	Defaults to 2.

index.max_ngram_diff:即最大最小的差额
The index level setting index.max_ngram_diff controls the maximum allowed difference between max_gram and min_gram.

edge_ngram

边界ngram过滤 123过滤为1,12,123没有2,23
Setting Description

min_gram
	Defaults to 1.

max_gram
	Defaults to 2.

side
	deprecated. Either front or back. Defaults to front.

decimal_digit

decimal_digit的作用是将unicode数字转化为0-9
\u0032 转成2

tokenizer

A tokenizer receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens.

The tokenizer is also responsible for recording the order or position of each term and the start and end character offsets of the original word which the term represents.

An analyzer must have exactly one tokenizer.

测试tokenzer效果

POST _analyze
{
  "tokenizer": "tokenzer名称",
  "text": "分词文本:The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

Word Oriented Tokenizers

The following tokenizers are usually used for tokenizing full text into individual words
单词取词通常将整个文本切成独立的单词

Standard tokenizer

configuration参数:
max_token_length

The maximum token length. If a token is seen that exceeds this length then it is split at max_token_length intervals. Defaults to 255
超过此长度切割  如长度3,abcd分成abc d
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "standard",
          "max_token_length": 5
        }
      }
    }
  }
}

Partial Word Tokenizers

These tokenizers break up text or words into small fragments
部分词取词,将文本或单词切分成更小的片段

NGram Tokenizer

The ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word of the specified length.
ngram取词会将文本切成单词,然后每个单词是指定长度区间的ngram片段

取词效果:

POST _analyze
{
  "tokenizer": "ngram",
  "text": "Quick Fox"
}
The above sentence would produce the following terms:

[ Q, Qu, u, ui, i, ic, c, ck, k, "k ", " ", " F", F, Fo, o, ox, x ]

With the default settings, the ngram tokenizer treats the initial text as a single token and produces N-grams with minimum length 1 and maximum length 2
ngram默认最小长度1,最大长度2

Configuration
min_gram

Minimum length of characters in a gram. Defaults to 1
片段的最小长度 默认1

max_gram

Maximum length of characters in a gram. Defaults to 2.
片段的最大长度 默认2

token_chars

默认取值[]保留所有字符 指定的不包含
Character classes that should be included in a token. Elasticsearch will split on characters that don’t belong to the classes specified. Defaults to [] (keep all characters).

Character classes may be any of the following:
letter —  for example a, b, ï or 京
digit —  for example 3 or 7
whitespace —  for example " " or "\n"
punctuation — for example ! or "
symbol —  for example $ or √
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 3,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  }
}
POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "2 Quick Foxes."
}
结果不包含digit和letter
[ Qui, uic, ick, Fox, oxe, xes ]
Edge NGram Tokenizer

The edge_ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word where the start of the N-gram is anchored to the beginning of the word.
边界ngram,固定从每个单词的开始生成指定长度的ngram 如abc生成ab和abc不会有bc
参数同ngram

Structured Text Tokenizers

The following tokenizers are usually used with structured text like identifiers, email addresses, zip codes, and paths, rather than with full text
身份验证,邮箱地址,路径等有结构的文书取词

analyzer

built-in analyzers: 内置分词器

  • Standard Analyzer:standard
    The standard analyzer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation, lowercases terms, and supports removing stop words.
  • Simple Analyzer:simple
    The simple analyzer divides text into terms whenever it encounters a character which is not a letter. It lowercases all terms.
  • Whitespace Analyzer:whitespace
    The whitespace analyzer divides text into terms whenever it encounters any whitespace character. It does not lowercase terms.
  • Stop Analyzer:stop
    The stop analyzer is like the simple analyzer, but also supports removal of stop words.
  • Keyword Analyzer:keyword
    The keyword analyzer is a “noop” analyzer that accepts whatever text it is given and outputs the exact same text as a single term.
  • Pattern Analyzer:pattern
    The pattern analyzer uses a regular expression to split the text into terms. It supports lower-casing and stop words.
  • Language Analyzers:english等
    Elasticsearch provides many language-specific analyzers like english or french.
  • Fingerprint Analyzer:fingerprint
    The fingerprint analyzer is a specialist analyzer which creates a fingerprint which can be used for duplicate detection.

custom analyzer:自定义分词器
If you do not find an analyzer suitable for your needs, you can create a custom analyzer which combines the appropriate character filters, tokenizer, and token filters.

The built-in analyzers can be used directly without any configuration.
内置分词器开箱即用无需配置
Some of them, however, support configuration options to alter their behaviour
一些内置分词器支持有选择的配置

standard分词器搭配stopwords参数
"analysis": {
      "analyzer": {
        "自定义分词器名称": { 
          "type":      "standard",
          "stopwords": "_english_"  # 支持英语停用词 即分词忽略the a等
        }
      }
}

standard /Standard Tokenizer;Lower Case Token Filter,Stop Token Filter

默认分词器
The standard analyzer is the default analyzer which is used if none is specified
分词效果:

POST _analyze
{
  "analyzer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
The above sentence would produce the following terms:

[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]

standard参数:
max_token_length

The maximum token length. If a token is seen that exceeds this length then it is split at max_token_length intervals. Defaults to 255.
分词器分成token的最大长度,例如为5,则jumped分成jumpe d ;此参数最大255

stopwords

A pre-defined stop words list like _english_ or an array containing a list of stop words. Defaults to _none_
可以设置为_english_ 或自定义数组["a", "the"],默认_none_

stopwords_path

The path to a file containing stop words.文件方式
"analyzer": {
        "my_english_analyzer": {
          "type": "standard",
          "max_token_length": 5,  # token最长为5
          "stopwords": "_english_"  # 忽略英语停用词
        }
}

definition定义
standard分词器默认组成:
-Tokenizer

    • Standard Tokenizer
  • Token Filters
    • Lower Case Token Filter
    • Stop Token Filter (disabled by default)

If you need to customize the standard analyzer beyond the configuration parameters then you need to recreate it as a custom analyzer and modify it, usually by adding token filters.
若需要自定义standard分词器需要指定type为custom

 "analysis": {
      "analyzer": {
        "rebuilt_standard": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase"       
          ]
        }
      }
 }
 自定义standard分词器无法使用max_token_length,stopwords等参数,需要自定义Token Filters过滤器
  Lower Case Token Filter
  Stop Token Filter (disabled by default)

simple /Lower Case Tokenizer

The simple analyzer breaks text into terms whenever it encounters a character which is not a letter. All terms are lower cased.
遇到
分词效果:

POST _analyze
{
  "analyzer": "simple",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
The above sentence would produce the following terms:

[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

The simple analyzer is not configurable.
simple分词器没有参数

definition:

  • Tokenizer
    • Lower Case Tokenizer

whitespace /Whitespace Tokenizer

The whitespace analyzer breaks text into terms whenever it encounters a whitespace character
根据空格进行分词
分词效果:

POST _analyze
{
  "analyzer": "whitespace",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
The above sentence would produce the following terms:

[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]

没有可选参数
Definition

  • Tokenizer
    • Whitespace Tokenizer

stop /Lower Case Tokenizer;Stop Token Filter

The stop analyzer is the same as the simple analyzer but adds support for removing stop words. It defaults to using the english stop words.
与simple分词器类似,但是默认提供停止词
分词效果:

POST _analyze
{
  "analyzer": "stop",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
The above sentence would produce the following terms:

[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]

可选参数:
stopwords

A pre-defined stop words list like _english_ or an array containing a list of stop words. Defaults to _english_.

stopwords_path

The path to a file containing stop words. This path is relative to the Elasticsearch config directory.
"analyzer": {
        "my_stop_analyzer": {
          "type": "stop",
          "stopwords": ["the", "over"]
        }
 }

definition:

  • Tokenizer
    • Lower Case Tokenizer
  • Token filters
    • Stop Token Filter

自定义stop

"settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type":       "stop",
          "stopwords":  "_english_" 
        }
      },
      "analyzer": {
        "rebuilt_stop": {
          "tokenizer": "lowercase",
          "filter": [
            "english_stop"          
          ]
        }
      }
    }
  }

keyword /Keyword Tokenizer

The keyword analyzer is a “noop” analyzer which returns the entire input string as a single token
无操作的分词器,输入即输出
分词效果

POST _analyze
{
  "analyzer": "keyword",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
The above sentence would produce the following single term:

[ The 2 QUICK Brown-Foxes jumped over the lazy dog's bone. ]

keyword分词器无可选参数

definition

  • Tokenizer
    • Keyword Tokenizer

pattern /Pattern Tokenizer;Lower Case Token Filter,Stop Token Filter

The pattern analyzer uses a regular expression to split the text into terms. The regular expression should match the token separators not the tokens themselves. The regular expression defaults to \W+ (or all non-word characters).
正则表达式分词器默认所有非单词字符
分词效果:

POST _analyze
{
  "analyzer": "pattern",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
The above sentence would produce the following terms:

[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

configuration:可选配置参数
pattern

A Java regular expression, defaults to \W+.
java正则表达式,有默认

flags

Java regular expression flags. Flags should be pipe-separated, eg "CASE_INSENSITIVE|COMMENTS".

lowercase

Should terms be lowercased or not. Defaults to true.
是否小写分词,默认true

stopwords

A pre-defined stop words list like _english_ or an array containing a list of stop words. Defaults to _none_.
停止词,默认_none_

stopwords_path

The path to a file containing stop words.
"analyzer": {
        "my_email_analyzer": {
          "type":      "pattern",
          "pattern":   "\\W|_", 
          "lowercase": true
        }
 }

示例:驼峰分词

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "camel": {
          "type": "pattern",
          "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
        }
      }
    }
  }
}

GET my_index/_analyze
{
  "analyzer": "camel",
  "text": "MooseX::FTPClass2_beta"
}

[ moose, x, ftp, class, 2, beta ]

一般正则表达式:

  ([^\p{L}\d]+)                 # swallow non letters and numbers,
| (?<=\D)(?=\d)                 # or non-number followed by number,
| (?<=\d)(?=\D)                 # or number followed by non-number,
| (?<=[ \p{L} && [^\p{Lu}]])    # or lower case
  (?=\p{Lu})                    #   followed by upper case,
| (?<=\p{Lu})                   # or upper case
  (?=\p{Lu}                     #   followed by upper case
    [\p{L}&&[^\p{Lu}]]          #   then lower case
  )

definition

  • Tokenizer
    • Pattern Tokenizer
  • Token Filters
    • Lower Case Token Filter
    • Stop Token Filter (disabled by default)

自定义正则分词器

{
  "settings": {
    "analysis": {
      "tokenizer": {
        "split_on_non_word": {
          "type":       "pattern",
          "pattern":    "\\W+" 
        }
      },
      "analyzer": {
        "rebuilt_pattern": {
          "tokenizer": "split_on_non_word",
          "filter": [
            "lowercase"       
          ]
        }
      }
    }
  }
}

Language Analyzers

各种语言的分词器

configure可选参数有:
stopwords 停止词
stem_exclusion:The stem_exclusion parameter allows you to specify an array of lowercase words that should not be stemmed. Internally, this functionality is implemented by adding the keyword_marker token filter with the keywords set to the value of the stem_exclusion parameter

english analyzer
The english analyzer could be reimplemented as a custom analyzer as follows:
英语分词器等同以下自定义分词器

PUT /english_example
{
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type":       "stop",
          "stopwords":  "_english_" 
        },
        "english_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["example"] 
        },
        "english_stemmer": {
          "type":       "stemmer",
          "language":   "english"
        },
        "english_possessive_stemmer": {
          "type":       "stemmer",
          "language":   "possessive_english"
        }
      },
      "analyzer": {
        "rebuilt_english": {
          "tokenizer":  "standard",
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "english_stop",
            "english_keywords",
            "english_stemmer"
          ]
        }
      }
    }
  }
}

fingerprint /Standard Tokenizer;Lower Case Token Filter,ASCII Folding Token Filter,Stop Token Filter,Fingerprint Token Filter

Input text is lowercased, normalized to remove extended characters, sorted, deduplicated and concatenated into a single token. If a stopword list is configured, stop words will also be removed.
去重,排序并聚集为一个单个的term,若配置停止词则删除停止词
分词效果:

POST _analyze
{
  "analyzer": "fingerprint",
  "text": "Yes yes, Gödel said this sentence is consistent and."
}
The above sentence would produce the following single term:

[ and consistent godel is said sentence this yes ]

configuration
separator

The character to use to concatenate the terms. Defaults to a space.
默认使用空格聚集所有term 即分隔符

max_output_size

The maximum token size to emit. Defaults to 255. Tokens larger than this size will be discarded.
输出被允许的最大长度,超过则丢弃 默认255

stopwords
stopwords_path

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_fingerprint_analyzer": {
          "type": "fingerprint",
          "stopwords": "_english_",
          "max_output_size": 222,
          "separator": ","
        }
      }
    }
  }
}

Definition

  • Tokenizer
    • Standard Tokenizer
  • Token Filters (in order)
    • Lower Case Token Filter
    • ASCII Folding Token Filter
    • Stop Token Filter (disabled by default)
    • Fingerprint Token Filter

customer分词器

When the built-in analyzers do not fulfill your needs, you can create a custom analyzer which uses the appropriate combination of:

  • zero or more character filters
  • a tokenizer
  • zero or more token filters.

内置分词器不符合需求可通过filter和tokenizer自定义分词器

Configuration:
tokenizer:必选

	A built-in or customised tokenizer. (Required)

char_filter

An optional array of built-in or customised character filters.

filter

An optional array of built-in or customised token filters.

position_increment_gap

When indexing an array of text values, Elasticsearch inserts a fake "gap" between the last term of one value and the first term of the next value to ensure that a phrase query doesn’t match two terms from different array elements. Defaults to 100
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type":      "custom",  # 自定义的analyzer其type固定custom
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  }
}

你可能感兴趣的:(java)