elasticsearch 分词过程

我们常常会遇到问题,为什么指定的文档没有被搜索到。很多情况下, 这都归因于映射的定义和分析例程配置存在问题。

一:分词流程

整个流程大概是:单词 ====》Character Filter 预处理 =====》tokenizer分词 ====》 token filter对分词进行再处理。



  1. 单词或文档先经过Character Filters;Character Filters的作用就是对文本进行一个预处理,例如把文本中所有“&”换成“and”,把“?”去掉等等操作。
  2. 之后就进入了十分重要的tokenizers模块了,Tokenizers的作用是进行分词,例如,“tom is a good doctor .”。经过Character Filters去掉句号“.”(假设)后,分词器Tokenizers会将这个文本分出很多词来:“tom”、“is”、“a”、“good”、“doctor”。
  3. 经过分词之后的集合,最后会进入Token Filter词单元模块进行处理,此模块的作用是对已经分词后的集合(tokens)单元再进行操作,例如把“tom”再次拆分成“t”、“o”、“m”等操作。最后得出来的结果集合,就是最终的集合

 二:Custom Analyzer 自定义分词器

简而言之,是自定义的analyzer。允许多个零到多个tokenizer,零到多个 Char Filters. custom analyzer 的名字不能以 "_"开头.

The following are settings that can be set for a custom analyzer type:

Setting Description

tokenizer

通用的或者注册的tokenizer.

filter

通用的或者注册的 token filters.

char_filter

 通用的或者注册的 character filters.

position_increment_gap

距离查询时,最大允许查询的距离,默认是100

自定义的模板:

settings:
index :
    analysis :
        analyzer :
            myAnalyzer2 :
                type : custom
                tokenizer : myTokenizer1
                filter : [myTokenFilter1, myTokenFilter2]
                char_filter : [my_html]
                position_increment_gap: 256
        tokenizer :
            myTokenizer1 :
                type : standard
                max_token_length : 900
        filter :
            myTokenFilter1 :
                type : stop
                stopwords : [stop1, stop2, stop3, stop4]
            myTokenFilter2 :
                type : length
                min : 0
                max : 2000
        char_filter :
              my_html :
                type : html_strip
                escaped_tags : [xxx, yyy]
                read_ahead : 1024

 三:ES内置的一些analyzer

| analyzer              | logical name  | description                               |

| ----------------------|:-------------:| :-----------------------------------------|

| standard analyzer     | standard      | standard tokenizer, standard filter, lower case filter, stop filter |

| simple analyzer       | simple        | lower case tokenizer                      |

| stop analyzer         | stop          | lower case tokenizer, stop filter         |

| keyword analyzer      | keyword       | 不分词,内容整体作为一个token(not_analyzed) |

| pattern analyzer      | whitespace    | 正则表达式分词,默认匹配\W+                 |

| language analyzers    | [lang](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html)  | 各种语言 |

| snowball analyzer     | snowball      | standard tokenizer, standard filter, lower case filter, stop filter, snowball filter |

| custom analyzer       | custom        | 一个Tokenizer, 零个或多个Token Filter, 零个或多个Char Filter |

tokenizer:ES内置的tokenizer列表。
| tokenizer             | logical name  | description                           |

| ----------------------|:-------------:| :-------------------------------------|

| standard tokenizer    | standard      |                                       |

| edge ngram tokenizer  | edgeNGram     |                                       |

| keyword tokenizer     | keyword       | 不分词                                 |

| letter analyzer       | letter        | 按单词分                               |

| lowercase analyzer    | lowercase     | letter tokenizer, lower case filter   |

| ngram analyzers       | nGram         |                                       |

| whitespace analyzer   | whitespace    | 以空格为分隔符拆分                      |

| pattern analyzer      | pattern       | 定义分隔符的正则表达式                  |

| uax email url analyzer| uax_url_email | 不拆分url和email                       |

| path hierarchy analyzer| path_hierarchy| 处理类似`/path/to/somthing`样式的字符串|

token filter:ES内置的token filter列表。
| token filter          | logical name  | description                           |

| ----------------------|:-------------:| :-------------------------------------|

| standard filter       | standard      |                                       |

| ascii folding filter  | asciifolding  |                                       |

| length filter         | length        | 去掉太长或者太短的                      |

| lowercase filter      | lowercase     | 转成小写                               |

| ngram filter          | nGram         |                                       |

| edge ngram filter     | edgeNGram     |                                       |

| porter stem filter    | porterStem    | 波特词干算法                            |

| shingle filter        | shingle       | 定义分隔符的正则表达式                  |

| stop filter           | stop          | 移除 stop words                        |

| word delimiter filter | word_delimiter| 将一个单词再拆成子分词                   |

| stemmer token filter  | stemmer       |                                        |

| stemmer override filter| stemmer_override|                                     |

| keyword marker filter | keyword_marker|                                        |

| keyword repeat filter | keyword_repeat|                                        |

| kstem filter          | kstem         |                                        |

| snowball filter       | snowball      |                                        |

| phonetic filter       | phonetic      | [插件](https://github.com/elasticsearch/elasticsearch-analysis-phonetic) |

| synonym filter        | synonyms      | 处理同义词                              |

| compound word filter  | dictionary_decompounder, hyphenation_decompounder | 分解复合词  |

| reverse filter        | reverse       | 反转字符串                              |

| elision filter        | elision       | 去掉缩略语                              |

| truncate filter       | truncate      | 截断字符串                              |

| unique filter         | unique        |                                        |

| pattern capture filter| pattern_capture|                                       |

| pattern replace filte | pattern_replace| 用正则表达式替换                        |

| trim filter           | trim          | 去掉空格                                |

| limit token count filter| limit       | 限制token数量                           |

| hunspell filter       | hunspell      | 拼写检查                                |

| common grams filter   | common_grams  |                                        |

| normalization filter  | arabic_normalization, persian_normalization |          |

character filter:ES内置的character filter列表
| character filter          | logical name  | description               |

| --------------------------|:-------------:| :-------------------------|

| mapping char filter       | mapping       | 根据配置的映射关系替换字符   |

| html strip char filter    | html_strip    | 去掉HTML元素               |

| pattern replace char filter| pattern_replace| 用正则表达式处理字符串    |

 四:ES analyzer DSL

http://localhost:11200/search-product/_analyze?analyzer=keywordLowercase&text=笔记本

你可能感兴趣的:(ElasticSearch)