elasticsearch分析器

官方文档:https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis.html

分析器analyzer包含如下几个属性:

分析器类型type:custom
字符过滤器char_filter: 零个或多个
分词器tokenizer: 有且仅有一个
词元过滤器filter零个或多个 按顺序应用的

字符过滤器

字符过滤器也叫预处理过滤器,用于预处理字符流,然后再将其传递给分词器
字符过滤器有三种:

1. html_strip:HTML标签字符过滤器

特性:

a. 从原始文本中过滤掉HTML标签

可选配置:

escaped_tags: 不应从原始文本中过滤掉的HTML标签,数组类型。

example:

GET _analyze
{
  "tokenizer":      "keyword", 
  "char_filter":  [ "html_strip" ],
  "text": "

I'm so happy!

" }
{
  "tokens": [
    {
      "token": """

I'm so happy!

""",
      "start_offset": 0,
      "end_offset": 32,
      "type": "word",
      "position": 0
    }
  ]
}
PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": ["my_char_filter"]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "html_strip",
          "escaped_tags": ["b"]  // 不从原始文本中过滤标签
        }
      }
    }
  }
}
GET my_index/_analyze
{
  "analyzer":      "my_analyzer", 
  "text": "

I'm so happy!

" }
{
  "tokens": [
    {
      "token": """

I'm so happy!

""",
      "start_offset": 0,
      "end_offset": 32,
      "type": "word",
      "position": 0
    }
  ]
}

2. mapping: 映射字符过滤器

特性:

a. mapping字符过滤器接受键值对数组。每当遇到与键相同的字符串时,它将用与该键关联的值替换它们。
b. 匹配是贪婪的,优先匹配最长的那一个。
c. 允许替换为空字符串。

可选配置:

mappings:定义一个键值对数组
mappings_path: 定义一个包含键值对数组文件的路径

example:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "& => and",
            "$ => ¥"
          ]
        }
      }
    }
  }
}
POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "My license plate is $203 & $110"
}
{
  "tokens": [
    {
      "token": "My license plate is ¥203 and ¥110",
      "start_offset": 0,
      "end_offset": 31,
      "type": "word",
      "position": 0
    }
  ]
}

3. pattern_replace: 正则替换字符过滤器

特性:

a. 使用一个正则表达式匹配,用指定的字符串替换字符。
b. 替换字符串可以引用正则表达式中的捕获组。

可选配置:

pattern: 一个Java的正则表达式,必须。
replacement:替换字符串,可以参考使用捕获组 $1.. $9语法,说明 这里
flags:Java正则表达式标志。标记应以管道分隔,例如"CASE_INSENSITIVE|COMMENTS"

example:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "(\\d+)-(?=\\d)",
          "replacement": "$1_"
        }
      }
    }
  }
}
POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "My credit card is 123-456-789"
}
{
  "tokens": [
    {
      "token": "My credit card is 123_456_789",
      "start_offset": 0,
      "end_offset": 29,
      "type": "word",
      "position": 0
    }
  ]
}

分词器

  • Standard Tokenizer

特性:

标准类型的tokenizer对欧洲语言非常友好, 支持Unicode。

可选配置:

max_token_length:最大的token集合,即经过tokenizer过后得到的结果集的最大值。如果token的长度超过了设置的长度,将会继续分,默认255

ex:

POST _analyze
{
  "tokenizer": "standard",
  "text": "The 2 QUICK Brown-Foxes of dog's bone."
}

结果

[The, 2, QUICK, Brown, Foxes, of, dog's bone]

ex:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "standard",
          "max_token_length": 5
        }
      }
    }
  }
}
  • Letter Tokenizer

特性:

每当遇到一个字符是不是字母的时候,进行分词。

可选配置:

不可配置

ex:

POST _analyze
{
  "tokenizer": "letter",
  "text": "The 2 QUICK Brown-Foxes of dog's bone."
}

结果:

[The, 2, QUICK, Brown, Foxes, of, dog, s, bone]

  • Lowercase Tokenizer

特性:

可以看做是Letter Tokenizer和lower token filter的组合

  • Whitespace Tokenizer

特性:

遇到空白字符就分词

可选配置:

  • UAX URL Email Tokenizer

特性:

和standard 类型的分词器类似,但是能识别url和email

可选配置:

max_token_length:默认256

ex:

POST _analyze
{
  "tokenizer": "uax_url_email",
  "text": "Email me at [email protected] http://www.baidu.com"
}

结果

[Email, me, at, [email protected], http://www.baidu.com]

  • Classic Tokenizer

特性:

为英语而生的分词器. 这个分词器对于英文的首字符缩写、 公司名字、 email 、 大部分网站域名.都能很好的解决。 但是, 对于除了英语之外的其他语言,都不是很好使。

可选配置:

max_token_length: 默认255

  • Thai Tokenizer

特性:

泰语专用分词器

  • NGram Tokenizer

特性:

N-gram就像一个滑动窗口,在整个单词上移动-连续的指定长度字符序列。它们对于查询不用空格的语言(例如德语,汉语)很有用。

可选配置:

min_gram:分词后词语的最小长度
max_gram: 分词后数据的最大长度
token_chars:设置分词的形式,例如数字还是文字。elasticsearch将根据分词的形式对文本进行分词。
[] (Keep all characters)

token_chars可用的值:

普通字符:letter —  for example a, b, ï or 京
数字:digit —  for example 3 or 7
空格或回车符:whitespace —  for example " " or "\n"
标点符号:punctuation — for example ! or "
特殊字符:symbol —  for example $ or √

ex:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer",
          "filter":["lowercase"]
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 10,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
  }
}
POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "2 2311 Quick Foxes."
}

分词结果:

[231, 2311, 311, qui, quic, quick, uic, uick, ick, fox, foxe, foxes, oxe, oxes, xes]

  • Edge NGram Tokenizer
    特性

边缘ngram分词器,与ngram分词器的不同之处在于ngram是补全提示,而edge-ngram是自动补全。

例如:

POST _analyze
{
  "tokenizer": "ngram",
  "text": "a Quick Foxes."
}

ngram分词测试结果:

["a", "a ", " ", " Q", "Q", "Qu", "u", "ui", "i", "ic", "c", "ck", "k", "k ", " ", " F", "F", "Fo", "o", "ox", "x", "xe", "e", "es", "s", "s.", ".",]

POST _analyze
{
  "tokenizer": "edge_ngram",
  "text": "a Quick Foxes."
}

edeg_ngram分词测试结果

["a", "a "]

从上面的测试结果可以看出:

默认ngram和edge_ngram分词器是将a Quick Foxes.当成一个整体的。
默认ngram和edge_ngram最小长度和最大长度都是 1 和 2。
ngram是一个固定的小窗口在单词上滑动的。(一般用来搜索补全提示)
edge_ngram是起始位置不动,小窗口从最小值到最大值延长的结果。(一般用来单词自动补全)

  • Keyword Tokenizer

特性:

keyword 类型的tokenizer 是将一整块的输入数据作为一个单独的分词

可选配置:

buffer_size: 默认256

  • Pattern Tokenizer
    特性:

用正则表达式分词

可选配置:

pattern:一个Java的正则表达式,则默认为\W+
flags: Java正则表达式标志。标记应以管道分隔,例如"CASE_INSENSITIVE|COMMENTS"
group: 要提取哪个捕获组作为令牌。默认为-1(分割).

group默认为-1,表示以正则表达式匹配字符来进行分割分词
group=0,则表示保留匹配所有正则表达式的字符串来分词
group=1,2,3...,则表示保留匹配正则表达式中的某一个()中的匹配结果。

ex:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "\"(.*)\"",   
          "flags": "",
          "group": -1
        }
      }
    }
  }
}

注意:匹配两端带"的字符串,注意"\"(.*)\"""\".*\""的区别,同样都是匹配两端带"的字符串,前面一个有group,而后面这个没有group。

{
  "analyzer": "my_analyzer",
  "text": "comma,\"separated\",values"
}

分词测试结果:

当group为默认的-1时, 以正则匹配到的结果作为分隔符分词,所得结果: ["comma", "values"]
当group=0时,以正则匹配到的结果作为分词结果,所得结果为匹配的字符串:[""separated""]
当group=1时,以正则中第一个()中匹配的字符串作为分词结果,所得的分词结果为:["separated"]
当group=2时,以正则中第二个()中匹配的字符串作为分词结果,这里正则表达式中因为只有一个(),所以会报异常。

  • Path Hierarchy Tokenizer
    特性

所述path_hierarchy标记生成器需要像文件系统路径的分层值,分割的路径分隔,并发出一个术语,树中的每个组件。

可选配置

delimiter:路径匹配分割符,默认/.
replacement:替换路径分隔符,默认与delimiter一致
buffer_size:分割路径最大长度,默认1024.
reverse:是否翻转,默认 false.
skip: 默认为 0.

ex:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "path_hierarchy",
          "delimiter": "-",
          "replacement":"/",
          "reverse": false,
          "skip": 0
        }
      }
    }
  }
}
POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "one-two-three-four"
}
{
  "tokens": [
    {
      "token": "one",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 0
    },
    {
      "token": "one/two",
      "start_offset": 0,
      "end_offset": 7,
      "type": "word",
      "position": 0
    },
    {
      "token": "one/two/three",
      "start_offset": 0,
      "end_offset": 13,
      "type": "word",
      "position": 0
    },
    {
      "token": "one/two/three/four",
      "start_offset": 0,
      "end_offset": 18,
      "type": "word",
      "position": 0
    }
  ]
}

一、内置的8种分析器:

  • standard analyzer:默认分词器,它提供了基于语法的分词(基于Unicode文本分割算法,如Unicode® Standard Annex #29所指定的),适用于大多数语言,对中文分词效果很差。
POST _analyze
{
  "analyzer":"standard",
  "text":"Geneva K. Risk-Issues "
}
{
  "tokens": [
    {
      "token": "geneva",
      "start_offset": 0,
      "end_offset": 6,
      "type": "",
      "position": 0
    },
    {
      "token": "k",
      "start_offset": 7,
      "end_offset": 8,
      "type": "",
      "position": 1
    },
    {
      "token": "risk",
      "start_offset": 10,
      "end_offset": 14,
      "type": "",
      "position": 2
    },
    {
      "token": "issues",
      "start_offset": 15,
      "end_offset": 21,
      "type": "",
      "position": 3
    }
  ]
}
  • simple analyzer:它提供了基于字母的分词,如果遇到不是字母时直接分词,所有字母均置为小写
POST _analyze
{
  "analyzer":"simple",
  "text":"Geneva K. Risk-Issues "
}
{
  "tokens": [
    {
      "token": "geneva",
      "start_offset": 0,
      "end_offset": 6,
      "type": "word",
      "position": 0
    },
    {
      "token": "k",
      "start_offset": 7,
      "end_offset": 8,
      "type": "word",
      "position": 1
    },
    {
      "token": "risk",
      "start_offset": 10,
      "end_offset": 14,
      "type": "word",
      "position": 2
    },
    {
      "token": "issues",
      "start_offset": 15,
      "end_offset": 21,
      "type": "word",
      "position": 3
    }
  ]
}
  • whitespace analyzer:它提供了基于空格的分词,如果遇到空格时直接分词
POST _analyze
{
  "analyzer":"whitespace",
  "text":"Geneva K. Risk-Issues "
}
{
  "tokens": [
    {
      "token": "Geneva",
      "start_offset": 0,
      "end_offset": 6,
      "type": "word",
      "position": 0
    },
    {
      "token": "K.",
      "start_offset": 7,
      "end_offset": 9,
      "type": "word",
      "position": 1
    },
    {
      "token": "Risk-Issues",
      "start_offset": 10,
      "end_offset": 21,
      "type": "word",
      "position": 2
    }
  ]
}
  • stop analyzer:它与simple analyzer相同,但是支持删除停用词。它默认使用 _english_stop 单词。
POST _analyze
{
  "analyzer":"stop",
  "text":"Geneva K.of Risk-Issues "
}
{
  "tokens": [
    {
      "token": "geneva",
      "start_offset": 0,
      "end_offset": 6,
      "type": "word",
      "position": 0
    },
    {
      "token": "k",
      "start_offset": 7,
      "end_offset": 8,
      "type": "word",
      "position": 1
    },
    {
      "token": "risk",
      "start_offset": 16,
      "end_offset": 20,
      "type": "word",
      "position": 4
    },
    {
      "token": "issues",
      "start_offset": 21,
      "end_offset": 27,
      "type": "word",
      "position": 5
    }
  ]
}
  • keyword analyzer:它提供的是无操作分词,它将整个输入字符串作为一个词返回,即不分词。
POST _analyze
{
  "analyzer":"keyword",
  "text":"Geneva K.of Risk-Issues "
}
{
  "tokens": [
    {
      "token": "Geneva K.of Risk-Issues ",
      "start_offset": 0,
      "end_offset": 24,
      "type": "word",
      "position": 0
    }
  ]
}
  • pattern analyzer:它提供了基于正则表达式将文本分词。正则表达式应该匹配词语分隔符,而不是词语本身。正则表达式默认为\W+(或所有非单词字符)。
POST _analyze
{
  "analyzer":"pattern",
  "text":"Geneva K.of Risk-Issues "
}
{
  "tokens": [
    {
      "token": "geneva",
      "start_offset": 0,
      "end_offset": 6,
      "type": "word",
      "position": 0
    },
    {
      "token": "k",
      "start_offset": 7,
      "end_offset": 8,
      "type": "word",
      "position": 1
    },
    {
      "token": "of",
      "start_offset": 9,
      "end_offset": 11,
      "type": "word",
      "position": 2
    },
    {
      "token": "risk",
      "start_offset": 12,
      "end_offset": 16,
      "type": "word",
      "position": 3
    },
    {
      "token": "issues",
      "start_offset": 17,
      "end_offset": 23,
      "type": "word",
      "position": 4
    }
  ]
}
  • language analyzers:它提供了一组语言的分词,旨在处理特定语言.它包含了一下语言的分词:

    阿拉伯语,亚美尼亚语,巴斯克语,巴西语,保加利亚语,加泰罗尼亚语,cjk,捷克语,丹麦语,荷兰语,英语,芬兰语,法语,加利西亚语,德语,希腊语,印度语,匈牙利语,爱尔兰语,意大利语,拉脱维亚语,立陶宛语,挪威语,波斯语,葡萄牙语,罗马尼亚语,俄罗斯语,索拉尼语,西班牙语,瑞典语,土耳其语,泰国语

POST _analyze
{
  "analyzer":"english",  ## french(法语)
  "text":"Geneva K.of Risk-Issues "
}
{
  "tokens": [
    {
      "token": "geneva",
      "start_offset": 0,
      "end_offset": 6,
      "type": "",
      "position": 0
    },
    {
      "token": "k.of",
      "start_offset": 7,
      "end_offset": 11,
      "type": "",
      "position": 1
    },
    {
      "token": "risk",
      "start_offset": 12,
      "end_offset": 16,
      "type": "",
      "position": 2
    },
    {
      "token": "isue",
      "start_offset": 17,
      "end_offset": 23,
      "type": "",
      "position": 3
    }
  ]
}
  • fingerprint analyzer:对text进行排序,重复数据删除然后将它们重新组合为单个text。
POST _analyze
{
  "analyzer":"fingerprint",
  "text":"Geneva K.of Risk-Issues "
}
{
  "tokens": [
    {
      "token": "geneva issues k.of risk",
      "start_offset": 0,
      "end_offset": 24,
      "type": "fingerprint",
      "position": 0
    }
  ]
}

二、测试自定义分析器

分析器analyze API的使用
分析器analyze API可验证分析器的分析效果并解释分析过程。

text: 待分析文本
explain:解释分析过程
char_filter:字符过滤器
tokenizer:分词器
filter:词元过滤器

GET _analyze
{
  "char_filter": ["html_strip"],
  "tokenizer": "standard",
  "filter": ["lowercase"],
  "text": "

No dreams, why bother Beijing !

", "explain": true }
{
  "detail": {
    "custom_analyzer": true,
    "charfilters": [
      {
        "name": "html_strip",
        "filtered_text": [
          """

No dreams, why bother Beijing !

"""
        ]
      }
    ],
    "tokenizer": {
      "name": "standard",
      "tokens": [
        {
          "token": "No",
          "start_offset": 7,
          "end_offset": 9,
          "type": "",
          "position": 0,
          "bytes": "[4e 6f]",
          "positionLength": 1
        },
        {
          "token": "dreams",
          "start_offset": 13,
          "end_offset": 23,
          "type": "",
          "position": 1,
          "bytes": "[64 72 65 61 6d 73]",
          "positionLength": 1
        },
        {
          "token": "why",
          "start_offset": 25,
          "end_offset": 28,
          "type": "",
          "position": 2,
          "bytes": "[77 68 79]",
          "positionLength": 1
        },
        {
          "token": "bother",
          "start_offset": 29,
          "end_offset": 35,
          "type": "",
          "position": 3,
          "bytes": "[62 6f 74 68 65 72]",
          "positionLength": 1
        },
        {
          "token": "Beijing",
          "start_offset": 39,
          "end_offset": 50,
          "type": "",
          "position": 4,
          "bytes": "[42 65 69 6a 69 6e 67]",
          "positionLength": 1
        }
      ]
    },
    "tokenfilters": [
      {
        "name": "lowercase",
        "tokens": [
          {
            "token": "no",
            "start_offset": 7,
            "end_offset": 9,
            "type": "",
            "position": 0,
            "bytes": "[6e 6f]",
            "positionLength": 1
          },
          {
            "token": "dreams",
            "start_offset": 13,
            "end_offset": 23,
            "type": "",
            "position": 1,
            "bytes": "[64 72 65 61 6d 73]",
            "positionLength": 1
          },
          {
            "token": "why",
            "start_offset": 25,
            "end_offset": 28,
            "type": "",
            "position": 2,
            "bytes": "[77 68 79]",
            "positionLength": 1
          },
          {
            "token": "bother",
            "start_offset": 29,
            "end_offset": 35,
            "type": "",
            "position": 3,
            "bytes": "[62 6f 74 68 65 72]",
            "positionLength": 1
          },
          {
            "token": "beijing",
            "start_offset": 39,
            "end_offset": 50,
            "type": "",
            "position": 4,
            "bytes": "[62 65 69 6a 69 6e 67]",
            "positionLength": 1
          }
        ]
      }
    ]
  }
}

归一化分析器 normalizer

针对type为keyword类型的字段,只能精确搜索,而且是区分大小写的。有时候我们希望对于keyword类型的字段不区分大小写也能精确检索怎么办呢?normalizer这个分析器可以帮你解决这个问题。
normalizer的构成比analyzer少了一个tokenizer属性,它的结构如下:

分析器类型type:custom
字符过滤器char_filter: 零个或多个 按顺序应用的
词元过滤器filter零个或多个 按顺序应用的

这里借用官方的一个例子:

PUT index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "quote": {
          "type": "mapping",
          "mappings": [
            "« => \"",
            "» => \""
          ]
        }
      },
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "char_filter": ["quote"],
          "filter": ["lowercase", "asciifolding"]
        }
      }
    }
  },
  "mappings": {
    "type": {
      "properties": {
        "foo": {
          "type": "keyword",  // normalizer只能用在keyword类型的字段
          "normalizer": "my_normalizer"
        }
      }
    }
  }
}
PUT testlog/wd_doc/1
{
  "title": "Quick Frox" 
}

GET testlog/wd_doc/_search
{
  "query": {
    "match": {
      "title": {
        "query": "quick Frox"   // 大小写不敏感,无论大小写都能检索到
      }
    }
  }
}

内置的12个分词器

  • 标准分词器
  • 字母分词器
  • 小写分词器
  • 空格分词器
  • UAX URL电子邮件令牌生成器
  • 经典分词器
  • 泰语分词器
  • NGram令牌生成器
  • Edge NGram令牌生成器
  • 关键字标记器
  • 模式标记器
  • 路径层次标记器

你可能感兴趣的:(elasticsearch分析器)