分词器是es中的一个组件,通俗意义上理解,就是将一段文本按照一定的逻辑,分析成多个词语,同时对这些词语进行常规化的一种工具;ES会将text格式的字段按照分词器进行分词,并编排成倒排索引,正是因为如此,es的查询才如此之快;
es本身就内置有多种分词器,他们的特性与作用梳理如下:
分词器 | 作用 |
---|---|
Standard | ES默认分词器,按单词分类并进行小写处理 |
Simple | 按照非字母切分,然后去除非字母并进行小写处理 |
Stop | 按照停用词过滤并进行小写处理,停用词包括the、a、is |
Whitespace | 按照空格切分 |
Language | 据说提供了30多种常见语言的分词器 |
Patter | 按照正则表达式进行分词,默认是\W+ ,代表非字母 |
Keyword | 不进行分词,作为一个整体输出 |
这些分词器用于处理单词和字母,那功能基本已经覆盖,可以说是相当全面了!但对于中文而言,不同汉字组合成词语,往往多个字符组合在一起表达一种意思,显然,上述分词器无法满足需求;对应于中文,目前也有许多对应分词器,例如:IK,jieba,THULAC等,使用最多的即是IK分词器。
除了中文文字以外,我们也经常会使用拼音,例如各类输入法,百度的搜索框等都支持拼音的联想搜索,那么假如将数据存入到es中,如何通过拼音搜索我们想要的数据呢,这个时候对应的拼音分词器可以有效帮助到我们,它的开发者也正是ik分词器的创始人。
各种分词器的功能介绍令人眼花缭乱,那么,在业务的应用与开发中,我们该如何选择合适的分词器来满足我们的业务需求呢?具体可以根据分词器的分词效果酌情选择;接下来就具体看看各个分词器的分词效果吧~
以 “text” : “白兔万岁A*” 为例:
{
"tokens": [
{
"token": "白",
"start_offset": 0,
"end_offset": 1,
"type": "" ,
"position": 0
},
{
"token": "兔",
"start_offset": 1,
"end_offset": 2,
"type": "" ,
"position": 1
},
{
"token": "万",
"start_offset": 2,
"end_offset": 3,
"type": "" ,
"position": 2
},
{
"token": "岁",
"start_offset": 3,
"end_offset": 4,
"type": "" ,
"position": 3
},
{
"token": "a",
"start_offset": 4,
"end_offset": 5,
"type": "" ,
"position": 4
}
]
}
{
"tokens": [
{
"token": "白兔",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "万岁",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 1
},
{
"token": "万",
"start_offset": 2,
"end_offset": 3,
"type": "TYPE_CNUM",
"position": 2
},
{
"token": "岁",
"start_offset": 3,
"end_offset": 4,
"type": "COUNT",
"position": 3
}
]
}
{
"tokens": [
{
"token": "bai",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "btwsa",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "tu",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 1
},
{
"token": "wan",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 2
},
{
"token": "sui",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 3
},
{
"token": "a",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 4
}
]
}
不同分词器的分词效果各有不同,那么,假如我们需要完成一个模糊查询的搜索功能,以多种形式查询es中的同一个字段,例如类似于百度搜索框那样,既想通过简单词语或者单个字去搜索,又想根据拼音去搜索,很明显,单一种类的分词器是非常难以满足业务需求的;
此时,可以考虑构建索引字段中不同的field去适配多个分词器,例如:我们可以将字段设置多个分词器:
mapping:
{
"properties":{
"name":{
"type":"text",
"analyzer":"ik_max_word"
},
"fields":{
"PY":{
"type":"text",
"analyzer":"pinyin"
}
}
}
}
如果想要更加自由地使用es的分词功能,也许还能打开另一扇通往成功的大门 —— 自定义分词器,自定义分词器,顾名思义,就是通过不同分词器的组合以及相关属性设置,去创建符合自己心意的分词器,例如,如果我们既想通过词语联想一句话,又想享受拼音自动拼写转成词语的便捷,那么何不定义一个专属的分词器呢?例如:定义一个ik与拼音结合的分词器:
{
"analysis":{
"analyzer":{
"my_max_analyzer":{
"tokenizer":"ik_max_word",
"filter":"py"
},
"my_smart_analyzer":{
"tokenizer":"",
"filter":"py"
}
},
"filter":{
"py":{
"type":"pinyin",
"first_letter":"prefix",
"keep_separate_first_letter":true,
"keep_full_pinyin":true,
"keep_joined_full_pinyin":true,
"keep_original":true,
"limit_first_letter_length":16,
"lowercase":true,
"remove_duplicated_term":true
}
}
}
}
此时,对应 “白兔万岁A*" 分词效果如下:
{
"tokens": [
{
"token": "b",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "bai",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "白兔",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "baitu",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "bt",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "t",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 1
},
{
"token": "tu",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 1
},
{
"token": "w",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 2
},
{
"token": "wan",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 2
},
{
"token": "s",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 3
},
{
"token": "sui",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 3
},
{
"token": "万岁",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 3
},
{
"token": "wansui",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 3
},
{
"token": "ws",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 3
},
{
"token": "w",
"start_offset": 2,
"end_offset": 3,
"type": "TYPE_CNUM",
"position": 4
},
{
"token": "wan",
"start_offset": 2,
"end_offset": 3,
"type": "TYPE_CNUM",
"position": 4
},
{
"token": "万",
"start_offset": 2,
"end_offset": 3,
"type": "TYPE_CNUM",
"position": 4
},
{
"token": "s",
"start_offset": 3,
"end_offset": 4,
"type": "COUNT",
"position": 5
},
{
"token": "sui",
"start_offset": 3,
"end_offset": 4,
"type": "COUNT",
"position": 5
},
{
"token": "岁",
"start_offset": 3,
"end_offset": 4,
"type": "COUNT",
"position": 5
}
]
}