安装及配置

下载地址

https://github.com/medcl/elasticsearch-analysis-ik/releases

注意：ik分词器的版本要和 Elasticsearch 的版本保持一致

安装

将下载的安装包 elasticsearch-analysis-ik-7.10.2.zip 复制到 elasticsearch 根目录下的 plugins 文件夹中，然后解压 elasticsearch-analysis-ik-7.10.2.zip ，解压完成后删除压缩包，并把分词器文件夹重命名为 ik，重启 Elasticsearch 即可。

功能介绍

ik分词器提供两种分词方式：

分词器名称	说明
ik_smart	会做最粗粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国”、“国歌”，适合Phrase查询
ik_max_word	会将文本做最细粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国”、“中华人民”、“中华”、“华人”、“人民共和国”、“人民”、“人”、“民”、“共和国”、“共和”、“和”、“国”、“国歌”，会穷尽各种可能的组合，适合Term Query。

ex:

GET _analyze
 {
 "analyzer": "ik_smart",
 "text": "中华人民共和国"
 } 
 执行结果 
 {
 "tokens" : [
 {
 "token" : "中华人民共和国",
 "start_offset" : 0,
 "end_offset" : 7,
 "type" : "CN_WORD",
 "position" : 0
 }
 ]
 }

GET _analyze
{
"analyzer": "ik_max_word",
"text": "中华人民共和国"
}

执行结果：

{
"tokens" : [
{
"token" : "中华人民共和国",
"start_offset" : 0,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "中华人民",
"start_offset" : 0,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "中华",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "华人",
"start_offset" : 1,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "人民共和国",
"start_offset" : 2,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "人民",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "共和国",
"start_offset" : 4,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "共和",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 7
},
{
"token" : "国",
"start_offset" : 6,
"end_offset" : 7,
"type" : "CN_CHAR",
"position" : 8
}
]
}

自定义分词器

配置自定义分词器前先看一个例子

GET _analyze
{
"analyzer": "ik_smart",
"text": ["十三届全国人大三次会议表决通过了“民法典”，自2021年1月1日起施行。"]
}

执行结果：

{
"tokens" : [
{
"token" : "十",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "三届",
"start_offset" : 1,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "全国人大",
"start_offset" : 3,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "三次",
"start_offset" : 7,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "会议",
"start_offset" : 9,
"end_offset" : 11,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "表决",
"start_offset" : 11,
"end_offset" : 13,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "通过了",
"start_offset" : 13,
"end_offset" : 16,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "民法典",
"start_offset" : 16,
"end_offset" : 19,
"type" : "CN_WORD",
"position" : 7
},
{
"token" : "自",
"start_offset" : 20,
"end_offset" : 21,
"type" : "CN_CHAR",
"position" : 8
},
{
"token" : "2021年",
"start_offset" : 21,
"end_offset" : 26,
"type" : "TYPE_CQUAN",
"position" : 9
},
{
"token" : "1月",
"start_offset" : 26,
"end_offset" : 28,
"type" : "TYPE_CQUAN",
"position" : 10
},
{
"token" : "1日",
"start_offset" : 28,
"end_offset" : 30,
"type" : "TYPE_CQUAN",
"position" : 11
},
{
"token" : "起",
"start_offset" : 30,
"end_offset" : 31,
"type" : "CN_CHAR",
"position" : 12
},
{
"token" : "施行",
"start_offset" : 31,
"end_offset" : 33,
"type" : "CN_WORD",
"position" : 13
}
]
}

创建自定义词库

在安装的 ik 分词器的 config 中创建文件夹 custom : D:\elasticsearch\elasticsearch-7.10.2\plugins\ik\config\custom 在 custom 中创建 mydic.dic(自定义词库) 和 ext_stopwork.dic（停用词词库）

在 mydic.dic 中添加内容

十三届全国人大

在 ext_stopwork.dic 中添加内容

自
 起

配置自定义词库

在目录 D:\elasticsearch\elasticsearch-7.10.2\plugins\ik\config 下的 IKAnalyzer.cfg.xml 中配置刚创建的两个文件，主要内容如下：


 IK Analyzer 扩展配置
 
 custom/mydic.dic
 
 custom/ext_stopwork.dic

重启 Elasticsearch 服务，再次运行前面的例子：

GET _analyze
{
"analyzer": "ik_smart",
"text": ["十三届全国人大三次会议表决通过了“民法典”，自2021年1月1日起施行。"]
}

执行结果：

{
"tokens" : [
{
"token" : "十三届全国人大",
"start_offset" : 0,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "三次",
"start_offset" : 7,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "会议",
"start_offset" : 9,
"end_offset" : 11,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "表决",
"start_offset" : 11,
"end_offset" : 13,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "通过了",
"start_offset" : 13,
"end_offset" : 16,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "民法典",
"start_offset" : 17,
"end_offset" : 20,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "2021年",
"start_offset" : 23,
"end_offset" : 28,
"type" : "TYPE_CQUAN",
"position" : 6
},
{
"token" : "1月",
"start_offset" : 28,
"end_offset" : 30,
"type" : "TYPE_CQUAN",
"position" : 7
},
{
"token" : "1日",
"start_offset" : 30,
"end_offset" : 32,
"type" : "TYPE_CQUAN",
"position" : 8
},
{
"token" : "施行",
"start_offset" : 33,
"end_offset" : 35,
"type" : "CN_WORD",
"position" : 9
}
]
}

使用IK构建索引库

使用 ik 分词器创建索引库

PUT news
 {
 "mappings": {
 "properties": {
 "title": {
 "type": "text",
 "analyzer": "ik_max_word",
 "search_analyzer": "ik_smart"
 },
 "content": {
 "type": "text",
 "analyzer": "ik_max_word",
 "search_analyzer": "ik_smart"
 }
 }
 }
 }
 查看索引库信息
 GET news/_mapping
 执行结果
 {
 "news" : {
 "mappings" : {
 "properties" : {
 "content" : {
 "type" : "text",
 "analyzer" : "ik_max_word",
 "search_analyzer" : "ik_smart"
 },
 "title" : {
 "type" : "text",
 "analyzer" : "ik_max_word",
 "search_analyzer" : "ik_smart"
 }
 }
 }
 }
 }

注意在创建索引字段数据类型时， title 和 content 的 analyzer （分词器）使用的是 ik_max_word, 这是因为在创建倒排索引时尽量进行细粒度的拆分，尽量满足更多的搜索需求，而 search_analyzer (搜索) 是 ik_smart , 即搜索时尽量粗粒度的划分，满足搜索的精确性。

创建测试用例数据

POST news/_bulk
 {"index": {}}
 {"title": "柳岩为何40岁也无人敢取？", "content": "娱乐圈里的女星那么多，但要说到性感女星就一定要提到柳岩，毕竟像刘岩这样有料又有身材的女星，参加活动还是很吃香的。"}
 {"index": {}}
 {"title": "刘德华首当音乐老师", "content": "刘德华表示，希望自己首度塑造的音乐老师形象能够得到大家的认可，尤其希望能的到全国老师，家长和同学们的认可，“如果真的有机会做老师，我也想做音乐老师，因为我觉得音乐课很重要，音乐的力量是可以改变人生的！”"}
 {"index": {}}
 {"title": "奥巴马怒怼特朗普抗疫不力", "content": "奥巴马现身费城的竞选集会并发表讲话，他对特朗普四年的执政工作进行了猛烈攻击，谴责特朗普政府抗疫不力，搞砸美国经济。"}
 {"index": {}}
 {"title": "韩星柳真怀孕4个月喜迎二胎", "content": "韩星柳真怀孕4个月喜迎二胎，柳真为什么选择奇太映女儿为啥姓金？说起韩星柳真有些人可能不认识，不过只要追过S.E.S组合的网友应该都知道她，她曾经在韩国也有“国民妖精”之称，据说他所在的S.E.S更是韩国乐坛的第一支女子组合。"}

测试 ex1

GET news/_search
 {
 "query": {
 "match": {
 "title": "刘德华"
 }
 }
 }
 执行结果
 {
 "took" : 26,
 "timed_out" : false,
 "_shards" : {
 "total" : 1,
 "successful" : 1,
 "skipped" : 0,
 "failed" : 0
 },
 "hits" : {
 "total" : {
 "value" : 1,
 "relation" : "eq"
 },
 "max_score" : 1.547678,
 "hits" : [
 {
 "_index" : "news",
 "_type" : "_doc",
 "_id" : "l_JQC3gB8u3smGzBUQjj",
 "_score" : 1.547678,
 "_source" : {
 "title" : "刘德华首当音乐老师",
 "content" : "刘德华表示，希望自己首度塑造的音乐老师形象能够得到大家的认可，尤其希望能的到全国老师，家长和同学们的认可，“如果真的有机会做老师，我也想做音乐老师，因为我觉得音乐课很重要，音乐的力量是可以改变人生的！”"
 }
 }
 ]
 }
 }

测试ex2

GET news/_search
{
"query": {
"match": {
"title": "柳岩"
}
}
}

执行结果：

{
"took" : 11,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.875202,
"hits" : [
{
"_index" : "news",
"_type" : "_doc",
"_id" : "lvJQC3gB8u3smGzBUQjj",
"_score" : 1.875202,
"_source" : {
"title" : "柳岩为何40岁也无人敢取？",
"content" : "娱乐圈里的女星那么多，但要说到性感女星就一定要提到柳岩，毕竟像刘岩这样有料又有身材的女星，参加活动还是很吃香的。"
}
},
{
"_index" : "news",
"_type" : "_doc",
"_id" : "mfJQC3gB8u3smGzBUQjj",
"_score" : 0.6017173,
"_source" : {
"title" : "韩星柳真怀孕4个月喜迎二胎",
"content" : "韩星柳真怀孕4个月喜迎二胎，柳真为什么选择奇太映女儿为啥姓金？说起韩星柳真有些人可能不认识，不过只要追过S.E.S组合的网友应该都知道她，她曾经在韩国也有“国民妖精”之称，据说他所在的S.E.S更是韩国乐坛的第一支女子组合。"
}
}
]
}
}

测试ex2 执行结果分析：当搜索 “柳岩” 时出现了柳岩和柳真两条结果，通过分词查看可知
GET _analyze
 {
 "analyzer": "ik_smart",
 "text": ["柳岩"]
 } 
    执行结果： 
    {
 "tokens" : [
 {
 "token" : "柳",
 "start_offset" : 0,
 "end_offset" : 1,
 "type" : "CN_CHAR",
 "position" : 0
 },
 {
 "token" : "岩",
 "start_offset" : 1,
 "end_offset" : 2,
 "type" : "CN_CHAR",
 "position" : 1
 }
 ]
 }
分词器把柳岩拆分了 “柳” 和 “研” 两个字去搜索了，当搜索“柳”字时把柳岩和柳真都搜索出来了

AI分词器

安装及配置

功能介绍

执行结果

执行结果：

自定义分词器

执行结果：

执行结果：

使用IK构建索引库

查看索引库信息

执行结果

执行结果

执行结果：

执行结果：

动态更新索引数据

你可能感兴趣的:(AI分词器)