Elasticsearch 默认已经含有的分词法
这些默认的分词器对中文都不是很友好,所以说明要使用 Elasticsearch-Analysis-IK 中文分词器插件,官方文档请看:https://github.com/medcl/elasticsearch-analysis-ik
1. 下载并安装
上传到你使用的Elasticsearch 用一台服务器下,如:
#上传到源文件目录下,并对它解压到你的Elasticsearch 目录/plugins下/ik
unzip elasticsearch-analysis-ik-7.4.2.zip -d /var/temp/elasticsearch-7.4.2/plugins/ik
#解压完成了,重新启动Elasticsearch 即可
jps #先查看Elasticsearch 是否启动
#如: 7456 Jps
#如:24299 -- process information unavailable 表示没有启动
#使用非root 用户启动Elasticsearch
./elasticsearch -d #后台启动
2. 测试使用中文分词器
使用GET提交 请求体用application/json
url中index_text为索引,_analyze为解析
请求体中"text" 为测试的内容
请求体中"tokenizer" 分词器的类型 ik提供了二种分词器:ik_max_word 和 ik_smart
GET "http://localhost:9200/index_text/_analyze"
{
"text":"中华人民共和国"
,"tokenizer": "ik_max_word"
}
#ik_max_word的结果
{
"tokens": [
{
"token": "中华人民共和国",
"start_offset": 0,
"end_offset": 7,
"type": "CN_WORD",
"position": 0
},
{
"token": "中华人民",
"start_offset": 0,
"end_offset": 4,
"type": "CN_WORD",
"position": 1
},
{
"token": "中华",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 2
},
{
"token": "华人",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 3
},
{
"token": "人民共和国",
"start_offset": 2,
"end_offset": 7,
"type": "CN_WORD",
"position": 4
},
{
"token": "人民",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 5
},
{
"token": "共和国",
"start_offset": 4,
"end_offset": 7,
"type": "CN_WORD",
"position": 6
},
{
"token": "共和",
"start_offset": 4,
"end_offset": 6,
"type": "CN_WORD",
"position": 7
},
{
"token": "国",
"start_offset": 6,
"end_offset": 7,
"type": "CN_CHAR",
"position": 8
}
]
}
#ik_smart的结果
{
"tokens": [
{
"token": "中华人民共和国",
"start_offset": 0,
"end_offset": 7,
"type": "CN_WORD",
"position": 0
}
]
}
ik_max_word 和 ik_smart 什么区别?
ik自定义中文词库
上面的演示可以对中文进行分词,但是特殊的词汇它分不出来,如下:
#请求
GET "http://localhost:9200/index_text/_analyze"
{
"tokenizer":"ik_max_word",
"text":"骚年家的熊孩子,好调皮"
}
#响应的结果
{
"tokens": [
{
"token": "骚",
"start_offset": 0,
"end_offset": 1,
"type": "CN_CHAR",
"position": 0
},
{
"token": "年",
"start_offset": 1,
"end_offset": 2,
"type": "CN_CHAR",
"position": 1
},
{
"token": "家",
"start_offset": 2,
"end_offset": 3,
"type": "CN_CHAR",
"position": 2
},
{
"token": "的",
"start_offset": 3,
"end_offset": 4,
"type": "CN_CHAR",
"position": 3
},
{
"token": "熊",
"start_offset": 4,
"end_offset": 5,
"type": "CN_CHAR",
"position": 4
},
{
"token": "孩子",
"start_offset": 5,
"end_offset": 7,
"type": "CN_WORD",
"position": 5
},
{
"token": "好",
"start_offset": 8,
"end_offset": 9,
"type": "CN_CHAR",
"position": 6
},
{
"token": "调皮",
"start_offset": 9,
"end_offset": 11,
"type": "CN_WORD",
"position": 7
}
]
}
分词器他不会区分网络名词如(骚年家的熊孩子,好调皮),这里的 [“骚年”,“熊孩子”] 就不会认为他是一组词语。所以我们要配置自定义中文词库,来区分这些网络词。
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
>
>IK Analyzer 扩展配置 >
<!--用户可以在这里配置自己的扩展字典 -->
my.dic
>
添加my.dic扩展字典后退出,创建my.dic 字典文件,到IKAnalyzer.cfg.xml同目录下创建我们刚刚添加的my.dic,就是plugins/ik/config/目录下
vi my.dic
#添加我们想添加的词,一行一词
#我们添加 花菇凉,骚年,熊孩子
花菇凉
骚年
熊孩子
#保证,然后重新启动Elasticsearch 即可
然后我们再去分词查询,我们添加的骚年和熊孩子变成了词汇了。
#请求
GET "http://localhost:9200/index_text/_analyze"
{
"tokenizer":"ik_max_word",
"text":"骚年家的熊孩子,好调皮"
}
#响应的结果
{
"tokens": [
{
"token": "骚年",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "家",
"start_offset": 2,
"end_offset": 3,
"type": "CN_CHAR",
"position": 1
},
{
"token": "的",
"start_offset": 3,
"end_offset": 4,
"type": "CN_CHAR",
"position": 2
},
{
"token": "熊孩子",
"start_offset": 4,
"end_offset": 7,
"type": "CN_WORD",
"position": 3
},
{
"token": "孩子",
"start_offset": 5,
"end_offset": 7,
"type": "CN_WORD",
"position": 4
},
{
"token": "好",
"start_offset": 8,
"end_offset": 9,
"type": "CN_CHAR",
"position": 5
},
{
"token": "调皮",
"start_offset": 9,
"end_offset": 11,
"type": "CN_WORD",
"position": 6
}
]
}