可以参考本人blog
http://blog.csdn.net/qq_30581017/article/details/79533240
或者github地址下载对应版本的ik:
https://github.com/medcl/elasticsearch-analysis-ik
如果宝宝们已经安装好ik,在es中测试分词成功的,可以添加自己的热词库与停止库
① cd 到自己的elasticsearch-6.2.2/config/analysis-ik目录下,可以看到有很多dic(dictionary字典),ik也是基于词典进行分词的
②新建目录custom
mkdir custom
③新建txt文件或dic文件,或者下载词典库(我是复制ik的dic文件并加上自己的短语做测试),如下所示,本人里面加了拓展文件ext.txt”逼格”词语,拓展停止文件english_stopword.txt “java”词语
④配置ik自定义词典配置文件
<properties>
<comment>IK Analyzer 扩展配置comment>
<entry key="ext_dict">custom/ext.txtentry>
<entry key="ext_stopwords">custom/english_stopword.txtentry>
properties>
⑥测试
本人安装的es的图形工具kibana,创建索引,创建映射,分词检测
PUT testindex
POST testindex/test/_mapping
{
"properties": {
"content":{
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word"
}
}
}
POST testindex/_analyze
{
"field": "content",
"text": ["java 是个 逼格高技术"]
}
分词结果如下,可以看到java被停止,逼格被拓展,嘻嘻,还不错
{
"tokens": [
{
"token": "是",
"start_offset": 5,
"end_offset": 6,
"type": "CN_CHAR",
"position": 0
},
{
"token": "个",
"start_offset": 6,
"end_offset": 7,
"type": "CN_CHAR",
"position": 1
},
{
"token": "逼格",
"start_offset": 8,
"end_offset": 10,
"type": "CN_WORD",
"position": 2
},
{
"token": "高技术",
"start_offset": 10,
"end_offset": 13,
"type": "CN_WORD",
"position": 3
},
{
"token": "高技",
"start_offset": 10,
"end_offset": 12,
"type": "CN_WORD",
"position": 4
},
{
"token": "技术",
"start_offset": 11,
"end_offset": 13,
"type": "CN_WORD",
"position": 5
}
]
}
附上本人的拓展词典和拓展停止词典txt,仅供测试
链接: https://pan.baidu.com/s/1GRlDa45BOEQuOi7M2pRWrQ 密码: mrpw
链接: https://pan.baidu.com/s/1PUltSp-8jvxWwEu2GtiJ2g 密码: pwm9