Elasticsearch中,内置了很多分词器(analyzers),例如standard
(标准分词器)、english
(英文分词)和chinese
(中文分词)。其中standard
就是无脑的一个一个词(汉字)切分,所以适用范围广,但是精准度低;english
对英文更加智能,可以识别单数负数,大小写,过滤stopwords(例如“the”这个词)等;chinese
效果很差。这次主要玩这几个内容:安装中文分词ik,对比不同分词器的效果,得出一个较佳的配置。
IK分析插件将Lucene IK分析器(http://code.google.com/p/ik-analyzer/)集成到elasticsearch中,支持自定义字典。
分析:ik_smart
,ik_max_word
,分词:ik_smart
,ik_max_word
POST http://192.168.159.159:9200/index1/_analyze?analyzer=ik_max_word
联想召回笔记本电源线
{
"tokens": [
{
"token": "联想",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 1
},
{
"token": "召回",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 2
},
{
"token": "笔记本",
"start_offset": 4,
"end_offset": 7,
"type": "CN_WORD",
"position": 3
},
{
"token": "电源线",
"start_offset": 7,
"end_offset": 10,
"type": "CN_WORD",
"position": 4
}
]
}
自带chinese和standard分词器的结果:
{
"tokens": [
{
"token": "联",
"start_offset": 0,
"end_offset": 1,
"type": "",
"position": 1
},
{
"token": "想",
"start_offset": 1,
"end_offset": 2,
"type": "",
"position": 2
},
{
"token": "召",
"start_offset": 2,
"end_offset": 3,
"type": "",
"position": 3
},
{
"token": "回",
"start_offset": 3,
"end_offset": 4,
"type": "",
"position": 4
},
{
"token": "笔",
"start_offset": 4,
"end_offset": 5,
"type": "",
"position": 5
},
{
"token": "记",
"start_offset": 5,
"end_offset": 6,
"type": "",
"position": 6
},
{
"token": "本",
"start_offset": 6,
"end_offset": 7,
"type": "",
"position": 7
},
{
"token": "电",
"start_offset": 7,
"end_offset": 8,
"type": "",
"position": 8
},
{
"token": "源",
"start_offset": 8,
"end_offset": 9,
"type": "",
"position": 9
},
{
"token": "线",
"start_offset": 9,
"end_offset": 10,
"type": "",
"position": 10
}
]
}
由此可见自带分词器将其分成一个一个的字,这在我们使用过程中并不是很友好,因此ik分词器相反是更好的选择,那么接下来我们就看看ik分词器的安装使用
1.下载或编译
可选1 - 从这里下载预生成包:https://github.com/medcl/elasticsearch-analysis-ik/releases
解压插件到文件夹 your-es-root/plugins/
可选2 - 使用elasticsearch-plugin来安装(version> v5.5.1):
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.0.0/elasticsearch-analysis-ik-6.0.0.zip
重新开始elasticsearch
注意:要选择和elasticsearch相同的版本
PUT http://localhost:9200/index1
{
"settings": {
"refresh_interval": "5s",
"number_of_shards" : 3,
"number_of_replicas" : 1
},
"mappings": {
"resource": {
"dynamic": false,
"properties": {
"title": {
"type": "text",
"analyzer": "ik_max_word",
"fields": {
"cn": {
"type": "text",
"analyzer": "ik_max_word"
},
"en": {
"type": "text",
"analyzer": "english"
}
}
}
}
}
}
}
http://localhost:9200/_bulk
{ "create": { "_index": "index1", "_type": "resource", "_id": 1 } }
{ "title": "周星驰最新电影" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 2 } }
{ "title": "周星驰最好看的新电影" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 3 } }
{ "title": "周星驰最新电影,最好,新电影" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 4 } }
{ "title": "最最最最好的新新新新电影" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 5 } }
{ "title": "I'm not happy about the foxes" }
注意bulk api要“回车”换行,不然会报错。
POST http://localhost:9200/index1/resource/_search
{
"query": {
"multi_match": {
"type": "most_fields",
"query": "最新",
"fields": [ "title", "title.cn", "title.en" ]
}
}
}
我们修改query
和fields
字段来对比。
1)搜索“最新”,字段限制在title.cn
的结果(只展示hit部分):
"hits": [
{
"_index": "index1",
"_type": "resource",
"_id": "1",
"_score": 1.0537746,
"_source": {
"title": "周星驰最新电影"
}
},
{
"_index": "index1",
"_type": "resource",
"_id": "3",
"_score": 0.9057159,
"_source": {
"title": "周星驰最新电影,最好,新电影"
}
},
{
"_index": "index1",
"_type": "resource",
"_id": "4",
"_score": 0.5319481,
"_source": {
"title": "最最最最好的新新新新电影"
}
},
{
"_index": "index1",
"_type": "resource",
"_id": "2",
"_score": 0.33246756,
"_source": {
"title": "周星驰最好看的新电影"
}
}
]
再次搜索“最新”,字段限制在
title
,
title.en
的结果(只展示hit部分):
"hits": [
{
"_index": "index1",
"_type": "resource",
"_id": "4",
"_score": 1,
"_source": {
"title": "最最最最好的新新新新电影"
}
},
{
"_index": "index1",
"_type": "resource",
"_id": "1",
"_score": 0.75,
"_source": {
"title": "周星驰最新电影"
}
},
{
"_index": "index1",
"_type": "resource",
"_id": "3",
"_score": 0.70710677,
"_source": {
"title": "周星驰最新电影,最好,新电影"
}
},
{
"_index": "index1",
"_type": "resource",
"_id": "2",
"_score": 0.625,
"_source": {
"title": "周星驰最好看的新电影"
}
}
]
结论
:如果没有使用ik中文分词,会把“最新”当成两个独立的“字”,搜索准确性低。
title
和
title.cn
,结果为空,对于它们两个分词器,fox和foxes不同。再次搜索“fox”,字段限制在
title.en
,结果如下:
"hits": [
{
"_index": "index1",
"_type": "resource",
"_id": "5",
"_score": 0.9581454,
"_source": {
"title": "I'm not happy about the foxes"
}
}
]
结论
:中文和标准分词器,不对英文单词做任何处理(单复数等),查全率低。
其实最开始创建的索引已经是最佳配置了,在title
下增加cn
和en
两个fields,这样对中文,英文和其他什么乱七八糟文的效果都好点。就像前面说的,title
使用标准分词器,title.cn
使用ik分词器,title.en
使用自带的英文分词器,每次搜索同时覆盖。
POST http://192.168.159.159:9200/index1/_analyze?analyzer=ik_max_word
成龙原名陈港生
返回结果
{
"tokens" : [ {
"token" : "成龙",
"start_offset" : 1,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 0
}, {
"token" : "原名",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 1
}, {
"token" : "陈",
"start_offset" : 5,
"end_offset" : 6,
"type" : "CN_CHAR",
"position" : 2
}, {
"token" : "港",
"start_offset" : 6,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 3
}, {
"token" : "生",
"start_offset" : 7,
"end_offset" : 8,
"type" : "CN_CHAR",
"position" : 4
} ]
}
比如ik 的主词典中没有”陈港生” 这个词,所以被拆分了。
IK Analyzer 扩展配置
custom/mydict.dic;custom/single_word_low_freq.dic
custom/ext_stopword.dic
http://192.168.1.136/hotWords.php
这里我是用的是远程扩展字典,因为可以使用其他程序调用更新,且不用重启 ES,很方便;当然使用自定义的 mydict.dic 字典也是很方便的,一行一个词,自己加就可以了
$s = <<<'EOF'
陈港生
元楼
蓝瘦
EOF;
header('Last-Modified: '.gmdate('D, d M Y H:i:s', time()).' GMT', true, 200);
header('ETag: "5816f349-19"');
echo $s;
现在再测试一下,就可以看到 ik 分词器已经匹配到了 “陈港生” 这个词
...
}, {
"token" : "陈港生",
"start_offset" : 5,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 2
}, {
...