ES 作为一个搜索引擎,拥有高效且功能齐全搜索算法,这一期我们来了解一下其细节。
ES 并不支持中文词语的切割,当使用中文时,我们输入的词汇会被切割成一个个单子,而不能组成我们想要的词语。
幸运的是,ES 人性化的支持各种插件的安装,通过安装 IK 分词器,我们就可以解决这个问题。
在 Docker 中,我们需要先进入容器,不用 Docker 可以跳过这一步:
docker exec -it elasticsearch bash
找到 bin 目录,里面有用于插件安装的程序:
cd /bin
接着下载安装插件,注意版本号一致:
elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/7.8.1/elasticsearch-analysis-ik-7.8.1.zip
重启容器,安装成功:
exit # 退出容器
docker restart elasticsearch
我们依旧使用 Kibana 进行测试。
不使用分词器:
GET _analyze
{
"text": "学习?学个屁"
}
返回体:
{
"tokens" : [
{
"token" : "学",
"start_offset" : 0,
"end_offset" : 1,
"type" : "" ,
"position" : 0
},
{
"token" : "习",
"start_offset" : 1,
"end_offset" : 2,
"type" : "" ,
"position" : 1
},
{
"token" : "学",
"start_offset" : 3,
"end_offset" : 4,
"type" : "" ,
"position" : 2
},
{
"token" : "个",
"start_offset" : 4,
"end_offset" : 5,
"type" : "" ,
"position" : 3
},
{
"token" : "屁",
"start_offset" : 5,
"end_offset" : 6,
"type" : "" ,
"position" : 4
}
]
}
可以看到字被一个个分开了。
使用分词器:
GET _analyze
{
"analyzer": "ik_smart",
"text": "学习?学个屁"
}
返回体:
{
"tokens" : [
{
"token" : "学习",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "学",
"start_offset" : 3,
"end_offset" : 4,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "个",
"start_offset" : 4,
"end_offset" : 5,
"type" : "CN_CHAR",
"position" : 2
},
{
"token" : "屁",
"start_offset" : 5,
"end_offset" : 6,
"type" : "CN_CHAR",
"position" : 3
}
]
}
学习被当成一个词汇,而不是单字。
学习被当成一个词汇,然而学个屁仍然是单字。如果我们要把学个屁加入词汇怎么办呢?这时候我们就需要自定义词典。
进入容器,如果你已经把 config 目录挂载到卷外就不需要进入了(我忘了):
docker exec -it elasticsearch bash
ES 的 docker 容器基于 centos 系统,我们先安装 vim:
yum install vim
进入词典目录:
cd /usr/share/elasticsearch/config/analysis-ik
用 vim 生成一个文件 my_word.dic
,加入我们需要的词汇。
看一下文本内容:
[root@3fb842497984 analysis-ik]# cat my_word.dic
学个屁
加入配置,打开 IKAnalyzer.cfg.xml,将自定义词典加到 ext_dict 中:
<properties>
<comment>IK Analyzer 扩展配置comment>
<entry key="ext_dict">my_word.dicentry>
<entry key="ext_stopwords">entry>
properties>
退出,重启容器:
exit
docker restart elasticsearch
再来试一下效果:
{
"tokens" : [
{
"token" : "学习",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "学个屁",
"start_offset" : 3,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 1
}
]
}
测试成功。
keyword:关键词,存入数据时不会被分词
text:文本,存入数据时会被分词
再测试之前,还要再介绍一下 ES 的两种查询:
match: 模糊查询,会对搜索关键词分词
term: 精确查询,不会对搜索关键词分词
创建索引。
name 为 keyword,desc 为 text:
PUT user
{
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"age": {
"type": "integer"
},
"desc": {
"type": "text"
}
}
}
}
插入三个字段:
PUT user/_doc/1
{
"name": "user koorye1",
"age": 19,
"desc": "i love python"
}
PUT user/_doc/2
{
"name": "user koorye2",
"age": 20,
"desc": "i love java"
}
PUT user/_doc/3
{
"name": "user koorye3",
"age": 21,
"desc": "i love c"
}
接下来我们用两种匹配测试一下:
GET user/_search
{
"query": {
"match": {
"name": "user"
}
}
}
返回结果为空。为什么?因为 name 是 keyword 类型,被看作一个整体,我们不能使用 keyword 的部分内容进行匹配。
GET user/_search
{
"query": {
"term": {
"name": "user koorye1"
}
}
}
查询关键词是精确的,name 也是关键词类型,都不会分词,故只有完全匹配才能测试成功。这次我们查到一条结果:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.9808291,
"hits" : [
{
"_index" : "user",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.9808291,
"_source" : {
"name" : "user koorye1",
"age" : 19,
"desc" : "i love python"
}
}
]
}
}
GET user/_search
{
"query": {
"match": {
"desc": "i love"
}
}
}
由于存储的是 text 类型,故已经被分词,相当于库中存储了 i / love / python / java / c 几个单词。我们使用模糊查询,故查询关键词被分词,是 i / love。模糊查询只需有一个关键词对应即可,所以可以成功匹配,返回 3 条结果:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 0.26706278,
"hits" : [
{
"_index" : "user",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.26706278,
"_source" : {
"name" : "user koorye1",
"age" : 19,
"desc" : "i love python"
}
},
{
"_index" : "user",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.26706278,
"_source" : {
"name" : "user koorye2",
"age" : 20,
"desc" : "i love java"
}
},
{
"_index" : "user",
"_type" : "_doc",
"_id" : "3",
"_score" : 0.26706278,
"_source" : {
"name" : "user koorye3",
"age" : 21,
"desc" : "i love c"
}
}
]
}
}
GET user/_search
{
"query": {
"term": {
"desc": "i love python"
}
}
}
字段完全相同,返回结果却为空?也许很出乎意料。原因是,由于 desc 属于 text 类型,库的字段已经被一个个拆开了,库中只有单独的 i / love / python 几个单词,却没有完整的 i love python 这个句子。因此,即使完全相同,我们这次也查不到结果。
相反,单个字段匹配反而可以用 term 得到结果:
GET user/_search
{
"query": {
"term": {
"desc": "python"
}
}
}
返回体:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.9808291,
"hits" : [
{
"_index" : "user",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.9808291,
"_source" : {
"name" : "user koorye1",
"age" : 19,
"desc" : "i love python"
}
}
]
}
}
使用 _source 即可:
GET user/_search
{
"query": {
"match": {
"desc": "i love"
}
},
"_source": ["name", "age"]
}
返回体:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 1.497693,
"hits" : [
{
"_index" : "user",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.497693,
"_source" : {
"name" : "user koorye1",
"age" : 19
}
},
{
"_index" : "user",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.497693,
"_source" : {
"name" : "user koorye2",
"age" : 20
}
},
{
"_index" : "user",
"_type" : "_doc",
"_id" : "3",
"_score" : 1.497693,
"_source" : {
"name" : "user koorye3",
"age" : 21
}
}
]
}
}
返回体中没有 desc.
通过 sort 指定排序规则,from 指定开始页,size 指定每页的数据量
GET user/_search
{
"query": {
"match": {
"desc": "i love"
}
},
"sort": [
{
"age": {
"order": "desc"
}
}
],
"from": 0,
"size": 2
}
返回体:
{
"took" : 22,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "user",
"_type" : "_doc",
"_id" : "3",
"_score" : null,
"_source" : {
"name" : "user koorye3",
"age" : 21,
"desc" : "i love c"
},
"sort" : [
21
]
},
{
"_index" : "user",
"_type" : "_doc",
"_id" : "2",
"_score" : null,
"_source" : {
"name" : "user koorye2",
"age" : 20,
"desc" : "i love java"
},
"sort" : [
20
]
}
]
}
}
ES 提供了几种逻辑类型来进行与或非判断:
与格式:
GET user/_search
{
"query": {
"bool": {
"must": [{
"match": {
"desc": "love"
}
},{
"match": {
"age": "19"
}
}]
}
}
}
或格式:
GET user/_search
{
"query": {
"bool": {
"should": [{
"match": {
"desc": "love"
}
},{
"match": {
"age": "19"
}
}]
}
}
}
非格式:
GET user/_search
{
"query": {
"bool": {
"must_not": [{
"match": {
"age": "19"
}
}]
}
}
}
ES 提供了区间查询:
例:
GET user/_search
{
"query": {
"bool": {
"filter": [{
"range": {
"age": {
"gte": 20,
"lte": 30
}
}
}]
}
}
}
通过 highlight 指定需要高亮的字段:
GET user/_search
{
"query": {
"match": {
"desc": "love"
}
},
"highlight": {
"fields": {
"desc": {}
}
}
}
返回体:
{
"took" : 115,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 0.13353139,
"hits" : [
{
"_index" : "user",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.13353139,
"_source" : {
"name" : "user koorye1",
"age" : 19,
"desc" : "i love python"
},
"highlight" : {
"desc" : [
"i love python"
]
}
},
{
"_index" : "user",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.13353139,
"_source" : {
"name" : "user koorye2",
"age" : 20,
"desc" : "i love java"
},
"highlight" : {
"desc" : [
"i love java"
]
}
},
{
"_index" : "user",
"_type" : "_doc",
"_id" : "3",
"_score" : 0.13353139,
"_source" : {
"name" : "user koorye3",
"age" : 21,
"desc" : "i love c"
},
"highlight" : {
"desc" : [
"i love c"
]
}
}
]
}
}
我们还可以自定义高亮的前后缀:
GET user/_search
{
"query": {
"match": {
"desc": "love"
}
},
"highlight": {
"fields": {
"desc": {}
},
"pre_tags": "",
"post_tags": ""
}
}
返回体:
{
"took" : 6,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 0.13353139,
"hits" : [
{
"_index" : "user",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.13353139,
"_source" : {
"name" : "user koorye1",
"age" : 19,
"desc" : "i love python"
},
"highlight" : {
"desc" : [
"i love python"
]
}
},
{
"_index" : "user",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.13353139,
"_source" : {
"name" : "user koorye2",
"age" : 20,
"desc" : "i love java"
},
"highlight" : {
"desc" : [
"i love java"
]
}
},
{
"_index" : "user",
"_type" : "_doc",
"_id" : "3",
"_score" : 0.13353139,
"_source" : {
"name" : "user koorye3",
"age" : 21,
"desc" : "i love c"
},
"highlight" : {
"desc" : [
"i love c"
]
}
}
]
}
}