提示:ES 宗旨 专注于搜索,ES也有内置的分词器,但是对于中文分词我们还是比较喜欢使用ik分词器,该文章主要讲述下关于分词器的使用
可选内置分词:
whitespace | 空格分隔 |
---|---|
simple | |
stop | stopwords用_english_ |
非内置的分词ik
ik_max_word | 细分 |
---|---|
ik_smart | 简单的分词 |
代码如下(示例):
#请求
POST _analyze
{
"analyzer":"whitespace",
"text": "茅 台,酒 业酱香"
}
#结果
{
"tokens" : [
{
"token" : "茅",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "台,酒",
"start_offset" : 2,
"end_offset" : 5,
"type" : "word",
"position" : 1
},
{
"token" : "业酱香",
"start_offset" : 6,
"end_offset" : 9,
"type" : "word",
"position" : 2
}
]
}
代码如下(示例):
#请求
POST _analyze
{
"analyzer":"simple",
"text": "茅 台,酒 业酱香"
}
#结果
{
"tokens" : [
{
"token" : "茅",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "台",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 1
},
{
"token" : "酒",
"start_offset" : 4,
"end_offset" : 5,
"type" : "word",
"position" : 2
},
{
"token" : "业酱香",
"start_offset" : 6,
"end_offset" : 9,
"type" : "word",
"position" : 3
}
]
}
代码如下(示例):
#请求
POST _analyze
{
"analyzer":"simple",
"text": "茅 台,酒 业酱香"
}
#结果
{
"tokens" : [
{
"token" : "茅",
"start_offset" : 0,
"end_offset" : 1,
"type" : "" ,
"position" : 0
},
{
"token" : "台",
"start_offset" : 2,
"end_offset" : 3,
"type" : "" ,
"position" : 1
},
{
"token" : "酒",
"start_offset" : 4,
"end_offset" : 5,
"type" : "" ,
"position" : 2
},
{
"token" : "业",
"start_offset" : 6,
"end_offset" : 7,
"type" : "" ,
"position" : 3
},
{
"token" : "酱",
"start_offset" : 7,
"end_offset" : 8,
"type" : "" ,
"position" : 4
},
{
"token" : "香",
"start_offset" : 8,
"end_offset" : 9,
"type" : "" ,
"position" : 5
}
]
}
代码如下(示例):
#请求
POST _analyze
{
"analyzer":"ik_max_word",
"text": "茅台酒业"
}
#结果
{
"tokens" : [
{
"token" : "茅台酒",
"start_offset" : 0,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "茅台",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "茅",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "台",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "酒业",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "酒",
"start_offset" : 2,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "业",
"start_offset" : 3,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 6
}
]
}
代码如下(示例):
#请求
POST _analyze
{
"analyzer":"ik_smart",
"text": "茅台酒业"
}
#结果
{
"tokens" : [
{
"token" : "茅台",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "酒业",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
}
]
}
注意:
【分词analyzer】结果要 包含【查询分词结果search_analyzer】
只设置【分词】就不用注意了
示例代码:
PUT /test_1/test/_mapping
{
"test": {
"properties": {
"application": {
"type": "keyword",
"ignore_above": 10
},
"contentName": {
"type": "text",
#ik_smart “茅台酒业” 分为【茅台 酒业】
"analyzer": "ik_smart",
#whitespace 可查询到结果 【茅台 酒业】
# 查不到【茅台酒业 茅 台 酒 业】
"search_analyzer": "whitespace"
},
"id": {
"type": "long"
}
}
}
}