"content": {
"type": "text",
"analyzer":"ik_max_word", #对内容使用ik分词
"fielddata": true #为了词频统计
}
content:"那我估计他应该喜欢西红柿"
"term":{"content":"估计"}
"term":{"content":"估计他"}
当前content中的索引为:
{
"tokens": [
{
"token": "那我",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "估计",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 1
},
{
"token": "他",
"start_offset": 4,
"end_offset": 5,
"type": "CN_CHAR",
"position": 2
},
{
"token": "应该",
"start_offset": 5,
"end_offset": 7,
"type": "CN_WORD",
"position": 3
},
{
"token": "喜欢",
"start_offset": 7,
"end_offset": 9,
"type": "CN_WORD",
"position": 4
},
{
"token": "西红柿",
"start_offset": 9,
"end_offset": 12,
"type": "CN_WORD",
"position": 5
}
]
}
索引中不存在"估计他",所以无法找到对应值
后面我试着为content添加附属字段keyword,希望能够借此达到目的:
"content": {
"type": "text",
"analyzer":"ik_max_word",
"fielddata": true,
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"term":{"content.keyword":"估计他"}
"term":{"content.keyword":"那我估计他应该喜欢西红柿"}
然而使用match查询时得命中结果为包含"估计 他"得内容,其实仍然不满足要求
"match":{"content":"估计他"}
"wildcard" : { "content.keyword" : "\*估计他\*" }
得到得查询结果是正确得
但是如果所有关键词都通过这样得查询方式会无法体现倒排索引的效率,
因此采取的方式是:
查询前先对关键词进行分词,如果分词结果中包含整个关键词的内容,使用term,不包含则使用wildcard
对"西便门"分词,分词结果如下,存在包含整个关键词内容的,使用term查询:term:"西便门"
{
"tokens": [
{
"token": "西便门",
"start_offset": 0,
"end_offset": 3,
"type": "CN_WORD",
"position": 0
},
{
"token": "便门",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 1
}
]
}
对"估计他"分词,分词结果如下,不存在包含整个关键词内容的"估计他",
使用wildcard查询,查询content的附属字段keyword:
"wildcard" : { "content.keyword" : "*估计他*" }
也可以用"regexp" : { "content.keyword" : ".*估计他.*" }
{
"tokens": [
{
"token": "估计",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "他",
"start_offset": 2,
"end_offset": 3,
"type": "CN_CHAR",
"position": 1
}
]
}