1、什么是近似搜索
假设有两个句子
java is my favourite programming langurage, and I also think spark is a very good big data system.
java spark are very related, because scala is spark's programming langurage and scala is also based on jvm like java.
适用match query 搜索java spark
{
{
"match": {
"content": "java spark"
}
}
}
match query 只能搜索到包含java和spark的document,但是不知道java和spark是不是离得很近。
假设我们想要java和spark离得很近的document优先返回,就要给它一个更高的relevance score,这就涉及到了proximity match近似匹配。
下面给出要实现的两个需求:
(1)搜索java spark,就靠在一起,中间不能插入任何其它字符
(2)搜索java spark,要求java和spark两个单词靠的越近,doc的分数越高,排名越靠前
2、match phrase
准备数据:
PUT /test_index/_create/1
{
"content": "java is my favourite programming language, and I also think spark is a very good big data system."
}
PUT /test_index/_create/2
{
"content": "java spark are very related, because scala is spark's programming language and scala is also based on jvm like java."
}
对于需求1 搜索java spark,就靠在一起,中间不能插入任何其它字符:
使用match query搜索无法实现
GET /test_index/_search
{
"query": {
"match": {
"content": "java spark"
}
}
}
结果:
{
"took" : 16,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.4255141,
"hits" : [
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.4255141,
"_source" : {
"content" : "java spark are very related, because scala is spark's programming language and scala is also based on jvm like java."
}
},
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.37266707,
"_source" : {
"content" : "java is my favourite programming language, and I also think spark is a very good big data system."
}
}
]
}
}
使用match phrase搜索就可以实现
GET /test_index/_search
{
"query": {
"match_phrase": {
"content": "java spark"
}
}
}
结果:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.35695744,
"hits" : [
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.35695744,
"_source" : {
"content" : "java spark are very related, because scala is spark's programming language and scala is also based on jvm like java."
}
}
]
}
}
3、term position
假设我们有两个document
doc1: hello world, java spark
doc2: hi, spark java
hello doc1(0)
world doc1(1)
java doc1(2) doc2(2)
spark doc1(3) doc2(1)
position详情如下:
GET /_analyze
{
"text": ["hello world, java spark"],
"analyzer": "standard"
}
{
"tokens" : [
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "",
"position" : 0
},
{
"token" : "world",
"start_offset" : 6,
"end_offset" : 11,
"type" : "",
"position" : 1
},
{
"token" : "java",
"start_offset" : 13,
"end_offset" : 17,
"type" : "",
"position" : 2
},
{
"token" : "spark",
"start_offset" : 18,
"end_offset" : 23,
"type" : "",
"position" : 3
}
]
}
GET /_analyze
{
"text": ["hi, spark java"],
"analyzer": "standard"
}
{
"tokens" : [
{
"token" : "hi",
"start_offset" : 0,
"end_offset" : 2,
"type" : "",
"position" : 0
},
{
"token" : "spark",
"start_offset" : 4,
"end_offset" : 9,
"type" : "",
"position" : 1
},
{
"token" : "java",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 2
}
]
}
4、match phrase基本原理
索引中的position,match_phrase
hello world, java spark doc1
hi, spark java doc2
hello doc1(0)
wolrd doc1(1)
java doc1(2) doc2(2)
spark doc1(3) doc2(1)
使用match_phrase查询要求找到每个term都在一个共有的那些doc,就是要求一个doc,必须要包含查询的每个term,并且满足位置运算。
doc1 --> java和spark --> spark position恰巧比java大1 --> java的position是2,spark的position是3,恰好满足条件
doc1符合条件
doc2 --> java和spark --> java position是2,spark position是1,spark position比java position小1,而不是大1 --> 光是position就不满足,那么doc2不匹配
doc2不符合条件
5、slop参数
含义:query string搜索文本中的几个term,要经过几次移动才能与一个document匹配,这个移动的次数就是slop。
实际举一个例子:
对于hello world, java is very good, spark is also very good. 假设我们要用match phrase 匹配到java spark。可以发现直接进行查询会查不到
PUT /test_index/_create/1
{
"content": "hello world, java is very good, spark is also very good."
}
GET /test_index/_search
{
"query": {
"match_phrase": {
"content": "java spark"
}
}
}
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
此时使用
GET /_analyze
{
"text": ["hello world, java is very good, spark is also very good."],
"analyzer": "standard"
}
结果:
{
"tokens" : [
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "",
"position" : 0
},
{
"token" : "world",
"start_offset" : 6,
"end_offset" : 11,
"type" : "",
"position" : 1
},
{
"token" : "java",
"start_offset" : 13,
"end_offset" : 17,
"type" : "",
"position" : 2
},
{
"token" : "is",
"start_offset" : 18,
"end_offset" : 20,
"type" : "",
"position" : 3
},
{
"token" : "very",
"start_offset" : 21,
"end_offset" : 25,
"type" : "",
"position" : 4
},
{
"token" : "good",
"start_offset" : 26,
"end_offset" : 30,
"type" : "",
"position" : 5
},
{
"token" : "spark",
"start_offset" : 32,
"end_offset" : 37,
"type" : "",
"position" : 6
},
{
"token" : "is",
"start_offset" : 38,
"end_offset" : 40,
"type" : "",
"position" : 7
},
{
"token" : "also",
"start_offset" : 41,
"end_offset" : 45,
"type" : "",
"position" : 8
},
{
"token" : "very",
"start_offset" : 46,
"end_offset" : 50,
"type" : "",
"position" : 9
},
{
"token" : "good",
"start_offset" : 51,
"end_offset" : 55,
"type" : "",
"position" : 10
}
]
}
java is very good spark is
java spark
java --> spark
java --> spark
java --> spark
可以发现java的position是2,spark的position是6,那么我们只需要设置slop大于等于3(也就是移动3词就可以了)就可以搜到了
GET /test_index/_search
{
"query": {
"match_phrase": {
"content": {
"query": "java spark",
"slop": 3
}
}
}
}
结果:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.21824157,
"hits" : [
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.21824157,
"_source" : {
"content" : "hello world, java is very good, spark is also very good."
}
}
]
}
}
此时加上slop的match phrase就是proximity match近似匹配了。加上slop之后虽然是近似匹配可以搜索到很多结果,但是距离越近的会优先返回,也就是相关度分数就会越高。