Elasticsearch实战(十二)---搜索推荐 match_phrase_prefix及fuzzy错误拼写模糊查询

Elasticsearch实战- 搜索推荐 match_phrase_prefix

文章目录

      • Elasticsearch实战- 搜索推荐 match_phrase_prefix
        • 1.搜索推荐场景
          • 1.1 准备数据
        • 2.搜索推荐实现
          • 2.1 match_phrase_prefix
        • 2 推荐搜索 优化
        • 3 错误单词拼写的模糊搜索 fuzzy

1.搜索推荐场景

场景:

假设我们想查找含有某个单词的数据,但是我们忘了这个单词怎么拼写了,只记得前面几个字母。但这里用match和match_phrase都不太合适,输入的不是完整的词。

  • 现在有个输入框,我想输入 a 字符, 出来一串提示,例如百度的 提示 a站,a4纸尺寸,apple 等等等等,是如何实现的 ?
  • 如果我现在有一个单词 hello,但是我拼写错了, 我写成了 hallo,到底能不能搜出来? 限于英文,不适用中文
    Elasticsearch实战(十二)---搜索推荐 match_phrase_prefix及fuzzy错误拼写模糊查询_第1张图片
1.1 准备数据
POST /testcopy/_bulk
{"index":{"_id": 1}}
{"empId" : "111","name" : "员工1","age" : 20,"sex" : "男","mobile" : "19000001111","salary":1333,"deptName" : "技术部","provice" : "湖北省","city":"武汉","area":"光谷大道","address":"湖北省武汉市洪山区光谷大厦","content" : "i like to write best elasticsearch article"}
{"index":{"_id": 2}}
{"empId" : "222","name" : "员工2","age" : 25,"sex" : "男","mobile" : "19000002222","salary":15963,"deptName" : "销售部","provice" : "湖北省","city":"武汉","area":"江汉区","address" : "湖北省武汉市江汉路","content" : "i think java is the best programming language"}
{"index":{"_id": 3}}
{ "empId" : "333","name" : "员工3","age" : 30,"sex" : "男","mobile" : "19000003333","salary":20000,"deptName" : "技术部","provice" : "湖北省","city":"武汉","area":"经济技术开发区","address" : "湖北省武汉市经济开发区","content" : "i am only an elasticsearch beginner"}
{"index":{"_id": 4}}
{"empId" : "444","name" : "员工4","age" : 20,"sex" : "女","mobile" : "19000004444","salary":5600,"deptName" : "销售部","provice" : "湖北省","city":"武汉","area":"沌口开发区","address" : "湖北省武汉市沌口开发区","content" : "elasticsearch and hadoop are all very good solution, i am a beginner"}
{"index":{"_id": 5}}
{ "empId" : "555","name" : "员工5","age" : 20,"sex" : "男","mobile" : "19000005555","salary":9665,"deptName" : "测试部","provice" : "湖北省","city":"高新开发区","area":"武汉","address" : "湖北省武汉市东湖隧道","content" : "spark is best big data solution based on scala ,an programming language similar to java"}
{"index":{"_id": 6}}
{"empId" : "666","name" : "员工6","age" : 30,"sex" : "女","mobile" : "19000006666","salary":30000,"deptName" : "技术部","provice" : "武汉市","city":"湖北省","area":"江汉区","address" : "湖北省武汉市江汉路","content" : "i like java developer"}
{"index":{"_id": 7}}
{"empId" : "777","name" : "员工7","age" : 60,"sex" : "女","mobile" : "19000007777","salary":52130,"deptName" : "测试部","provice" : "湖北省","city":"黄冈市","area":"边城区","address" : "湖北省黄冈市边城区","content" : "i like elasticsearch developer"}
{"index":{"_id": 8}}
{"empId" : "888","name" : "员工8","age" : 19,"sex" : "女","mobile" : "19000008888","salary":60000,"deptName" : "技术部","provice" : "湖北省","city":"武汉","area":"汉阳区","address" : "湖北省武汉市江汉大学","content" : "i like spark language"}
{"index":{"_id": 9}}
{"empId" : "999","name" : "员工9","age" : 40,"sex" : "男","mobile" : "19000009999","salary":23000,"deptName" : "销售部","provice" : "河南省","city":"郑州市","area":"二七区","address" : "河南省郑州市郑州大学","content" : "i like java developer"}
{"index":{"_id": 10}}
{"empId" : "101010","name" : "张湖北","age" : 35,"sex" : "男","mobile" : "19000001010","salary":18000,"deptName" : "测试部","provice" : "湖北省","city":"武汉","area":"高新开发区","address" : "湖北省武汉市东湖高新","content" : "i like java developer i also like  elasticsearch"}
{"index":{"_id": 11}}
{"empId" : "111111","name" : "王河南","age" : 61,"sex" : "男","mobile" : "19000001011","salary":10000,"deptName" : "销售部",,"provice" : "河南省","city":"开封市","area":"金明区","address" : "河南省开封市河南大学","content" : "i am not like  java "}
{"index":{"_id": 12}}
{"empId" : "121212","name" : "张大学","age" : 26,"sex" : "女","mobile" : "19000001012","salary":1321,"deptName" : "测试部",,"provice" : "河南省","city":"开封市","area":"金明区","address" : "河南省开封市河南大学","content" : "i am java developer  thing java is good"}
{"index":{"_id": 13}}
{"empId" : "131313","name" : "李江汉","age" : 36,"sex" : "男","mobile" : "19000001013","salary":1125,"deptName" : "销售部","provice" : "河南省","city":"郑州市","area":"二七区","address" : "河南省郑州市二七区","content" : "i like java and java is very best i like it do you like java "}
{"index":{"_id": 14}}
{"empId" : "141414","name" : "王技术","age" : 45,"sex" : "女","mobile" : "19000001014","salary":6222,"deptName" : "测试部",,"provice" : "河南省","city":"郑州市","area":"金水区","address" : "河南省郑州市金水区","content" : "i like c++"}
{"index":{"_id": 15}}
{"empId" : "151515","name" : "张测试","age" : 18,"sex" : "男","mobile" : "19000001015","salary":20000,"deptName" : "技术部",,"provice" : "河南省","city":"郑州市","area":"高新开发区","address" : "河南省郑州高新开发区","content" : "i think spark is good"}

2.搜索推荐实现

2.1 match_phrase_prefix

搜索推荐,比如现在doc的索引有 elasticsearch, 我现在输入 elasticse 的时候,推荐相关doc,我们可以用 match_phrase_prefix来实现,其原理和match phrase类似

  1. 先使用match匹配term数据(java)
  2. 然后在指定的slop移动次数范围内,前缀匹配(s)
  3. max_expansions是用于指定prefix最多匹配多少个term(单词),超过这个数量就不再匹配了。

这种语法的限制是,只有最后一个term会执行前缀搜索。
执行性能很差,毕竟最后一个term是需要扫描所有符合slop要求的倒排索引的term。
因为效率较低,如果必须使用,则一定要使用参数max_expansions。

#搜索以 java d 为前缀的 doc
get /testcopy/_search
{
  "query":{
    "match_phrase_prefix": {
      "content": {
        "query": "java d",
        "slop":2,
         "max_expansions": 3
      }
    }
  }
}

查询结果, 可以搜除结果 以 java d开头的 doc
Elasticsearch实战(十二)---搜索推荐 match_phrase_prefix及fuzzy错误拼写模糊查询_第2张图片

2 推荐搜索 优化

既然 match_phrase_prefix 搜索效率低,如果我需要模糊匹配应该如何实现 ? Es提供了 几种分词方式来实现 前缀搜索,使得搜索走倒排索引,提高查询效率 , 如 edge_ngram和ngram是ElasticSearch自带的两个分词器,一般设置索引映射的时候都会用到,设置完步长之后,就可以直接给解析器analyzer的tokenizer赋值使用。

但这两个分词器到底有什么区别呢?
这里,我们统一用字符串来做分词示例:

“字符串” edge_ngram分词器,分词结果如下:

 {
"tokens": [
{
"token": "字",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "字符",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "字符串",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 2
}
]
}

“字符串” ngram分词器,分词结果如下:

 {
"tokens": [
{
"token": "字",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "字符",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "字符串",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 2
},
{
"token": "符",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 3
},
{
"token": "符串",
"start_offset": 1,
"end_offset": 3,
"type": "word",
"position": 4
},
{
"token": "串",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 5
}
]
}

简单理解来说:

  • edge_ngram的分词器,就是从首字开始,按步长,逐字符分词,直至最终结尾文字;
  • ngram呢,就不仅是从首字开始,而是逐字开始按步长,逐字符分词。

如果必须首字匹配的情况,那么用edge_ngram自然是最佳选择,如果需要文中任意字符的匹配,ngram就更为合适了

3 错误单词拼写的模糊搜索 fuzzy

如果我现在有一个单词 hello,但是我拼写错了, 我写成了 hallo,到底能不能搜出来 ? 适用英文不适合中文
现在 fuzzy就可以解决这种 单词拼写错误的问题
fuzzy查询参数 fuzziness 代表要搜索的关键字 可以修改多少个字母来实现 拼写错误的纠正
hello -> 拼写成 hallo,就是要纠正1个字母
elasticsearch -> 拼写成 elasticsearxx, 就是要纠正2个字母

#fuzzy查询 解决拼写错误问题 elasticsearxx 2个字母拼写错误
get /testcopy/_search
{
  "query":{
    "fuzzy": {
      "content": {
        "value": "elasticsearxx",
        "fuzziness": 1
      }
    }
  }
}

查询结果 没有查出来 符合要求的,因为 fuzziness=1 允许1个字符拼写错误,但是 elasticsearxx 是2个字符拼写错误,超出了设置,就没有搜出符合要求的结果
Elasticsearch实战(十二)---搜索推荐 match_phrase_prefix及fuzzy错误拼写模糊查询_第3张图片

我们现在 改下参数 ,改成 fuzziness:3 试一试, 允许3个字符拼写错误

get /testcopy/_search
{
  "query":{
    "fuzzy": {
      "content": {
        "value": "elasticsearxx",
        "fuzziness": 3
      }
    }
  }
}

可以正常查询出来结果, 符合错误单词拼写的查询结果
Elasticsearch实战(十二)---搜索推荐 match_phrase_prefix及fuzzy错误拼写模糊查询_第4张图片


至此 我们已经学习了 根据前缀关键字 推荐,如何使用 ngram来优化前缀匹配及 如果想要 实现拼错单词也正常搜索的功能,该如何实现的解决方案

你可能感兴趣的:(ElasticSearch,elasticsearch,match_phrase,短语前缀搜素,fuzzy模糊搜索,fuzzy错误拼写搜索)