1. 借鉴

极客时间阮一鸣老师的Elasticsearch核心技术与实战
ElasticSearch7+Spark构建高相关性搜索服务&千人千面推荐系统
官方文档 full-text-queries
Elasticsearch Query DSL之全文检索(Full text queries)上篇
Elasticsearch Query DSL之全文检索(Full text queries)下篇
Elasticsearch自定义分词，从一个问题说开去
Elasticsearch之match_phrase小坑记录
官方文档 slop
Elasticsearch - 短语匹配(match_phrase)以及slop参数
网名生成器

2. 开始

基于全文本的查询

基于全文本的查询有以下几种：match，match phrase，match bool prefix，match phrase prefix，multi match，query string，simple query string。
会对输入进行分词这一点要注意

我们创建一些数据供以后使用

# 创建索引
PUT /actors
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "ik_max_word", 
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

# 添加数据[名字随机生成的，如有雷同，纯属巧合]
PUT /_bulk
{"index": {"_index": "actors", "_id": 1}}
{"name": "陈宛曼"}
{"index": {"_index": "actors", "_id": 2}}
{"name": "陈浩南"}
{"index": {"_index": "actors", "_id": 3}}
{"name": "司徒浩南"}
{"index": {"_index": "actors", "_id": 4}}
{"name": "易白竹"}
{"index": {"_index": "actors", "_id": 5}}
{"name": "单于冰岚"}
{"index": {"_index": "actors", "_id": 6}}
{"name": "宾三春"}
{"index": {"_index": "actors", "_id": 7}}
{"name": "慕容雅云"}
{"index": {"_index": "actors", "_id": 8}}
{"name": "陈子晋"}
{"index": {"_index": "actors", "_id": 9}}
{"name": "党绮梅"}
{"index": {"_index": "actors", "_id": 10}}
{"name": "慕容鸿晖"}

analyze 过程

在说基于全文本查询之前，我们先来说搜索之前的分词过程，这个过程是是十分重要的

我们以以下一段话来分析一下

Water dropping day by day wears the hardest rock away
[水滴石穿]

# 我们新建一个搜索
PUT /mine/_doc/1
{
  "name": "Water dropping day by day wears & the hardest rock away"
}

# 简单查询一下
GET /mine/_search
{
  "query": {
    "match": {
      "name": "drop"
    }
  }
}

查询结果

{
  "took" : 645,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

看到这个结果是不是觉的有些奇怪，drop至少在文档中有出现，在我们的认知中drop跟dropping其实是一个词的不同时态，为什么搜不出来呢？我们不知道搜索引擎给我们分出了什么词，那我们使用analyzer的api查看分词

# 使用analyze api查看分词结果
GET /mine/_analyze
{
  "field": "name",
  "text": "Water dropping day by day wears & the hardest rock away"
}

我们看下analyze api作用在name字段上，输入text(Water ....)，分出来的词如下:

{
  "tokens" : [
    {
      "token" : "water",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "",
      "position" : 0
    },
    {
      "token" : "dropping",
      "start_offset" : 6,
      "end_offset" : 14,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "day",
      "start_offset" : 15,
      "end_offset" : 18,
      "type" : "",
      "position" : 2
    },
    {
      "token" : "by",
      "start_offset" : 19,
      "end_offset" : 21,
      "type" : "",
      "position" : 3
    },
    {
      "token" : "day",
      "start_offset" : 22,
      "end_offset" : 25,
      "type" : "",
      "position" : 4
    },
    {
      "token" : "wears",
      "start_offset" : 26,
      "end_offset" : 31,
      "type" : "",
      "position" : 5
    },
    {
      "token" : "the",
      "start_offset" : 32,
      "end_offset" : 35,
      "type" : "",
      "position" : 6
    },
    {
      "token" : "hardest",
      "start_offset" : 36,
      "end_offset" : 43,
      "type" : "",
      "position" : 7
    },
    {
      "token" : "rock",
      "start_offset" : 44,
      "end_offset" : 48,
      "type" : "",
      "position" : 8
    },
    {
      "token" : "away",
      "start_offset" : 49,
      "end_offset" : 53,
      "type" : "",
      "position" : 9
    }
  ]
}

我们来看下，输入drop的关键字的时候的分词结果

GET /mine/_analyze
{
  "field": "name",
  "text": "drop"
}

分词结果如下，输入drop后，分词器的分词结果是drop

{
  "tokens" : [
    {
      "token" : "drop",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "",
      "position" : 0
    }
  ]
}

可以看到文档分词的结果是dropping，而输入分词的结果是drop，这让我们看上去是个相似的词，但是对于搜索引擎来说，则需要完全匹配，所以用drop来搜索时，搜索不出来结果。

那接下来我们看下analyze的过程

字符过滤->字符处理->分词过滤（分词转换）

analyze过程

那我们如何是drop能搜索出来文档呢?需要指定分词器

# 先删除之前的索引
DELETE /mine

# 创建索引
PUT /mine 
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "english"
      }
    }
  }
}

# 添加文档
PUT /mine/_doc/1
{
  "name": "Water dropping day by day wears & the hardest rock away"
}

我们使用analyze api查看一下分词结果

GET /mine/_analyze
{
  "field": "name",
  "text": "Water dropping day by day wears & the hardest rock away"
}

分词结果

{
  "tokens" : [
    {
      "token" : "water",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "",
      "position" : 0
    },
    {
      "token" : "drop",
      "start_offset" : 6,
      "end_offset" : 14,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "dai",
      "start_offset" : 15,
      "end_offset" : 18,
      "type" : "",
      "position" : 2
    },
    {
      "token" : "dai",
      "start_offset" : 22,
      "end_offset" : 25,
      "type" : "",
      "position" : 4
    },
    {
      "token" : "wear",
      "start_offset" : 26,
      "end_offset" : 31,
      "type" : "",
      "position" : 5
    },
    {
      "token" : "hardest",
      "start_offset" : 38,
      "end_offset" : 45,
      "type" : "",
      "position" : 7
    },
    {
      "token" : "rock",
      "start_offset" : 46,
      "end_offset" : 50,
      "type" : "",
      "position" : 8
    },
    {
      "token" : "awai",
      "start_offset" : 51,
      "end_offset" : 55,
      "type" : "",
      "position" : 9
    }
  ]
}

我们看到分词结果有些不同，我们来归纳一下这两个分词器的表现不同的地方

1. dropping 变为 drop
1. by 这个词消失了【english分词器在字符过滤阶段将其过滤掉了】
1. day 这个词变为了dai【english分词器在分词过滤阶段，进行词干转换，比如说复数转为单数(days -> day -> dai)】

现在我们再执行一下query语句就能查询到了

GET /mine/_search
{
  "query": {
    "match": {
      "name": "drop"
    }
  }
}

查询结果

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "mine",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "Water dropping day by day wears & the hardest rock away"
        }
      }
    ]
  }
}

接下来，我们看看全文本查询

1.match

执行以下查询，查看执行结果

GET /actors/_search
{
  "query": {
    "match": {
      "name": "慕容浩南"
    }
  }
}

结果如下：

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : 3.002836,
    "hits" : [
      {
        "_index" : "actors",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 3.002836,
        "_source" : {
          "name" : "陈浩南"
        }
      },
      {
        "_index" : "actors",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 3.002836,
        "_source" : {
          "name" : "司徒浩南"
        }
      },
      {
        "_index" : "actors",
        "_type" : "_doc",
        "_id" : "7",
        "_score" : 1.501418,
        "_source" : {
          "name" : "慕容雅云"
        }
      },
      {
        "_index" : "actors",
        "_type" : "_doc",
        "_id" : "10",
        "_score" : 1.501418,
        "_source" : {
          "name" : "慕容鸿晖"
        }
      }
    ]
  }
}

看结果就能印证我们之前说的，会对输入进行分词，分为三个词(慕容，浩，南【可以使用_analyze api进行查看】)，可以看到结果是包含慕容或者浩或者南，如果我就要搜慕容浩南呢？

GET /actors/_search
{
  "query": {
    "match": {
      "name": {
        "query": "慕容浩南",
        "operator": "and"
      }
    }
  }
}

以下为搜索结果，遗憾的是并没有这么个人在我们的索引中

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

那都可以配置哪些参数呢?【不全，没有印证的没有写下来，详见官网】

operator:OR 和 AND。表示对查询字符串分词后，返回的关键字列表，OR只需一个满足及认为匹配，而AND则需要全部都匹配，默认值为：OR
minimum_should_match:最少需要匹配个数。在操作类型为OR时生效，指明分词后的关键字，至少minimum_should_match 个匹配，则命中。
以下举个例子:我们要查询关键字”慕容浩南“，其中分词为”慕容“，”浩“，”南“，我们设置minimum_should_match为2，只要name属性包含2个以上关键字[包含2个]即为匹配

GET /actors/_search
{
  "query": {
    "match": {
      "name": {
        "query": "慕容浩南",
        "operator": "or",
        "minimum_should_match": "2"
      }
    }
  }
}

结果如下：

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 3.002836,
    "hits" : [
      {
        "_index" : "actors",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 3.002836,
        "_source" : {
          "name" : "陈浩南"
        }
      },
      {
        "_index" : "actors",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 3.002836,
        "_source" : {
          "name" : "司徒浩南"
        }
      }
    ]
  }
}

类型	例子	释义
整型	2	如果分词的匹配个数小于2个，则无法匹配到任何条目。
负整数	-1	负数表示最多不允许不匹配的个数【也就是需要匹配的个数为(总数-1)】
百分比	75%	百分比，表示需要匹配的词占总数的百分比。
组合类型	2<90%	如果查询字符串分词的个数小于等于2（前面的整数），则只要全部匹配则返回，如果分词的个数大于2个，则只要90%的匹配即可。
多组合类型	2<-25% 9<-3	支持多条件表达式，中间用空格分开。该表达式的意义如下：1、如果分词的个数小于等于2，则必须全部匹配；如果大于2小于9，则除了25%（注意负号）之外都需要满足。2、如果大于9个，则只允许其中3个不满足。

analyzer: 设置分词器
lenient:是否忽略由于数据类型不匹配引起的异常，默认为false。例如尝试用文本查询字符串查询数值字段，默认会抛出错误。
fuzziness:模糊匹配
zero_terms_query:默认情况下，如果分词器会过滤查询字句中的停用词，可能会造成查询字符串分词后变成空字符串，此时默认的行为是无法匹配到任何文档，如果想改变该默认情况，可以设置zero_terms_query=all,默认值为none。
auto_generate_synonyms_phrase_query:如果为true，则为同义词自动创建短语查询。默认值为true

2. match phrase

match_phrase检索时候，文档必须同时满足以下两个条件，才能被检索到：

1）分词后所有词项都出现在该字段中；
2）字段中的词项顺序要一致。

对于匹配了短语"quick brown fox"的文档，下面的条件必须为true：

quick、brown和fox必须全部出现在某个字段中。

brown的位置必须比quick的位置大1。

fox的位置必须比quick的位置大2。

如果以上的任何一个条件没有被满足，那么文档就不能被匹配。

举个例子

# 我们搜索”慕容雅云“，注意，”慕容雅云“被分词器分为”慕容“，”雅“，”云“
GET /actors/_search
{
  "query": {
    "match_phrase": {
      "name": {
        "query": "慕容雅云"
      }
    }
  }
}

看下结果

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 5.539568,
    "hits" : [
      {
        "_index" : "actors",
        "_type" : "_doc",
        "_id" : "7",
        "_score" : 5.539568,
        "_source" : {
          "name" : "慕容雅云"
        }
      }
    ]
  }
}

结果只有一个，说明分词后，需要保证慕容，雅，云都在的文档可以被筛选出来

再来看下一个

GET /actors/_search
{
  "query": {
    "match_phrase": {
      "name": {
        "query": "雅云慕容"
      }
    }
  }
}

结果如下，啥也没有，说明，除了文档中要同时包含慕容，雅，云外，还要保证他们的顺序

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

但是假如我知道这个人的名字包含慕容，雅，云，不知道是姓慕容还是姓雅云这该咋整

我们可以使用slop参数来进行调整。slop参数告诉match_phrase查询词条能够相隔多远时仍然将文档视为匹配。相隔多远的意思是，你需要移动一个词条多少次来让查询和文档匹配？
我们先来举个例子再来解释

GET /actors/_search
{
  "query": {
    "match_phrase": {
      "name": {
        "query": "雅云慕容",
        "slop": 3
      }
    }
  }
}

结果如下：

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 2.1158507,
    "hits" : [
      {
        "_index" : "actors",
        "_type" : "_doc",
        "_id" : "7",
        "_score" : 2.1158507,
        "_source" : {
          "name" : "慕容雅云"
        }
      }
    ]
  }
}

可以看到有数据了，可是为啥呢？大家可以参考我在借鉴部分的slop的官方文档，里面用了表格来说明了，挺清楚，下面我们也来用表格来说明我们这个。

slop
简要说明一下：我们查询的是”雅云慕容 “，在文档中的分词的顺序是”慕容“，”雅“，”云“，match_phrase查询时会先对查询条件进行分词，词项和顺序分为”雅“，”云“，”慕容“。第一步，调整”慕容“的顺序，使它和”雅“并列；第二步，调整”雅“的顺序，使它和”云“并列；第三步，调整”云“。调整结束，索引slop为3。

3. match bool prefix

match_bool_prefix查询分析其输入，并根据这些关键字构造bool查询。除了最后一个词外，每个词都在一个term查询中使用。最后一个词用于前缀查询。

举个栗子

GET /actors/_search
{
  "query": {
    "match_bool_prefix": {
      "name": "陈浩南"
    }
  }
}

我们上边的这个栗子可以重写为以下语句

GET /actors/_search
{
  "query": {
    "bool" : {
      "should": [
        { "term": { "name": "陈" }},
        { "term": { "name": "浩" }},
        { "prefix": { "name": "南"}}
      ]
    }
  }
}

match_bool_prefix查询和match_phrase_prefix之间的一个重要区别是:
match_phrase_prefix查询匹配短语，而match_bool_prefix查询可以在任何位置匹配短语。上面的示例match_bool_prefix查询可以匹配一个包含"陈浩南"的字段，但是它也可以匹配"浩南陈"。它还可以匹配一个出现在任何位置，可包含"陈"、浩"和以"南"开头的字段。当然可以使用minimum_should_match来指定匹配的精度。

4.match phrase prefix

返回包含所提供文本的单词的文档，其顺序与所提供的顺序相同。提供的文本的最后一项作为前缀，匹配以该项开头的任何单词。

我觉得官方的解释比较容易懂，所以直接拿过来了

# 下面的搜索在消息字段中返回包含以quick brown f开头的短语的文档。
GET /_search
{
    "query": {
        "match_phrase_prefix" : {
            "message" : {
                "query" : "quick brown f"
            }
        }
    }
}

# 这个搜索会匹配message为quick brown fox 或者 two quick brown ferrets的文本，但是不会匹配 fox is quick and brown.

5.multi match

multi_match查询建立在匹配查询的基础上，支持多字段查询

在说multi_match之前，我们先导入一批新数据吧：地址：Elasticsearch 7.x 深入数据准备

GET /tmdb_movies/_search
{
  "query": {
    "multi_match": {
      "query": "basketball with cartoom aliens",
      "fields": ["title", "overview"]
    }
  }
}

使用fields指定在多个字段上搜索
默认是取最大值的得分【即query中分词后在title字段的打分和query分词后在overview中的打分取最大值】

放大系数

如果说我们觉得title字段的分数要比overview字段的分数重要，我们可以指定boot

GET /tmdb_movies/_search
{
  "query": {
    "multi_match": {
      "query": "basketball with cartoom aliens",
      "fields": ["title^3", "overview"]
    }
  }
}

如此一来，在title上的打分就放大了3倍

tie_breaker

GET /tmdb_movies/_search
{
  "explain": true, 
  "query": {
    "multi_match": {
      "query": "basketball with cartoom aliens",
      "fields": ["title^3", "overview"],
      "tie_breaker": 0.3
    }
  }
}

tie_breaker是做什么的呢？我们上边说到multi_match默认是取最大值的评分，所以说title放大3倍后评分比overview要高【我们如此假定】
那么我们来做一下对比

我们假定title的打分是0.8，则title^3的打分为2.4
我们假定overview的打分为0.4
我们指定tie_breaker为0.3

搜索属性	没有tie_breaker得分	添加tie_breaker得分
["title^3", "overview"]	2.4	2.4 + 0.4 * 0.3

我们可以看到，tie_breaker的作用是，文档的最终得分为文档的匹配属性的最大得分+其他属性的得分*tie_breaker，即综合考虑所有属性的得分，而不是简单的取最大值

type

multi_match中有很多type

best_fileds：默认的打分方式，取最高的分数作为文档的分数，与dis_max相同
-- best_fileds模式

GET /tmdb_movies/_search
{
  "query": {
    "multi_match": {
      "query": "basketball with cartoom aliens",
      "fields": ["title", "overview"]
    }
  }
}

-- dis_max模式[写法与multi_match不同，但是都是取最高分]

GET /tmdb_movies/_search
{
  "query": {
    "dis_max": {
      "queries": [
        {"match": {"title": "basketball with cartoom aliens"}},
        {"match": {"overview": "basketball with cartoom aliens"}}
        ]
    }
  }
}

most_fileds：所有文档字段得分相加

GET /tmdb_movies/_search
{
  "explain": true, 
  "query": {
    "multi_match": {
      "query": "basketball with cartoom aliens",
      "fields": ["title^3", "overview"],
      "type": "most_fields"
    }
  }
}

cross_fileds：以分词为单位计算总分：搜索词在不同的fields中的最大值作为这个词的打分，然后将每个词的打分相加

GET /tmdb_movies/_search
{
  "explain": true, 
  "query": {
    "multi_match": {
      "query": "basketball with cartoom aliens",
      "fields": ["title", "overview"],
      "type": "cross_fields"
    }
  }
}

上面的解释我们再来解释一下：我们查询basketball with cartoom aliens，它的分词结果为basketball，cartoom，alien【为啥with被去掉了呢？因为是停用词】。cross_fields的type会先计算每个词在每个fields上的打分，我们用表格来表示一下：

-	basketball	cartoom	alien
title	1.0	2.0	1.5
overview	2.0	3.0	1.0

这个表格表示basketball在title和overview字段上的打分分别为1.0和2.0，其他字段以此类推。所以cross_fields这个type就是取每个分词在每个字段上的最高打分，然后相加，作为文档的得分。我们这个例子里，basketball最高得分为2.0，cartoom最高得分为3.0，alien最高的分为1.5，所以文档的最终等分为：2.0 + 3.0 + 1.5 = 6.5

6. query_string

# 索引两篇文档
PUT /person/_doc/1
{
  "name": "sun rui kai",
  "know": "java"
}

PUT /person/_doc/2
{
  "name": "gabriella",
  "know": "you and i"
}

查询

# 在name字段上查询，必须有sun和rui
GET /person/_search
{
  "query": {
    "query_string": {
      "default_field": "name",
      "query": "sun AND rui"
    }
  }
}

# 可以指定多个字段
GET /person/_search
{
  "query": {
    "query_string": {
      "fields": ["name", "know"],
      "query": "(sun AND rui) OR gabriella"
    }
  }
}

支持的操作	例子	描述
1	a AND b	必须有a和b
2	a NOT b	必须有a，必须没有b
3	a OR b	有a，b两者其一即可

7. simple_query_string

看名字就知道它跟query_string差不多，那到底有啥区别呢？来个栗子

# 我们在上面的基础上我们再加一篇文档
PUT /person/_doc/3
{
  "name": "sun rui xie",
  "know": "elasticsearch"
}

# 我们用simple_query_string来试试
GET /person/_search
{
  "query": {
    "simple_query_string": {
      "query": "sun AND xie",
      "fields": ["name"]
    }
  }
}

查询结果

{
  "took" : 763,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.2990015,
    "hits" : [
      {
        "_index" : "person",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.2990015,
        "_source" : {
          "name" : "sun rui xie",
          "know" : "elasticsearch"
        }
      },
      {
        "_index" : "person",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.4208172,
        "_source" : {
          "name" : "sun rui kai",
          "know" : "java"
        }
      }
    ]
  }
}

可以看到用simple_query_string，我们用AND连接已经失效了，这就是两者的区别：在simple_query_string中，query中使用query_string中的连接符会当做普通的普通的字符串来匹配，那如果我们如何指定连接符呢？

GET /person/_search
{
  "query": {
    "simple_query_string": {
      "query": "sun xie",
      "fields": ["name"],
      "default_operator": "AND"
    }
  }
}

指定default_operator字段为AND等就可以，默认是OR

Elasticsearch 7.x 深入【4】DSL查询【二】全文级别的查询

1. 借鉴

2. 开始

基于全文本的查询

analyze 过程

1.match

2. match phrase

3. match bool prefix

4.match phrase prefix

5.multi match

放大系数

tie_breaker

type

6. query_string

7. simple_query_string

你可能感兴趣的:(Elasticsearch 7.x 深入【4】DSL查询【二】全文级别的查询)