ElasticSearch 6.x 学习笔记:16.全文检索

ElasticSearch 6.x 全文检索相关内容官方文档:

The high-level full text queries are usually used for running full text queries on full text fields like the body of an email. They understand how the field being queried is analyzed and will apply each field’s analyzer (or search_analyzer) to the query string before executing.
高级别全文检索通常用于在全文本字段(如电子邮件正文)上运行全文检索。 他们了解如何分析被查询的字段,并在执行之前将每个字段的分析器(或search_analyzer)应用于查询字符串。

16.1 match 查询


GET website/_search
  "query": {
    "term": {
        "title": "centos升级"
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []


GET website/_search
  "query": {
    "match": {
  "took": 36,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  "hits": {
    "total": 1,
    "max_score": 0.5753642,
    "hits": [
        "_index": "website",
        "_type": "blog",
        "_id": "3",
        "_score": 0.5753642,
        "_source": {
          "title": "CentOS升级gcc",
          "author": "程裕强",
          "postdate": "2016-12-25",
          "url": "http://url/53868915"


GET website/_search
  "query": {
    "match": {
  "took": 8,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  "hits": {
    "total": 2,
    "max_score": 0.9227539,
    "hits": [
        "_index": "website",
        "_type": "blog",
        "_id": "6",
        "_score": 0.9227539,
        "_source": {
          "title": "CentOS更换国内yum源",
          "author": "程裕强",
          "postdate": "2016-12-30",
          "url": "http://url/53946911"
        "_index": "website",
        "_type": "blog",
        "_id": "3",
        "_score": 0.5753642,
        "_source": {
          "title": "CentOS升级gcc",
          "author": "程裕强",
          "postdate": "2016-12-25",
          "url": "http://url/53868915"

16.2 match_phrase查询(短语查询)

Like the match query but used for matching exact phrases or word proximity matches.
与match query类似,但用于匹配精确短语,可称为短语查询


  1. 分词后所有词项都要出现在该字段中
  2. 字段中的词项顺序要一致


PUT test

PUT test/hello/1
{ "content":"World Hello"}

PUT test/hello/2
{ "content":"Hello World"}

PUT test/hello/3
{ "content":"I just said hello world"}

(2)使用match_phrase查询”hello world”

GET test/_search
  "query": {
    "match_phrase": {
      "content": "hello world"


  "took": 21,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  "hits": {
    "total": 2,
    "max_score": 0.5753642,
    "hits": [
        "_index": "test",
        "_type": "hello",
        "_id": "2",
        "_score": 0.5753642,
        "_source": {
          "content": "Hello World"
        "_index": "test",
        "_type": "hello",
        "_id": "3",
        "_score": 0.5753642,
        "_source": {
          "content": "I just said hello world"

16.3 match_phrase_prefix 查询(前缀查询)

The match_phrase_prefix is the same as match_phrase, except that it allows for prefix matches on the last term in the text.

GET test/_search
  "query": {
    "match_phrase_prefix": {
      "content": "hello wor"
  "took": 10,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  "hits": {
    "total": 2,
    "max_score": 0.5753642,
    "hits": [
        "_index": "test",
        "_type": "hello",
        "_id": "1",
        "_score": 0.5753642,
        "_source": {
          "content": "Hello World"
        "_index": "test",
        "_type": "hello",
        "_id": "3",
        "_score": 0.5753642,
        "_source": {
          "content": "I just said hello world"

16.4 multi_match 查询


The multi_match query builds on the match query to allow multi-field queries.

GET website/_search
  "query": {
    "multi_match": {
      "query": "centos",
      "fields": ["title","abstract"]


  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  "hits": {
    "total": 5,
    "max_score": 0.9227539,
    "hits": [
        "_index": "website",
        "_type": "blog",
        "_id": "6",
        "_score": 0.9227539,
        "_source": {
          "title": "CentOS更换国内yum源",
          "author": "程裕强",
          "postdate": "2016-12-30",
          "abstract": "CentOS更换国内yum源",
          "url": "http://url/53946911"
        "_index": "website",
        "_type": "blog",
        "_id": "2",
        "_score": 0.41360322,
        "_source": {
          "title": "watchman源码编译",
          "author": "程裕强",
          "postdate": "2016-12-23",
          "abstract": "CentOS7.x的watchman源码编译",
          "url": "http://url.cn/53844169"
        "_index": "website",
        "_type": "blog",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "title": "CentOS升级gcc",
          "author": "程裕强",
          "postdate": "2016-12-25",
          "abstract": "CentOS升级gcc",
          "url": "http://url/53868915"
        "_index": "website",
        "_type": "blog",
        "_id": "7",
        "_score": 0.20055373,
        "_source": {
          "title": "搭建Ember开发环境",
          "author": "程裕强",
          "postdate": "2016-12-30",
          "abstract": "CentOS系统下搭建Ember开发环境",
          "url": "http://url/53947507"
        "_index": "website",
        "_type": "blog",
        "_id": "1",
        "_score": 0.1671281,
        "_source": {
          "title": "Ambari源码编译",
          "author": "程裕强",
          "postdate": "2016-12-21",
          "abstract": "CentOS7.x下的Ambari2.4源码编译",
          "url": "http://url.cn/53788351"


16.5 common_terms 查询(常用词查询)

有些词在文本中出现的频率非常高,但是对文本所携带的信息基本不产生影响。比如英文中的a、an、the、of,中文的“的”、”了”、”着”、”是” 、标点符号等。文本经过分词之后,停用词通常被过滤掉,不会被进行索引。在检索的时候,用户的查询中如果含有停用词,检索系统也会将其过滤掉(因为用户输入的查询字符串也要进行分词处理)。排除停用词可以加快建立索引的速度,减小索引库文件的大小。

(2)虽然停用词对文档评分影响不大,但是有时停用词仍然具有重要意义,去除停用词显然不合适。如果去除停用词,就无法区分“happy”和”not happy”, “to be or not to be”就不能被索引,搜索的准确率就会降低。

(3)common_terms查询提供了一种解决方案,把查询分词后的词项分为重要词项(比如low frequency terms ,低频词)和不重要词(high frequency terms which would previously have been stopwords,高频的停用词)。在搜索时,首先搜索与重要词匹配的文档,然后执行第二次搜索,搜索评分较小的高频词。
Terms are allocated to the high or low frequency groups based on the cutoff_frequency, which can be specified as an absolute frequency (>=1) or as a relative frequency (0.0 .. 1.0).词项是高频词还是低频词,可以通过cutoff_frequency来设置阀值,取值可以是绝对频率 (>=1)或者相对频率(0.0 ~1.0)

GET website/_search
    "query": {
        "common": {
            "title": {
                "query": "to be",
                "cutoff_frequency": 0.0001,
                "low_freq_operator": "and"
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  "hits": {
    "total": 1,
    "max_score": 2.364739,
    "hits": [
        "_index": "website",
        "_type": "blog",
        "_id": "9",
        "_score": 2.364739,
        "_source": {
          "title": "to be or not to be",
          "author": "somebody",
          "postdate": "2018-01-03",
          "abstract": "to be or not to be,that is the question",
          "url": "http://url/63991802"

16.5 query_string查询


16.6 simple_query_string


GET website/_search
  "query": {
    "simple_query_string" : {
        "query": "\"fried eggs\" +(eggplant | potato) -frittata",
        "fields": ["title^5", "abstract"],
        "default_operator": "and"
  "took": 20,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
