ES11-全文检索

2019独角兽企业重金招聘Python工程师标准>>> hot3.png

高级别全文检索通常用于在全文本字段(如电子邮件正文)上运行全文检索。 他们了解如何分析被查询的字段,并在执行之前将每个字段的分析器(或search_analyzer)应用于查询字符串。

1.term查询

term是代表完全匹配,也就是精确查询,搜索前不会再对搜索词进行分词,所以我们的搜索词必须是文档分词集合中的一个。

例如我们可以通过指定分词器对”周五召开董事会会议 审议及批准更新后的一季报“进行分词。

GET telegraph/_analyze
{
  "analyzer": "ik_max_word",
  "text": "周五召开董事会会议 审议及批准更新后的一季报"
}

分词结果集合中共有15个

{
  "tokens": [
    {
      "token": "周五",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "五",
      "start_offset": 1,
      "end_offset": 2,
      "type": "TYPE_CNUM",
      "position": 1
    },
    {
      "token": "召开",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "董事会",
      "start_offset": 4,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "董事",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "会会",
      "start_offset": 6,
      "end_offset": 8,
      "type": "CN_WORD",
      "position": 5
    },
    {
      "token": "会议",
      "start_offset": 7,
      "end_offset": 9,
      "type": "CN_WORD",
      "position": 6
    },
    {
      "token": "审议",
      "start_offset": 10,
      "end_offset": 12,
      "type": "CN_WORD",
      "position": 7
    },
    {
      "token": "及",
      "start_offset": 12,
      "end_offset": 13,
      "type": "CN_CHAR",
      "position": 8
    },
    {
      "token": "批准",
      "start_offset": 13,
      "end_offset": 15,
      "type": "CN_WORD",
      "position": 9
    },
    {
      "token": "更新",
      "start_offset": 15,
      "end_offset": 17,
      "type": "CN_WORD",
      "position": 10
    },
    {
      "token": "后",
      "start_offset": 17,
      "end_offset": 18,
      "type": "CN_CHAR",
      "position": 11
    },
    {
      "token": "的",
      "start_offset": 18,
      "end_offset": 19,
      "type": "CN_CHAR",
      "position": 12
    },
    {
      "token": "一季",
      "start_offset": 19,
      "end_offset": 21,
      "type": "CN_WORD",
      "position": 13
    },
    {
      "token": "一",
      "start_offset": 19,
      "end_offset": 20,
      "type": "TYPE_CNUM",
      "position": 14
    },
    {
      "token": "季报",
      "start_offset": 20,
      "end_offset": 22,
      "type": "CN_WORD",
      "position": 15
    }
  ]
}

我们用term进行搜索”会议“

GET telegraph/_search
{
  "query": {
    "term": {
      "title": {
        "value": "会议"
      }
    }
  }
}

由于搜索字段”会议“属于分词集合,可以搜索到结果

{
  "took": 9,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "telegraph",
        "_type": "msg",
        "_id": "AZetp2QBW8hrYY3zGJk7",
        "_score": 0.2876821,
        "_source": {
          "title": "周五召开董事会会议 审议及批准更新后的一季报",
          "content": "以审议及批准更新后的2018年第一季度报告",
          "author": "中兴通讯",
          "pubdate": "2018-07-17T12:33:11"
        }
      }
    ]
  }
}

如果我们搜索”董事会会议“

GET telegraph/_search
{
  "query": {
    "term": {
      "title": {
        "value": "董事会会议"
      }
    }
  }
}

”董事会会议“虽然属于文档文本中的一部分,但是由于没有在分词集合中,所以也是搜索不到的

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

2.match搜索

match查询会先对搜索词进行分词,分词完毕后再逐个对分词结果进行匹配,因此相比于term的精确搜索,match是分词匹配搜索。

当我们搜索”河北会议“时,搜索词首先会被分解为”河北“、”会议“,只要文档中包含”河北“、”会议“任意一个就会被搜索到。当然我们也可以通过”operator“来指定被分解词匹配逻辑关系,比如我们可以指定”operator“为”and“时,只有文档的分词集合中同时含有”河北“和”会议“才会被搜索到。默认”operator“为”or“,也就是只要文档分词集合中只要含有任意一个就会被搜索到。

GET telegraph/_search
{
  "query": {
    "match": {
      "title": {
        "query": "河北会议"
      }
    }
  }
}

搜索结果

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.99277425,
    "hits": [
      {
        "_index": "telegraph",
        "_type": "msg",
        "_id": "BJetp2QBW8hrYY3zGJk7",
        "_score": 0.99277425,
        "_source": {
          "title": "河北聚焦十大行业推进国际产能合作",
          "content": "河北省政府近日出台积极参与“一带一路”建设推进国际产能合作实施方案",
          "author": "财联社",
          "pubdate": "2018-07-17T14:14:55"
        }
      },
      {
        "_index": "telegraph",
        "_type": "msg",
        "_id": "AZetp2QBW8hrYY3zGJk7",
        "_score": 0.2876821,
        "_source": {
          "title": "周五召开董事会会议 审议及批准更新后的一季报",
          "content": "以审议及批准更新后的2018年第一季度报告",
          "author": "中兴通讯",
          "pubdate": "2018-07-17T12:33:11"
        }
      }
    ]
  }
}

如果我们指定”operator“为”and“进行搜索

GET telegraph/_search
{
  "query": {
    "match": {
      "title": {
        "query": "河北会议",
        "operator": "and"
      }
    }
  }
}

因为所有文档中没有一个的分词集合中既包含”河北“又包含”会议“,所以搜索结果为空。

{
  "took": 8,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

3.match_phrase查询

match_phrase查询会将查询内容分词,分词器可以自定义,文档中同时满足以下三个条件才会被检索到:

  1. 分词后所有词项都要出现在该字段中
  2. 字段中的词项顺序要一致
  3. 各搜索词之间必须紧邻

同样上面的例子,我们搜索”董事会会议“,文档会被搜索到。如果分词顺序不一致或者没有紧密相邻都不能被搜索到。

GET telegraph/_search
{
  "query": {
    "match_phrase": {
      "title":{
        "query": "董事会会议"
      }
    }
  }
}
{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1.1507283,
    "hits": [
      {
        "_index": "telegraph",
        "_type": "msg",
        "_id": "AZetp2QBW8hrYY3zGJk7",
        "_score": 1.1507283,
        "_source": {
          "title": "周五召开董事会会议 审议及批准更新后的一季报",
          "content": "以审议及批准更新后的2018年第一季度报告",
          "author": "中兴通讯",
          "pubdate": "2018-07-17T12:33:11"
        }
      }
    ]
  }
}

4.match_phrase_prefix

match_phrase_prefix与match_phrase比较相近,只是match_phrase_prefix允许搜索词的最后一个分词的前缀匹配上即可。

上面的例子中文档的分词集合中有”召开“、”董事会“这两个紧邻的分词。我们使用match_phrase_prefix搜索时只需要搜索词中包含”召开“以及”董事会“的前缀就能匹配上。

GET telegraph/_search
{
  "query": {
    "match_phrase_prefix": {
      "title": {
        "query": "召开董"
      }
    }
  }
}
{
  "took": 10,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.8630463,
    "hits": [
      {
        "_index": "telegraph",
        "_type": "msg",
        "_id": "AZetp2QBW8hrYY3zGJk7",
        "_score": 0.8630463,
        "_source": {
          "title": "周五召开董事会会议 审议及批准更新后的一季报",
          "content": "以审议及批准更新后的2018年第一季度报告",
          "author": "中兴通讯",
          "pubdate": "2018-07-17T12:33:11"
        }
      }
    ]
  }
}

5.multi_match

当我们想对多个字段进行匹配,其中一个字段包含分词就被文档就被搜索到时,可以用multi_match。

我们搜索”聚焦成交“,只要”title“、”content“任意一个字段中包含

GET telegraph/_search
{
  "query": {
    "multi_match": {
      "query": "聚焦成交",
      "fields": ["title","content"]
    }
  }
}
{
  "took": 9,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1.0806551,
    "hits": [
      {
        "_index": "telegraph",
        "_type": "msg",
        "_id": "Apetp2QBW8hrYY3zGJk7",
        "_score": 1.0806551,
        "_source": {
          "title": "长生生物再次跌停 三机构抛售近1000万元",
          "content": "长生生物再次一字跌停,报收19.89元,成交1432万元",
          "author": "长生生物",
          "pubdate": "2018-07-17T10:03:11"
        }
      },
      {
        "_index": "telegraph",
        "_type": "msg",
        "_id": "BJetp2QBW8hrYY3zGJk7",
        "_score": 0.99277425,
        "_source": {
          "title": "河北聚焦十大行业推进国际产能合作",
          "content": "河北省政府近日出台积极参与“一带一路”建设推进国际产能合作实施方案",
          "author": "财联社",
          "pubdate": "2018-07-17T14:14:55"
        }
      }
    ]
  }
}

6.common_terms

 

7.query_string

 

8.simple_query_string

 

 

转载于:https://my.oschina.net/u/3100849/blog/1858702

你可能感兴趣的:(ES11-全文检索)