9.2.1-elasticsearch全文检索之intervals查询

1、intervals查询

intervals查询使用了匹配规则,这些规则将会使用在指定字段的对应词(term)上;
这些规则定义将产生横跨文本的最小化的间隔(interval),这些间隔可以被父级间隔(interval)组合或过滤;

intervals查询示例

//请求参数
GET software/_search
{
     
  "query": {
     
    "intervals": {
     
      "desc": {
     
        "all_of": {
     
          "ordered": true,
          "intervals": [
            {
     
              "match": {
     
                "query": "distributed search",
                "max_gaps": 0,
                "ordered": true
              }
            },
            {
     
              "any_of": {
     
                "intervals": [
                  {
     
                    "match": {
     
                      "query": "analytics engine"
                    }
                  },
                  {
     
                    "match": {
     
                      "query": "Elastic Stack"
                    }
                  }
                ]
              }
            }
          ]
        }
      }
    }
  }
}

intervals查询顶级参数

序号 参数 描述
1 (必须)—希望搜索的文档字段;该参数对应着规则对象,基于词(term)、顺序(order)以及相互间距离来匹配文档;

2、intervals查询关键字

合法的规则关键词有以下几类

序号 关键字 描述
1 match
2 prefix
3 wildcard
4 fuzzy
5 all_of
6 any_of
2.1、match规则参数说明

match规则匹配被分词后的文本

具体匹配参数

序号 参数 描述
1 query (必须,字符串类型)–指定需要查询的文本信息
2 max_gaps (可选,数值类型)—匹配词(term)之间最大间隔,默认为-1;未指定或指定为-1则匹配无间隔限制,设置为0则匹配词必须要在已匹配词的下个词开始匹配(连续)
3 ordered (可选,布尔类型)—值为true表示匹配词必须按照指定顺序出现,默认为false
4 analyzer (可选,字符串类型)—指定查询的分词器,默认为指定查询字段对应的分词器
5 filter (可选,规则对象)—对应一个interval filter
6 use_field (可选,字符串类型)—若指定该字段,则intervals查询不使用上层转而以该字段进行查询,查询使用的分词器也是该字段对应的搜索分词器;
2.2、prefix规则参数说明

prefix规则匹配的词要以指定的字符串开头,若prefix参数指定的字符串匹配超过128个词(term)则ES将报错,
这可以通过设置字段参数index_prefix来接触该限制;

具体匹配参数

序号 参数 描述
1 prefix (必须,字符串类型)—指定匹配词(term)开头的字符串
2 analyzer (可选,字符串类型)—分词器用于对前缀字符串进行normalize处理,默认为上层指定的分词器
3 use_field (可选,字符串类型)—若指定该字段,则intervals查询不使用上层转而以该字段进行查询
2.3、wildcard规则参数说明

wildcard规则使用通配符进行匹配,指定的通配符匹配超过128个则ES将报错;

具体匹配参数

序号 参数 描述
1 pattern (必须,字符串类型)—指定通配符;参数支持两类通配符: ? 匹配单个字符; * 匹配零或多个字符,包括空字符
2 analyzer (可选,字符串类型)—分词器用于对通配符进行normalize处理,默认为上层指定的分词器
3 use_field (可选,字符串类型)—若指定该字段,则intervals查询不使用上层转而以该字段进行查询
2.4、fuzzy规则参数说明

fuzzy规则匹配与给定词(term)相似词(可编辑距离内的term)的匹配结果,若模糊匹配的词(term)超过128个则ES将报错;

具体匹配参数

序号 参数 描述
1 term (必须,字符串类型)—需要匹配的词
2 prefix_length (可选,字符串类型)—创建扩展时起始字符数保持不变,默认起始字符数为0
3 transpositions (可选,布尔类型)—确定编辑时是否包括两个相邻字符的换位(ab->ba),默认为true
4 fuzziness (可选,字符串类)—匹配允许的最大编辑距离,默认为auto
5 analyzer (可选,字符串类型)—分词器用于对term进行normalize处理,默认为上层指定的分词器
6 use_field (可选,字符串类型)—若指定该字段,则intervals查询不使用上层转而以该字段进行查询
2.5、all_of规则参数说明

all_of规则返回的匹配结果是跨越多个组合规则而得到的;

具体匹配参数

序号 参数 描述
1 intervals (必须,对象数组)—需要组合的规则数组;所有规则都必须在文档中产生匹配项以使最终有匹配文档
2 max_gaps (可选,数值类型)—匹配词(term)之间最大间隔,默认为-1;未指定或指定为-1则匹配无间隔限制,设置为0则匹配词必须要在已匹配词的下个词开始匹配(连续)
3 ordered (可选,布尔类型)—值为true表示匹配词必须按照指定顺序出现,默认为false
4 filter (可选,规则对象)—对应一个interval filter
2.6、any_of规则参数说明

any_of规则匹配任何子规则的文档;

具体匹配参数

序号 参数 描述
1 intervals (必须,对象数组)—需要任一匹配的规则数组;
2 filter (可选,规则对象)—对应一个interval filter
2.6、filter规则参数说明

filter规则是基于查询返回intervals;

具体匹配参数

序号 参数 描述
1 after (可选,查询对象)—query的interval在filter的interval之后
2 before (可选,规则对象)—query的interval在filter的interval之前
3 contained_by (可选,查询对象)—filter中的interval包含query的interval
4 containing (可选,查询对象)—query的interval包含filter的interval
5 not_contained_by (可选,查询对象)—filter中的interval不包含query的interval
6 not_containing (可选,查询对象)—query的interval不包含filter的interval
7 not_overlapping (可选,查询对象)—filter中的interval与query的interval不重叠
8 overlapping (可选,查询对象)—filter中的interval与query的interval相互重叠
9 script (可选,脚本对象)—脚本用于返回匹配的文档
//以下查询包含filter规则,有两个限制条件:
//1、要求desc字段查询时指定的query字段中两个词相隔不得超过3个位置(max_gaps)
//2、在匹配词'distributed engine'之间不允许包含'redis'字段
POST software/_search
{
     
  "query": {
     
    "intervals":{
     
      "desc":{
     
        "match":{
     
          "query":"distributed engine",
          "max_gaps": 3,
          "filter":{
     
            "not_containing":{
     
              "match":{
     
                "query": "redis"
              }
            }
          }
        }
      }
    }
  }
}

//结果返回,可结合不同情况分别测试
{
     
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
     
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
     
    "total" : {
     
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.19999999,
    "hits" : [
      {
     
        "_index" : "software",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.19999999,
        "_source" : {
     
          "title" : "elasticsearch",
          "desc" : "Elasticsearch is the distributed search and analytics engine at the heart of the Elastic Stack"
        }
      },
      {
     
        "_index" : "software",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 0.19999999,
        "_source" : {
     
          "title" : "elasticsearch",
          "desc" : "distributed search and analytics engine at the heart of the Elastic Stack"
        }
      }
    ]
  }
}


//查询的字段'distributed engine'要在'redis'之前
GET software/_search
{
     
  "query": {
     
    "intervals":{
     
      "desc":{
     
        "match":{
     
          "query":"distributed engine",
          "max_gaps": 3,
          "filter":{
     
            "before":{
     
              "match":{
     
                "query": "redis"
              }
            }
          }
        }
      }
    }
  }
}

//结果返回
{
     
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
     
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
     
    "total" : {
     
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.19999999,
    "hits" : [
      {
     
        "_index" : "software",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 0.19999999,
        "_source" : {
     
          "title" : "elasticsearch",
          "desc" : "distributed search redis analytics engine redis"
        }
      }
    ]
  }
}



GET software/_search
{
     
  "query": {
     
    "intervals":{
     
      "desc":{
     
        "match":{
     
          "query":"distributed engine",
          "filter":{
     
            "script":{
     
              "source":"interval.start > 1 && interval.end < 10 && interval.gaps == 3"
            }
          }
        }
      }
    }
  }
}


{
     
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
     
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
     
    "total" : {
     
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.19999999,
    "hits" : [
      {
     
        "_index" : "software",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.19999999,
        "_source" : {
     
          "title" : "elasticsearch",
          "desc" : "Elasticsearch is the distributed search and analytics engine at the heart of the Elastic Stack"
        }
      }
    ]
  }
}

最小化
intervals查询总是最小化间隔(interval)以保证查询时间在线性范围内;这在有时候会出现令人不解的情况,尤其是在使用了max_gaps参数或filter的情况下;例如以下查询希望’library API’短语中包含code的查询:

//
GET software/_search
{
     
  "query": {
     
    "intervals":{
     
      "desc":{
     
        "match":{
     
          "query":"library API",
          "filter":{
     
            "contained_by":{
     
              "match":{
     
                "query":"code"
              }
            }
          }
        }
      }
    }
  }
}

以上的查询语句并不与短语but rather a code library and API that can easily be used匹配,可以将contained_by改成after进行匹配;

另外的一个限制是在any_of子规则查询当中出现的重叠短语;即当一个较短短语匹配则较长短语将永远无法匹配到,这在组合使用max_gaps时返回令人不解的结果,考虑以下的查询:

GET software/_search
{
     
  "query": {
     
    "intervals": {
     
      "desc": {
     
        "all_of": {
     
          "intervals": [
            {
     
              "match": {
     
                "query": "add"
              }
            },
            {
     
              "any_of": {
     
                "intervals": [
                  {
     
                    "match": {
     
                      "query": "search"
                    }
                  },
                  {
     
                    "match": {
     
                      "query": "search capabilities"
                    }
                  }
                ]
              }
            },
            {
     
              "match": {
     
                "query": "to"
              }
            }
          ],
          "max_gaps": 0,
          "ordered": true
        }
      }
    }
  }
}

以上这个查询将永远也不会匹配add search capabilities to,因为any_of的规则只会产生search,在这种情况下就需要重写上面的查询条件,重写之后的条件如下:

GET software/_search
{
     
  "query": {
     
    "intervals": {
     
      "desc": {
     
        "any_of": {
     
          "intervals": [
            {
     
              "match": {
     
                "query": "add search capabilities to",
                "max_gaps": 0,
                "ordered": true
              }
            },
            {
     
              "match": {
     
                "query": "add search to",
                "max_gaps": 0,
                "ordered": true
              }
            }
          ]
        }
      }
    }
  }
}


//以上两个查询条件结果相同
{
     
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
     
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
     
    "total" : {
     
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.3333333,
    "hits" : [
      {
     
        "_index" : "software",
        "_type" : "_doc",
        "_id" : "6",
        "_score" : 0.3333333,
        "_source" : {
     
          "title" : "lucene",
          "desc" : "Lucene is not a complete application, but rather a code library and API that can easily be used to add search capabilities to applications"
        }
      }
    ]
  }
}

以下为查询的索引文档信息

PUT software/_doc/1
{
     
  "title":"elasticsearch",
  "desc":"Elasticsearch is the distributed search and analytics engine at the heart of the Elastic Stack"
}

PUT software/_doc/2
{
     
  "title":"redis",
  "desc":"Redis is an open source, in-memory data structure store, used as a database, cache and message broker"
}

PUT software/_doc/3
{
     
  "title":"Luence",
  "desc":"Lucene Core is a Java library providing powerful indexing and search features, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities"
}

PUT software/_doc/4
{
     
  "title":"elasticsearch",
  "desc":"distributed search and analytics engine at the heart of the Elastic Stack"
}

PUT software/_doc/5
{
     
  "title":"elasticsearch",
  "desc":"distributed search redis analytics engine redis"
}


PUT software/_doc/6
{
     
  "title":"lucene",
  "desc":"Lucene is not a complete application, but rather a code library and API that can easily be used to add search capabilities to applications"
}

你可能感兴趣的:(ELK,elasticsearch)