ES7.X 自定义分词+scroll查询

11月拉!

  • 自定义分词
    PUT user
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "pinyin_analyzer":{
              "tokenizer":"my_piniyin"
            }
          },
          "tokenizer": {
            "my_piniyin":{
              "type":"pinyin",
              "keep_full_pinyin":true,
              "keep_original":true,
              "limit_first_letter_length":16,
              "lowercase":true,
              "remove_duplicated_term":true,
              "keep_separate_first_letter":false
            }
          }
        }
      },
      "mappings": {
        "properties": {
          "name":{
            "type": "keyword",
            "fields": {
              "my_pinyin":{
                "type":"text",
                "analyzer":"pinyin_analyzer"
              }
            }
          }
        }
      }
    }

    我们先创建一个索引,如上设置,settings设置好自定义索引,起名pinyin_analyzer, 标记是my_pinyin,设置pinyin分词器的各项元素,感觉比较重要的是keep_full_pinyin:true, 汉语全量转拼音,具体的可以看文档https://github.com/medcl/elasticsearch-analysis-pinyin。接下来我们开始分词

    {
      "tokens" : [
        {
          "token" : "liu",
          "start_offset" : 0,
          "end_offset" : 0,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "刘德华",
          "start_offset" : 0,
          "end_offset" : 0,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "ldh",
          "start_offset" : 0,
          "end_offset" : 0,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "de",
          "start_offset" : 0,
          "end_offset" : 0,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "hua",
          "start_offset" : 0,
          "end_offset" : 0,
          "type" : "word",
          "position" : 2
        }
      ]
    }

    看我们的pinyin分词已经将刘德华,分词了,还比较详细,使用term倒排查一下就出来,还是蛮好用的。

  • alias索引别名

    POST _aliases
    {
      "actions": [
        {
          "add": {
            "index": "movies",
            "alias": "myindex2",
            "filter": {
              "range": {
                "year": {
                  "gte": 1
                }
              }
            }
          }
        }
      ]
    }

    在给一个索引添加别名的时候可以附加一个filter过滤,新的别名索引里只能查询到filter过滤后的docs

  • 复合查询

  1. 给查询算分结果*某个字段的值,提升权重

    POST movies/_search
    {
      "explain": true, 
      "size": 2, 
      "query": {
        "function_score": {
          "query": {
            "multi_match": {
              "query": "Old",
              "fields": ["title","genre.keyword"]
            }
          },
          "field_value_factor": {
            "field":"year",
            "modifier": "log2p",    //分值追加一个函数  _score * log(2 + factor * year)
            "factor": 0.01          //增加函数进行收敛 
          }
        }
      }
    }

    如上是查询title、genre中带有old或者包含old的文档,并进行相关性打分,将打分结果*字段year的值,然后进行排序。

    {
      "took" : 2,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 47,
          "relation" : "eq"
        },
        "max_score" : 9.856819,
        "hits" : [
          {
            "_shard" : "[movies][0]",
            "_node" : "JZoUKVAzQkuhCZV5j8r4Qg",
            "_index" : "movies",
            "_type" : "_doc",
            "_id" : "72696",
            "_score" : 9.856819,
            "_source" : {
              "year" : 2009,
              "genre" : [
                "Comedy"
              ],
              "@version" : "1",
              "id" : "72696",
              "title" : "Old Dogs"
            },
            "_explanation" : {
              "value" : 9.856819,
              "description" : "function score, product of:",
              "details" : [
                {
                  "value" : 7.3328753,
                  "description" : "max of:",
                  "details" : [
                    {
                      "value" : 7.3328753,
                      "description" : "weight(title:old in 14201) [PerFieldSimilarity], result of:",
                      "details" : [
                        {
                          "value" : 7.3328753,
                          "description" : "score(freq=1.0), product of:",
                          "details" : [
                            {
                              "value" : 2.2,
                              "description" : "boost",
                              "details" : [ ]
                            },
                            {
                              "value" : 6.3534727,
                              "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                              "details" : [
                                {
                                  "value" : 47,
                                  "description" : "n, number of documents containing term",
                                  "details" : [ ]
                                },
                                {
                                  "value" : 27287,
                                  "description" : "N, total number of documents with field",
                                  "details" : [ ]
                                }
                              ]
                            },
                            {
                              "value" : 0.5246147,
                              "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                              "details" : [
                                {
                                  "value" : 1.0,
                                  "description" : "freq, occurrences of term within document",
                                  "details" : [ ]
                                },
                                {
                                  "value" : 1.2,
                                  "description" : "k1, term saturation parameter",
                                  "details" : [ ]
                                },
                                {
                                  "value" : 0.75,
                                  "description" : "b, length normalization parameter",
                                  "details" : [ ]
                                },
                                {
                                  "value" : 2.0,
                                  "description" : "dl, length of field",
                                  "details" : [ ]
                                },
                                {
                                  "value" : 2.9695094,
                                  "description" : "avgdl, average length of field",
                                  "details" : [ ]
                                }
                              ]
                            }
                          ]
                        }
                      ]
                    }
                  ]
                },
                {
                  "value" : 1.3441957,
                  "description" : "min of:",
                  "details" : [
                    {
                      "value" : 1.3441957,
                      "description" : "field value function: log2p(doc['year'].value * factor=0.01)",
                      "details" : [ ]
                    },
                    {
                      "value" : 3.4028235E38,
                      "description" : "maxBoost",
                      "details" : [ ]
                    }
                  ]
                }
              ]
            }
          },
          {
            "_shard" : "[movies][0]",
            "_node" : "JZoUKVAzQkuhCZV5j8r4Qg",
            "_index" : "movies",
            "_type" : "_doc",
            "_id" : "50259",
            "_score" : 9.852491,
            "_source" : {
              "year" : 2006,
              "genre" : [
                "Drama"
              ],
              "@version" : "1",
              "id" : "50259",
              "title" : "Old Joy"
            },
            "_explanation" : {
              "value" : 9.852491,
              "description" : "function score, product of:",
              "details" : [
                {
                  "value" : 7.3328753,
                  "description" : "max of:",
                  "details" : [
                    {
                      "value" : 7.3328753,
                      "description" : "weight(title:old in 11233) [PerFieldSimilarity], result of:",
                      "details" : [
                        {
                          "value" : 7.3328753,
                          "description" : "score(freq=1.0), product of:",
                          "details" : [
                            {
                              "value" : 2.2,
                              "description" : "boost",
                              "details" : [ ]
                            },
                            {
                              "value" : 6.3534727,
                              "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                              "details" : [
                                {
                                  "value" : 47,
                                  "description" : "n, number of documents containing term",
                                  "details" : [ ]
                                },
                                {
                                  "value" : 27287,
                                  "description" : "N, total number of documents with field",
                                  "details" : [ ]
                                }
                              ]
                            },
                            {
                              "value" : 0.5246147,
                              "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                              "details" : [
                                {
                                  "value" : 1.0,
                                  "description" : "freq, occurrences of term within document",
                                  "details" : [ ]
                                },
                                {
                                  "value" : 1.2,
                                  "description" : "k1, term saturation parameter",
                                  "details" : [ ]
                                },
                                {
                                  "value" : 0.75,
                                  "description" : "b, length normalization parameter",
                                  "details" : [ ]
                                },
                                {
                                  "value" : 2.0,
                                  "description" : "dl, length of field",
                                  "details" : [ ]
                                },
                                {
                                  "value" : 2.9695094,
                                  "description" : "avgdl, average length of field",
                                  "details" : [ ]
                                }
                              ]
                            }
                          ]
                        }
                      ]
                    }
                  ]
                },
                {
                  "value" : 1.3436055,
                  "description" : "min of:",
                  "details" : [
                    {
                      "value" : 1.3436055,
                      "description" : "field value function: log2p(doc['year'].value * factor=0.01)",
                      "details" : [ ]
                    },
                    {
                      "value" : 3.4028235E38,
                      "description" : "maxBoost",
                      "details" : [ ]
                    }
                  ]
                }
              ]
            }
          }
        ]
      }
    }
    

    我们看一下打分详情,即为 _score * log(2+ factor * year) 

11.4更

  • 提升分值 boost mode
    POST movies/_search
    {
      "explain": true, 
      "size": 2, 
      "query": {
        "function_score": {
          "query": {
            "multi_match": {
              "query": "Old",
              "fields": ["title","genre.keyword"]
            }
          },
          "field_value_factor": {
            "field": "year"
          }, 
          "boost_mode": "sum"
        }
      }
    }
    

    boost_mode 有四种模式

    • multiply : 将field_value_factor中获取的数值与query中的相关性打分做乘法运算,然后进行排序

    • sum: 算分与字段值因素的和

    • min/max : 算分与字段值因素之间取最大/最小值作为相关性打分

    • replace:  使用字段值因素取代算分

      {
        "took" : 0,
        "timed_out" : false,
        "_shards" : {
          "total" : 1,
          "successful" : 1,
          "skipped" : 0,
          "failed" : 0
        },
        "hits" : {
          "total" : {
            "value" : 47,
            "relation" : "eq"
          },
          "max_score" : 2020.3269,
          "hits" : [
            {
              "_shard" : "[movies][0]",
              "_node" : "JZoUKVAzQkuhCZV5j8r4Qg",
              "_index" : "movies",
              "_type" : "_doc",
              "_id" : "114250",
              "_score" : 2020.3269,
              "_source" : {
                "year" : 2014,
                "genre" : [
                  "Comedy",
                  "Drama"
                ],
                "@version" : "1",
                "id" : "114250",
                "title" : "My Old Lady"
              },
              "_explanation" : {
                "value" : 2020.3269,
                "description" : "sum of",
                "details" : [
                  {
                    "value" : 6.3268967,
                    "description" : "max of:",
                    "details" : [
                      {
                        "value" : 6.3268967,
                        "description" : "weight(title:old in 23775) [PerFieldSimilarity], result of:",
                        "details" : [
                          {
                            "value" : 6.3268967,
                            "description" : "score(freq=1.0), product of:",
                            "details" : [
                              {
                                "value" : 2.2,
                                "description" : "boost",
                                "details" : [ ]
                              },
                              {
                                "value" : 6.3534727,
                                "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                                "details" : [
                                  {
                                    "value" : 47,
                                    "description" : "n, number of documents containing term",
                                    "details" : [ ]
                                  },
                                  {
                                    "value" : 27287,
                                    "description" : "N, total number of documents with field",
                                    "details" : [ ]
                                  }
                                ]
                              },
                              {
                                "value" : 0.4526441,
                                "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                                "details" : [
                                  {
                                    "value" : 1.0,
                                    "description" : "freq, occurrences of term within document",
                                    "details" : [ ]
                                  },
                                  {
                                    "value" : 1.2,
                                    "description" : "k1, term saturation parameter",
                                    "details" : [ ]
                                  },
                                  {
                                    "value" : 0.75,
                                    "description" : "b, length normalization parameter",
                                    "details" : [ ]
                                  },
                                  {
                                    "value" : 3.0,
                                    "description" : "dl, length of field",
                                    "details" : [ ]
                                  },
                                  {
                                    "value" : 2.9695094,
                                    "description" : "avgdl, average length of field",
                                    "details" : [ ]
                                  }
                                ]
                              }
                            ]
                          }
                        ]
                      }
                    ]
                  },
                  {
                    "value" : 2014.0,
                    "description" : "min of:",
                    "details" : [
                      {
                        "value" : 2014.0,
                        "description" : "field value function: none(doc['year'].value * factor=1.0)",
                        "details" : [ ]
                      },
                      {
                        "value" : 3.4028235E38,
                        "description" : "maxBoost",
                        "details" : [ ]
                      }
                    ]
                  }
                ]
              }
            }
          ]
        }
      }
      

      从分析上来看,相关性的分6.3268967,而字段值因素是2014,所以总分是2020.3269

  • max_boost : 最大提升上限,此参数可以限制字段值因素的最大分值上限,所获取的分值将在这个上限范围内

  • POST movies/_search
    {
      "explain": true, 
      "size": 1, 
      "query": {
        "function_score": {
          "query": {
            "multi_match": {
              "query": "Old",
              "fields": ["title","genre.keyword"]
            }
          },
          "field_value_factor": {
            "field": "year"
          }, 
          "boost_mode": "sum",
          "max_boost": 10
        }
      }
    }
    

    比如上面你的查询,field_value_factor的值会被限制在10(max_boost)内,最大10,因为boost_mode是sum,所以及果实查询的相关性打分加上这个字段值因素的最大值。

  • random_score 一致性随机函数

    GET movies/_search
    {
      "explain": true, 
      "size": 1, 
      "query": {
        "function_score": {
          "query": {
            "term": {
              "title": {
                "value": "love"
              }
            }
          },
          "random_score": {
            "seed": 314159265359,
            "field":"_seq_no"
          }
        }
      }
    }

    7.0之后需要random_score设置field字段,否则会报错,一致性随机函数是根据seed的的序号进行随机,如果seed的值是一样的,那么随机结果也是一致的。

  • suggest 推荐模块,原理是将查询分解为token,在索引字典里查找相似的term返回

    GET movies/_search
    {
      "size": 1, 
      "query": {
        "term": {
          "title": {
            "value": "lover"
          }
        }
      },
      "suggest": {
        "my_suggest": {
          "text": "lover",
          "term": {
            "field": "title",
            "suggest_mode":"popular"
          }
        }
      }
    }

    suggest_mode有几种常用的,比如

    • missing :  如果索引即terms => lover已经存在,则不提供建议

    • popular:  推荐出现频率更加高的词

    • always :  无论这个terms是否存在,都提供建议

      {
        "took" : 3,
        "timed_out" : false,
        "_shards" : {
          "total" : 1,
          "successful" : 1,
          "skipped" : 0,
          "failed" : 0
        },
        "hits" : {
          "total" : {
            "value" : 12,
            "relation" : "eq"
          },
          "max_score" : 8.87367,
          "hits" : [
            {
              "_index" : "movies",
              "_type" : "_doc",
              "_id" : "2586",
              "_score" : 8.87367,
              "_source" : {
                "year" : 1999,
                "genre" : [
                  "Comedy",
                  "Crime",
                  "Thriller"
                ],
                "@version" : "1",
                "id" : "2586",
                "title" : "Goodbye Lover"
              }
            }
          ]
        },
        "suggest" : {
          "my_suggest" : [
            {
              "text" : "lover",
              "offset" : 0,
              "length" : 5,
              "options" : [
                {
                  "text" : "lovers",
                  "score" : 0.8,
                  "freq" : 25
                },
                {
                  "text" : "loved",
                  "score" : 0.8,
                  "freq" : 14
                },
                {
                  "text" : "love",
                  "score" : 0.75,
                  "freq" : 355
                },
                {
                  "text" : "lives",
                  "score" : 0.6,
                  "freq" : 40
                },
                {
                  "text" : "live",
                  "score" : 0.5,
                  "freq" : 72
                }
              ]
            }
          ]
        }
      }
      

      推荐的信息放在自定义的数组中,有分值及频率。需要的时候可以自选。

插播一条刚才遇到的问题。线上es报错查询超过1w条

  • 我们先来了解一下es的配置index.max_result_window,es的配置,可以是全局的,也可以针对某个索引设置,默认1w条
  • 线上引起这次报错的查询来源是什么呢,是一个脚本,while取数,每次20条,没有退出条件,在平时这个脚本不会引发es报错,因为平时数据量没双十一这么高,这几天大促,数据量持续走高,所以导致了超过配置限制。
  • 如何解决这个问题呢?有几个思路,第一,因为他是脚本查询,不是前台实时查询,所以允许延迟时间,这样我们就可以采用es的scroll查询,scroll查询不是针对于实时的,它会对es进行多次查询,通过记录scroll_id+快照的方式进行查询,我们可以指定查询的时间间隔
    curl -XGET 'localhost:9200/index/type/_search?scroll=1m' -d '
    {
        "query": {
            "match_phase" : {
                "title" : "elasticsearch"
            }
        }
    }
    

    我们指定了scroll = 1min 即与下次查询之间最大间隔1min,超过则断联,第一次查询除了数据外还会返回一个scroll_id用作下次查询,所以下次查询就是如下查询

    curl -XGET  'localhost:9200/_search/scroll'  -d'
    {
        "scroll" : "1m", 
        "scroll_id" : "c2Nhbjs2OzM0NDg1ODpzRlBLc0FXNlNyNm5JWUc1" 
    }

    scroll会一直向指定查询游走,直到查询到对应数据或者查不到数据或者超时断联时会停止请求。但是只是用scroll进行查询是有代价的,它会进行排序,最坏的情况下是全局排序。

  • 所以有些时候我们深度分页的情况下只想要数据,而不想排序,我们可以加上scan参数

    GET /old_index/_search?search_type=scan&scroll=1m 
    {
    "query": { "match_all": {}},
    "size": 1000
    }

    如上,我们只需加上search_type=scan,则可以禁止排序,从而避免全局排序。还有一种方式是使用_doc去sort得出来的结果,这个执行的效率最快,但是数据就不会有排序,适合用在只想取得所有数据的场景,示例如下

    GET /old_index/_search?scroll=1m 
    {
    "query": { "match_all": {}},
    "size": 1000,
    "sort": [
            "_doc"
            ]
        }
    }

     

  • 另外一个优化点是,在使用scroll游标查询的时候,在查询完毕的时候尽可能的清除这个scroll,这样可以减轻es的负担

    DELETE 127.0.0.1:9200/_search/scroll
    {
        "scroll_id": "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAdsMqFmVkZTBJalJWUmp5UmI3V0FYc2lQbVEAAAAAAHbDKRZlZGUwSWpSVlJqeVJiN1dBWHNpUG1RAAAAAABpX2sWclBEekhiRVpSRktHWXFudnVaQ3dIQQAAAAAAaV9qFnJQRHpIYkVaUkZLR1lxbnZ1WkN3SEEAAAAAAGlfaRZyUER6SGJFWlJGS0dZcW52dVpDd0hB"
    }
    

     

继续咱们的es学习,上面只是个小查取,等大促过去之后,我再对今天出现的问题做些优化。

 

你可能感兴趣的:(elasticsearch)