Elasticsearch:Painless scripting 编程实践

在我之前的文章 “Elastic:菜鸟上手指南”,我有许多的文章关于 Painless 编程的。你可以参考其中的 “Painless 编程” 章节。针对许多刚接触 Elasticsearch 的开发者来说,对于 Painless 编程而言,可能也是雾里看花,不能完整地了解 Painless scripting 的所有应用场景。在我之前的很多文章中,我用到了脚本编程。在今天的文章中,我将就 Painless 的编程做一个总结。方便大家学习。针对 Painless 编程的调试,我们可以参考之前的文章 “Elasticsearch:Painless 编程调试”。这里就不在赘述了。

 

准备数据

在今天的练习中,我将使用如下的数据来做演示。我们可以使用 bulk 指令来完成数据的导入:

PUT employee/_bulk?refresh
{"index":{"_id": 1}}
{ "salary" : 5000, "bonus": 500, "@timestamp" : "2021-02-28", "weight": 60, "height": 175, "name" : "Peter", "occupation": "software engineer","hobbies": ["dancing", "badminton"]}
{"index":{"_id": 2}}
{ "salary" : 6000, "bonus": 500, "@timestamp" : "2020-02-01", "weight": 50, "height": 165, "name" : "John", "occupation": "sales", "hobbies":["singing", "volleyball"]}
{"index":{"_id": 3}}
{ "salary" : 7000, "bonus": 600, "@timestamp" : "2019-03-01", "weight": 55, "height": 172, "name" : "mary", "occupation": "manager", "hobbies":["dancing", "tennis"]}
{"index":{"_id": 4}}
{ "salary" : 8000, "bonus": 700, "@timestamp" : "2018-02-28", "weight": 45, "height": 166, "name" : "jerry", "occupation": "sales", "hobbies":["biking", "swimming"]}
{"index":{"_id": 5}}
{ "salary" : 9000, "bonus": 800, "@timestamp" : "2017-02-01", "weight": 60, "height": 170, "name" : "cathy", "occupation": "manager", "hobbies":["climbing", "jigging"]}
{"index":{"_id": 6}}
{ "salary" : 7500, "bonus": 500, "@timestamp" : "2017-03-01", "weight": 40, "height": 158, "name" : "cherry", "occupation": "software engineer", "hobbies":["basketball", "yoga"]}

在上面,我们创建了一个叫做 employee 的索引。它含有6个文档。上面的 @timestamp 表示员工的入职时间。

 

Painless 编程实例

在这个章节里,我将尽量使用比较详尽的例子来展示 Painless 编程的应用场景。

更新文档

我们想把 id 为 6 的文档的 salary 增加100:

POST employee/_update/6
{
  "script": {
    "lang": "painless", 
    "source": """
      ctx._source.salary += params['increasement']
    """,
    "params": {
      "increasement": 100
    }
  }
}

GET employee/_doc/6

在上面,我们把 salary 字段的值增加 100。重新得到 id 为6文档的内容:

{
  "_index" : "employee",
  "_type" : "_doc",
  "_id" : "6",
  "_version" : 2,
  "_seq_no" : 6,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "salary" : 7600,
    "bonus" : 500,
    "@timestamp" : "2017-03-01",
    "weight" : 40,
    "height" : 158,
    "name" : "cherry",
    "occupation" : "software engineer",
    "hobbies" : [
      "basketball",
      "yoga"
    ]
  }
}

我们甚至可以对 hobbies 这样的 list 进行 add 或者 remove 这样的操作:

POST employee/_update/6
{
  "script": {
    "lang": "painless",
    "source": """
      ctx._source.hobbies.add(params['hobby_new']);
      
      if (ctx._source.hobbies.contains(params['hobby_to_be_removed'])) { 
        ctx._source.hobbies.remove(ctx._source.hobbies.indexOf(params['hobby_to_be_removed'])) 
      }
    """,
    "params": {
      "hobby_new": "football",
      "hobby_to_be_removed": "basketball"
    }
  }
}

GET employee/_doc/6

在上面,我们对 id 为6 的文档添加一个新的 hobby,同时我们也删除它的一个旧的 hobby。上面运行的结果为:

{
  "_index" : "employee",
  "_type" : "_doc",
  "_id" : "6",
  "_version" : 3,
  "_seq_no" : 7,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "salary" : 7600,
    "bonus" : 500,
    "@timestamp" : "2017-03-01",
    "weight" : 40,
    "height" : 158,
    "name" : "cherry",
    "occupation" : "software engineer",
    "hobbies" : [
      "yoga",
      "football"
    ]
  }
}

显然之前的 basketball 已经被删除了,同时我们也添加了 football 这个爱好。

在更新文档的时候,我们也可以直接使用 _update_by_query 这个方法。这种方法使用于针对一部分满足条件的文档进行更新,并且通常我们并不知道文档的 id:

POST employee/_update_by_query
{
  "query": {
    "match": {
      "occupation.keyword": "software engineer"
    }
  },
  "script": {
    "lang": "painless",
    "source": """
      ctx._source.salary += params['increasement']
    """,
    "params": {
      "increasement": 100
    }
  }
}

在上面,我们把所有职业为 software engineer 的薪水都涨 100 元。我们可以使用如下的命令来查看结果:

GET employee/_search
{
  "_source": ["salary"], 
  "query": {
    "match": {
      "occupation.keyword": "software engineer"
    }
  }
}

上面显示的结果为:

    "hits" : [
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.6931471,
        "_source" : {
          "salary" : 5100
        }
      },
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "6",
        "_score" : 0.6931471,
        "_source" : {
          "salary" : 7700
        }
      }
    ]

我们可以看到 software engineer 的薪水都涨了100。

 

在 reindex 中使用

我们也可以在 reindex 中使用 script,比如我们增加或者删除一个字段。接下来,我们想在新的索引 employee_new 中添加一个新的字段。如果薪水大于7000,我们就认为是搞薪水。我们想增加一个新的字段叫做 high_pay:

POST _reindex
{
  "source": {
    "index": "employee"
  },
  "dest": {
    "index": "employee_new"
  },
  "script": {
    "lang": "painless", 
    "source": """
      if(ctx._source['salary'] >= 7000) {
        ctx._source['high_pay'] = true
      } else {
        ctx._source['high_pay'] = false
      }
    """
  }
}

在上面,我们检查 salary 是否为高于 7000,如果是增加一个新的字段 high_pay,并标识为 true。上面经过 reindex 后的索引 employee_new 里的文档是这样的:

    "hits" : [
      {
        "_index" : "employee_new",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "@timestamp" : "2020-02-01",
          "occupation" : "sales",
          "hobbies" : [
            "singing",
            "volleyball"
          ],
          "bonus" : 500,
          "name" : "John",
          "weight" : 50,
          "high_pay" : false,
          "salary" : 6000,
          "height" : 165
        }
      },

我们可以看到一个新的字段叫做 high_pay。

 

在 Pipeline 中进行使用

Pipeline 在数据导入中会被经常使用。在 Pipeline 的定义中,我们可以使用一个叫做 script 的 processor。比如,我们可以通过如下的方式来定义一个叫做 income。在我们的字段中,我们可以看到一个叫做 salary 及 bonus 两个字段。我们希望在导入的时候,把两个部分加起来,并得出一个总的 income。我们首先来创建几个叫做 income 的 pipeline:

PUT _ingest/pipeline/income
{
  "description": "sum up salary and bonus to come with the income",
  "processors": [
    {
      "script": {
        "source": """
          ctx['income'] = ctx['salary'] + ctx['bonus']
        """
      }
    }
  ]
}

在运行完上面的脚本之后,我们可以使用如下的方法来创建一个 id 为7 的文档:

PUT employee/_doc/7?pipeline=income
{
  "salary": 6000,
  "bonus": 500,
  "@timestamp": "2014-05-01",
  "weight": 40,
  "height": 158,
  "name": "cherry",
  "occupation": "software engineer",
  "hobbies": [
    "volleybal"
  ]
}

我们使用如下的命令来检查 id 为7的内容:

GET employee/_doc/7
{
  "_index" : "employee",
  "_type" : "_doc",
  "_id" : "7",
  "_version" : 1,
  "_seq_no" : 6,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "income" : 6500,
    "occupation" : "software engineer",
    "bonus" : 500,
    "weight" : 40,
    "salary" : 6000,
    "@timestamp" : "2014-05-01",
    "hobbies" : [
      "volleybal"
    ],
    "name" : "cherry",
    "height" : 158
  }
}

从上面,我们可以看出来一个叫做 income 的新的字段。它的值显然是 salary + bonus。在实际的使用中,我们可以在 reindex 时也使用这个 pipeline:

POST _reindex
{
  "source": {
    "index": "employee"
  },
  "dest": {
    "index": "employee_income",
    "pipeline": "income"
  }
}

我们会在新生成的 employee_income 索引中看到 income 这个字段:

    "hits" : [
      {
        "_index" : "employee_income",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "income" : 5500,
          "occupation" : "software engineer",
          "bonus" : 500,
          "weight" : 60,
          "salary" : 5000,
          "@timestamp" : "2021-02-28",
          "hobbies" : [
            "dancing",
            "badminton"
          ],
          "name" : "Peter",
          "height" : 175
        }
      }

pipeline 也可以在 bulk 命令中进行使用,比如:

PUT employee/_bulk?pipeline=income

这样在我们使用批量导入的时候,我们就可以直接使用 pipeline 对数据进行操作了。

 

创建 scripted fields

在实际的搜索过程中,我们可以通过脚本的方法来生成我们想要的字段。比如在下面的例子中,我将使用 script 的方法来生成一个人的 BMI。

GET employee/_search
{
  "script_fields": {
    "BMI": {
      "script": {
        "source": """
          double height = (float)doc['height'].value/100.0;
          return doc['weight'].value / (height*height)
        """
      }
    }
  }
}

上面的计算将显示每个员工的 BMI 指数:

    "hits" : [
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "fields" : {
          "BMI" : [
            19.591836734693878
          ]
        }
      },
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "fields" : {
          "BMI" : [
            18.36547291092746
          ]
        }
      },
    ...

根据 BMI 的定义,我们可以通过如下的方式来查找那些 BMI 不太正常的员工:

GET employee/_search
{
  "script_fields": {
    "BMI": {
      "script": {
        "source": """
          double height = (float)doc['height'].value/100.0;
          return doc['weight'].value / (height*height)
        """
      }
    }
  },
  "query": {
    "script": {
      "script": {
        "source": """
          double height = (float)doc['height'].value/100.0;
          double bmi = doc['weight'].value / (height*height);
          return  bmi > 24 ? true : bmi < 18.4 ? true : false
        """        
      }
    }
  }
}

上面将显示 BMI 大于24 以及 BMI. 小于 18.4 的员工。我们的搜索结果显示:

    "hits" : [
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "fields" : {
          "BMI" : [
            18.36547291092746
          ]
        }
      },
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 1.0,
        "fields" : {
          "BMI" : [
            16.330381768036002
          ]
        }
      },
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "6",
        "_score" : 1.0,
        "fields" : {
          "BMI" : [
            16.023073225444637
          ]
        }
      },
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "7",
        "_score" : 1.0,
        "fields" : {
          "BMI" : [
            16.023073225444637
          ]
        }
      }
    ]

通过脚本进行搜索

在实际的使用中,我们可以通过使用脚本来对文档进行直接的搜索:

GET employee/_search
{
  "query": {
    "script": {
      "script": {
        "lang": "painless", 
        "source": """
          return doc['salary'].value > params['salary']
        """,
        "params": {
          "salary": 8000
        }
      }
    }
  }
}

上面运行的结果是:

    "hits" : [
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 1.0,
        "_source" : {
          "salary" : 9000,
          "bonus" : 800,
          "@timestamp" : "2017-02-01",
          "weight" : 60,
          "height" : 170,
          "name" : "cathy",
          "occupation" : "manager",
          "hobbies" : [
            "climbing",
            "jigging"
          ]
        }
      }
    ]

相比较而言,script query 的方法比较低效。另外,假如我们的文档是几百万或者 PB 级的数据量,那么上面的运算可能被执行无数次,那么可能需要巨大的计算量。在这种情况下,我们需要考虑在 ingest 的时候做计算。请阅读我的另外一篇文章 “避免不必要的脚本 - scripting”。

如果我们不用计较分数,我们甚至可以在 filter 中使用脚本:

GET employee/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "script": {
            "script": {
              "lang": "painless",
              "source": """
                return doc['salary'].value > params['salary']
              """,
              "params": {
                "salary": 8000
              }
            }
          }
        }
      ]
    }
  }
}

 

使用 Painless 进行定制相关性

我们可以使用 Painless 的编程方法来对我们的相关性进行定制。比如,我们可以对我们的搜索结果进行定制分数的计算,从而改变返回文档的排列顺序。在下面的例子中,我将按照 BMI 进行排序:

GET employee/_search
{
  "script_fields": {
    "BMI": {
      "script": {
        "source": """
          double height = (float)doc['height'].value/100.0;
          return doc['weight'].value / (height*height)
        """
      }
    }
  },
  "query": {
    "function_score": {
      "script_score": {
        "script": {
          "lang": "painless",
          "source": """
            double height = (float)doc['height'].value/100.0;
            return doc['weight'].value / (height*height)          
          """
        }
      }
    }
  }
}

在上面的 script_score 中,我们使用了 script 来重新计算搜索结果的分数。上面的显示结果为:

    "hits" : [
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 20.761246,
        "fields" : {
          "BMI" : [
            20.761245674740486
          ]
        }
      },
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 19.591837,
        "fields" : {
          "BMI" : [
            19.591836734693878
          ]
        }
      },
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 18.591131,
        "fields" : {
          "BMI" : [
            18.591130340724717
          ]
        }
      },
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 18.365473,
        "fields" : {
          "BMI" : [
            18.36547291092746
          ]
        }
      },
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 16.330381,
        "fields" : {
          "BMI" : [
            16.330381768036002
          ]
        }
      },
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "6",
        "_score" : 16.023073,
        "fields" : {
          "BMI" : [
            16.023073225444637
          ]
        }
      },
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "7",
        "_score" : 16.023073,
        "fields" : {
          "BMI" : [
            16.023073225444637
          ]
        }
      }
    ]

为了减少计算,我们可以去掉 scripted_fields,但是在本例子中,我们可以清楚地看到 BMI 的排列顺序是从大到小排列的。

 

Painless 在 Aggregation 中的应用

在上面的例子中,我们主要介绍了 Painless 在搜索,更新文档以及导入文档中的应用。在实际的很多应用中,我们可以通过 Painless 变成对数据生产更加有意义的聚合。

Script aggregation

我们想对员工入职入职的年份进行统计,哪一年最多,哪一个月份比较多。我们可以使用如下的统计:

GET employee/_search
{
  "size": 0,
  "aggs": {
    "year": {
      "terms": {
        "script": {
          "source": """
            ZonedDateTime date =  doc['@timestamp'].value; 
            return date.getYear(); 
          """
        }
      }
    },
    "month": {
      "terms": {
        "script": {
          "source": """
             ZonedDateTime date =  doc['@timestamp'].value; 
            return date.getMonthValue(); 
          """
        }
      }
    }
  }
}

对于很多刚学 Painless 编程的开发者来说,这个确实有点懵了。这里的 script 到底是干啥的,在官方文档里没有讲的很明白。在这里,你应该把 script 当做是一个 scripted_filed 来看就明白了。比如在上面的第一个 year 部分,script 实际上是计算出 @timestamp 的年份,并把它当做是一个 field 来进行统计。上面的统计结果是:

  "aggregations" : {
    "month" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "2",
          "doc_count" : 4
        },
        {
          "key" : "3",
          "doc_count" : 2
        },
        {
          "key" : "5",
          "doc_count" : 1
        }
      ]
    },
    "year" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "2017",
          "doc_count" : 2
        },
        {
          "key" : "2014",
          "doc_count" : 1
        },
        {
          "key" : "2018",
          "doc_count" : 1
        },
        {
          "key" : "2019",
          "doc_count" : 1
        },
        {
          "key" : "2020",
          "doc_count" : 1
        },
        {
          "key" : "2021",
          "doc_count" : 1
        }
      ]
    }
  }

我们可以看出来在 2017 年招的最多,同时在2月份也是招工旺季。如果我们想把月份修改为文字,可以这么做:

GET employee/_search
{
  "size": 0,
  "aggs": {
    "year": {
      "terms": {
        "script": {
          "source": """
            ZonedDateTime date =  doc['@timestamp'].value; 
            return date.getYear(); 
          """
        }
      }
    },
    "month": {
      "terms": {
        "script": {
          "source": """
             ZonedDateTime date =  doc['@timestamp'].value; 
            return date.getMonth(); 
          """
        }
      }
    }
  }
}

除了上面的 terms 计算外,我们甚至可以使用脚本来计算出整个公司所有员工 BMI 的平均值:

GET employee/_search
{
  "size": 0,
  "aggs": {
    "avg_bmi": {
      "avg": {
        "script": {
          "source": """
            double height = (float)doc['height'].value/100.0;
            return doc['weight'].value / (height*height)
          """
        }
      }
    }
  }
}

上面查询显示的结果为:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 7,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "avg_bmi" : {
      "value" : 17.95517341143026
    }
  }
}

如果我们想求出所有员工的开销(salary+bonus)的平均值,我们可以这么做:

GET employee/_search
{
  "size": 0,
  "aggs": {
    "avg_income": {
      "avg": {
        "script": {
          "source": """
            return doc['salary'].value + doc['bonus'].value
          """
        }
      }
    }
  }
}

上面查询的结果为:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 7,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "avg_income" : {
      "value" : 7514.285714285715
    }
  }
}

我们也可以求出来最大开销的是多少:

GET employee/_search
{
  "size": 0,
  "aggs": {
    "avg_income": {
      "max": {
        "script": {
          "source": """
            return doc['salary'].value + doc['bonus'].value
          """
        }
      }
    }
  }
}
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 7,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "avg_income" : {
      "value" : 9800.0
    }
  }
}

 

Scripted metric aggregation

Scripted metric aggregation 使用脚本来进行计算并输出 metric 结果。 最重要的是,这使我们可以自由定义自己的聚合。 我们在这里如何在上下文中使用它?我在之前的文章 “Elasticsearch:Script aggregation (2)” 有做详细的描述。

在今天的练习中,我们可以使用两个例子来进行描述。首先我们想得出来关于所有员工的开销的总和是多少?我们可以使用如下的聚合:

GET employee/_search
{
  "size": 0,
  "aggregations": {
    "latest_value": {
      "scripted_metric": {
        "init_script": "state.incomes = []",
        "map_script": """
          state.incomes.add((double)doc.salary.value + (double)doc.bonus.value)
        """,
        "combine_script": """
          double sum = 0;
          for (income in state.incomes) {
            sum += income
          }
          return sum
        """,
        "reduce_script": """
          double total = 0;   
          for (sum in states) {
            total += sum
          }
         return total 
        """
      }
    }
  }
}

上面的脚本感觉有点乱。对于刚开始学 Painless 脚本的开发者来说,确实不容易懂。

Scripted metric aggregation 在其执行的 4 个阶段使用脚本:

  • init_script

在任何文档聚合之前执行。允许聚合设置任何初始状态。

在上面的例子中,init_script 在状态对象中创建了一个数组 incomes。

  • map_script

每个收集的文档执行一次。这是必需的脚本。如果未指定 combine_script ,则需要将结果状态存储在 state 对象中。

在上面的例子中,map_script 把每个文档的 salary 及 bonus 进行相加。

  • combine_script

文档收集完成后在每个分片上执行一次。这是必需的脚本。允许聚合合并从每个分片返回的状态。

在上面的例子中, combine_script 遍历所有存储的 incomes,将每个员工的 income 进行相加,并最终返回。

  • reduce_script

在所有分片返回结果后,在协调节点上执行一次。这是必需的脚本。该脚本可以访问变量状态,该变量是每个分片上 combine_script 结果的数组。

在上面的例子中,reduce_script 遍历每个分片返回的利润,在返回开销之前,将这些值相加,该总开销值将在聚合的最终结果中返回。

{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 7,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "latest_value" : {
      "value" : 52600.0
    }
  }
}

上面返回的值 value 含有公司员工每个月的开销值。

我们接下来想知道最近入职的一个员工是谁?我们可以使用如下的方法来得到:

GET employee/_search
{
  "sort": [
    {
      "@timestamp": {
        "order": "desc"
      }
    }
  ]
}

上面显示的结果是:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 7,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : null,
        "_source" : {
          "salary" : 5000,
          "bonus" : 500,
          "@timestamp" : "2021-02-28",
          "weight" : 60,
          "height" : 175,
          "name" : "Peter",
          "occupation" : "software engineer",
          "hobbies" : [
            "dancing",
            "badminton"
          ]
        },
        "sort" : [
          1614470400000
        ]
      },
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : null,
        "_source" : {
          "salary" : 6000,
          "bonus" : 500,
          "@timestamp" : "2020-02-01",
          "weight" : 50,
          "height" : 165,
          "name" : "John",
          "occupation" : "sales",
          "hobbies" : [
            "singing",
            "volleyball"
          ]
        },
        "sort" : [
          1580515200000
        ]
      },
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : null,
        "_source" : {
          "salary" : 7000,
          "bonus" : 600,
          "@timestamp" : "2019-03-01",
          "weight" : 55,
          "height" : 172,
          "name" : "mary",
          "occupation" : "manager",
          "hobbies" : [
            "dancing",
            "tennis"
          ]
        },
        "sort" : [
          1551398400000
        ]
      }

    ...

当然我们也可以使用如下的 scripted metric aggregation 的方法得到,虽然显得比较麻烦。但是这种方法使用于我们在特定情况下按照我们所需要的条件进行排序的聚合。这个通常我们需要 top_hits 来一起完成。

GET employee/_search
{
  "size": 0,
  "aggs": {
    "latest_doc": {
      "scripted_metric": {
        "init_script": "state.timestamp_latest = 0L; state.last_doc = ''",
        "map_script": """ 
        def current_date = doc['@timestamp'].getValue().toInstant().toEpochMilli();
        if (current_date > state.timestamp_latest) {
          state.timestamp_latest = current_date;
          state.last_doc = new HashMap(params['_source']);
        }
      """,
        "combine_script": "return state",
        "reduce_script": """ 
        def last_doc = '';
        def timestamp_latest = 0L;
        for (s in states) {
          if (s.timestamp_latest > (timestamp_latest)) {
            timestamp_latest = s.timestamp_latest; 
            last_doc = s.last_doc;
          }
        }
        return last_doc
      """
      }
    }
  }
}

上面显示的结果为:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 7,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "latest_doc" : {
      "value" : {
        "@timestamp" : "2021-02-28",
        "occupation" : "software engineer",
        "hobbies" : [
          "dancing",
          "badminton"
        ],
        "bonus" : 500,
        "name" : "Peter",
        "weight" : 60,
        "salary" : 5000,
        "height" : 175
      }
    }
  }
}

现在有一种情况,比如我们想得到 BMI 最大的文档所有的内容:

GET employee/_search
{
  "size": 0,
  "aggs": {
    "latest_doc": {
      "scripted_metric": {
        "init_script": "state.max_bmi = 0.0; state.max_doc = ''",
        "map_script": """ 
          double height = (float)doc['height'].value/100.0;
          double bmi = doc['weight'].value / (height*height);
          if (bmi > state.max_bmi) {
            state.max_bmi = bmi;
            state.last_doc = new HashMap(params['_source']);
          }
        """,
        "combine_script": "return state",
        "reduce_script": """ 
          def last_doc = '';
          def max_bmi = 0.0;
          for (s in states) {
            if (s.max_bmi > max_bmi) {
              max_bmi = s.max_bmi; 
              last_doc = s.last_doc;
            }
          }
          return last_doc
        """
      }
    }
  }
}

上面的结果显示:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 7,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "latest_doc" : {
      "value" : {
        "@timestamp" : "2017-02-01",
        "occupation" : "manager",
        "hobbies" : [
          "climbing",
          "jigging"
        ],
        "bonus" : 800,
        "name" : "cathy",
        "weight" : 60,
        "salary" : 9000,
        "height" : 170
      }
    }
  }
}

显然这个和我们在上面的一个按照  BMI 进行排序打分的那个结果是一致的。在那个搜索中,我们看到 id 为5的文档的 BMI 是最高的。只不过在那个文档中,我们没有得到文档的内容:

{ "salary" : 9000, "bonus": 800, "@timestamp" : "2017-02-01", "weight": 60, "height": 170, "name" : "cathy", "occupation": "manager", "hobbies":["climbing", "jigging"]}

 

Scripts 在 Kibana 中的应用

在我之前的很多文章中,我已经使用 Painless 脚本来实现我们想要的东西。我们可以阅读如下的一些文章:

  • Kibana: 如何在 Kibana 中生成 Scripted fields

  • Kibana:使用 Scripted fields 来提高数据的可观测性

  • Kibana:运用 script fields 对数据进行清洗

  • Kibana:在 Lens 中轻松地创建运行时字段以分析数据 - 7.13 版本
     

Scripts 在 Runtime fields 中的应用

  • Elasticsearch:使用 Runtime fields 对索引字段进行覆盖处理以修复错误 - 7.11 发布
  • Elasticsearch:创建 Runtime field 并在 Kibana 中使用它 - 7.11 发布
  • Elasticsearch:动态创建 Runtime fields - 7.11 发行版

 

总结

在 Elastic Stack 的应用中,我们会发现许多的 Painless 的脚本编程应用案例。这篇文章列举了大多数我们需要应用的场景,尽管也有不全的地方。希望对学习 Elasticsearch 的开发者有所帮助。

你可能感兴趣的:(Elastic,elasticsearch,大数据)