Elasticsearch 7.x 深入【10】Aggregation

1. 借鉴

极客时间 阮一鸣老师的Elasticsearch核心技术与实战
Elasticsearch--Aggregation详细总结(聚合统计)
Elasticsearch聚合——Bucket Aggregations
Elasticsearch聚合——Metrics Aggregations
Elasticsearch聚合——Pipeline Aggregations
官网 search-aggregations
地理距离过滤器
Elasticsearch:aggregation介绍
ES aggregation详解
aggregation 详解1(概述)
aggregation 详解2(metrics aggregations)
aggregation 详解3(bucket aggregation)
aggregation 详解4(pipeline aggregations)
[Elasticsearch] 过滤查询以及聚合(Filtering Queries and Aggregations)
官网 search-aggregations-bucket
官网 search-aggregations-metrics
官网 search-aggregations-pipeline
官网 search-aggregations-matrix
Using a bucket script aggregation inside filter aggreagtion
问题:nested查询,内部需要聚合,再刷选,怎么弄?

2. 开始

数据准备:

Aggregation 分类

aggregations提供基于搜索查询的聚合数据,它有以下分类

  • Bucket
    一组构建bucket的聚合,其中每个bucket与一个键和一个文档条件相关联。当执行聚合时,将对上下文中每个文档计算所有bucket条件,当某个条件匹配时,将认为文档“落在”相关bucket中。在聚合过程的最后,我们将得到一个存储段列表——每个存储段都有一组“属于”它的文档。
  • Metric
    在一组文档上跟踪和计算指标的聚合。
  • Matrix
    操作多个字段并根据从请求的文档字段中提取的值生成矩阵结果的一组聚合。与Bucket和Metric不同,这个聚合还不支持脚本。
  • Pipeline
    聚合,聚合其他聚合及其相关指标的输出

聚合的语法

"aggregations" : { // 关键词
    "" : { // 自定义的聚合名字
        "" : { // 聚合的类型
            
        }
        [,"meta" : {  [] } ]?
        [,"aggregations" : { []+ } ]?  // 子聚合
    }
    [,"" : { ... } ]*  // 同级聚合
}

下面我们依次来看一下

Bucket

在es的文档中有好多类型,这里就不一一列举了

  • Terms
  • Range
  • Date Range
  • Histogram
  • Date Histogram
  • ...
栗子1: terms

我们举个栗子,看下有订单中有几种商品

GET /aggs_order/_search
{
  "size": 0,
  "aggs": {
    "group_by_goodsName": {
      "terms": {
        "field": "goodsName.keyword",
        "size": 10
      }
    }
  }
}

我们看下结果

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "group_by_goodsName" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "IPhone 8 Plus",
          "doc_count" : 2
        },
        {
          "key" : "IPhone 9 Plus",
          "doc_count" : 2
        },
        {
          "key" : "IPhone 10 Plus",
          "doc_count" : 1
        }
      ]
    }
  }
}
  • 优化terms聚合的性能[在mapping时指定eager_global_ordinals为true]
    在字段需要经常被聚合;同时不断有新文档写入时,可以增加这个属性
  • min_doc_count:我们可以在聚合时指定最小的文档数目,只有满足这个参数要求的个数的词条才会被记录返回

terms聚合中,返回结果中的属性含义:

属性 含义
doc_count_error_upper_bound 被遗漏的term桶,可能包含文档的最大值
sum_other_doc_count 除了返回结果中bucket中的terms之外,其他terms的文档总数(总数-返回的总数)
栗子2:子聚合

取每种商品中,价格最高的1个订单

# 先根据goodsName.keyword分组,然后在按照价格倒序排序,取第一个
GET /aggs_order/_search
{
  "size": 0,
  "aggs": {
    "group_by_goodsName": {
      "terms": {
        "field": "goodsName.keyword",
        "size": 10
      },
      "aggs": {
        "more_amount": {
          "top_hits": {
            "size": 1,
            "sort": [
              {
                "amount": {
                  "order": "desc"
                }
              }
              ]
          }
        }
      }
    }
  }
}

看下返回结果

{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "group_by_goodsName" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "IPhone 8 Plus",
          "doc_count" : 2,
          "more_amount" : {
            "hits" : {
              "total" : {
                "value" : 2,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "aggs_order",
                  "_type" : "_doc",
                  "_id" : "HAOY-SKXIS-LIWN",
                  "_score" : null,
                  "_source" : {
                    "platform" : "IOS",
                    "amount" : 1200,
                    "createTime" : "2020-04-15 10:00",
                    "originatorId" : 2,
                    "originatorName" : "李四",
                    "goodsId" : 1,
                    "goodsName" : "IPhone 8 Plus"
                  },
                  "sort" : [
                    1200
                  ]
                }
              ]
            }
          }
        },
        {
          "key" : "IPhone 9 Plus",
          "doc_count" : 2,
          "more_amount" : {
            "hits" : {
              "total" : {
                "value" : 2,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "aggs_order",
                  "_type" : "_doc",
                  "_id" : "USYX_SJJSUL_XUSYA",
                  "_score" : null,
                  "_source" : {
                    "platform" : "PC",
                    "amount" : 500,
                    "createTime" : "2020-01-20 10:00",
                    "originatorId" : 1,
                    "originatorName" : "张三",
                    "goodsId" : 2,
                    "goodsName" : "IPhone 9 Plus"
                  },
                  "sort" : [
                    500
                  ]
                }
              ]
            }
          }
        },
        {
          "key" : "IPhone 10 Plus",
          "doc_count" : 1,
          "more_amount" : {
            "hits" : {
              "total" : {
                "value" : 1,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "aggs_order",
                  "_type" : "_doc",
                  "_id" : "XXSA-KSUWL-USIA",
                  "_score" : null,
                  "_source" : {
                    "platform" : "PC",
                    "createTime" : "2020-01-20 10:00",
                    "originatorId" : 3,
                    "originatorName" : "王五",
                    "goodsId" : 3,
                    "goodsName" : "IPhone 10 Plus"
                  },
                  "sort" : [
                    -9223372036854775808
                  ]
                }
              ]
            }
          }
        }
      ]
    }
  }
}
栗子3:range

按照订单价格区间进行分组(通过这个例子,可以看到range是前闭后开区间 [0, 300) )

GET /aggs_order/_search
{
  "size": 0,
  "aggs": {
    "amount_range": {
      "range": {
        "field": "amount",
        "ranges": [
          {
            "to": 300
          },
          {
            "from": 300,
            "to": 700
          },
          {
            "key": "gt 700",
            "from": 700
          }
        ]
      }
    }
  }
}

看下结果

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "amount_range" : {
      "buckets" : [
        {
          "key" : "*-300.0",
          "to" : 300.0,
          "doc_count" : 1
        },
        {
          "key" : "300.0-700.0",
          "from" : 300.0,
          "to" : 700.0,
          "doc_count" : 2
        },
        {
          "key" : "gt 700",
          "from" : 700.0,
          "doc_count" : 1
        }
      ]
    }
  }
}
栗子4:script

首先计算出订单中的年,然后按照年进行分组

GET /aggs_order/_search
{
  "size": 0,
  "aggs": {
    "group_by_year": {
      "range": {
        "script": {
          "source": """
              JodaCompatibleZonedDateTime dateTime = doc['createTime'].value;
              return params.now - dateTime.getYear();
          """,
          "params": {
            "now": 2020
          }
        },
        "ranges": [
          {
            "to": 1
          },
          {
            "from": 1,
            "to": 3
          },
          {
            "from": 3,
            "to": 5
          },
          {
            "from": 5
          }
        ]
      }
    }
  }
}
栗子5:geo_distance

以给定位置为圆心画一个圆,来找出那些地理坐标落在其中的文档

GET /aggs_hotel/_search
{
  "size": 0, 
  "aggs": {
    "rings_around_amsterdam": {
      "geo_distance": {
        "field": "location",
        "origin": {
          "lon": 109.0000000,
          "lat": 34.0000000
        },
        "ranges": [
          { "to" : 100000 },
          { "from" : 100000, "to" : 300000 },
          { "from" : 300000 }
        ]
      }
    }
  }
}

我们来看下结果

{
  "took" : 82,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 8,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "rings_around_amsterdam" : {
      "buckets" : [
        {
          "key" : "*-100000.0",
          "from" : 0.0,
          "to" : 100000.0,
          "doc_count" : 6
        },
        {
          "key" : "100000.0-300000.0",
          "from" : 100000.0,
          "to" : 300000.0,
          "doc_count" : 0
        },
        {
          "key" : "300000.0-*",
          "from" : 300000.0,
          "doc_count" : 2
        }
      ]
    }
  }
}
  • 我们可以使用unit来指定单位,默认是m
By default, the distance unit is m (meters) but it can also accept: mi (miles), in (inches), 
yd (yards), km (kilometers), cm (centimeters), mm (millimeters).
  • 我们可以使用keyed,将buckets下的数组变为buckets下的每一个hash
栗子5:filter ,nested

我们查一下“泽兰雅家酒店”这个酒店,会员等级为001,住离日期是[2020-05-01, 2020-05-03),所要花费的价格等信息

# 第一种写法,直接筛选
GET /aggs_hotel_price/_search
{
  "size": 0, 
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "name.keyword": "泽兰雅家酒店"
        }
      }
    }
  },
  "aggs": {
     "prices": {
        "nested": {
          "path": "prices"
        },
        "aggs": {
          "group_by_level": {
            "terms": {
              "field": "prices.level",
              "size": 1,
              "include": "001"
            },
            "aggs": {
              "date_range": {
                "date_range": {
                  "field": "prices.selldate",
                  "ranges": [
                    {
                      "from": "2020-05-01",
                      "to": "2020-05-03"
                    }
                  ]
                },
                "aggs": {
                  "stats": {
                    "stats": {
                      "field": "prices.price"
                    }
                  }
                }
              }
            }
          }
        }
    }
  }
}

# 第二种写法-并不是很建议,因为这里对名字进行分组后筛选
GET /aggs_hotel_price/_search
{
  "size": 0,
  "aggs": {
    "group_by_name": {
      "filter": {
        "term": {
          "name.keyword": "泽兰雅家酒店"
        }
      },
      "aggs": {
        "prices": {
          "nested": {
            "path": "prices"
          },
          "aggs": {
            "group_by_level": {
              "terms": {
                "field": "prices.level",
                "size": 1,
                "include": "001"
              },
              "aggs": {
                "date_range": {
                  "date_range": {
                    "field": "prices.selldate",
                    "ranges": [
                      {
                        "from": "2020-05-01",
                        "to": "2020-05-03"
                      }
                    ]
                  },
                  "aggs": {
                    "stats": {
                      "stats": {
                        "field": "prices.price"
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

我们看下第二种方式的结果[需要注意的是,上述2中方式的返回结果的格式不一样,因为第二种多了一次聚合]

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "group_by_name" : {
      "doc_count" : 1,
      "prices" : {
        "doc_count" : 6,
        "group_by_level" : {
          "doc_count_error_upper_bound" : 0,
          "sum_other_doc_count" : 0,
          "buckets" : [
            {
              "key" : "001",
              "doc_count" : 2,
              "date_range" : {
                "buckets" : [
                  {
                    "key" : "2020-05-01-2020-05-03",
                    "from" : 1.5882912E12,
                    "from_as_string" : "2020-05-01",
                    "to" : 1.588464E12,
                    "to_as_string" : "2020-05-03",
                    "doc_count" : 2,
                    "stats" : {
                      "count" : 2,
                      "min" : 9.0,
                      "max" : 15.0,
                      "avg" : 12.0,
                      "sum" : 24.0
                    }
                  }
                ]
              }
            }
          ]
        }
      }
    }
  }
}

Metric

在Metric中,有两种类型,一种是单值类型,另外一种是多值类型,我们接下来分别看下

单值类型(只返回一个分析结果)

在es的文档中有好多类型,这里就不一一列举了

  • min 最小值
  • max 最大值
  • avg 平均值
  • sum 总和
  • cardinality 去重后的数量

接下来我们来举个栗子

  • 我要查询订单中最小的支付金额
GET /aggs_order/_search
{
  "size": 0,  // 我这里没有query部分,我也不关系它的返回,这里size设置为0
  "aggs": { // 这里是关键字,不能变的
    "min_aggs": { // 这里是自定义的aggs的名称,自定义
      "min": { // 这里是要聚合的类型,只能是我们上面说的那些
        "field": "amount" // 要进行聚合的字段
      }
    }
  }
}

返回结果如下:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "min_aggs" : {
      "value" : 100.0 // 可以看到这里返回了最小值
    }
  }
}

同样我们也可以有多个聚合,这里我们查询订单中支付金额的最大,最小和平均值

GET /aggs_order/_search
{
  "size": 0, 
  "aggs": {
    "min_aggs": {
      "min": {
        "field": "amount"
      }
    },
    "max_aggs": {
      "max": {
        "field": "amount"
      }
    },
    "avg_aggs": {
      "avg": {
        "field": "amount"
      }
    }
  }
}

我们来看下结果

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "avg_aggs" : { // 平均值
      "value" : 525.0
    },
    "min_aggs" : { // 最小值
      "value" : 100.0
    },
    "max_aggs" : { // 最大值
      "value" : 1200.0
    }
  }
}

多值类型(返回多个分析结果)

在es的文档中有好多类型,这里就不一一列举了

  • stats
  • extended stats
  • percentile
  • percentile rank
  • top hits
  • ...

我们举个栗子,我要看下订单中amount的综合数据,比如最大值,最小值等等

GET /aggs_order/_search
{
  "size": 0, 
  "aggs": {
    "stats_aggs": {
      "stats": { // 指定聚合类型为多值类型中的stats
        "field": "amount"
      }
    }
  }
}

我们看下返回结果

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "stats_aggs" : {
      "count" : 4, // 包含amount字段的文档数量
      "min" : 100.0, // 最小值
      "max" : 1200.0, // 最大值
      "avg" : 525.0, // 平均值
      "sum" : 2100.0 // 总和
    }
  }
}

Pipeline

对聚合再次进行聚合

  • pipeline的分析结果根据不同聚合,会输出到不同位置[以下解释摘自冰钟,多谢]
    1. Sibling - 以兄弟聚合(同级聚合)的结果作为输入,对兄弟聚合的结果进行聚合计算。计算出一个新的聚合结果,结果与兄弟聚合的结果同级。
      max,min,avg,sum
      stats,extended status
      percentiles
      ...
    2. Parent - 以父聚合的结果作为输入,对父聚合的结果进行聚合计算。可以计算出新的桶或是新的聚合结果加入到现有的桶中。
      derivative[求导]
      cumltive sum[累计求和]
      moving function[滑动窗口]
      ...

在pipeline的聚合中,必须要指定buckets_path,我们看下这个path的语法

buckets_path 的语法

# 聚合分隔符 ==> ">",指定父子聚合关系,如:"my_bucket>my_stats"
AGG_SEPARATOR       =  `>` ;

# metric aggregation的分隔符,指定度量值,如:“my_stats.avg”
# 我自己的实验:bucket和bucket聚合之间用>,bucket和metric聚合之间用>或者.都行,metric和metric之间用metric
METRIC_SEPARATOR    =  `.` ;

# 聚合名称 ==>  ,指定聚合的名称
AGG_NAME            =   ;

# 在多值metric聚合的情况下,指定metric聚合的名字
METRIC              =   ;

# 用于多值聚合选取其中指定名称的聚合进行
# 如:sale_type['hat']>sales
MULTIBUCKET_KEY     =  `[]`

# 最后的路径公式为:
PATH                =  ? (,  )* ( ,  ) ;
栗子1: min_bucket

计算个人订单的平均金额,并从中取出最小的那个

GET /aggs_order/_search
{
  "size": 0,
  "aggs": {
    "group_by_originatorId": {
      "terms": {
        "field": "originatorName"
      },
      "aggs": {
        "avg_amount": {
          "avg": {
            "field": "amount",
            "missing": 0
          }
        }
      }
    },
      "min_avg_amount": { // 这里是自定义的pipeline聚合的名字
        "min_bucket": { // 这里是关键字
          "buckets_path": "group_by_originatorId>avg_amount" // 这里是聚合路径
        }
      }
  }
}

我们看下结果,因为min bucket是Sibling pipeline,所以结果与兄弟聚合的结果同级

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "group_by_originatorId" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "张三",
          "doc_count" : 2,
          "avg_amount" : {
            "value" : 300.0
          }
        },
        {
          "key" : "王五",
          "doc_count" : 2,
          "avg_amount" : {
            "value" : 150.0
          }
        },
        {
          "key" : "李四",
          "doc_count" : 1,
          "avg_amount" : {
            "value" : 1200.0
          }
        }
      ]
    },
    "min_avg_amount" : {
      "value" : 150.0,
      "keys" : [
        "王五"
      ]
    }
  }
}

聚合的作用范围

默认的作用范围是query的查询结果集
我们可以使用以下方式改变聚合的作用范围

post filter

在聚合分析之后进行筛选

# 按照名称分桶,分别统计每个人的订单金额信息[在返回结果的aggregations中展示],最后筛选出张三的信息[在返回结果的hits中展示]
GET /aggs_order/_search
{
  "size": 0,
  "aggs": {
    "group_by_originatorName": {
      "terms": {
        "field": "originatorName"
      },
      "aggs": {
        "stats": {
          "stats": {
            "field": "amount"
          }
        }
      }
    }
  },
  "post_filter": {
    "term": {
      "originatorName": "张三"
    }
  }
}

查询结果

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "aggs_order",
        "_type" : "_doc",
        "_id" : "HASA-XSIAN-SIWU",
        "_score" : 1.0,
        "_source" : {
          "platform" : "Android",
          "amount" : 100,
          "createTime" : "2019-05-20 10:00",
          "originatorId" : 1,
          "originatorName" : "张三",
          "goodsId" : 1,
          "goodsName" : "IPhone 8 Plus"
        }
      },
      {
        "_index" : "aggs_order",
        "_type" : "_doc",
        "_id" : "USYX_SJJSUL_XUSYA",
        "_score" : 1.0,
        "_source" : {
          "platform" : "PC",
          "amount" : 500,
          "createTime" : "2020-01-20 10:00",
          "originatorId" : 1,
          "originatorName" : "张三",
          "goodsId" : 2,
          "goodsName" : "IPhone 9 Plus"
        }
      }
    ]
  },
  "aggregations" : {
    "group_by_originatorName" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "张三",
          "doc_count" : 2,
          "stats" : {
            "count" : 2,
            "min" : 100.0,
            "max" : 500.0,
            "avg" : 300.0,
            "sum" : 600.0
          }
        },
        {
          "key" : "王五",
          "doc_count" : 2,
          "stats" : {
            "count" : 1,
            "min" : 300.0,
            "max" : 300.0,
            "avg" : 300.0,
            "sum" : 300.0
          }
        },
        {
          "key" : "李四",
          "doc_count" : 1,
          "stats" : {
            "count" : 1,
            "min" : 1200.0,
            "max" : 1200.0,
            "avg" : 1200.0,
            "sum" : 1200.0
          }
        }
      ]
    }
  }
}

global

在该聚合中,忽略掉query部分的查询限制

GET /aggs_order/_search
{
  "size": 0, 
  "query": {
    "range": {
      "amount": {
        "gt": 100
      }
    }
  }, 
  "aggs": {
    "group_by_originatorName": {
      "terms": {
        "field": "originatorName"
      },
      "aggs": {
        "stats": {
          "stats": {
            "field": "amount"
          }
        }
      }
    },
    "all": {
      "global": {},
      "aggs": {
        "group_by_originatorName": {
          "terms": {
            "field": "originatorName"
          },
          "aggs": {
            "stats": {
              "stats": {
                "field": "amount"
              }
            }
          }
        }
      }
    }
  }
}

我们看下结果比对下:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "all" : {
      "doc_count" : 5,
      "group_by_originatorName" : {
        "doc_count_error_upper_bound" : 0,
        "sum_other_doc_count" : 0,
        "buckets" : [
          {
            "key" : "张三",
            "doc_count" : 2,
            "stats" : {
              "count" : 2,
              "min" : 100.0,
              "max" : 500.0,
              "avg" : 300.0,
              "sum" : 600.0
            }
          },
          {
            "key" : "王五",
            "doc_count" : 2,
            "stats" : {
              "count" : 1,
              "min" : 300.0,
              "max" : 300.0,
              "avg" : 300.0,
              "sum" : 300.0
            }
          },
          {
            "key" : "李四",
            "doc_count" : 1,
            "stats" : {
              "count" : 1,
              "min" : 1200.0,
              "max" : 1200.0,
              "avg" : 1200.0,
              "sum" : 1200.0
            }
          }
        ]
      }
    },
    "group_by_originatorName" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "张三",
          "doc_count" : 1,
          "stats" : {
            "count" : 1,
            "min" : 500.0,
            "max" : 500.0,
            "avg" : 500.0,
            "sum" : 500.0
          }
        },
        {
          "key" : "李四",
          "doc_count" : 1,
          "stats" : {
            "count" : 1,
            "min" : 1200.0,
            "max" : 1200.0,
            "avg" : 1200.0,
            "sum" : 1200.0
          }
        },
        {
          "key" : "王五",
          "doc_count" : 1,
          "stats" : {
            "count" : 1,
            "min" : 300.0,
            "max" : 300.0,
            "avg" : 300.0,
            "sum" : 300.0
          }
        }
      ]
    }
  }
}

排序

根据关键字排序

  • _count
  • _key

通过聚合后的文档数量和关键词排序

GET /aggs_order/_search
{
  "size": 0,
  "aggs": {
    "group_by_originatorName": {
      "terms": {
        "field": "originatorName",
        "order": [
          {"_count": "desc"},
          {"_key": "desc"}
          ]
      }
    }
  }
}

根据子单值聚合结果排序

使用类似min,max,min等返回单值结果的聚合作为排序条件

GET /aggs_order/_search
{
  "size": 0,
  "aggs": {
    "group_by_originatorName": {
      "terms": {
        "field": "originatorName",
        "order": {
          "avg_amount": "desc"
        }
      },
      "aggs": {
        "avg_amount": {
          "avg": {
            "field": "amount"
          }
        }
      }
    }
  }
}

根据子多值聚合结果排序

使用类似stats等返回多值结果的聚合中的某一项作为排序条件

GET /aggs_order/_search
{
  "size": 0,
  "aggs": {
    "group_by_originatorName": {
      "terms": {
        "field": "originatorName",
        "order": {
          "stats_amount.sum": "desc"
        }
      },
      "aggs": {
        "stats_amount": {
          "stats": {
            "field": "amount"
          }
        }
      }
    }
  }
}

思考题

nested查询,内部需要聚合,再刷选,怎么弄?

业务场景:当前有100w用户,50w红包记录,一个用户有多条红包记录。首先建100w索引记录,然后在用户记录中,使用一个字段nested类型,保存对应当前的红包列表。
红包记录有:红包金额,红包有效期。
需求:需要实现一个功能,在当前的红包有效期内,累计的红包金额满足,对应的当前用户有多少?

在找资料的时候,发现了这么一个问题,然后我自己试了一下,现在给出我的答案

  • 第一种方式
    这种方式需要在terms中指定size,多分片时候会有数据精准度问题,而且如果size过大,会占用更多内存,慎用
GET /aggs_user_envelope/_search
{
  "size": 0,
  "aggs": {
    "aggs_nested": {
      "nested": {
        "path": "envelope"
      },
      "aggs": {
        "filter_date": {
          "filters": {
            "filters": {
              "range": {
                "range": {
                  "envelope.until": {
                    "gte": "2020-05-30 00:00"
                  }
                }
              }
            }
          },
          "aggs": {
            "group_by_username": {
              "terms": {
                "field": "envelope.userId",
                "size": 10
              },
              "aggs": {
                "sum_of_money": {
                  "sum": {
                    "field": "envelope.money"
                  }
                },
                "filter_money": {
                  "bucket_selector": {
                    "buckets_path": {
                      "money": "sum_of_money"
                    },
                    "script": "params.money >= 50"
                  }
                },
                "sort": {
                  "bucket_sort": {
                    "sort": [
                      {"sum_of_money": {"order": "desc"}}
                      ,{"_count": {"order": "desc"}}
                      ,{"_key": {"order": "desc"}}
                      ]
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}
  • 第二种方式
    官网推荐使用composite进行分页,类似scroll分页,但是composite聚合也有限制,内部只能是Terms,Histogram,Date histogram这三种聚合

第一次分页:

GET /aggs_user_envelope/_search
{
  "size": 0,
  "aggs": {
    "nested_wrapper": {
      "nested": {
        "path": "envelope"
      },
      "aggs": {
        "group_by_userName": {
          "composite": {
            "size": 2, 
            "sources": [
              {
                "userName": {
                  "terms": {
                    "field": "envelope.userId",
                    "missing_bucket": true
                  }
                }
              }
            ]
          },
          "aggs": {
            "filter_date": {
              "filter": {
                "range": {
                  "envelope.until": {
                    "gte": "2020-05-30 00:00"
                  }
                }
              },
              "aggs": {
                "sum_of_money": {
                  "sum": {
                    "field": "envelope.money"
                  }
                }
              }
            },
            "filter_money": {
              "bucket_selector": {
                "buckets_path": {
                  "money": "filter_date>sum_of_money"
                },
                "script": "params.money >= 50"
              }
            },
            "sort": {
              "bucket_sort": {
                "sort": [
                  {"filter_date>sum_of_money": {"order": "desc"}}
                  ,{"_count": {"order": "desc"}}
                  ,{"_key": {"order": "desc"}}
                  ]
              }
            }
          }
        }
      }
    }
  }
}

我们看下第一次分页的结果

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "nested_wrapper" : {
      "doc_count" : 9,
      "group_by_userName" : {
        "after_key" : {
          "userName" : "10086" // 使用这个作为下次分页的依据
        },
        "buckets" : [
          {
            "key" : {
              "userName" : "10086"
            },
            "doc_count" : 3,
            "filter_date" : {
              "doc_count" : 3,
              "sum_of_money" : {
                "value" : 50.0
              }
            }
          }
        ]
      }
    }
  }
}

第二次分页需要指定after

GET /aggs_user_envelope/_search
{
  "size": 0,
  "aggs": {
    "nested_wrapper": {
      "nested": {
        "path": "envelope"
      },
      "aggs": {
        "group_by_userName": {
          "composite": {
            "size": 2, 
            "sources": [
              {
                "userName": {
                  "terms": {
                    "field": "envelope.userId",
                    "missing_bucket": true
                  }
                }
              }
            ],
            "after": {"userName" : "10086"} // 这里指定after
          },
          "aggs": {
            "filter_date": {
              "filter": {
                "range": {
                  "envelope.until": {
                    "gte": "2020-05-30 00:00"
                  }
                }
              },
              "aggs": {
                "sum_of_money": {
                  "sum": {
                    "field": "envelope.money"
                  }
                }
              }
            },
            "filter_money": {
              "bucket_selector": {
                "buckets_path": {
                  "money": "filter_date>sum_of_money"
                },
                "script": "params.money >= 50"
              }
            },
            "sort": {
              "bucket_sort": {
                "sort": [
                  {"filter_date>sum_of_money": {"order": "desc"}}
                  ,{"_count": {"order": "desc"}}
                  ,{"_key": {"order": "desc"}}
                  ]
              }
            }
          }
        }
      }
    }
  }
}

3. 大功告成

你可能感兴趣的:(Elasticsearch 7.x 深入【10】Aggregation)