聚合分析:英文为Aggregation
,是es除搜索功能外提供的针对es数据做统计分析的功能。
Bucket、Metric、Pipeline
等多种分析方式,可以满足大部分的分析需求为了便于理解,es将聚合分析主要分为如下4类:
Bucket
:分桶类型,类似SQL中的GROUP BY语法Metric
:指标分析类型,如计算最大值、最小值、平均值等等Pipeline
:管道分析类型,基于上一级的聚合分析结果进行在分析Matrix
:矩阵分析类型Metric聚合分析分为单值分析和多值分析两类:
min,max,avg,sum
cardinality
stats,extended stats
percentile,percentile rank
top hits
样例:
GET myindex/_search
{
"size": 0, #不返回文档列表
"aggs": {
"min_age": {
"min": { #关键词
"field": "age"
}
}
}
}
#注意我们将size设置成0,这样我们就可以只看到聚合结果了,而不会显示命中的结果
#一次返回多个聚合结果 (并列关系,不是子聚合)
GET myindex/_search
{
"size": 0,
"aggs": {
"min_age": {
"min": {
"field": "age"
}
},
"avg_age":{
"avg":{
"field":"age"
}
},
"max_age":{
"max":{
"field":"age"
}
}
}
}
ardinality
:意为集合的势,或者基数,是指不同数值的个数,类似SQL中的distinct count
概念。
样例:
GET /bank/account/_search
{
"size":0,
"aggs":{
"count_of_genders":{
"cardinality": {
"field": "gender.keyword"
}
}
}
}
stats
:返回一系列数值类型的统计值,包含min、max、avg、sum
和count
extended stats
:对stats的扩展,包含了更多的统计数据,比如方差、标准差等 样例:
GET bank/account/_search
{
"size": 0,
"aggs": {
"stats_age": {
"stats": {
"field": "age"
}
}
}
}
Percentile
: 百分位数统计。Percentile Rank
: 百分位数统计样例:
#前者
GET bank/account/_search
{
"size": 0,
"aggs": {
"per_age": {
"percentiles": {
"field": "age",
"percents": [
1,
5,
25,
50,
75,
95,
99
]
}
}
}
}
#后者
GET bank/account/_search
{
"size": 0,
"aggs": {
"per_age": {
"percentile_ranks": {
"field": "age",
"values": [
20,
35,
40
]
}
}
}
}
Top Hits
: 一般用于分桶后获取该桶内匹配的顶部文档列表,即详情数据
例如,按照性别进行分组,并对每组中按照balance进行排序(子聚合)
样例:
GET bank/account/_search
{
"size": 0,
"aggs": {
"group_by_gender": {
"terms": {
"field": "gender.keyword"
},
"aggs": {
"top_employee": {
"top_hits": {
"size": 2,
"_source": ["gender","balance"],
"sort": [
{
"balance": {
"order": "desc"
}
}
]
}
}
}
}
}
}
Bucket
:意为桶,即按照一定的规则将文档分配到不同的桶中,达到分类分析的目的
常见的Bucket聚合分析如下:
Terms
Range
Date Range
Histogram
Date Histogram
最简单的分桶策略,直接按照term来分桶,如果是text
类型,则按照分词后的结果分桶
样例:
GET /bank/account/_search
{
"size":0,
"aggs": {
"group_by_gender": {
"terms": {
"field": "gender.keyword"
}
}
}
}
Range
: 通过制定数值的范围来设定分桶规则Date Range
: 通过指定日期的范围来设定分桶规则示例:
#Range
GET /bank/account/_search
{
"size": 0,
"aggs": {
"range_age": {
"range": {
"field": "age",
"ranges": [
{
"to": 25
},
{
"from": 25,
"to": 35
},
{
"from": 35
}
]
}
}
}
}
#Date Range:
GET myindex/_search
{
"size": 0,
"aggs": {
"data_range": {
"date_range": {
"field": "",
"format": "MM-yyy",
"ranges": [
{
"from": "now-10d/d",
"to": "now"
}
]
}
}
}
}
Historgram
: 直方图,以固定间隔的策略来分割数据Date Histogram
: 针对日期的直方图或者柱状图,是时序分析中常用的聚合分析类型示例:
#Historgram
GET bank/account/_search
{
"size": 0,
"aggs": {
"hist_age": {
"histogram": {
"field": "age",
"interval": 10,
"extended_bounds":{
"min":10,
"max":50
}
}
}
}
}
#Date Histogram
GET myindex/_search
{
"size": 0,
"aggs": {
"by_year": {
"date_histogram": {
"field": "date",
"interval": "month"
, "format": "yyyy-MM-dd"
}
}
}
}
Bucket聚合分析允许通过子分析来进一步进行分析,该分析可以是Bucket也可以是Metric,这也使得es的聚合分析能力变得异常强大
GET /bank/account/_search
{
"size": 0,
"aggs": {
"group_by_gender": {
"terms": {
"field": "gender.keyword"
},
"aggs": {
"range_age": {
"range": {
"field": "age",
"ranges": [
{
"from": 20,
"to": 30
},
{
"from":30,
"to":40
}
]
}
}
}
}
}
}
GET /bank/account/_search
{
"size":0,
"aggs": {
"group_by_gender": {
"terms": {
"field": "gender.keyword"
},
"aggs": {
"stats_age": {
"stats": {
"field": "age"
}
}
}
}
}
}
针对聚合分析的结果再次进行聚合分析,而且支持链式调用
Pipeline
的分析结果会输出到原有结果汇总,根据输出位置的不同,分为以下两类:
Parent
结果内嵌到现有聚合分析结果中 Derivative 导数求导
Moving Average 移动平均
Cumulative Sum 累计求和
Sibling
结果与现有聚合分析结果同级Max/Min/Avg/Sum Bucket
Stats/Extended Stats Bucket
Percentiles Bucket
Sibling - Min Bucket
: 找出所有bucket中值最小的bucket名称和值
GET /bank/account/_search
{
"size": 0,
"aggs": {
"group_by_stats": {
"terms": {
"field": "state.keyword"
},
"aggs": {
"average_balance": {
"avg": {
"field": "balance"
}
}
}
},
"min_avg_by_balance":{
"min_bucket": {
"buckets_path": "group_by_stats>average_balance"
}
}
}
}
对应的还有:max_bucket/avg_bucket/sum_buctet/stats_bucket
等等
计算Bucket值的导数
示例:
GET bank/account/_search
{
"size":0,
"aggs": {
"hist_city": {
"histogram": {
"field": "age",
"interval": 5
},
"aggs":{
"avg_balance":{
"avg": {
"field": "balance"
}
},
"derivative_avg_balance":{
"derivative": {
"buckets_path": "avg_balance"
}
}
}
}
}
}
Moving Average移动平均(moving_avg
)、Cumulative Sum累计求和(cumulative_sum
)的用法同derivative
es聚合分析默认作业范围是query的结果集,可以通过如下的方式改变其作业范围:
filter
: 为某个聚合分析设定过滤条件,从而在不更改整体query语句的情况下修改了作用范围Post_filter
: 作用于文档过滤,但在聚合分析后生效global
:无视query过滤条件,基于全部文档进行分析_count
文档数_key
按照key
值排序示例:
GET bank/account/_search
{
"size":0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword",
"size": 5,
"order": {
"_count": "desc"
}
}
}
}
}
GET /bank/account/_search
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword",
"size": 10,
"order": {
"avg_balance": "asc"
}
},
"aggs": {
"avg_balance": {
"avg": {
"field": "balance"
}
}
}
}
}
}
GET /bank/account/_search
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword",
"size": 10,
"order": {
"stats_balance.avg": "asc"
}
},
"aggs": {
"stats_balance": {
"stats": {
"field": "balance"
}
}
}
}
}
}
因为分片的原因,导致terms
不准确,针对这个,有几个解决办法
样例:
GET bank/account/_search
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword",
"size": 2,
"shard_size": 10
}
}
}
}
terms
聚合返回结果有如下两个统计值:
doc_count_error_upper_bound
被遗漏的term可能的最大值sum_other_doc_count
返回结果bucket的term外其他term的文档总数设定
show_term_doc_count_error
可以查看每个bucket误算的最大值:
GET bank/account/_search
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword",
"size": 2,
"show_term_doc_count_error": true
}
}
}
}
返回结果中
"doc_count_error_upper_bound": 0
: 0表示计算准确
Shard_Size
默认大小如下:shard_size=(size*1.5)+10
通过调整Shard_size
的大小降低doc_count_error_upper_bound
来提升准确度,增大了整体的计算量,从而降低了响应时间