Elasticsearch除搜索以外,提供了针对ES 数据进行统计分析的功能。聚合(aggregations)可以让我们极其方便的实现对数据的统计、分析、运算。
基本语法
聚合查询的语法结构与其他查询相似,通常包含以下部分:
提示:以下是本篇文章正文内容,下面案例可供参考
{
"mappings": {
"properties": {
"age": {
"type": "integer"
},
"job": {
"fields": {
"keyword": {
"ignore_above": 50,
"type": "keyword"
}
},
"type": "text"
},
"name": {
"type": "keyword"
},
"salary": {
"type": "integer"
},
"sex": {
"type": "keyword"
}
}
}
}
[
{ "name" : "李四","age":41,"job":"Dev Manager","sex":"male","salary": 50000},
{ "name" : "绯色","age":36,"job":"Java Developer","sex":"female","salary":38000 },
{ "name" : "埃斯基","age":33,"job":"Java Developer","sex":"male","salary":28000},
{ "name" : "张三","age":32,"job":"Manager","sex":"female","salary":35000 },
{ "name" : "王佛为","age":32,"job":"Java Developer","sex":"male","salary":22000 },
{ "name" : "马里奥","age":32,"job":"Javascript Developer","sex":"male","salary": 25000},
{ "name" : "马路","age":31,"job":"UI","sex":"female","salary": 25000},
{ "name" : "李佛尔","age":31,"job":"Java Developer","sex":"male","salary": 32000},
{ "name" : "应善","age":30,"job":"Java Developer","sex":"female","salary":30000 },
{ "name" : "坦克","age":30,"job":"DBA","sex":"male","salary": 30000},
{ "name" : "王五","age":25,"job":"Designer","sex":"male","salary":18000 },
{ "name" : "坤坤","age":26,"job":"Designer","sex":"female","salary": 22000},
{ "name" : "王超","age":25,"job":"UI","sex":"female","salary":18000 },
{ "name" : "李飞","age":27,"job":"UI","sex":"male","salary":20000 },
{ "name" : "万五千","age":27,"job":"Java Developer","sex":"male","salary": 20000},
{ "name" : "李讲萨","age":20,"job":"Java Developer","sex":"male","salary": 9000},
{ "name" : "海坤","age":21,"job":"Javascript Developer","sex":"male","salary": 16000},
{ "name" : "奥特","age":25,"job":"Javascript Developer","sex":"male","salary": 16000},
{ "name" : "图图","age":29,"job":"Javascript Developer","sex":"female","salary": 20000},
{ "name" : "李澎","age":29,"job":"DBA","sex":"female","salary": 20000}
]
按照一定的规则,将文档分配到不同的桶中,从而达到分类的目的。ES提供的一些常见的 Bucket Aggregation。
Terms,需要字段支持filedata
按照分词后的结果
进行分桶桶聚合可以用于各种场景,例如:
Terms Aggregation(词项聚合):将文档按指定字段的值进行分组,并计算每个分组的文档数或其他指标。
聚合可配置属性有:
: 按工作(job)分类统计年龄(age)大于等于33人数,最后按数量增序排列
GET aggs_index/_search
{
"size": 0,
"aggs": {
"cardinate_job": {
"terms": {
"field": "job.keyword",
"order": {
"_count": "asc"
}
}
}
},
"query": {
"range": {
"age": {
"gte": 33
}
}
}
}
返回数据如下:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"cardinate_job" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Dev Manager",
"doc_count" : 1
},
{
"key" : "Java Developer",
"doc_count" : 2
}
]
}
}
}
Java实现:
/**
* 对年龄大于等于33的数据分类统计数量
*/
@RequestMapping(value = "/query_terms", method = RequestMethod.GET)
@ApiOperation(value = "Aggregation - query_terms")
public void qtQuery() throws Exception {
// 定义请求对象
SearchSourceBuilder builder = new SearchSourceBuilder();
// 定义查询范围
QueryBuilder queryBuilder = QueryBuilders.rangeQuery("age").gte(33);
// 聚合分析字段
String age = "job.keyword";
// 定义分组名称
String cardinate_job = "cardinate_job";
//聚合查询 order true=asc/false=desc
BucketOrder order = BucketOrder.count(true);
AggregationBuilder avg = AggregationBuilders.terms(cardinate_job).field(age).order(order);
builder.query(queryBuilder).aggregation(avg);
// 打印返回数据
SearchResponse search = client.aggregationSearch(builder, INDEX_NAME);
Map<String, Aggregation> map = search.getAggregations().asMap();
Terms cardData = (Terms) map.get(cardinate_job);
List<? extends Terms.Bucket> buckets = cardData.getBuckets();
for (Terms.Bucket bucket : buckets) {
LOGGER.info("key:{}; doc_count:{};", bucket.getKey().toString(), bucket.getDocCount());
}
}
查询数据如下:
key:Dev Manager; doc_count:1;
key:Java Developer; doc_count:2;
注意 :
1.其中在查询语句中设置了"size": 0, 含义是:只返回聚合结果,不返回查询结果。
2.可以发现在按照job进行分类统计的时候用的是job.keyword,并不是直接使用job,是因为在创建索引的时候job默认是text类型,如果想要直接对text类型的字段进行聚合统计,需要对 Text 字段打开 fielddata(如下),但是对job.keyword 和 job 进行 terms 聚合,分桶的总数并不一样,是因为对text字段统计是基于分词的结果,而对keyword是基于每个Document整体。
PUT /aggs_index/_mapping
{
"properties" : {
"job":{
"type": "text",
"fielddata": true
}
}
}
Range Aggregation(范围聚合):将文档按指定的范围进行分组,然后对每个范围内的文档进行统计。
: 按工资0-10000,10000-20000,20000+ 这三个区间段统计人数
GET aggs_index/_search
{
"size": 0,
"aggs": {
"aggs_salary": {
"range": {
"field": "salary",
"ranges": [
{
"from": 0,
"to": 10000
},{
"from": 10000,
"to": 20000
},{
"from": 20000
}
]
}
}
}
}
返回结果如下:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 20,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"aggs_salary" : {
"buckets" : [
{
"key" : "0.0-10000.0",
"from" : 0.0,
"to" : 10000.0,
"doc_count" : 1
},
{
"key" : "10000.0-20000.0",
"from" : 10000.0,
"to" : 20000.0,
"doc_count" : 4
},
{
"key" : "20000.0-*",
"from" : 20000.0,
"doc_count" : 15
}
]
}
}
}
Java实现:
/**
* 统计工资区间的人数
*/
@RequestMapping(value = "/range", method = RequestMethod.GET)
@ApiOperation(value = "Aggregation - range")
public void rangeQuery() throws Exception {
// 定义请求对象
SearchSourceBuilder builder = new SearchSourceBuilder();
// 聚合分析字段
String salary = "salary";
// 定义分组名称
String range_salary = "range_salary";
AggregationBuilder range = AggregationBuilders.range(range_salary)
.field(salary)
.addRange(0,10000)
.addRange(10000,20000)
.addRange(20000,Double.MAX_VALUE);
builder.aggregation(range);
// 打印返回数据
SearchResponse search = client.aggregationSearch(builder, INDEX_NAME);
Map<String, Aggregation> map = search.getAggregations().asMap();
Range rangeData = (Range) map.get(range_salary);
List<? extends Range.Bucket> buckets = rangeData.getBuckets();
for (Range.Bucket bucket : buckets) {
LOGGER.info("key:{}; doc_count:{};", bucket.getKey().toString(), bucket.getDocCount());
}
}
返回数据如下:
key:0.0-10000.0; doc_count:1;
key:10000.0-20000.0; doc_count:4;
key:20000.0-1.7976931348623157E308; doc_count:15;
Histogram Aggregation(直方图聚合):将文档按指定的间隔进行分组,并对每个间隔内的文档进行统计。
: 按照工资的间隔(区间值为5000)分桶
GET aggs_index/_search
{
"size": 0,
"aggs": {
"agg_his": {
"histogram": {
"field": "salary",
"interval": 5000
}
}
}
}
返回数据如下:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 20,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"agg_his" : {
"buckets" : [
{
"key" : 5000.0,
"doc_count" : 1
},
{
"key" : 10000.0,
"doc_count" : 0
},
{
"key" : 15000.0,
"doc_count" : 4
},
{
"key" : 20000.0,
"doc_count" : 6
},
{
"key" : 25000.0,
"doc_count" : 3
},
{
"key" : 30000.0,
"doc_count" : 3
},
{
"key" : 35000.0,
"doc_count" : 2
},
{
"key" : 40000.0,
"doc_count" : 0
},
{
"key" : 45000.0,
"doc_count" : 0
},
{
"key" : 50000.0,
"doc_count" : 1
}
]
}
}
}
Java实现:
@RequestMapping(value = "/Histogram", method = RequestMethod.GET)
@ApiOperation(value = "Aggregation - Histogram")
public void histogramQuery() throws Exception {
// 定义请求对象
SearchSourceBuilder builder = new SearchSourceBuilder();
// 聚合分析字段
String salary = "salary";
// 定义分组名称
String range_salary = "histogram_salary";
AggregationBuilder his = AggregationBuilders.histogram(range_salary).field(salary).interval(5000);
builder.aggregation(his);
// 打印返回数据
SearchResponse search = client.aggregationSearch(builder, INDEX_NAME);
Map<String, Aggregation> map = search.getAggregations().asMap();
Histogram rangeData = (Histogram) map.get(range_salary);
List<? extends Histogram.Bucket> buckets = rangeData.getBuckets();
for (Histogram.Bucket bucket : buckets) {
LOGGER.info("key:{}; doc_count:{};", bucket.getKey().toString(), bucket.getDocCount());
}
}
返回数据如下:
: key:5000.0; doc_count:1;
: key:10000.0; doc_count:0;
: key:15000.0; doc_count:4;
: key:20000.0; doc_count:6;
: key:25000.0; doc_count:3;
: key:30000.0; doc_count:3;
: key:35000.0; doc_count:2;
: key:40000.0; doc_count:0;
: key:45000.0; doc_count:0;
: key:50000.0; doc_count:1;
: 此时发现返回的数据有点多,可以通过设置min_doc_count=1,表示返回的数据中统计的数量至少为1
GET aggs_index/_search
{
"size": 0,
"aggs": {
"agg_his": {
"histogram": {
"field": "salary",
"interval": 5000,
"min_doc_count": 1
}
}
}
}
返回数据如下:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 20,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"agg_his" : {
"buckets" : [
{
"key" : 5000.0,
"doc_count" : 1
},
{
"key" : 15000.0,
"doc_count" : 4
},
{
"key" : 20000.0,
"doc_count" : 6
},
{
"key" : 25000.0,
"doc_count" : 3
},
{
"key" : 30000.0,
"doc_count" : 3
},
{
"key" : 35000.0,
"doc_count" : 2
},
{
"key" : 50000.0,
"doc_count" : 1
}
]
}
}
}
: 此时可以发现key=5000的数量有1个,表示[5000,10000) 有1人,如果此时想从0开始计数可以通过设置extended_bounds的最大最小值来控制查询范围
GET aggs_index/_search
{
"size": 0,
"aggs": {
"agg_his": {
"histogram": {
"field": "salary",
"interval": 5000,
"extended_bounds": {
"min": 0,
"max": 20000
}
}
}
}
}
返回数据如下:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 20,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"agg_his" : {
"buckets" : [
{
"key" : 0.0,
"doc_count" : 0
},
{
"key" : 5000.0,
"doc_count" : 1
},
{
"key" : 10000.0,
"doc_count" : 0
},
{
"key" : 15000.0,
"doc_count" : 4
},
{
"key" : 20000.0,
"doc_count" : 6
},
{
"key" : 25000.0,
"doc_count" : 3
},
{
"key" : 30000.0,
"doc_count" : 3
},
{
"key" : 35000.0,
"doc_count" : 2
},
{
"key" : 40000.0,
"doc_count" : 0
},
{
"key" : 45000.0,
"doc_count" : 0
},
{
"key" : 50000.0,
"doc_count" : 1
}
]
}
}
}
: 此时又发现,其实最大最小值并没有完全生效
extended_bounds: extended_bounds参数也用于限制聚合结果的边界范围,但与hard_bounds不同的是,extended_bounds允许结果在指定的范围之外继续计算。这意味着即使结果超出了边界范围,Elasticsearch也会计算并返回这些超出范围的结果。
Metrics Aggregations(度量聚合)是一种用于计算数值指标的聚合类型。它们允许对字段进行统计计算,例如计算平均值、和、最小值、最大值、计数等。Metrics Aggregations是在查询结果的基础上进行数值计算,以便对数据进行更深入的分析和理解。
(平均值聚合/和聚合/最小值聚合/最大值聚合)
: 统计每种工作的平均工资、工资总和、最小工资、最大工资
GET aggs_index/_search
{
"size": 0,
"aggs": {
"term_job": {
"terms": {
"field": "job.keyword"
},
"aggs": {
"avg_salary": {
"avg": {
"field": "salary"
}
},
"sum_salary": {
"sum": {
"field": "salary"
}
},
"max_salary": {
"max": {
"field": "salary"
}
},
"min_salary": {
"min": {
"field": "salary"
}
}
}
}
}
}
返回数据如下:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 20,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"term_job" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Java Developer",
"doc_count" : 7,
"max_salary" : {
"value" : 38000.0
},
"sum_salary" : {
"value" : 179000.0
},
"min_salary" : {
"value" : 9000.0
},
"avg_salary" : {
"value" : 25571.428571428572
}
},
{
"key" : "Javascript Developer",
"doc_count" : 4,
"max_salary" : {
"value" : 25000.0
},
"sum_salary" : {
"value" : 77000.0
},
"min_salary" : {
"value" : 16000.0
},
"avg_salary" : {
"value" : 19250.0
}
},
{
"key" : "UI",
"doc_count" : 3,
"max_salary" : {
"value" : 25000.0
},
"sum_salary" : {
"value" : 63000.0
},
"min_salary" : {
"value" : 18000.0
},
"avg_salary" : {
"value" : 21000.0
}
},
{
"key" : "DBA",
"doc_count" : 2,
"max_salary" : {
"value" : 30000.0
},
"sum_salary" : {
"value" : 50000.0
},
"min_salary" : {
"value" : 20000.0
},
"avg_salary" : {
"value" : 25000.0
}
},
{
"key" : "Designer",
"doc_count" : 2,
"max_salary" : {
"value" : 22000.0
},
"sum_salary" : {
"value" : 40000.0
},
"min_salary" : {
"value" : 18000.0
},
"avg_salary" : {
"value" : 20000.0
}
},
{
"key" : "Dev Manager",
"doc_count" : 1,
"max_salary" : {
"value" : 50000.0
},
"sum_salary" : {
"value" : 50000.0
},
"min_salary" : {
"value" : 50000.0
},
"avg_salary" : {
"value" : 50000.0
}
},
{
"key" : "Manager",
"doc_count" : 1,
"max_salary" : {
"value" : 35000.0
},
"sum_salary" : {
"value" : 35000.0
},
"min_salary" : {
"value" : 35000.0
},
"avg_salary" : {
"value" : 35000.0
}
}
]
}
}
}
Java实现:
@RequestMapping(value = "/termsQuery", method = RequestMethod.GET, produces = "text/html;charset=UTF-8")
@ApiOperation(value = "Aggregation - 按工作分类统计工资")
public void termsQuery() throws Exception {
// 定义请求对象
SearchSourceBuilder builder = new SearchSourceBuilder();
// 分桶字段
String job = "job.keyword";
// 分桶分组名称
String terms_job = "term_job";
// 按工作类别分桶
AggregationBuilder job_terms = AggregationBuilders.terms(terms_job).field(job);
// 聚合分析字段
String salary = "salary";
// 1.定义分组名称,查询平均工资
String avg_salary = "avg_salary";
AggregationBuilder avg = AggregationBuilders.avg(avg_salary).field(salary);
// 桶嵌套
job_terms.subAggregation(avg);
// 2.定义分组名称,查询最大工资
String max_salary = "max_salary";
AggregationBuilder max = AggregationBuilders.max(max_salary).field(salary);
job_terms.subAggregation(max);
// 3.定义分组名称,查询最小工资
String min_salary = "min_salary";
AggregationBuilder min = AggregationBuilders.min(min_salary).field(salary);
job_terms.subAggregation(min);
// 4.定义分组名称,查询工资之和
String sum_salary = "sum_salary";
AggregationBuilder sum = AggregationBuilders.sum(sum_salary).field(salary);
job_terms.subAggregation(sum);
// 因为是聚合统计,不需要返回查询数据的信息
builder.size(0).aggregation(job_terms);
// 查询数据
SearchResponse search = client.aggregationSearch(builder, INDEX_NAME);
// 打印返回数据
Map<String, Aggregation> map = search.getAggregations().asMap();
Terms termsData = (Terms) map.get(terms_job);
List<? extends Terms.Bucket> buckets = termsData.getBuckets();
for (Terms.Bucket bucket : buckets) {
Map<String, Aggregation> ageMap = bucket.getAggregations().asMap();
System.out.println("key:"+bucket.getKey().toString() + "; doc_count:" + bucket.getDocCount() + ";");
Avg avgSalary = (Avg) ageMap.get(avg_salary);
Max maxSalary = (Max) ageMap.get(max_salary);
Sum sumSalary = (Sum) ageMap.get(sum_salary);
Min minSalary = (Min) ageMap.get(min_salary);
System.out.println("平均工资:" + avgSalary.getValue());
System.out.println("最大工资:" + maxSalary.getValue());
System.out.println("最小工资:" + minSalary.getValue());
System.out.println("年龄工资:" + sumSalary.getValue() + "\n");
}
}
返回数据打印如下:
key:Java Developer; doc_count:7;
平均工资:25571.428571428572
最大工资:38000.0
最小工资:9000.0
年龄工资:179000.0
key:Javascript Developer; doc_count:4;
平均工资:19250.0
最大工资:25000.0
最小工资:16000.0
年龄工资:77000.0
key:UI; doc_count:3;
平均工资:21000.0
最大工资:25000.0
最小工资:18000.0
年龄工资:63000.0
key:DBA; doc_count:2;
平均工资:25000.0
最大工资:30000.0
最小工资:20000.0
年龄工资:50000.0
key:Designer; doc_count:2;
平均工资:20000.0
最大工资:22000.0
最小工资:18000.0
年龄工资:40000.0
key:Dev Manager; doc_count:1;
平均工资:50000.0
最大工资:50000.0
最小工资:50000.0
年龄工资:50000.0
key:Manager; doc_count:1;
平均工资:35000.0
最大工资:35000.0
最小工资:35000.0
年龄工资:35000.0
同时计算平均值、和、最小值和最大值。
: > : 统计每种工作的平均工资、工资总和、最小工资、最大工资
GET aggs_index/_search
{
"size": 0,
"aggs": {
"term_job": {
"terms": {
"field": "job.keyword"
},
"aggs": {
"stats_salary": {
"stats": {
"field": "salary"
}
}
}
}
}
}
返回数据如下:
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 20,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"term_job" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Java Developer",
"doc_count" : 7,
"stats_salary" : {
"count" : 7,
"min" : 9000.0,
"max" : 38000.0,
"avg" : 25571.428571428572,
"sum" : 179000.0
}
},
{
"key" : "Javascript Developer",
"doc_count" : 4,
"stats_salary" : {
"count" : 4,
"min" : 16000.0,
"max" : 25000.0,
"avg" : 19250.0,
"sum" : 77000.0
}
},
{
"key" : "UI",
"doc_count" : 3,
"stats_salary" : {
"count" : 3,
"min" : 18000.0,
"max" : 25000.0,
"avg" : 21000.0,
"sum" : 63000.0
}
},
{
"key" : "DBA",
"doc_count" : 2,
"stats_salary" : {
"count" : 2,
"min" : 20000.0,
"max" : 30000.0,
"avg" : 25000.0,
"sum" : 50000.0
}
},
{
"key" : "Designer",
"doc_count" : 2,
"stats_salary" : {
"count" : 2,
"min" : 18000.0,
"max" : 22000.0,
"avg" : 20000.0,
"sum" : 40000.0
}
},
{
"key" : "Dev Manager",
"doc_count" : 1,
"stats_salary" : {
"count" : 1,
"min" : 50000.0,
"max" : 50000.0,
"avg" : 50000.0,
"sum" : 50000.0
}
},
{
"key" : "Manager",
"doc_count" : 1,
"stats_salary" : {
"count" : 1,
"min" : 35000.0,
"max" : 35000.0,
"avg" : 35000.0,
"sum" : 35000.0
}
}
]
}
}
}
Java实现:
@RequestMapping(value = "/statsQuery", method = RequestMethod.GET)
@ApiOperation(value = "Aggregation - 使用stats按工作分类统计工资")
public void statsSalaryQuery() throws Exception {
// 定义请求对象
SearchSourceBuilder builder = new SearchSourceBuilder();
// 分桶字段
String job = "job.keyword";
// 分桶分组名称
String terms_job = "term_job";
// 按工作类别分桶
AggregationBuilder job_terms = AggregationBuilders.terms(terms_job).field(job);
// 聚合分析字段
String salary = "salary";
// 1.定义分组名称,查询平均工资
String stats_salary = "stats_salary";
AggregationBuilder stats = AggregationBuilders.stats(stats_salary).field(salary);
// 桶嵌套
job_terms.subAggregation(stats);
// 因为是聚合统计,不需要返回查询数据的信息
builder.size(0).aggregation(job_terms);
// 查询数据
SearchResponse search = client.aggregationSearch(builder, INDEX_NAME);
// 打印返回数据
Map<String, Aggregation> map = search.getAggregations().asMap();
Terms termsData = (Terms) map.get(terms_job);
List<? extends Terms.Bucket> buckets = termsData.getBuckets();
for (Terms.Bucket bucket : buckets) {
Map<String, Aggregation> ageMap = bucket.getAggregations().asMap();
System.out.println("key:"+bucket.getKey().toString() + "; doc_count:" + bucket.getDocCount() + ";");
Stats statsSalary = (Stats) ageMap.get(stats_salary);
System.out.println("平均工资:" + statsSalary.getAvg());
System.out.println("最大工资:" + statsSalary.getMax());
System.out.println("最小工资:" + statsSalary.getMin());
System.out.println("年龄工资:" + statsSalary.getSum() + "\n");
}
}
查询数据如下:
key:Java Developer; doc_count:7;
平均工资:25571.428571428572
最大工资:38000.0
最小工资:9000.0
工资总和:179000.0
key:Javascript Developer; doc_count:4;
平均工资:19250.0
最大工资:25000.0
最小工资:16000.0
工资总和:77000.0
key:UI; doc_count:3;
平均工资:21000.0
最大工资:25000.0
最小工资:18000.0
工资总和:63000.0
key:DBA; doc_count:2;
平均工资:25000.0
最大工资:30000.0
最小工资:20000.0
工资总和:50000.0
key:Designer; doc_count:2;
平均工资:20000.0
最大工资:22000.0
最小工资:18000.0
工资总和:40000.0
key:Dev Manager; doc_count:1;
平均工资:50000.0
最大工资:50000.0
最小工资:50000.0
工资总和:50000.0
key:Manager; doc_count:1;
平均工资:35000.0
最大工资:35000.0
最小工资:35000.0
工资总和:35000.0
在统计聚合的基础上,增加了标准差和方差的计算。
GET aggs_index/_search
{
"size": 0,
"aggs": {
"term_job": {
"terms": {
"field": "job.keyword"
},
"aggs": {
"stats_salary": {
"extended_stats": {
"field": "salary"
}
}
}
}
}
}
返回数据如下:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 20,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"term_job" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Java Developer",
"doc_count" : 7,
"stats_salary" : {
"count" : 7,
"min" : 9000.0,
"max" : 38000.0,
"avg" : 25571.428571428572,
"sum" : 179000.0,
"sum_of_squares" : 5.117E9,
"variance" : 7.710204081632654E7,
"std_deviation" : 8780.776777502464,
"std_deviation_bounds" : {
"upper" : 43132.982126433504,
"lower" : 8009.875016423644
}
}
},
{
"key" : "Javascript Developer",
"doc_count" : 4,
"stats_salary" : {
"count" : 4,
"min" : 16000.0,
"max" : 25000.0,
"avg" : 19250.0,
"sum" : 77000.0,
"sum_of_squares" : 1.537E9,
"variance" : 1.36875E7,
"std_deviation" : 3699.6621467371856,
"std_deviation_bounds" : {
"upper" : 26649.324293474372,
"lower" : 11850.675706525628
}
}
},
{
"key" : "UI",
"doc_count" : 3,
"stats_salary" : {
"count" : 3,
"min" : 18000.0,
"max" : 25000.0,
"avg" : 21000.0,
"sum" : 63000.0,
"sum_of_squares" : 1.349E9,
"variance" : 8666666.666666666,
"std_deviation" : 2943.920288775949,
"std_deviation_bounds" : {
"upper" : 26887.8405775519,
"lower" : 15112.159422448101
}
}
},
{
"key" : "DBA",
"doc_count" : 2,
"stats_salary" : {
"count" : 2,
"min" : 20000.0,
"max" : 30000.0,
"avg" : 25000.0,
"sum" : 50000.0,
"sum_of_squares" : 1.3E9,
"variance" : 2.5E7,
"std_deviation" : 5000.0,
"std_deviation_bounds" : {
"upper" : 35000.0,
"lower" : 15000.0
}
}
},
{
"key" : "Designer",
"doc_count" : 2,
"stats_salary" : {
"count" : 2,
"min" : 18000.0,
"max" : 22000.0,
"avg" : 20000.0,
"sum" : 40000.0,
"sum_of_squares" : 8.08E8,
"variance" : 4000000.0,
"std_deviation" : 2000.0,
"std_deviation_bounds" : {
"upper" : 24000.0,
"lower" : 16000.0
}
}
},
{
"key" : "Dev Manager",
"doc_count" : 1,
"stats_salary" : {
"count" : 1,
"min" : 50000.0,
"max" : 50000.0,
"avg" : 50000.0,
"sum" : 50000.0,
"sum_of_squares" : 2.5E9,
"variance" : 0.0,
"std_deviation" : 0.0,
"std_deviation_bounds" : {
"upper" : 50000.0,
"lower" : 50000.0
}
}
},
{
"key" : "Manager",
"doc_count" : 1,
"stats_salary" : {
"count" : 1,
"min" : 35000.0,
"max" : 35000.0,
"avg" : 35000.0,
"sum" : 35000.0,
"sum_of_squares" : 1.225E9,
"variance" : 0.0,
"std_deviation" : 0.0,
"std_deviation_bounds" : {
"upper" : 35000.0,
"lower" : 35000.0
}
}
}
]
}
}
}
计算指定字段的唯一值数量。
: 统计每种工作中有多少种工资
GET aggs_index/_search
{
"size": 0,
"aggs": {
"term_job": {
"terms": {
"field": "job.keyword"
},
"aggs": {
"card_salary": {
"cardinality": {
"field": "salary"
}
}
}
}
}
}
返回数据如下:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 20,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"term_job" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Java Developer",
"doc_count" : 7,
"card_salary" : {
"value" : 7
}
},
{
"key" : "Javascript Developer",
"doc_count" : 4,
"card_salary" : {
"value" : 3
}
},
{
"key" : "UI",
"doc_count" : 3,
"card_salary" : {
"value" : 3
}
},
{
"key" : "DBA",
"doc_count" : 2,
"card_salary" : {
"value" : 2
}
},
{
"key" : "Designer",
"doc_count" : 2,
"card_salary" : {
"value" : 2
}
},
{
"key" : "Dev Manager",
"doc_count" : 1,
"card_salary" : {
"value" : 1
}
},
{
"key" : "Manager",
"doc_count" : 1,
"card_salary" : {
"value" : 1
}
}
]
}
}
}
Java实现:
@RequestMapping(value = "/cardinality", method = RequestMethod.GET)
@ApiOperation(value = "Aggregation - 统计每种工作中有多少种工资")
public void cardinalityQuery() throws Exception {
// 定义请求对象
SearchSourceBuilder builder = new SearchSourceBuilder();
// 分桶字段
String job = "job.keyword";
// 分桶分组名称
String terms_job = "term_job";
// 按工作类别分桶
AggregationBuilder job_terms = AggregationBuilders.terms(terms_job).field(job);
// 聚合分析字段
String salary = "salary";
// 1.定义分组名称,查询平均工资
String card_salary = "card_salary";
AggregationBuilder card = AggregationBuilders.cardinality(card_salary).field(salary);
// 桶嵌套
job_terms.subAggregation(card);
// 因为是聚合统计,不需要返回查询数据的信息
builder.size(0).aggregation(job_terms);
// 查询数据
SearchResponse search = client.aggregationSearch(builder, INDEX_NAME);
// 打印返回数据
Map<String, Aggregation> map = search.getAggregations().asMap();
Terms termsData = (Terms) map.get(terms_job);
List<? extends Terms.Bucket> buckets = termsData.getBuckets();
for (Terms.Bucket bucket : buckets) {
Map<String, Aggregation> salaryMap = bucket.getAggregations().asMap();
System.out.println("key:"+bucket.getKey().toString() + "; doc_count:" + bucket.getDocCount() + ";");
Cardinality cardSalary = (Cardinality) salaryMap.get(card_salary);
System.out.println("去重后工资类型数量:" + cardSalary.getValue() + "\n");
}
}
返回数据如下:
key:Java Developer; doc_count:7;
去重后工资类型数量:7
key:Javascript Developer; doc_count:4;
去重后工资类型数量:3
key:UI; doc_count:3;
去重后工资类型数量:3
key:DBA; doc_count:2;
去重后工资类型数量:2
key:Designer; doc_count:2;
去重后工资类型数量:2
key:Dev Manager; doc_count:1;
去重后工资类型数量:1
key:Manager; doc_count:1;
去重后工资类型数量:1
计算指定字段的非空值数量,该函数通常用于对数据进行统计分析,以便了解某一字段中唯一值的数量。例如,可以使用value_count函数来统计一个字段中不同类型的文档数量,或者统计一个字段中的不同取值的数量。这样可以帮助用户更好地了解数据分布和特征。
: 统计每种工作工资不为空的数量
GET aggs_index/_search
{
"size": 0,
"aggs": {
"term_job": {
"terms": {
"field": "job.keyword"
},
"aggs": {
"value_salary": {
"value_count": {
"field": "salary"
}
}
}
}
}
}
返回数据如下:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 20,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"term_job" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Java Developer",
"doc_count" : 7,
"value_salary" : {
"value" : 7
}
},
{
"key" : "Javascript Developer",
"doc_count" : 4,
"value_salary" : {
"value" : 4
}
},
{
"key" : "UI",
"doc_count" : 3,
"value_salary" : {
"value" : 3
}
},
{
"key" : "DBA",
"doc_count" : 2,
"value_salary" : {
"value" : 2
}
},
{
"key" : "Designer",
"doc_count" : 2,
"value_salary" : {
"value" : 2
}
},
{
"key" : "Dev Manager",
"doc_count" : 1,
"value_salary" : {
"value" : 1
}
},
{
"key" : "Manager",
"doc_count" : 1,
"value_salary" : {
"value" : 1
}
}
]
}
}
}
允许根据自定义脚本计算指标值。
Scripted Metric Aggregation 聚合函数的语法包括以下几个固定词汇:
总的来说就是: 1.在每个分片上定义一个数组(容器) --> 2.每个分片汇总符合条件的数据 --> 3.每个分片上执行汇总每个分片上数组操作 --> 4.根据条件汇总每个分片提交上来的数组,最终返回
: 统计每类工作中sex=male的工资总和
GET aggs_index/_search
{
"size": 0,
"aggs": {
"term_job": {
"terms": {
"field": "job.keyword"
},
"aggs": {
"scripted_salary": {
"scripted_metric": {
"init_script": "state.transactions = []",
"map_script": "state.transactions.add(doc.sex.value=='male' ? doc.salary.value : 0)",
"combine_script": "double price = 0; for(a in state.transactions) {price+=a} return price",
"reduce_script": "double allpro = 0; for (t in states) {allpro+=t} return allpro"
}
}
}
}
}
}
返回数据如下:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 20,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"term_job" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Java Developer",
"doc_count" : 7,
"scripted_salary" : {
"value" : 111000.0
}
},
{
"key" : "Javascript Developer",
"doc_count" : 4,
"scripted_salary" : {
"value" : 57000.0
}
},
{
"key" : "UI",
"doc_count" : 3,
"scripted_salary" : {
"value" : 20000.0
}
},
{
"key" : "DBA",
"doc_count" : 2,
"scripted_salary" : {
"value" : 30000.0
}
},
{
"key" : "Designer",
"doc_count" : 2,
"scripted_salary" : {
"value" : 18000.0
}
},
{
"key" : "Dev Manager",
"doc_count" : 1,
"scripted_salary" : {
"value" : 50000.0
}
},
{
"key" : "Manager",
"doc_count" : 1,
"scripted_salary" : {
"value" : 0.0
}
}
]
}
}
}
Java实现:
@RequestMapping(value = "/scripted_metric", method = RequestMethod.GET)
@ApiOperation(value = "Aggregation - scripted_metric")
public void scriptedQuery() throws Exception {
// 定义请求对象
SearchSourceBuilder builder = new SearchSourceBuilder();
// 分桶字段
String job = "job.keyword";
// 分桶分组名称
String terms_job = "term_job";
// 按工作类别分桶
AggregationBuilder job_terms = AggregationBuilders.terms(terms_job).field(job);
// 1.定义分组名称,查询平均工资
String scripted_salary = "scripted_salary";
AggregationBuilder scripted = AggregationBuilders.scriptedMetric(scripted_salary)
.initScript(new Script("state.transactions = []"))
.mapScript(new Script("state.transactions.add(doc.sex.value=='male' ? doc.salary.value : 0)"))
.combineScript(new Script("double price = 0; for(a in state.transactions) {price+=a} return price"))
.reduceScript(new Script("double allpro = 0; for (t in states) {allpro+=t} return allpro"));
// 桶嵌套
job_terms.subAggregation(scripted);
// 因为是聚合统计,不需要返回查询数据的信息
builder.size(0).aggregation(job_terms);
// 查询数据
SearchResponse search = client.aggregationSearch(builder, INDEX_NAME);
// 打印返回数据
Map<String, Aggregation> map = search.getAggregations().asMap();
Terms termsData = (Terms) map.get(terms_job);
List<? extends Terms.Bucket> buckets = termsData.getBuckets();
for (Terms.Bucket bucket : buckets) {
Map<String, Aggregation> salaryMap = bucket.getAggregations().asMap();
System.out.println("key:"+bucket.getKey().toString() + "; doc_count:" + bucket.getDocCount() + ";");
ScriptedMetric scrSalary = (ScriptedMetric) salaryMap.get(scripted_salary);
System.out.println("去重后工资类型数量:" + scrSalary.aggregation().toString() + "\n");
}
}
返回数据如下:
key:Java Developer; doc_count:7;
去重后工资类型数量:111000.0
key:Javascript Developer; doc_count:4;
去重后工资类型数量:57000.0
key:UI; doc_count:3;
去重后工资类型数量:20000.0
key:DBA; doc_count:2;
去重后工资类型数量:30000.0
key:Designer; doc_count:2;
去重后工资类型数量:18000.0
key:Dev Manager; doc_count:1;
去重后工资类型数量:50000.0
key:Manager; doc_count:1;
去重后工资类型数量:0.0
Top_hits 表示返回每个桶内的文档的字段值,类似于SQL中的GROUP BY和TOP子句的组合。它的作用是返回每个桶内的文档,并可以指定返回文档中的字段或者计算字段的值。这种聚合适用于需要查找每个分组内部的具体文档信息的情况。
: 统计不同工种中,年纪最大的3个员工的具体信息
GET aggs_index/_search
{
"size": 0,
"aggs": {
"job_term": {
"terms": {
"field": "job.keyword"
},
"aggs": {
"age_top": {
"top_hits": {
"size": 3,
"sort": [{
"age": {
"order": "desc"
}
}]
}
}
}
}
}
}
返回数据如下:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 20,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"job_term" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Java Developer",
"doc_count" : 7,
"age_top" : {
"hits" : {
"total" : {
"value" : 7,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "aggs_index",
"_type" : "_doc",
"_id" : "-8SKH40BE9ToH2j_hE3R",
"_score" : null,
"_source" : {
"sex" : "female",
"name" : "绯色",
"job" : "Java Developer",
"salary" : 38000,
"age" : 36
},
"sort" : [
36
]
},
{
"_index" : "aggs_index",
"_type" : "_doc",
"_id" : "_8SKH40BE9ToH2j_hE3R",
"_score" : null,
"_source" : {
"sex" : "male",
"name" : "埃斯基",
"job" : "Java Developer",
"salary" : 28000,
"age" : 33
},
"sort" : [
33
]
},
{
"_index" : "aggs_index",
"_type" : "_doc",
"_id" : "-cSKH40BE9ToH2j_hE3R",
"_score" : null,
"_source" : {
"sex" : "male",
"name" : "王佛为",
"job" : "Java Developer",
"salary" : 22000,
"age" : 32
},
"sort" : [
32
]
}
]
}
}
},
{
"key" : "Javascript Developer",
"doc_count" : 4,
"age_top" : {
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "aggs_index",
"_type" : "_doc",
"_id" : "_sSKH40BE9ToH2j_hE3R",
"_score" : null,
"_source" : {
"sex" : "male",
"name" : "马里奥",
"job" : "Javascript Developer",
"salary" : 25000,
"age" : 32
},
"sort" : [
32
]
},
{
"_index" : "aggs_index",
"_type" : "_doc",
"_id" : "AsSKH40BE9ToH2j_hE7R",
"_score" : null,
"_source" : {
"sex" : "female",
"name" : "图图",
"job" : "Javascript Developer",
"salary" : 20000,
"age" : 29
},
"sort" : [
29
]
},
{
"_index" : "aggs_index",
"_type" : "_doc",
"_id" : "AcSKH40BE9ToH2j_hE7R",
"_score" : null,
"_source" : {
"sex" : "male",
"name" : "奥特",
"job" : "Javascript Developer",
"salary" : 16000,
"age" : 25
},
"sort" : [
25
]
}
]
}
}
},
{
"key" : "UI",
"doc_count" : 3,
"age_top" : {
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "aggs_index",
"_type" : "_doc",
"_id" : "9sSKH40BE9ToH2j_hE3R",
"_score" : null,
"_source" : {
"sex" : "female",
"name" : "马路",
"job" : "UI",
"salary" : 25000,
"age" : 31
},
"sort" : [
31
]
},
{
"_index" : "aggs_index",
"_type" : "_doc",
"_id" : "98SKH40BE9ToH2j_hE3R",
"_score" : null,
"_source" : {
"sex" : "male",
"name" : "李飞",
"job" : "UI",
"salary" : 20000,
"age" : 27
},
"sort" : [
27
]
},
{
"_index" : "aggs_index",
"_type" : "_doc",
"_id" : "9cSKH40BE9ToH2j_hE3R",
"_score" : null,
"_source" : {
"sex" : "female",
"name" : "王超",
"job" : "UI",
"salary" : 18000,
"age" : 25
},
"sort" : [
25
]
}
]
}
}
},
{
"key" : "DBA",
"doc_count" : 2,
"age_top" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "aggs_index",
"_type" : "_doc",
"_id" : "A8SKH40BE9ToH2j_hE7R",
"_score" : null,
"_source" : {
"sex" : "male",
"name" : "坦克",
"job" : "DBA",
"salary" : 30000,
"age" : 30
},
"sort" : [
30
]
},
{
"_index" : "aggs_index",
"_type" : "_doc",
"_id" : "BMSKH40BE9ToH2j_hE7R",
"_score" : null,
"_source" : {
"sex" : "female",
"name" : "李澎",
"job" : "DBA",
"salary" : 20000,
"age" : 29
},
"sort" : [
29
]
}
]
}
}
},
{
"key" : "Designer",
"doc_count" : 2,
"age_top" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "aggs_index",
"_type" : "_doc",
"_id" : "9MSKH40BE9ToH2j_hE3R",
"_score" : null,
"_source" : {
"sex" : "female",
"name" : "坤坤",
"job" : "Designer",
"salary" : 22000,
"age" : 26
},
"sort" : [
26
]
},
{
"_index" : "aggs_index",
"_type" : "_doc",
"_id" : "88SKH40BE9ToH2j_hE3R",
"_score" : null,
"_source" : {
"sex" : "male",
"name" : "王五",
"job" : "Designer",
"salary" : 18000,
"age" : 25
},
"sort" : [
25
]
}
]
}
}
},
{
"key" : "Dev Manager",
"doc_count" : 1,
"age_top" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "aggs_index",
"_type" : "_doc",
"_id" : "8sSKH40BE9ToH2j_hE3R",
"_score" : null,
"_source" : {
"sex" : "male",
"name" : "李四",
"job" : "Dev Manager",
"salary" : 50000,
"age" : 41
},
"sort" : [
41
]
}
]
}
}
},
{
"key" : "Manager",
"doc_count" : 1,
"age_top" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "aggs_index",
"_type" : "_doc",
"_id" : "8cSKH40BE9ToH2j_hE3R",
"_score" : null,
"_source" : {
"sex" : "female",
"name" : "张三",
"job" : "Manager",
"salary" : 35000,
"age" : 32
},
"sort" : [
32
]
}
]
}
}
}
]
}
}
}
Java实现:
@RequestMapping(value = "/tophits", method = RequestMethod.GET)
@ApiOperation(value = "Aggregation - Tophits")
public void tophitsQuery() throws Exception {
// 定义请求对象
SearchSourceBuilder builder = new SearchSourceBuilder();
// 聚合分析字段
String job = "job.keyword";
String age = "age";
// 聚合分组名称
String job_term = "job_term";
String age_top = "age_top";
AggregationBuilder jobData = AggregationBuilders.terms(job_term).field(job);
AggregationBuilder ageData = AggregationBuilders.topHits(age_top)
.size(3)
.sort(SortBuilders.fieldSort(age)
.order(SortOrder.DESC));
jobData.subAggregation(ageData);
builder.aggregation(jobData);
// 打印返回数据
SearchResponse search = client.aggregationSearch(builder, INDEX_NAME);
Map<String, Aggregation> map = search.getAggregations().asMap();
Terms rangeData = (Terms) map.get(job_term);
List<? extends Terms.Bucket> buckets = rangeData.getBuckets();
for (Terms.Bucket bucket : buckets) {
Map<String, Aggregation> pileMap = bucket.getAggregations().asMap();
System.out.println("key:"+bucket.getKey().toString() + "; doc_count:" + bucket.getDocCount() + ";");
TopHits ageTopData = (TopHits) pileMap.get(age_top);
// 打印具体数据
SearchHits hits = ageTopData.getHits();
for (SearchHit hit: hits.getHits()) {
System.out.println(hit.getSourceAsMap().toString());
}
}
}
返回数据如下:
key:Java Developer; doc_count:7;
{sex=female, name=绯色, job=Java Developer, salary=38000, age=36}
{sex=male, name=埃斯基, job=Java Developer, salary=28000, age=33}
{sex=male, name=王佛为, job=Java Developer, salary=22000, age=32}
key:Javascript Developer; doc_count:4;
{sex=male, name=马里奥, job=Javascript Developer, salary=25000, age=32}
{sex=female, name=图图, job=Javascript Developer, salary=20000, age=29}
{sex=male, name=奥特, job=Javascript Developer, salary=16000, age=25}
key:UI; doc_count:3;
{sex=female, name=马路, job=UI, salary=25000, age=31}
{sex=male, name=李飞, job=UI, salary=20000, age=27}
{sex=female, name=王超, job=UI, salary=18000, age=25}
key:DBA; doc_count:2;
{sex=male, name=坦克, job=DBA, salary=30000, age=30}
{sex=female, name=李澎, job=DBA, salary=20000, age=29}
key:Designer; doc_count:2;
{sex=female, name=坤坤, job=Designer, salary=22000, age=26}
{sex=male, name=王五, job=Designer, salary=18000, age=25}
key:Dev Manager; doc_count:1;
{sex=male, name=李四, job=Dev Manager, salary=50000, age=41}
key:Manager; doc_count:1;
{sex=female, name=张三, job=Manager, salary=35000, age=32}
对聚合分析的结果,再次进行聚合分析,此类聚合的作用对象往往是桶,而不是文档,是一种后期对每个分桶的一些计算操作
。
Pipeline Aggregation (管道聚合)是Elasticsearch中一种特殊类型的聚合操作,用于对其他聚合结果
进行进一步处理和计算。
在 Pipeline Aggregation 中,有两种主要的聚合类型:Sibling Aggregation(兄弟聚合)和Parent Aggregation(父聚合)。
Sibling Aggregation 是指多个聚合操作在同一级别进行,并且它们之间的结果是并列的,没有任何层次关系。Sibling Aggregation 可以用于对多个字段进行聚合,然后将它们的结果合并在一起,此类聚合的输入是其【兄弟聚合】的输出。常用的 Sibling Aggregation 函数有:
Parent Aggregation 是指多个聚合操作是嵌套层次结构的关系,其中一个聚合是父聚合,其他聚合是子聚合。此类聚合的"输入"是其【父聚合】的输出。Parent Aggregation 可以用于对字段进行分组,然后在每个分组内再进行聚合。常用的 Parent Aggregation 函数有:
特殊: Bucket Script(脚本聚合):使用自定义脚本计算桶中指定字段的聚合结果。
总结:
Pipeline Aggregation 是 Elasticsearch 中一种强大的聚合类型,允许在已经聚合的结果上进行进一步的聚合操作
。其中,Sibling Aggregation 是并列的多个聚合操作,没有层次关系;Parent Aggregation 是嵌套的聚合操作,其中一个是父聚合,其他是子聚合。使用 Pipeline Aggregation 可以对多个字段进行聚合,分组聚合等复杂的计算操作。
在Pipeline Aggregation中,bucket_path语法用于引用前一个聚合的桶(bucket)或指标值(metric)的结果。它允许您在后续聚合操作中使用前一个聚合操作的结果。
bucket_path语法有两个主要的构造元素:buckets
和values
。
buckets
用于引用前一个聚合操作中创建的桶,可以通过桶的名称或索引来引用。例如,buckets.my_agg_name
将引用名为my_agg_name
的桶。values
用于引用前一个聚合操作中创建的指标值,也可以通过名称或索引来引用。例如,values.my_agg_name.value
将引用名为my_agg_name
的指标值。除了引用前一个聚合操作的结果,bucket_path语法还支持一些其他操作。
.
运算符可以在结果中引用特定的属性。例如,buckets.my_agg_name.key
将引用my_agg_name
桶的键。[]
运算符可以引用桶的索引。例如,buckets[0]
将引用第一个桶。[-1]
表示引用最后一个桶。下面是一个示例,以说明bucket_path语法的使用:
GET aggs_index/_search
{
"size": 0,
"aggs": {
"terms_job": {
"terms": {
"field": "job.keyword"
},
"aggs": {
"avg_salary": {
"avg": {
"field": "salary"
}
}
}
},
"my_bucket" : {
"min_bucket": {
"buckets_path": "terms_job>avg_salary"
}
}
}
}
在上面的示例中,首先创建了一个名为terms_job
的桶聚合操作,然后在my_bucket
聚合操作中使用了bucket_path
来引用terms_jobz
桶内avg_salary
值,最后求得按工作划分的平均工资中最小的工作。
: 查询平均工资最低的工种
GET aggs_index/_search
{
"size": 0,
"aggs": {
"jobs": {
"terms": {
"field": "job.keyword",
"size": 10
},
"aggs": {
"avg_salary": {
"avg": {
"field": "salary"
}
}
}
},
"min_salary_by_job": {
"min_bucket": {
"buckets_path": "jobs>avg_salary"
}
}
}
}
查询结果如下:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 20,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"jobs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Java Developer",
"doc_count" : 7,
"avg_salary" : {
"value" : 25571.428571428572
}
},
{
"key" : "Javascript Developer",
"doc_count" : 4,
"avg_salary" : {
"value" : 19250.0
}
},
{
"key" : "UI",
"doc_count" : 3,
"avg_salary" : {
"value" : 21000.0
}
},
{
"key" : "DBA",
"doc_count" : 2,
"avg_salary" : {
"value" : 25000.0
}
},
{
"key" : "Designer",
"doc_count" : 2,
"avg_salary" : {
"value" : 20000.0
}
},
{
"key" : "Dev Manager",
"doc_count" : 1,
"avg_salary" : {
"value" : 50000.0
}
},
{
"key" : "Manager",
"doc_count" : 1,
"avg_salary" : {
"value" : 35000.0
}
}
]
},
"min_salary_by_job" : {
"value" : 19250.0,
"keys" : [
"Javascript Developer"
]
}
}
}
分析:先按照工做(job)分类,再统计每种工作的平均工资,再获取到统计好的分类中工资最低的桶。
Java实现:
@RequestMapping(value = "/pipeline_min", method = RequestMethod.GET)
@ApiOperation(value = "min_bucket")
public void minQuery() throws Exception {
// 定义请求对象
SearchSourceBuilder builder = new SearchSourceBuilder();
// 分桶字段
String job = "job.keyword";
// 分桶分组名称
String terms_job = "term_job";
// 按工作类别分桶
AggregationBuilder job_terms = AggregationBuilders.terms(terms_job).field(job);
// 聚合分析字段
String salary = "salary";
// 1.定义分组名称,查询平均工资
String avg_salary = "avg_salary";
AggregationBuilder avgAggs = AggregationBuilders.avg(avg_salary).field(salary);
// 2.桶嵌套
job_terms.subAggregation(avgAggs);
// 因为是聚合统计,不需要返回查询数据的信息
builder.size(0).aggregation(job_terms);
// 定义聚合名称
String bucket_name = "min_salary_by_job";
// 定义buckets_path
String buckets_path = terms_job + ">" + avg_salary;
// 添加聚合
builder.aggregation(PipelineAggregatorBuilders.minBucket(bucket_name, buckets_path));
// 查询数据
SearchResponse search = client.aggregationSearch(builder, INDEX_NAME);
// 打印返回数据
Map<String, Aggregation> map = search.getAggregations().asMap();
Terms termsData = (Terms) map.get(terms_job);
List<? extends Terms.Bucket> buckets = termsData.getBuckets();
for (Terms.Bucket bucket : buckets) {
Map<String, Aggregation> salaryMap = bucket.getAggregations().asMap();
System.out.println("key:"+bucket.getKey().toString() + "; doc_count:" + bucket.getDocCount() + ";");
Avg avg = (Avg) salaryMap.get(avg_salary);
System.out.println("平均工资:" + avg.getValue());
}
// 获取分组最小工资
BucketMetricValue minData = (BucketMetricValue) map.get(bucket_name);
System.out.println("工作:" + minData.keys()[0] + "; 最小工资:" + minData.getValueAsString());
}
}
返回数据如下:
key:Java Developer; doc_count:7;
平均工资:25571.428571428572
key:Javascript Developer; doc_count:4;
平均工资:19250.0
key:UI; doc_count:3;
平均工资:21000.0
key:DBA; doc_count:2;
平均工资:25000.0
key:Designer; doc_count:2;
平均工资:20000.0
key:Dev Manager; doc_count:1;
平均工资:50000.0
key:Manager; doc_count:1;
平均工资:35000.0
工作:Javascript Developer; 最小工资:19250.0
: 同 Min_bucket,不多作赘述。
: 同 Min_bucket,不多作赘述。
: 同 Min_bucket,不多作赘述。
: 同 Min_bucket, Stats_bucket只是返回值会一次性返回多种聚合操作。
"min_salary_by_job" : {
"count" : 7,
"min" : 19250.0,
"max" : 50000.0,
"avg" : 27974.48979591837,
"sum" : 195821.42857142858
}
: 同 Stats_bucket,Extended_stats_bucket会基于Stats_bucket返回多种聚合操作(如均值、标准差、最小值、最大值等统计指标)。
"min_salary_by_job" : {
"count" : 7,
"min" : 19250.0,
"max" : 50000.0,
"avg" : 27974.48979591837,
"sum" : 195821.42857142858,
"sum_of_squares" : 6.215460459183674E9,
"variance" : 1.0535084339858396E8,
"std_deviation" : 10264.055894166982,
"std_deviation_bounds" : {
"upper" : 48502.601584252334,
"lower" : 7446.378007584404
}
}
Elasticsearch的percentiles_bucket(百分位数分桶)是一个聚合操作,用于计算指定字段上的百分位数,并将结果按照另一个字段进行分桶。
百分位数是一种统计指标,用于表示一组数据中某个特定百分比处的值。例如,第50百分位数就是中位数,表示有一半的数据小于它,一半的数据大于它。
percentiles_bucket操作的输入是一个基于某个字段的数据集,它首先通过percentiles子聚合计算出指定百分位数的值。然后,它使用另一个字段来将这些计算得到的百分位数值进行分桶。这个分桶字段可以是任意类型的,但最常用的是日期或数字字段。
使用percentiles_bucket操作时,需要指定以下参数:
: 每类工作的平均工资的百分位数
GET aggs_index/_search
{
"size": 0,
"aggs": {
"jobs": {
"terms": {
"field": "job.keyword",
"size": 10
},
"aggs": {
"avg_salary": {
"avg": {
"field": "salary"
}
}
}
},
"min_salary_by_job": {
"percentiles_bucket": {
"buckets_path": "jobs>avg_salary"
}
}
}
}
查询数据如下:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 20,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"jobs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Java Developer",
"doc_count" : 7,
"avg_salary" : {
"value" : 25571.428571428572
}
},
{
"key" : "Javascript Developer",
"doc_count" : 4,
"avg_salary" : {
"value" : 19250.0
}
},
{
"key" : "UI",
"doc_count" : 3,
"avg_salary" : {
"value" : 21000.0
}
},
{
"key" : "DBA",
"doc_count" : 2,
"avg_salary" : {
"value" : 25000.0
}
},
{
"key" : "Designer",
"doc_count" : 2,
"avg_salary" : {
"value" : 20000.0
}
},
{
"key" : "Dev Manager",
"doc_count" : 1,
"avg_salary" : {
"value" : 50000.0
}
},
{
"key" : "Manager",
"doc_count" : 1,
"avg_salary" : {
"value" : 35000.0
}
}
]
},
"min_salary_by_job" : {
"values" : {
"1.0" : 19250.0,
"5.0" : 19250.0,
"25.0" : 21000.0,
"50.0" : 25000.0,
"75.0" : 35000.0,
"95.0" : 50000.0,
"99.0" : 50000.0
}
}
}
}
Java实现:
@RequestMapping(value = "/percentiles_bucket", method = RequestMethod.GET)
@ApiOperation(value = "percentiles_bucket")
public void percentilesQuery() throws Exception {
// 定义请求对象
SearchSourceBuilder builder = new SearchSourceBuilder();
// 分桶字段
String job = "job.keyword";
// 分桶分组名称
String terms_job = "term_job";
// 按工作类别分桶
AggregationBuilder job_terms = AggregationBuilders.terms(terms_job).field(job);
// 聚合分析字段
String salary = "salary";
// 1.定义分组名称,查询平均工资
String per_salary = "avg_salary";
AggregationBuilder avgAggs = AggregationBuilders.avg(per_salary).field(salary);
// 2.桶嵌套
job_terms.subAggregation(avgAggs);
// 因为是聚合统计,不需要返回查询数据的信息
builder.size(0).aggregation(job_terms);
// 定义聚合名称
String bucket_name = "per_salary_by_job";
// 定义buckets_path
String buckets_path = terms_job + ">" + per_salary;
// 添加聚合
builder.aggregation(PipelineAggregatorBuilders.percentilesBucket(bucket_name, buckets_path));
// 查询数据
SearchResponse search = client.aggregationSearch(builder, INDEX_NAME);
// 打印返回数据
Map<String, Aggregation> map = search.getAggregations().asMap();
ParsedPercentiles perData = (ParsedPercentiles) map.get(bucket_name);
Iterator<Percentile> it = perData.iterator();
while (it.hasNext()) {
Percentile entry = it.next();
System.out.println("key:" + entry.getPercent() + "; value:" + entry.getValue());
}
}
打印数据如下:
key:1.0; value:19250.0
key:5.0; value:19250.0
key:25.0; value:21000.0
key:50.0; value:25000.0
key:75.0; value:35000.0
key:95.0; value:50000.0
key:99.0; value:50000.0
总结: 同一组观测数据中某两个百分位数的差称为百分位数间距,它说明有百分数为这两个百分数差的观测数据的变异程度。例如上述测试在对每类工作平均工资的统计中P25.0=21000.0,P75=35000.0,则间距P75 - P25=14000.0,说明有50%人员的工资处在21000.0与35000.0之间,它们的变异度为 14000.0。因此百分位数间距也可作为描述数据分布离散程度的指标。
Derivative Aggregation是elasticsearch中的一个聚合方法,用于计算一个字段的导数(差值)。它可以用于分析时间序列数据,例如计算一个字段在给定时间间隔内的变化率。
使用Derivative Aggregation需要指定以下参数:
buckets_path:指定要计算导数的字段路径。可以是一个字段名称,也可以是通过点号连接的多个字段名称,用于指定嵌套字段的路径。
gap_policy:指定如何处理缺失的数据点。可以选择填充(fill)缺失的数据点为0,或者忽略(skip)缺失的数据点。
format:指定导数的输出格式。可以选择使用默认的格式或者自定义输出格式。
需要注意的是,Derivative Aggregation只能应用于数值类型字段,基于父聚合(只能是histogram或date_histogram类型)的某个权值,并且需要确保指定的字段是已经聚合(例如使用sum、avg等聚合方法)过的。
: 以5000为间隔,求导工资区间总和
GET aggs_index/_search
{
"size": 0,
"aggs": {
"agg_his": {
"histogram": {
"field": "salary",
"interval": 5000,
"min_doc_count": 0
},
"aggs": {
"sum_sa": {
"sum": {
"field": "salary"
}
},
"dvt" : {
"derivative": {
"buckets_path": "sum_sa",
"gap_policy": "skip"
}
}
}
}
}
}
返回数据如下:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 20,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"agg_his" : {
"buckets" : [
{
"key" : 5000.0,
"doc_count" : 1,
"sum_sa" : {
"value" : 9000.0
}
},
{
"key" : 10000.0,
"doc_count" : 0,
"sum_sa" : {
"value" : 0.0
},
"dvt" : {
"value" : null
}
},
{
"key" : 15000.0,
"doc_count" : 4,
"sum_sa" : {
"value" : 68000.0
},
"dvt" : {
"value" : null
}
},
{
"key" : 20000.0,
"doc_count" : 6,
"sum_sa" : {
"value" : 124000.0
},
"dvt" : {
"value" : 56000.0
}
},
{
"key" : 25000.0,
"doc_count" : 3,
"sum_sa" : {
"value" : 78000.0
},
"dvt" : {
"value" : -46000.0
}
},
{
"key" : 30000.0,
"doc_count" : 3,
"sum_sa" : {
"value" : 92000.0
},
"dvt" : {
"value" : 14000.0
}
},
{
"key" : 35000.0,
"doc_count" : 2,
"sum_sa" : {
"value" : 73000.0
},
"dvt" : {
"value" : -19000.0
}
},
{
"key" : 40000.0,
"doc_count" : 0,
"sum_sa" : {
"value" : 0.0
},
"dvt" : {
"value" : null
}
},
{
"key" : 45000.0,
"doc_count" : 0,
"sum_sa" : {
"value" : 0.0
},
"dvt" : {
"value" : null
}
},
{
"key" : 50000.0,
"doc_count" : 1,
"sum_sa" : {
"value" : 50000.0
},
"dvt" : {
"value" : null
}
}
]
}
}
}
在Elasticsearch中,cumulative_sum是一个聚合函数,用于计算给定字段的累计和。它将给定字段的每个桶中的值相加,并将结果保存在新的桶中。
注意:
: 按照年龄划分,10为间隔大小,累计求和平均工资
GET aggs_index/_search
{
"size": 0,
"aggs": {
"his_age": {
"histogram": {
"field": "age",
"interval": 10
},
"aggs": {
"avg_salary": {
"avg": {
"field": "salary"
}
},
"cus" : {
"cumulative_sum": {
"buckets_path": "avg_salary"
}
}
}
}
}
}
返回数据如下:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 20,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"his_age" : {
"buckets" : [
{
"key" : 20.0,
"doc_count" : 10,
"avg_salary" : {
"value" : 17900.0
},
"cus" : {
"value" : 17900.0
}
},
{
"key" : 30.0,
"doc_count" : 9,
"avg_salary" : {
"value" : 29444.444444444445
},
"cus" : {
"value" : 47344.444444444445
}
},
{
"key" : 40.0,
"doc_count" : 1,
"avg_salary" : {
"value" : 50000.0
},
"cus" : {
"value" : 97344.44444444444
}
}
]
}
}
}
解释
: 所谓的累计求和就是将给定字段的每个桶中的值相加,并将结果保存在新的桶中。上述测试样例,第一次得到key=20的平均值为17900.0,则累计值为17900.0 + 0 = 17900.0,第二次统计的key=30.0的平均值为29444.444444444445,则累计值为上一次的累计值17900.0 + 29444.444444444445(本次平均值)= 47344.444444444445(本次累计值),同理,依次往下累加。
Java实现:
@RequestMapping(value = "/cumulative_sum", method = RequestMethod.GET)
@ApiOperation(value = "Cumulative_sum(累计求和)")
public void cumulativeQuery() throws Exception {
// 定义请求对象
SearchSourceBuilder builder = new SearchSourceBuilder();
// 1.定义字段、分组名称
String age = "age"; // 年龄
String salary = "salary"; // 工资字段
String his_age = "his_age"; // 直方图分组名称
String avg_salary = "avg_salary"; // 工资分组
String bucket_name = "cus"; // 求和分组
double interval = 10; // 间隔
HistogramAggregationBuilder histogramAgg = AggregationBuilders.histogram(his_age) // 按年龄划分
.field(age)
.interval(interval)
.subAggregation(
// 2.平均工资
AggregationBuilders.avg(avg_salary).field(salary)
)
.subAggregation(
// 3.累计求和
PipelineAggregatorBuilders.cumulativeSum(bucket_name, avg_salary)
);
// 4.因为是聚合统计,不需要返回查询数据的信息
builder.size(0).aggregation(histogramAgg);
// 5.查询数据
SearchResponse search = client.aggregationSearch(builder, INDEX_NAME);
// 打印返回数据
ParsedHistogram hisData = (ParsedHistogram) search.getAggregations().asMap().get(his_age);
for (Histogram.Bucket his : hisData.getBuckets()) {
Map<String, Aggregation> aggs = his.getAggregations().asMap();
System.out.println("直方图:key=" + his.getKeyAsString()
+ ";doc_count=" + his.getDocCount()
+ ";avg_value=" + ((ParsedAvg) aggs.get(avg_salary)).getValue()
+ ";cus_value=" + ((ParsedSimpleValue) aggs.get(bucket_name)).value());
}
}
打印数据如下:
直方图:key=20.0;doc_count=10;avg_value=17900.0;cus_value=17900.0
直方图:key=30.0;doc_count=9; avg_value=29444.444444444445;cus_value=47344.444444444445
直方图:key=40.0;doc_count=1; avg_value=50000.0;cus_value=97344.44444444444