Elasticsearch(六)elasticsearch聚合分析

终于到了最后一个业务需求:支持管理者对雇员目录做分析。 Elasticsearch 有一个功能叫聚合(aggregations),允许我们基于数据生成一些精细的分析结果。聚合与 SQL 中的 GROUP BY 类似但更强大。

基本聚合

举个例子,挖掘出雇员中最受欢迎的兴趣爱好:
GET /megacorp/employee/_search
{
“aggs”: {
“all_interests”: {
“terms”: { “field”: “interests” }
}
}
}
暂时忽略掉语法,直接看看结果:
{

“hits”: { … },
“aggregations”: {
“all_interests”: {
“buckets”: [
{
“key”: “music”,
“doc_count”: 2
},
{
“key”: “forestry”,
“doc_count”: 1
},
{
“key”: “sports”,
“doc_count”: 1
}
]
}
}
}
可以看到,两位员工对音乐感兴趣,一位对林地感兴趣,一位对运动感兴趣。这些聚合并非预先统计,而是从匹配当前查询的文档中即时生成。如果想知道叫 Smith 的雇员中最受欢迎的兴趣爱好,可以直接添加适当的查询来组合查询:

Client程序演示

增加一个方法:

/**
     * 挖掘出雇员中最受欢迎的兴趣爱好   聚合搜索using aggrefations
     * @param client
     */
    private static void findInterestHobby(Client client) {
        SearchRequestBuilder request = client.prepareSearch("megacorp1")
                .setTypes("employee1")
                .addAggregation(
                        AggregationBuilders.terms("agg1").field("interests")
                );
        SearchResponse response = request.get();
        Aggregations aggs = response.getAggregations();
        Map map= aggs.asMap();
        Set set = map.keySet();
        for (String str : set) {
            System.out.println("agg name="+str);
            Aggregation agg = map.get(str);
            Map data = agg.getMetaData();
            Set dataSet = map.keySet();
            for (String str2 : dataSet) {
                StringTerms obj = (StringTerms) map.get(str2);
                System.out.println("DocCountError="+obj.getDocCountError());
                System.out.println("SumOfOtherDocCounts="+obj.getSumOfOtherDocCounts());
                List buckes = obj.getBuckets();
                for (Iterator iterator = buckes.iterator(); iterator.hasNext();) {
                    Bucket bucket = (Bucket) iterator.next();
                    String key = bucket.getKeyAsString();

                    System.out.println(key+"="+bucket.getDocCount());
                }
            }
        }   }

主方法中增加调用:

// 8.挖掘出雇员中最受欢迎的兴趣爱好   聚合搜索using aggrefations
findInterestHobby(client);

运行后结果报错:

Caused by: RemoteTransportException[[111][127.0.0.1:9300][indices:data/read/search[phase/query]]]; nested: IllegalArgumentException[Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory.];

Caused by: java.lang.IllegalArgumentException: Fielddata is disabled on text fields by default.
...

fielddata

这里看下fielddata:
大多数字段默认都是索引的,这使得它们可以搜索。但是,在脚本中进行排序、聚合和访问字段值需要从搜索中获得不同的访问模式。

搜索需要回答“哪些文档包含这个术语?”排序和聚合需要回答一个不同的问题:“这个字段对这个文档的值是多少?”。

大多数字段可以使用索引时,找到值但是text文本字段不支持。
Text field使用fielddata的这种内存数据结构。它会在内存中存储反转整个索引的每个片段,包括文档关系。

因为它非常耗费内存所以默认是关闭的disabled,一般不必要设置的不要设置。
参考https://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata.html

我们这里让interests这个字段设置为fielddata:true
让已存在的text field设能够fielddata:

Elasticsearch(六)elasticsearch聚合分析_第1张图片
再次调用,运行结果:
agg name=agg1
DocCountError=0
SumOfOtherDocCounts=0
music=11
sports=8
forestry=2

Head插件示例

结果太长了,只显示最后聚合的结果,hits返回的数据结果省略。(下同)
Elasticsearch(六)elasticsearch聚合分析_第2张图片
Elasticsearch(六)elasticsearch聚合分析_第3张图片

有查询条件的聚合

GET /megacorp/employee/_search
{
“query”: {
“match”: {
“last_name”: “smith”
}
},
“aggs”: {
“all_interests”: {
“terms”: {
“field”: “interests”
}
}
}
}
all_interests 聚合已经变为只包含匹配查询的文档:

“all_interests”: {
“buckets”: [
{
“key”: “music”,
“doc_count”: 2
},
{
“key”: “sports”,
“doc_count”: 1
}
]
}

Client程序演示

我们把刚才的方法请求部分加上查询条件,就如我们之前学习的那样:

SearchRequestBuilder request = client.prepareSearch("megacorp1")
                .setTypes("employee1")
                .setQuery(QueryBuilders.matchQuery("last_name","Smith"))
                .addAggregation(
                        AggregationBuilders.terms("agg1").field("interests")

其他部分相同
调用结果:
agg name=agg1
DocCountError=0
SumOfOtherDocCounts=0
music=2
sports=1

Head插件示例

Elasticsearch(六)elasticsearch聚合分析_第4张图片
Elasticsearch(六)elasticsearch聚合分析_第5张图片

聚合支持分级汇总

聚合还支持分级汇总 。比如,查询特定兴趣爱好员工的平均年龄:
GET /megacorp/employee/_search
{
“aggs” : {
“all_interests” : {
“terms” : { “field” : “interests” },
“aggs” : {
“avg_age” : {
“avg” : { “field” : “age” }
}
}
}
}
}

得到的聚合结果有点儿复杂,但理解起来还是很简单的:

“all_interests”: {
“buckets”: [
{
“key”: “music”,
“doc_count”: 2,
“avg_age”: {
“value”: 28.5
}
},
{
“key”: “forestry”,
“doc_count”: 1,
“avg_age”: {
“value”: 35
}
},
{
“key”: “sports”,
“doc_count”: 1,
“avg_age”: {
“value”: 25
}
}
]
}

输出基本是第一次聚合的加强版。依然有一个兴趣及数量的列表,只不过每个兴趣都有了一个附加的 avg_age 属性,代表有这个兴趣爱好的所有员工的平均年龄。
即使现在不太理解这些语法也没有关系,依然很容易了解到复杂聚合及分组通过 Elasticsearch 特性实现得很完美。可提取的数据类型毫无限制。

Client程序演示

此部分可以参考https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/_structuring_aggregations.html

通俗点说,你可以在一个聚合下面再次聚合

增加一个方法:

/**
     * 子聚合
     * @param client
     */
    private static void findAvgInterestHobby(Client client) {
        SearchRequestBuilder request = client.prepareSearch("megacorp1")
                .setTypes("employee1")
                .addAggregation(
                     AggregationBuilders.terms("agg1").field("interests")
                     .subAggregation(AggregationBuilders.avg("avg_age").field("age"))
                );


        SearchResponse response = request.execute().actionGet();
        //为了方便直接返回string了,类似第一个例子可以分析
        System.out.println(response.toString());
    }

main方法增加调用:

// 9.子聚合
findAvgInterestHobby(client);

结果显示:
{“took”:8,”timed_out”:false,”_shards”:{“total”:5,”successful”:5,”failed”:0},”hits”:{“total”:13,”max_score”:1.0,”hits”:[{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”5”,”_score”:1.0,”_source”:{“first_name”:”John”,”last_name”:”Smith”,”age”:25,”about”:”I love to go rock climbing”,”interests”:[“sports”,”music”]}},{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”8”,”_score”:1.0,”_source”:{“first_name”:”Jane”,”last_name”:”1 Smith”,”age”:”32”,”about”:”I like to collect rock albums”,”interests”:[“music”]}},{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”9”,”_score”:1.0,”_source”:{“first_name”:”John”,”last_name”:”SmithSmithSmith”,”age”:25,”about”:”I love to go rock climbing”,”interests”:[“sports”,”music”]}},{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”10”,”_score”:1.0,”_source”:{“first_name”:”John”,”last_name”:”冬瓜核桃”,”age”:25,”about”:”I love to go rock climbing”,”interests”:[“sports”,”music”]}},{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”12”,”_score”:1.0,”_source”:{“first_name”:”John”,”last_name”:”蜂蜜”,”age”:25,”about”:”I love to go rock climbing”,”interests”:[“sports”,”music”]}},{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”2”,”_score”:1.0,”_source”:{“first_name”:”Jane”,”last_name”:”Smith”,”age”:”32”,”about”:”I like to collect rock albums”,”interests”:[“music”]}},{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”4”,”_score”:1.0,”_source”:{“first_name”:”Douglas1”,”last_name”:”Fir”,”age”:35,”about”:”I like to build cabinets”,”interests”:[“forestry”]}},{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”6”,”_score”:1.0,”_source”:{“first_name”:”John”,”last_name”:”Smith 1”,”age”:25,”about”:”I love to go rock climbing”,”interests”:[“sports”,”music”]}},{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”1”,”_score”:1.0,”_source”:{“first_name”:”John”,”last_name”:”Smith1”,”age”:25,”about”:”I love to go rock climbing”,”interests”:[“sports”,”music”]}},{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”7”,”_score”:1.0,”_source”:{“first_name”:”Jane”,”last_name”:”1Smith”,”age”:”32”,”about”:”I like to collect rock albums”,”interests”:[“music”]}}]},”aggregations”:{“agg1”:{“doc_count_error_upper_bound”:0,”sum_other_doc_count”:0,”buckets”:[{“key”:”music”,”doc_count”:11,”avg_age”:{“value”:26.90909090909091}},{“key”:”sports”,”doc_count”:8,”avg_age”:{“value”:25.0}},{“key”:”forestry”,”doc_count”:2,”avg_age”:{“value”:35.0}}]}}}

Head插件示例

Elasticsearch(六)elasticsearch聚合分析_第6张图片
Elasticsearch(六)elasticsearch聚合分析_第7张图片

你可能感兴趣的:(elasticsearch)