elasticsearch查询,根据,某一个字段去重

1、实现查询去重、分页
2、实现根据aifile.oid去重,create_time排序

方式一:使用聚合

DSL源码:

GET  /aipage/_search
{
  "query": {
    "match": {
      "status": 0
    }
  },
  "sort": [
    {
      "create_time": {
        "order": "desc"
      }
    }
  ],"aggs": {
    "target_oid": {
      "terms": {
        "field": "aifile.oid",
        "size": 10 //去重后查询出的数量
      },"aggs": {
        "rated": {
          "top_hits": {
            "sort": [{
              "create_time": {"order": "desc"}
            }], 
            "size": 10
          }
        }
      }
    }
  }, 
  "size": 0,
  "from": 0
}

terms节点中的size参数规定了最后返回的term个数(默认是10个)

得到结果:

......

"aggregations": {
        "file_oid": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 105410,
            "buckets": [
                {
                    "key": "3ABE5618-37D3-447E-91B3-EDA1F8AE4156",
                    "doc_count": 745,
                    "rated": {
                        ......
                    }
                },
                {
                    "key": "980AD126-BAD7-45F0-9037-ACEE21DEFCCC",
                    "doc_count": 624,
                    "rated": {
                        ......
                    }
                }
            ]
        }
    }

可以看到,按照size=2,返回了链条去重后的结果,key值对应的就是aifile.oid,reted列表事对应的原始数据。

order排序
order指定了最后返回结果的排序方式,默认是按照doc_count排序。

{
    "aggs" : {
        "genders" : {
            "terms" : {
                "field" : "gender",
                "order" : { "_count" : "asc" }
            }
        }
    }
}

也可以按照字典方式排序:

{
    "aggs" : {
        "genders" : {
            "terms" : {
                "field" : "gender",
                "order" : { "_term" : "asc" }
            }
        }
    }
}

当然也可以通过order指定一个单值的metric聚合,来排序。

{
    "aggs" : {
        "genders" : {
            "terms" : {
                "field" : "gender",
                "order" : { "avg_height" : "desc" }
            },
            "aggs" : {
                "avg_height" : { "avg" : { "field" : "height" } }
            }
        }
    }
}

同时也支持多值的Metric聚合,不过要指定使用的多值字段:

{
    "aggs" : {
        "genders" : {
            "terms" : {
                "field" : "gender",
                "order" : { "height_stats.avg" : "desc" }
            },
            "aggs" : {
                "height_stats" : { "stats" : { "field" : "height" } }
            }
        }
    }
}

min_doc_count与shard_min_doc_count
聚合的字段可能存在一些频率很低的词条,如果这些词条数目比例很大,那么就会造成很多不必要的计算。
因此可以通过设置min_doc_count和shard_min_doc_count来规定最小的文档数目,只有满足这个参数要求的个数的词条才会被记录返回。

通过名字就可以看出:
min_doc_count:规定了最终结果的筛选
shard_min_doc_count:规定了分片中计算返回时的筛选

多字段聚合

通常情况,terms聚合都是仅针对于一个字段的聚合。因为该聚合是需要把词条放入一个哈希表中,如果多个字段就会造成n^2的内存消耗。

不过,对于多字段,ES也提供了下面两种方式:

1 使用脚本合并字段
2 使用copy_to方法,合并两个字段,创建出一个新的字段,对新字段执行单个字段的聚合。

collect模式

对于子聚合的计算,有两种方式:

  • depth_first 直接进行子聚合的计算
  • breadth_first 先计算出当前聚合的结果,针对这个结果在对子聚合进行计算。

默认情况下ES会使用深度优先,不过可以手动设置成广度优先,比如:

{

    "aggs" : {
        "actors" : {
             "terms" : {
                 "field" : "actors",
                 "size" : 10,
                 "collect_mode" : "breadth_first"
             },
            "aggs" : {
                "costars" : {
                     "terms" : {
                         "field" : "actors",
                         "size" : 5
                     }
                 }
            }
         }
    }
}

缺省值Missing value

缺省值指定了缺省的字段的处理方式:

{
    "aggs" : {
        "tags" : {
             "terms" : {
                 "field" : "tags",
                 "missing": "N/A" 
             }
         }
    }
}

方式二:折叠

GET  /aipage/_search
{
  "query": {
    "match": {
      "status": 0
    }
  },
  "sort": [
    {
      "create_time": {
        "order": "desc"
      }
    }
  ],
  "collapse":{
        "field":"aifile.oid"
  },
  "size": 2,
  "from": 0
}

注意:这里的size是2,不再是0,否则不会返回查询结果
返回结果:

{
    ......
    "hits": {
        "total": {
            "value": 10000,
            "relation": "gte"
        },
        "max_score": null,
        "hits": [
            {
                    ......
                    "aifile": {
                        "oid": "C545E593-4D24-43C1-929F-49B45BE575A4",
                        ......
                    },
                    ......
                },
                "fields": {
                    "aifile.oid": [
                        "C545E593-4D24-43C1-929F-49B45BE575A4"
                    ]
                },
                "sort": [
                    1569185780000
                ]
            },
            {
                ......
                    "aifile": {
                        "oid": "F1D1DCD0-37EA-4A33-8A8D-A4E94AE61CBC",
                        ......
                    },
                   ......
                },
                "fields": {
                    "aifile.oid": [
                        "F1D1DCD0-37EA-4A33-8A8D-A4E94AE61CBC"
                    ]
                },
                "sort": [
                    1569184367000
                ]
            }
        ]
    }
}

方式二较方式一:

简化;
性能好很多。

Java实现
1)统计去重数目

public class EsTest {
    public static void main(String[] args) {
        Settings settings = Settings.settingsBuilder().put("cluster.name", "elasticsearch") // 设置集群名
                .put("client.transport.ignore_cluster_name", true) // 忽略集群名字验证, 打开后集群名字不对也能连接上
                .build();
        TransportClient client = TransportClient.builder().settings(settings).build()
                .addTransportAddress(new InetSocketTransportAddress(new InetSocketAddress("101.10.32.1", 9300)));
                
         CardinalityBuilder cardinalityBuilder = AggregationBuilders.cardinality("uid_aggs").field("uid");


        SearchRequestBuilder request = client.prepareSearch("user_onoffline_log")
                .setTypes("logs")
                .setSearchType(SearchType.QUERY_THEN_FETCH)
                .setQuery(QueryBuilders.boolQuery()
                        .must(QueryBuilders.termQuery("uid", "")))
                .addAggregation(cardinalityBuilder)
                .setSize(1);


        SearchResponse response = request.execute().actionGet();

        List aggregationList = response.getAggregations().asList();

        for (Aggregation aggregation : aggregationList) {
            System.out.println(aggregation.getProperty("value"));
        }
    }
}

2)返回去重内容

public class EsTest {
    public static void main(String[] args) {
        Settings settings = Settings.settingsBuilder().put("cluster.name", "elasticsearch") // 设置集群名
                .put("client.transport.ignore_cluster_name", true) // 忽略集群名字验证, 打开后集群名字不对也能连接上
                .build();
        TransportClient client = TransportClient.builder().settings(settings).build()
                .addTransportAddress(new InetSocketTransportAddress(new InetSocketAddress("101.10.32.1", 9300)));
    
        AggregationBuilder aggregationBuilder = AggregationBuilders
                .terms("uid_aggs").field("uid").size(10000)
                .subAggregation(AggregationBuilders.topHits("uid_top")
                        .addSort("offline_time", SortOrder.DESC)
                        .setSize(1));
        

        SearchRequestBuilder request = client.prepareSearch("user_onoffline_log")
                .setTypes("logs")
                .setSearchType(SearchType.QUERY_THEN_FETCH)
                .setQuery(QueryBuilders.boolQuery()
                        .must(QueryBuilders.termQuery("uid", "")))
                .addAggregation(aggregationBuilder)
                .setSize(1);


        SearchResponse response = request.execute().actionGet();
        Terms genders = response.getAggregations().get("uid_aggs");
        for (Terms.Bucket entry : genders.getBuckets()) {
            TopHits top = entry.getAggregations().get("uid_top");
            for (SearchHit hit : top.getHits()) {
                System.out.println(hit.getSource());
            }
        }
    }
}

你可能感兴趣的:(elasticsearch,elasticsearch)