DISTINCT/GROUP BY
去重的查询语句执行时间,对比两者在不同场景下的去重性能表名 | 记录数 | 查询字段不同值个数 | DISTINCT | GROUP BY |
---|---|---|---|---|
tab_1 | 100000 | 3 | ||
tab_2 | 100000 | 10000 |
drop table if exists tab_1;
create table tab_1(
id int,
value int
)
row format delimited
fields terminated by '\t';
drop table if exists tab_2;
create table tab_2 like tab_1;
测试用数据集tab_1.txt:
1 1
2 1
3 1
4 3
5 1
...
99997 3
99998 2
99999 3
100000 2
测试用数据集tab_2.txt:
1 3715
2 7211
3 4909
4 2913
5 9839
...
99997 2884
99998 698
99999 4839
100000 2101
分别加载数据集到对应表:
load data local inpath '/tmp/hive/data/tbl/tab_1.txt' overwrite into table tab_1;
load data local inpath '/tmp/hive/data/tbl/tab_2.txt' overwrite into table tab_2;
取消自动本地模式:
hive> set hive.exec.mode.local.auto = false;
手动指定Reducer个数:
hive> set mapreduce.job.reduces = 3;
执行测试查询语句,记录执行时间:
select distinct(value) from tab_1; -- 31.335s
select value from tab_1 group by value; -- 31.587s
select distinct(value) from tab_2; -- 32.376s
select value from tab_2 group by value; -- 33.834s
explain select distinct(value) from tab_1;
0: jdbc:hive2://hadoop101:10000/default (test)> explain select distinct(value) from tab_1;
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: tab_1 |
| Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: value (type: int) |
| outputColumnNames: value |
| Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| keys: value (type: int) |
| mode: hash |
| outputColumnNames: _col0 |
| Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: int) |
| sort order: + |
| Map-reduce partition columns: _col0 (type: int) |
| Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Group By Operator |
| keys: KEY._col0 (type: int) |
| mode: mergepartial |
| outputColumnNames: _col0 |
| Statistics: Num rows: 98611 Data size: 394445 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 98611 Data size: 394445 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
explain select value from tab_1 group by value;
0: jdbc:hive2://hadoop101:10000/default (test)> explain select value from tab_1 group by value;
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: tab_1 |
| Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: value (type: int) |
| outputColumnNames: value |
| Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| keys: value (type: int) |
| mode: hash |
| outputColumnNames: _col0 |
| Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: int) |
| sort order: + |
| Map-reduce partition columns: _col0 (type: int) |
| Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Group By Operator |
| keys: KEY._col0 (type: int) |
| mode: mergepartial |
| outputColumnNames: _col0 |
| Statistics: Num rows: 98611 Data size: 394445 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 98611 Data size: 394445 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+
表名 | 记录数 | 查询字段不同值个数 | DISTINCT | GROUP BY |
---|---|---|---|---|
tab_1 | 100000 | 3 | 31.335s | 31.587s |
tab_2 | 100000 | 10000 | 32.376s | 33.834s |
在Hive 2.3.0中,使用DISTINCT去重和使用Group By去重的执行计划相同,执行时间也大致相同,因此两者去重性能基本无差异
实验过程及结论,如有不足之处,欢迎指正,此实验结论仅供参考。
PS:在Hive中使用聚集函数时一定要注意,在使用聚集函数时,一般Hive都只会使用单个Reducer来进行聚集操作(即使手动设置Reducer个数也是如此),如果此时查询数据量过大,则会导致Reducer节点失效,因此在使用聚集函数时,且数据量较大时,可以使用子查询来实现分步聚合,如:可以先在子查询中,按照某个字段进行分组,然后聚合,这样就可以使用多个Reducer加快查询过程,最后在外部查询中对子查询结果进行整体聚合。