hive中distribute by、sort by、cluster by的区别

hive中distribute by、sort by、cluster by的区别
1.distribute by + sort by 组合使用
实验表

hive> select * from test1;
OK
19	1	FLUSH
18	1	CACHE
17	1	PROCEDURE
16	1	ANALYSE
15	1	HELP
14	1	VALUE
13	1	SERIAL
12	1	DEFAULT
11	1	HELP_VERSION
10	1	HELP_DATE
9	0	FLUSH
8	0	CACHE
7	0	PROCEDURE
6	0	ANALYSE
5	0	HELP
4	0	VALUE
3	0	SERIAL
2	0	DEFAULT
1	0	HELP_VERSION
0	0	HELP_DATE
Time taken: 0.123 seconds, Fetched: 20 row(s)
hive> desc test1;
OK
id                  	int                 	                    
type                	int                 	                    
name                	varchar(100)        	                    
Time taken: 0.064 seconds, Fetched: 3 row(s)

执行:

> select * from test1 distribute by type sort by id;
Query ID = root_20200110153747_48771756-ed5a-4e36-ae3c-2611cb64f63c
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1578636756028_0021, Tracking URL = http://linux-one:8088/proxy/application_1578636756028_0021/
Kill Command = /usr/java/hadoop/bin/mapred job  -kill job_1578636756028_0021
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-01-10 15:38:03,353 Stage-1 map = 0%,  reduce = 0%
2020-01-10 15:38:11,512 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.62 sec
2020-01-10 15:38:18,649 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.18 sec
MapReduce Total cumulative CPU time: 5 seconds 180 msec
Ended Job = job_1578636756028_0021
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 5.18 sec   HDFS Read: 11083 HDFS Write: 575 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 180 msec
OK
0	0	HELP_DATE
1	0	HELP_VERSION
2	0	DEFAULT
3	0	SERIAL
4	0	VALUE
5	0	HELP
6	0	ANALYSE
7	0	PROCEDURE
8	0	CACHE
9	0	FLUSH
10	1	HELP_DATE
11	1	HELP_VERSION
12	1	DEFAULT
13	1	SERIAL
14	1	VALUE
15	1	HELP
16	1	ANALYSE
17	1	PROCEDURE
18	1	CACHE
19	1	FLUSH
Time taken: 33.305 seconds, Fetched: 20 row(s)
hive> 

解析:
distribute by type 将map端输出的文件按type值分组,将同一个type值的输出文件输入到同一个reduce上处理,sort by id 为每个reduce的输出文件按ID排序。
2.cluster by
当cluster by后的字段和distribute by 和sort by字段一样时,相当于 distribute by 和sort by 的结合,默认只能是升序
以下两种写法是一样的

select * from test1 cluster by id;
=
select * from test1 distribute by type sort by id asc;

你可能感兴趣的:(hive)