Hive的排序(Order by,Sort by,Distribute by,Cluster by)

目录

总结:

Order by:

Sort by:

Distribute by:

Cluster by:

总结:


总结:

①order by 全排序,最终会使用一个Reducer生成一个有序的文件,如果输入的数据太大的话,一个Reducer根本应付不过来;②sort by ,会启用多个Reducer进行分区排序(对数据随机分区),并生成多个文件,文件内部是有序的,全局无序;③distribute by 能够实现定制化的排序,对表中的某个字段分区,然后进行排序,并形成和Reducer个数相同的输出文件,经常和sort by结合起来使用。④cluster by为distribute by 和 sort by的字段相同的时候的替代

Order by:

order by : 是全局排序,只会启动给一个Reducer来完成全排序。即·便我们一开始在跑job的时候,就设置Reducer的个数为多个Hive最终还是交给一个Reducer来处理。Reducer的个数默认是-1,即Hive按照自己的规则来设置Reducer的个数。

查询员工信息按工资升序排列
select * from emp order by sal;
+------------+------------+------------+----------+---------------+----------+-----------+-------------+--+
| emp.empno  | emp.ename  |  emp.job   | emp.mgr  | emp.hiredata  | emp.sal  | emp.comm  | emp.deptno  |
+------------+------------+------------+----------+---------------+----------+-----------+-------------+--+
| 7369       | SMITH      | CLERK      | 7902     | 1980-12-17    | 800.0    | NULL      | 20          |
| 7900       | JAMES      | CLERK      | 7698     | 1981-12-3     | 950.0    | NULL      | 30          |
| 7876       | ADAMS      | CLERK      | 7788     | 1987-5-23     | 1100.0   | NULL      | 20          |
| 7521       | WARD       | SALESMAN   | 7698     | 1981-2-22     | 1250.0   | 500.0     | 30          |
| 7654       | MARTIN     | SALESMAN   | 7698     | 1981-9-28     | 1250.0   | 1400.0    | 30          |
| 7934       | MILLER     | CLERK      | 7782     | 1982-1-23     | 1300.0   | NULL      | 10          |
| 7844       | TURNER     | SALESMAN   | 7698     | 1981-9-8      | 1500.0   | 0.0       | 30          |
| 7499       | ALLEN      | SALESMAN   | 7698     | 1981-2-20     | 1600.0   | 300.0     | 30          |
| 7782       | CLARK      | MANAGER    | 7839     | 1981-6-9      | 2450.0   | NULL      | 10          |
| 7698       | BLAKE      | MANAGER    | 7839     | 1981-5-1      | 2850.0   | NULL      | 30          |
| 7566       | JONES      | MANAGER    | 7839     | 1981-4-2      | 2975.0   | NULL      | 20          |
| 7788       | SCOTT      | ANALYST    | 7566     | 1987-4-19     | 3000.0   | NULL      | 20          |
| 7902       | FORD       | ANALYST    | 7566     | 1981-12-3     | 3000.0   | NULL      | 20          |
| 7839       | KING       | PRESIDENT  | NULL     | 1981-11-17    | 5000.0   | NULL      | 10          |
+------------+------------+------------+----------+---------------+----------+-----------+-------------+--+


查询员工信息按工资降序排列
select * from emp order by sal desc;
+------------+------------+------------+----------+---------------+----------+-----------+-------------+--+
| emp.empno  | emp.ename  |  emp.job   | emp.mgr  | emp.hiredata  | emp.sal  | emp.comm  | emp.deptno  |
+------------+------------+------------+----------+---------------+----------+-----------+-------------+--+
| 7839       | KING       | PRESIDENT  | NULL     | 1981-11-17    | 5000.0   | NULL      | 10          |
| 7902       | FORD       | ANALYST    | 7566     | 1981-12-3     | 3000.0   | NULL      | 20          |
| 7788       | SCOTT      | ANALYST    | 7566     | 1987-4-19     | 3000.0   | NULL      | 20          |
| 7566       | JONES      | MANAGER    | 7839     | 1981-4-2      | 2975.0   | NULL      | 20          |
| 7698       | BLAKE      | MANAGER    | 7839     | 1981-5-1      | 2850.0   | NULL      | 30          |
| 7782       | CLARK      | MANAGER    | 7839     | 1981-6-9      | 2450.0   | NULL      | 10          |
| 7499       | ALLEN      | SALESMAN   | 7698     | 1981-2-20     | 1600.0   | 300.0     | 30          |
| 7844       | TURNER     | SALESMAN   | 7698     | 1981-9-8      | 1500.0   | 0.0       | 30          |
| 7934       | MILLER     | CLERK      | 7782     | 1982-1-23     | 1300.0   | NULL      | 10          |
| 7654       | MARTIN     | SALESMAN   | 7698     | 1981-9-28     | 1250.0   | 1400.0    | 30          |
| 7521       | WARD       | SALESMAN   | 7698     | 1981-2-22     | 1250.0   | 500.0     | 30          |
| 7876       | ADAMS      | CLERK      | 7788     | 1987-5-23     | 1100.0   | NULL      | 20          |
| 7900       | JAMES      | CLERK      | 7698     | 1981-12-3     | 950.0    | NULL      | 30          |
| 7369       | SMITH      | CLERK      | 7902     | 1980-12-17    | 800.0    | NULL      | 20          |
+------------+------------+------------+----------+---------------+----------+-----------+-------------+--+


按照部门和工资升序排序
select * from emp order by deptno,sal;
+------------+------------+------------+----------+---------------+----------+-----------+-------------+--+
| emp.empno  | emp.ename  |  emp.job   | emp.mgr  | emp.hiredata  | emp.sal  | emp.comm  | emp.deptno  |
+------------+------------+------------+----------+---------------+----------+-----------+-------------+--+
| 7934       | MILLER     | CLERK      | 7782     | 1982-1-23     | 1300.0   | NULL      | 10          |
| 7782       | CLARK      | MANAGER    | 7839     | 1981-6-9      | 2450.0   | NULL      | 10          |
| 7839       | KING       | PRESIDENT  | NULL     | 1981-11-17    | 5000.0   | NULL      | 10          |
| 7369       | SMITH      | CLERK      | 7902     | 1980-12-17    | 800.0    | NULL      | 20          |
| 7876       | ADAMS      | CLERK      | 7788     | 1987-5-23     | 1100.0   | NULL      | 20          |
| 7566       | JONES      | MANAGER    | 7839     | 1981-4-2      | 2975.0   | NULL      | 20          |
| 7788       | SCOTT      | ANALYST    | 7566     | 1987-4-19     | 3000.0   | NULL      | 20          |
| 7902       | FORD       | ANALYST    | 7566     | 1981-12-3     | 3000.0   | NULL      | 20          |
| 7900       | JAMES      | CLERK      | 7698     | 1981-12-3     | 950.0    | NULL      | 30          |
| 7654       | MARTIN     | SALESMAN   | 7698     | 1981-9-28     | 1250.0   | 1400.0    | 30          |
| 7521       | WARD       | SALESMAN   | 7698     | 1981-2-22     | 1250.0   | 500.0     | 30          |
| 7844       | TURNER     | SALESMAN   | 7698     | 1981-9-8      | 1500.0   | 0.0       | 30          |
| 7499       | ALLEN      | SALESMAN   | 7698     | 1981-2-20     | 1600.0   | 300.0     | 30          |
| 7698       | BLAKE      | MANAGER    | 7839     | 1981-5-1      | 2850.0   | NULL      | 30          |
+------------+------------+------------+----------+---------------+----------+-----------+-------------+--+


由于order by只会启动一个reduce来进行排序,假如我们的数据量非常的大,内存是放不下的,那将会怎么办呢?我们使用sort by
来实现区内(分区)排序

Sort by:

Sort By:对于大规模的数据集order by的效率非常低。在很多情况下,并不需要全局排序,此时可以使用sort by。Sort by为每个reducer产生一个排序文件。每个Reducer内部进行排序,对全局结果集来说不是有序的。

1,设置reduce的个数为3,并查询reduce的个数:
0: jdbc:hive2://hadoop108:10000> set mapreduce.job.reduces=3;
No rows affected (0.016 seconds)
0: jdbc:hive2://hadoop108:10000> set mapreduce.job.reduces;
+--------------------------+--+
|           set            |
+--------------------------+--+
| mapreduce.job.reduces=3  |
+--------------------------+--+
1 row selected (0.009 seconds)

2,根据部门编号升序查看员工信息
select * from emp sort by deptno;
+------------+------------+------------+----------+---------------+----------+-----------+-------------+--+
| emp.empno  | emp.ename  |  emp.job   | emp.mgr  | emp.hiredata  | emp.sal  | emp.comm  | emp.deptno  |
+------------+------------+------------+----------+---------------+----------+-----------+-------------+--+
| 7782       | CLARK      | MANAGER    | 7839     | 1981-6-9      | 2450.0   | NULL      | 10          |
| 7839       | KING       | PRESIDENT  | NULL     | 1981-11-17    | 5000.0   | NULL      | 10          |
| 7788       | SCOTT      | ANALYST    | 7566     | 1987-4-19     | 3000.0   | NULL      | 20          |
| 7654       | MARTIN     | SALESMAN   | 7698     | 1981-9-28     | 1250.0   | 1400.0    | 30          |
| 7698       | BLAKE      | MANAGER    | 7839     | 1981-5-1      | 2850.0   | NULL      | 30          |
| 7844       | TURNER     | SALESMAN   | 7698     | 1981-9-8      | 1500.0   | 0.0       | 30          |
| 7934       | MILLER     | CLERK      | 7782     | 1982-1-23     | 1300.0   | NULL      | 10          |
| 7876       | ADAMS      | CLERK      | 7788     | 1987-5-23     | 1100.0   | NULL      | 20          |
| 7566       | JONES      | MANAGER    | 7839     | 1981-4-2      | 2975.0   | NULL      | 20          |
| 7900       | JAMES      | CLERK      | 7698     | 1981-12-3     | 950.0    | NULL      | 30          |
| 7521       | WARD       | SALESMAN   | 7698     | 1981-2-22     | 1250.0   | 500.0     | 30          |
| 7499       | ALLEN      | SALESMAN   | 7698     | 1981-2-20     | 1600.0   | 300.0     | 30          |
| 7902       | FORD       | ANALYST    | 7566     | 1981-12-3     | 3000.0   | NULL      | 20          |
| 7369       | SMITH      | CLERK      | 7902     | 1980-12-17    | 800.0    | NULL      | 20          |
+------------+------------+------------+----------+---------------+----------+-----------+-------------+--+
这里使用了三个Reducer进行排序,可以观察到deptno是有三个阶段的。但是这样观察数据并不是很明显,
我们将查询到的结果写到本地的磁盘中,以便观察


3,根据部门编号降序序查看员工信息,并输出到本地文件中:

insert overwrite local directory '/opt/module/data/export' 
row format delimited
fields terminated by '\t'
select * from emp sort by deptno desc;

[isea@hadoop108 export]$ ll
总用量 12
-rw-r--r--. 1 isea isea 288 12月  2 01:48 000000_0
-rw-r--r--. 1 isea isea 282 12月  2 01:48 000001_0
-rw-r--r--. 1 isea isea  91 12月  2 01:48 000002_0
[isea@hadoop108 export]$ cat 000000_0 
7844	TURNER	SALESMAN	7698	1981-9-8	1500.0	0.0	30
7698	BLAKE	MANAGER	7839	1981-5-1	2850.0	\N	30
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.0	1400.0	30
7788	SCOTT	ANALYST	7566	1987-4-19	3000.0	\N	20
7839	KING	PRESIDENT	\N	1981-11-17	5000.0	\N	10
7782	CLARK	MANAGER	7839	1981-6-9	2450.0	\N	10
[isea@hadoop108 export]$ pwd
/opt/module/data/export

那么Hive的分区规则是什么呢?这里使用的是随机分的,目的是为了避免数据倾斜。

可见,sort by 无法实现定制化的分区,假如我们想让某个字段一样的进入同一个分区,我们就需要使用distributed by

Distribute by:

在有些情况下,我们需要控制某个特定行应该到哪个reducer,通常是为了进行后续的聚集操作。distribute by 子句可以做这件事。distribute by类似MR中partition(自定义分区),进行分区,结合sort by使用。 

1,先按照部门编号分区,再按照员工编号降序排序。

insert overwrite local directory '/opt/module/data/export/distributed_by'
row format delimited
fields terminated by '\t'
select empno,ename,deptno from emp distribute by deptno sort by empno desc;

[isea@hadoop108 distributed_by]$ ll
总用量 12
-rw-r--r--. 1 isea isea 85 12月  2 02:05 000000_0
-rw-r--r--. 1 isea isea 42 12月  2 02:05 000001_0
-rw-r--r--. 1 isea isea 69 12月  2 02:05 000002_0
[isea@hadoop108 distributed_by]$ cat 000000_0 
7900	JAMES	30
7844	TURNER	30
7698	BLAKE	30
7654	MARTIN	30
7521	WARD	30
7499	ALLEN	30
[isea@hadoop108 distributed_by]$ cat 000001_0 
7934	MILLER	10
7839	KING	10
7782	CLARK	10
[isea@hadoop108 distributed_by]$ pwd
/opt/module/data/export/distributed_by

distribute by的分区规则是根据分区字段的hash码与reduce的个数进行模除后,余数相同的分到一个区。
Hive要求DISTRIBUTE BY语句要写在SORT BY语句之前。先分区,后对区内的数据进行排序。

Cluster by:

当distribute by和sorts by字段相同时,可以使用cluster by方式。cluster by除了具有distribute by的功能外还兼具sort by的功能。但是排序只能是升序排序,不能指定排序规则为ASC或者DESC。

1,先按照部门分区,在按照部门排序:

set mapreduce.job.reduces=4;
由于使用的是Hadoop默认的分区规则,这里的部门编号是 10,20,30 对 3取余之后分布在 1,2,0 区,看不出来效果,
因此将reduce的个数设置为4,分区的情况如下:
0分区:20
1分区:没有
2分区:30,10,这里10和30分到同样的一个区,并按照deptno升序排列
3分区:没有

insert overwrite local directory '/opt/module/data/export/cluster_by'
row format delimited
fields terminated by '\t'
select ename, deptno from emp cluster by deptno;

[isea@hadoop108 cluster_by]$ cat 000000_0 
FORD	20
SCOTT	20
JONES	20
ADAMS	20
SMITH	20
[isea@hadoop108 cluster_by]$ cat 000002_0 
MILLER	10
KING	10
CLARK	10
BLAKE	30
MARTIN	30
JAMES	30
WARD	30
ALLEN	30
TURNER	30
[isea@hadoop108 cluster_by]$ pwd
/opt/module/data/export/cluster_by

总结:

①order by 全排序,最终会使用一个Reducer生成一个有序的文件,如果输入的数据太大的话,一个Reducer根本应付不过来;②sort by ,会启用多个Reducer进行分区排序(对数据随机分区),并生成多个文件,文件内部是有序的,全局无序;③distribute by 能够实现定制化的排序,对表中的某个字段分区,然后进行排序,并形成和Reducer个数相同的输出文件,经常和sort by结合起来使用。④cluster by为distribute by 和 sort by的字段相同的时候的替代

你可能感兴趣的:(bigData,Hive,Hive)