目录
总结:
Order by:
Sort by:
Distribute by:
Cluster by:
总结:
①order by 全排序,最终会使用一个Reducer生成一个有序的文件,如果输入的数据太大的话,一个Reducer根本应付不过来;②sort by ,会启用多个Reducer进行分区排序(对数据随机分区),并生成多个文件,文件内部是有序的,全局无序;③distribute by 能够实现定制化的排序,对表中的某个字段分区,然后进行排序,并形成和Reducer个数相同的输出文件,经常和sort by结合起来使用。④cluster by为distribute by 和 sort by的字段相同的时候的替代
order by : 是全局排序,只会启动给一个Reducer来完成全排序。即·便我们一开始在跑job的时候,就设置Reducer的个数为多个Hive最终还是交给一个Reducer来处理。Reducer的个数默认是-1,即Hive按照自己的规则来设置Reducer的个数。
查询员工信息按工资升序排列
select * from emp order by sal;
+------------+------------+------------+----------+---------------+----------+-----------+-------------+--+
| emp.empno | emp.ename | emp.job | emp.mgr | emp.hiredata | emp.sal | emp.comm | emp.deptno |
+------------+------------+------------+----------+---------------+----------+-----------+-------------+--+
| 7369 | SMITH | CLERK | 7902 | 1980-12-17 | 800.0 | NULL | 20 |
| 7900 | JAMES | CLERK | 7698 | 1981-12-3 | 950.0 | NULL | 30 |
| 7876 | ADAMS | CLERK | 7788 | 1987-5-23 | 1100.0 | NULL | 20 |
| 7521 | WARD | SALESMAN | 7698 | 1981-2-22 | 1250.0 | 500.0 | 30 |
| 7654 | MARTIN | SALESMAN | 7698 | 1981-9-28 | 1250.0 | 1400.0 | 30 |
| 7934 | MILLER | CLERK | 7782 | 1982-1-23 | 1300.0 | NULL | 10 |
| 7844 | TURNER | SALESMAN | 7698 | 1981-9-8 | 1500.0 | 0.0 | 30 |
| 7499 | ALLEN | SALESMAN | 7698 | 1981-2-20 | 1600.0 | 300.0 | 30 |
| 7782 | CLARK | MANAGER | 7839 | 1981-6-9 | 2450.0 | NULL | 10 |
| 7698 | BLAKE | MANAGER | 7839 | 1981-5-1 | 2850.0 | NULL | 30 |
| 7566 | JONES | MANAGER | 7839 | 1981-4-2 | 2975.0 | NULL | 20 |
| 7788 | SCOTT | ANALYST | 7566 | 1987-4-19 | 3000.0 | NULL | 20 |
| 7902 | FORD | ANALYST | 7566 | 1981-12-3 | 3000.0 | NULL | 20 |
| 7839 | KING | PRESIDENT | NULL | 1981-11-17 | 5000.0 | NULL | 10 |
+------------+------------+------------+----------+---------------+----------+-----------+-------------+--+
查询员工信息按工资降序排列
select * from emp order by sal desc;
+------------+------------+------------+----------+---------------+----------+-----------+-------------+--+
| emp.empno | emp.ename | emp.job | emp.mgr | emp.hiredata | emp.sal | emp.comm | emp.deptno |
+------------+------------+------------+----------+---------------+----------+-----------+-------------+--+
| 7839 | KING | PRESIDENT | NULL | 1981-11-17 | 5000.0 | NULL | 10 |
| 7902 | FORD | ANALYST | 7566 | 1981-12-3 | 3000.0 | NULL | 20 |
| 7788 | SCOTT | ANALYST | 7566 | 1987-4-19 | 3000.0 | NULL | 20 |
| 7566 | JONES | MANAGER | 7839 | 1981-4-2 | 2975.0 | NULL | 20 |
| 7698 | BLAKE | MANAGER | 7839 | 1981-5-1 | 2850.0 | NULL | 30 |
| 7782 | CLARK | MANAGER | 7839 | 1981-6-9 | 2450.0 | NULL | 10 |
| 7499 | ALLEN | SALESMAN | 7698 | 1981-2-20 | 1600.0 | 300.0 | 30 |
| 7844 | TURNER | SALESMAN | 7698 | 1981-9-8 | 1500.0 | 0.0 | 30 |
| 7934 | MILLER | CLERK | 7782 | 1982-1-23 | 1300.0 | NULL | 10 |
| 7654 | MARTIN | SALESMAN | 7698 | 1981-9-28 | 1250.0 | 1400.0 | 30 |
| 7521 | WARD | SALESMAN | 7698 | 1981-2-22 | 1250.0 | 500.0 | 30 |
| 7876 | ADAMS | CLERK | 7788 | 1987-5-23 | 1100.0 | NULL | 20 |
| 7900 | JAMES | CLERK | 7698 | 1981-12-3 | 950.0 | NULL | 30 |
| 7369 | SMITH | CLERK | 7902 | 1980-12-17 | 800.0 | NULL | 20 |
+------------+------------+------------+----------+---------------+----------+-----------+-------------+--+
按照部门和工资升序排序
select * from emp order by deptno,sal;
+------------+------------+------------+----------+---------------+----------+-----------+-------------+--+
| emp.empno | emp.ename | emp.job | emp.mgr | emp.hiredata | emp.sal | emp.comm | emp.deptno |
+------------+------------+------------+----------+---------------+----------+-----------+-------------+--+
| 7934 | MILLER | CLERK | 7782 | 1982-1-23 | 1300.0 | NULL | 10 |
| 7782 | CLARK | MANAGER | 7839 | 1981-6-9 | 2450.0 | NULL | 10 |
| 7839 | KING | PRESIDENT | NULL | 1981-11-17 | 5000.0 | NULL | 10 |
| 7369 | SMITH | CLERK | 7902 | 1980-12-17 | 800.0 | NULL | 20 |
| 7876 | ADAMS | CLERK | 7788 | 1987-5-23 | 1100.0 | NULL | 20 |
| 7566 | JONES | MANAGER | 7839 | 1981-4-2 | 2975.0 | NULL | 20 |
| 7788 | SCOTT | ANALYST | 7566 | 1987-4-19 | 3000.0 | NULL | 20 |
| 7902 | FORD | ANALYST | 7566 | 1981-12-3 | 3000.0 | NULL | 20 |
| 7900 | JAMES | CLERK | 7698 | 1981-12-3 | 950.0 | NULL | 30 |
| 7654 | MARTIN | SALESMAN | 7698 | 1981-9-28 | 1250.0 | 1400.0 | 30 |
| 7521 | WARD | SALESMAN | 7698 | 1981-2-22 | 1250.0 | 500.0 | 30 |
| 7844 | TURNER | SALESMAN | 7698 | 1981-9-8 | 1500.0 | 0.0 | 30 |
| 7499 | ALLEN | SALESMAN | 7698 | 1981-2-20 | 1600.0 | 300.0 | 30 |
| 7698 | BLAKE | MANAGER | 7839 | 1981-5-1 | 2850.0 | NULL | 30 |
+------------+------------+------------+----------+---------------+----------+-----------+-------------+--+
由于order by只会启动一个reduce来进行排序,假如我们的数据量非常的大,内存是放不下的,那将会怎么办呢?我们使用sort by
来实现区内(分区)排序
Sort By:对于大规模的数据集order by的效率非常低。在很多情况下,并不需要全局排序,此时可以使用sort by。Sort by为每个reducer产生一个排序文件。每个Reducer内部进行排序,对全局结果集来说不是有序的。
1,设置reduce的个数为3,并查询reduce的个数:
0: jdbc:hive2://hadoop108:10000> set mapreduce.job.reduces=3;
No rows affected (0.016 seconds)
0: jdbc:hive2://hadoop108:10000> set mapreduce.job.reduces;
+--------------------------+--+
| set |
+--------------------------+--+
| mapreduce.job.reduces=3 |
+--------------------------+--+
1 row selected (0.009 seconds)
2,根据部门编号升序查看员工信息
select * from emp sort by deptno;
+------------+------------+------------+----------+---------------+----------+-----------+-------------+--+
| emp.empno | emp.ename | emp.job | emp.mgr | emp.hiredata | emp.sal | emp.comm | emp.deptno |
+------------+------------+------------+----------+---------------+----------+-----------+-------------+--+
| 7782 | CLARK | MANAGER | 7839 | 1981-6-9 | 2450.0 | NULL | 10 |
| 7839 | KING | PRESIDENT | NULL | 1981-11-17 | 5000.0 | NULL | 10 |
| 7788 | SCOTT | ANALYST | 7566 | 1987-4-19 | 3000.0 | NULL | 20 |
| 7654 | MARTIN | SALESMAN | 7698 | 1981-9-28 | 1250.0 | 1400.0 | 30 |
| 7698 | BLAKE | MANAGER | 7839 | 1981-5-1 | 2850.0 | NULL | 30 |
| 7844 | TURNER | SALESMAN | 7698 | 1981-9-8 | 1500.0 | 0.0 | 30 |
| 7934 | MILLER | CLERK | 7782 | 1982-1-23 | 1300.0 | NULL | 10 |
| 7876 | ADAMS | CLERK | 7788 | 1987-5-23 | 1100.0 | NULL | 20 |
| 7566 | JONES | MANAGER | 7839 | 1981-4-2 | 2975.0 | NULL | 20 |
| 7900 | JAMES | CLERK | 7698 | 1981-12-3 | 950.0 | NULL | 30 |
| 7521 | WARD | SALESMAN | 7698 | 1981-2-22 | 1250.0 | 500.0 | 30 |
| 7499 | ALLEN | SALESMAN | 7698 | 1981-2-20 | 1600.0 | 300.0 | 30 |
| 7902 | FORD | ANALYST | 7566 | 1981-12-3 | 3000.0 | NULL | 20 |
| 7369 | SMITH | CLERK | 7902 | 1980-12-17 | 800.0 | NULL | 20 |
+------------+------------+------------+----------+---------------+----------+-----------+-------------+--+
这里使用了三个Reducer进行排序,可以观察到deptno是有三个阶段的。但是这样观察数据并不是很明显,
我们将查询到的结果写到本地的磁盘中,以便观察
3,根据部门编号降序序查看员工信息,并输出到本地文件中:
insert overwrite local directory '/opt/module/data/export'
row format delimited
fields terminated by '\t'
select * from emp sort by deptno desc;
[isea@hadoop108 export]$ ll
总用量 12
-rw-r--r--. 1 isea isea 288 12月 2 01:48 000000_0
-rw-r--r--. 1 isea isea 282 12月 2 01:48 000001_0
-rw-r--r--. 1 isea isea 91 12月 2 01:48 000002_0
[isea@hadoop108 export]$ cat 000000_0
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 \N 30
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 \N 20
7839 KING PRESIDENT \N 1981-11-17 5000.0 \N 10
7782 CLARK MANAGER 7839 1981-6-9 2450.0 \N 10
[isea@hadoop108 export]$ pwd
/opt/module/data/export
那么Hive的分区规则是什么呢?这里使用的是随机分的,目的是为了避免数据倾斜。
可见,sort by 无法实现定制化的分区,假如我们想让某个字段一样的进入同一个分区,我们就需要使用distributed by
在有些情况下,我们需要控制某个特定行应该到哪个reducer,通常是为了进行后续的聚集操作。distribute by 子句可以做这件事。distribute by类似MR中partition(自定义分区),进行分区,结合sort by使用。
1,先按照部门编号分区,再按照员工编号降序排序。
insert overwrite local directory '/opt/module/data/export/distributed_by'
row format delimited
fields terminated by '\t'
select empno,ename,deptno from emp distribute by deptno sort by empno desc;
[isea@hadoop108 distributed_by]$ ll
总用量 12
-rw-r--r--. 1 isea isea 85 12月 2 02:05 000000_0
-rw-r--r--. 1 isea isea 42 12月 2 02:05 000001_0
-rw-r--r--. 1 isea isea 69 12月 2 02:05 000002_0
[isea@hadoop108 distributed_by]$ cat 000000_0
7900 JAMES 30
7844 TURNER 30
7698 BLAKE 30
7654 MARTIN 30
7521 WARD 30
7499 ALLEN 30
[isea@hadoop108 distributed_by]$ cat 000001_0
7934 MILLER 10
7839 KING 10
7782 CLARK 10
[isea@hadoop108 distributed_by]$ pwd
/opt/module/data/export/distributed_by
distribute by的分区规则是根据分区字段的hash码与reduce的个数进行模除后,余数相同的分到一个区。
Hive要求DISTRIBUTE BY语句要写在SORT BY语句之前。先分区,后对区内的数据进行排序。
当distribute by和sorts by字段相同时,可以使用cluster by方式。cluster by除了具有distribute by的功能外还兼具sort by的功能。但是排序只能是升序排序,不能指定排序规则为ASC或者DESC。
1,先按照部门分区,在按照部门排序:
set mapreduce.job.reduces=4;
由于使用的是Hadoop默认的分区规则,这里的部门编号是 10,20,30 对 3取余之后分布在 1,2,0 区,看不出来效果,
因此将reduce的个数设置为4,分区的情况如下:
0分区:20
1分区:没有
2分区:30,10,这里10和30分到同样的一个区,并按照deptno升序排列
3分区:没有
insert overwrite local directory '/opt/module/data/export/cluster_by'
row format delimited
fields terminated by '\t'
select ename, deptno from emp cluster by deptno;
[isea@hadoop108 cluster_by]$ cat 000000_0
FORD 20
SCOTT 20
JONES 20
ADAMS 20
SMITH 20
[isea@hadoop108 cluster_by]$ cat 000002_0
MILLER 10
KING 10
CLARK 10
BLAKE 30
MARTIN 30
JAMES 30
WARD 30
ALLEN 30
TURNER 30
[isea@hadoop108 cluster_by]$ pwd
/opt/module/data/export/cluster_by
①order by 全排序,最终会使用一个Reducer生成一个有序的文件,如果输入的数据太大的话,一个Reducer根本应付不过来;②sort by ,会启用多个Reducer进行分区排序(对数据随机分区),并生成多个文件,文件内部是有序的,全局无序;③distribute by 能够实现定制化的排序,对表中的某个字段分区,然后进行排序,并形成和Reducer个数相同的输出文件,经常和sort by结合起来使用。④cluster by为distribute by 和 sort by的字段相同的时候的替代