文中的两张表emp和dept:
emp数据如下:
7499 ALLEN SALESMAN 7698 1981-02-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-02-22 1250.0 500.0 30
7566 JONES MANAGER 7839 1981-04-02 2975.0 0.0 20
7654 MARTIN SALESMAN 7698 1981-09-28 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-05-01 2850.0 0.0 30
7782 CLARK MANAGER 7839 1981-06-09 2450.0 0.0 10
7788 SCOTT ANALYST 7566 1987-07-13 3000.0 0.0 20
7839 KING PRESIDENT NULL 1981-11-07 5000.0 0.0 10
7844 TURNER SALESMAN 7698 1981-09-08 1500.0 0.0 30
7876 ADAMS CLERK 7788 1987-07-13 1100.0 0.0 20
7900 JAMES CLERK 7698 1981-12-03 950.0 0.0 30
7902 FORD ANALYST 7566 1981-12-03 3000.0 0.0 20
7934 MILLER CLERK 7782 1982-01-23 1300.0 0.0 10
dept的数据如下:
10 ACCOUNTING NEW YORK
20 RESEARCH DALLAS
30 SALES CHICAGO
40 OPERATIONS BOSTON
1、where用法:
hive (db_emp)> select * from emp where sal>3000;
OK
emp.empno emp.enname emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
7839 KING PRESIDENT NULL 1981-11-07 5000.0 0.0 10
Time taken: 0.891 seconds, Fetched: 1 row(s)
2、order用法:查询部门10的所有员工,按照薪资进行降序排列:
select * from emp where deptno='10' order by sal desc;
结果:
hive (db_emp)> select * from emp where deptno='10' order by sal desc;
Query ID = root_20180725102525_ff119157-9589-4736-8780-4310923516fd
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1532453698068_0003, Tracking URL = http://master.cdh.com:8088/proxy/application_1532453698068_0003/
Kill Command = /opt/cdh5.14.2/hadoop-2.6.0/bin/hadoop job -kill job_1532453698068_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-07-25 10:25:56,776 Stage-1 map = 0%, reduce = 0%
2018-07-25 10:26:09,404 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.72 sec
2018-07-25 10:26:18,240 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.19 sec
MapReduce Total cumulative CPU time: 5 seconds 190 msec
Ended Job = job_1532453698068_0003
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.19 sec HDFS Read: 10580 HDFS Write: 154 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 190 msec
OK
emp.empno emp.enname emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
7839 KING PRESIDENT NULL 1981-11-07 5000.0 0.0 10
7782 CLARK MANAGER 7839 1981-06-09 2450.0 0.0 10
7934 MILLER CLERK 7782 1982-01-23 1300.0 0.0 10
Time taken: 38.628 seconds, Fetched: 3 row(s)
3、limit用法:
hive (db_emp)> select * from emp limit 5;
OK
emp.empno emp.enname emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
7499 ALLEN SALESMAN 7698 1981-02-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-02-22 1250.0 500.0 30
7566 JONES MANAGER 7839 1981-04-02 2975.0 0.0 20
7654 MARTIN SALESMAN 7698 1981-09-28 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-05-01 2850.0 0.0 30
Time taken: 0.162 seconds, Fetched: 5 row(s)
4、distinct用法:去除查询结果中的重复记录
select distinct deptno from emp;
结果:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.51 sec HDFS Read: 8303 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 510 msec
OK
deptno
10
20
30
Time taken: 32.66 seconds, Fetched: 3 row(s
5、between and用法:
hive (db_emp)> select * from emp where sal between 2000 and 3000;
OK
emp.empno emp.enname emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
7566 JONES MANAGER 7839 1981-04-02 2975.0 0.0 20
7698 BLAKE MANAGER 7839 1981-05-01 2850.0 0.0 30
7782 CLARK MANAGER 7839 1981-06-09 2450.0 0.0 10
7788 SCOTT ANALYST 7566 1987-07-13 3000.0 0.0 20
7902 FORD ANALYST 7566 1981-12-03 3000.0 0.0 20
Time taken: 0.117 seconds, Fetched: 5 row(s)
6、count用法:
select count(1) from emp;
或者
select count(*) from emp;(效率较低)
7、avg用法:求平均数
select avg(sal) avg_sal from emp;
结果:
Total MapReduce CPU Time Spent: 5 seconds 380 msec
OK
avg_sal
2171.153846153846
Time taken: 31.874 seconds, Fetched: 1 row(s)
8、group by用法:
select deptno avg(sal) from emp group by deptno;
结果:
Total MapReduce CPU Time Spent: 2 seconds 730 msec
OK
deptno _c1
10 2916.6666666666665
20 2518.75
30 1566.6666666666667
Time taken: 28.437 seconds, Fetched: 3 row(s)
9、having用法:跟where相似
select deptno avg(sal) avg_sal from emp group by deptno having avg_sal >2000;
结果:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.41 sec HDFS Read: 9666 HDFS Write: 33 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 410 msec
OK
deptno avg_sal
10 2916.6666666666665
20 2518.75
Time taken: 59.368 seconds, Fetched: 2 row(s)
where和having的区别:
正确理解MySQL中的where和having的区别
10、join的用法:等值join,左join,右join,全join
等值join:
select e.empno, e.enname, e.sal, d.deptno from emp e join dept d on e.deptno=d.deptno;
结果:
Stage-Stage-3: Map: 1 Cumulative CPU: 1.8 sec HDFS Read: 7480 HDFS Write: 285 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 800 msec
OK
e.empno e.enname e.sal d.deptno
7499 ALLEN 1600.0 30
7521 WARD 1250.0 30
7566 JONES 2975.0 20
7654 MARTIN 1250.0 30
7698 BLAKE 2850.0 30
7782 CLARK 2450.0 10
7788 SCOTT 3000.0 20
7839 KING 5000.0 10
7844 TURNER 1500.0 30
7876 ADAMS 1100.0 20
7900 JAMES 950.0 30
7902 FORD 3000.0 20
7934 MILLER 1300.0 10
Time taken: 27.963 seconds, Fetched: 13 row(s)
左join:
select e.empno, e.enname, e.sal, d.deptno from emp e left join dept d on e.deptno=d.deptno;
由于数据的原因,结果同上。
右join:
select e.empno, e.enname, e.sal, d.deptno from emp e right join dept d on e.deptno=d.deptno;
结果:
Stage-Stage-3: Map: 1 Cumulative CPU: 1.71 sec HDFS Read: 6727 HDFS Write: 297 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 710 msec
OK
e.empno e.enname e.sal d.deptno
7782 CLARK 2450.0 10
7839 KING 5000.0 10
7934 MILLER 1300.0 10
7566 JONES 2975.0 20
7788 SCOTT 3000.0 20
7876 ADAMS 1100.0 20
7902 FORD 3000.0 20
7499 ALLEN 1600.0 30
7521 WARD 1250.0 30
7654 MARTIN 1250.0 30
7698 BLAKE 2850.0 30
7844 TURNER 1500.0 30
7900 JAMES 950.0 30
NULL NULL NULL 40
Time taken: 31.193 seconds, Fetched: 14 row(s)
全join:
select e.empno, e.enname, e.sal, d.deptno from emp e full join dept d on e.deptno=d.deptno;
由于数据的原因,结果同上。
12、导出数据到本地:
insert overwrite local directory '/opt/data/emp' row format delimited fields terminated by '\t' select * from emp;
结果:
在本地的路径 /opt/data/emp/:
[root@master emp]# more 000000_0
7499 ALLEN SALESMAN 7698 1981-02-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-02-22 1250.0 500.0 30
7566 JONES MANAGER 7839 1981-04-02 2975.0 0.0 20
7654 MARTIN SALESMAN 7698 1981-09-28 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-05-01 2850.0 0.0 30
7782 CLARK MANAGER 7839 1981-06-09 2450.0 0.0 10
7788 SCOTT ANALYST 7566 1987-07-13 3000.0 0.0 20
7839 KING PRESIDENT \N 1981-11-07 5000.0 0.0 10
7844 TURNER SALESMAN 7698 1981-09-08 1500.0 0.0 30
7876 ADAMS CLERK 7788 1987-07-13 1100.0 0.0 20
7900 JAMES CLERK 7698 1981-12-03 950.0 0.0 30
7902 FORD ANALYST 7566 1981-12-03 3000.0 0.0 20
7934 MILLER CLERK 7782 1982-01-23 1300.0 0.0 10
13 、sort by
先设置reduce的个数:
hive (db_emp)> set mapreduce.job.reduces=3;
hive (db_emp)> set mapreduce.job.reduces;
mapreduce.job.reduces=3
执行sort by:
insert overwrite local directory '/opt/data/emp_sort' row format delimited fields terminated by '\t' select * from emp sort by sal;
结果:
在本地路径/opt/data/emp_sort下产生3个文件:
3个文件的内容如下:每个reduce都按照sal进行排序
[root@master data]# cd emp_sort/
[root@master emp_sort]# cat 000000_0
7876 ADAMS CLERK 7788 1987-07-13 1100.0 0.0 20
7844 TURNER SALESMAN 7698 1981-09-08 1500.0 0.0 30
7782 CLARK MANAGER 7839 1981-06-09 2450.0 0.0 10
7698 BLAKE MANAGER 7839 1981-05-01 2850.0 0.0 30
7788 SCOTT ANALYST 7566 1987-07-13 3000.0 0.0 20
7839 KING PRESIDENT \N 1981-11-07 5000.0 0.0 10
[root@master emp_sort]# cat 000001_0
7900 JAMES CLERK 7698 1981-12-03 950.0 0.0 30
7521 WARD SALESMAN 7698 1981-02-22 1250.0 500.0 30
7654 MARTIN SALESMAN 7698 1981-09-28 1250.0 1400.0 30
7566 JONES MANAGER 7839 1981-04-02 2975.0 0.0 20
7902 FORD ANALYST 7566 1981-12-03 3000.0 0.0 20
[root@master emp_sort]# cat 000002_0
7934 MILLER CLERK 7782 1982-01-23 1300.0 0.0 10
7499 ALLEN SALESMAN 7698 1981-02-20 1600.0 300.0 30
14、distribute by
insert overwrite local directory '/opt/data/emp_dist' row format delimited fields terminated by '\t' select * from emp distribute by deptno sort by sal;
查看本地路径:
数据结果如下:按照deptno进行分组后根据sal进行排序
[root@master data]# cd emp_dist/
[root@master emp_dist]# more 000000_0
7900 JAMES CLERK 7698 1981-12-03 950.0 0.0 30
7521 WARD SALESMAN 7698 1981-02-22 1250.0 500.0 30
7654 MARTIN SALESMAN 7698 1981-09-28 1250.0 1400.0 30
7844 TURNER SALESMAN 7698 1981-09-08 1500.0 0.0 30
7499 ALLEN SALESMAN 7698 1981-02-20 1600.0 300.0 30
7698 BLAKE MANAGER 7839 1981-05-01 2850.0 0.0 30
[root@master emp_dist]# more 000001_0
7934 MILLER CLERK 7782 1982-01-23 1300.0 0.0 10
7782 CLARK MANAGER 7839 1981-06-09 2450.0 0.0 10
7839 KING PRESIDENT \N 1981-11-07 5000.0 0.0 10
[root@master emp_dist]# more 000002_0
7876 ADAMS CLERK 7788 1987-07-13 1100.0 0.0 20
7566 JONES MANAGER 7839 1981-04-02 2975.0 0.0 20
7902 FORD ANALYST 7566 1981-12-03 3000.0 0.0 20
7788 SCOTT ANALYST 7566 1987-07-13 3000.0 0.0 20
15、cluster by
insert overwrite local directory '/opt/data/emp_clu' row format delimited fields terminated by '\t' select * from emp cluster by sal;