HiveQL 常用操作
Create Table dept (deptno Int,dname String) Row format delimited fields terminated By'\t';
Create Table emp (empno Int,ename String,mgr Int,sal Float,deptno Int) Row format delimited fields terminated By'\t';
Create Table salgrade (grade Int,losal Int,hisal Int) Row format delimited fields terminated By'\t';
Hive中分托管表和外部表,以上是托管表,托管表在在数据仓库目录下,由hive管理。外部表的数据在指定位置,不在hive数据仓库中,只在元数据库中注册。
创建外部表:
Create External Table ext(Id Int,Name String);
hive>load Data Local inpath '/home/licz/data/dept' overwrite Into Table dept;
Copying data from file:/home/licz/data/dept
Copying file: file:/home/licz/data/dept
Loading data to table default.dept
Deleted hdfs://gc:9000/user/hive/warehouse/dept
Table default.dept stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 50, raw_data_size: 0]
OK
Time taken: 1.547 seconds
hive>select * from dept;
OK
10 ACCOUNTING
20 RESEARCH
30 SALES
40 OPERATIONS
Time taken: 0.184 seconds, Fetched: 4 row(s)
load Data Local inpath '/home/licz/data/emp' overwrite Into Table emp;
load Data Local inpath '/home/licz/data/salgrade' overwrite Into Table salgrade;
如果导入的数据在HDFS上,则不需要加local关键字
--查看托管表位置
[licz@gc data]$ hadoop dfs -ls /user/hive/warehouse
Found 4 items
drwxr-xr-x - licz supergroup 0 2013-12-16 13:44 /user/hive/warehouse/dept
drwxr-xr-x - licz supergroup 0 2013-12-16 13:42 /user/hive/warehouse/emp
drwxr-xr-x - licz supergroup 0 2013-12-16 13:07 /user/hive/warehouse/ext
drwxr-xr-x - licz supergroup 0 2013-12-16 13:38 /user/hive/warehouse/salgrade
Hive导入数据时只是复制和移动文件,并不对数据模式进行检查,所以下面操作同样也成功了。这是Hive采用的“schema on read”加载方式,可以提高加载数据的效率。
hive> load Data Local inpath '/home/licz/data/salgrade' overwrite Into Tableemp;
Copying data from file:/home/licz/data/salgrade
Copying file: file:/home/licz/data/salgrade
Loading data to table default.emp
Deleted hdfs://gc:9000/user/hive/warehouse/emp
Table default.emp stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 59, raw_data_size: 0]
OK
Time taken: 2.118 seconds
hive> select * from emp;
OK
1 700 1200 NULL NULL
2 1201 1400 NULL NULL
3 1401 2000 NULL NULL
4 2001 3000 NULL NULL
5 3001 9999 NULL NULL
尽管导入数据和表结构不一致,还是导入成功了
--创建分区表
//注意:table中的列不能和partition中的列重合了
hive> Create Table ptest(ename String) partitioned By (deptno Int) Row format delimited fields terminated By '\t';
OK
Time taken: 0.271 seconds
hive> desc ptest;
OK
ename string None
deptno int None
# Partition Information
# col_name data_type comment
deptno int None
Time taken: 0.231 seconds, Fetched: 7 row(s)
导入分区表数据
hive>load Data Local inpath '/home/licz/data/ptest20' overwrite Into Table ptest Partition(deptno=20);
Copying data from file:/home/licz/data/ptest20
Copying file: file:/home/licz/data/ptest20
Loading data to table default.ptest partition (deptno=20)
Deleted hdfs://gc:9000/user/hive/warehouse/ptest/deptno=20
Partition default.ptest{deptno=20} stats: [num_files: 1, num_rows: 0, total_size: 29, raw_data_size: 0]
Table default.ptest stats: [num_partitions: 1, num_files: 1, num_rows: 0, total_size: 29, raw_data_size: 0]
OK
Time taken: 2.209 seconds
hive> load Data Local inpath '/home/licz/data/ptest30' Into Table ptest Partition(deptno=30);
hive> select * from ptest;
OK
SMITH 20
JONES 20
SCOTT 20
ADAMS 20
FORD 20
ALLEN 30
WARD 30
MARTIN 30
BLAKE 30
TURNER 30
JAMES 30
Time taken: 0.364 seconds, Fetched: 11 row(s)
创建分区后,会在相应的目录下建立以分区命名的目录,目录下是分区的数据
hive> dfs -ls /user/hive/warehouse/ptest;
Found 2 items
drwxr-xr-x - licz supergroup 0 2013-12-16 14:17 /user/hive/warehouse/ptest/deptno=20
drwxr-xr-x - licz supergroup 0 2013-12-16 14:33 /user/hive/warehouse/ptest/deptno=30
hive> dfs -ls /user/hive/warehouse/ptest/deptno=20;
Found 1 items
-rw-r--r-- 2 licz supergroup 29 2013-12-16 14:17 /user/hive/warehouse/ptest/deptno=20/ptest20
hive> dfs -cat /user/hive/warehouse/ptest/deptno=20/ptest20;
SMITH
JONES
SCOTT
ADAMS
FORD
对分区进行查询
hive> select ename from ptest where deptno=20;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201312111332_0005, Tracking URL = http://gc:50030/jobdetails.jsp?jobid=job_201312111332_0005
Kill Command = /home/licz/hadoop-1.2.1/libexec/../bin/hadoop job -kill job_201312111332_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-12-16 14:27:58,298 Stage-1 map = 0%, reduce = 0%
……
MapReduce Total cumulative CPU time: 3 seconds 110 msec
Ended Job = job_201312111332_0005
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 3.11 sec HDFS Read: 241 HDFS Write: 29 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 110 msec
OK
SMITH
JONES
SCOTT
ADAMS
FORD
Time taken: 61.54 seconds, Fetched: 5 row(s)
显示分区
hive> show partitions ptest;
OK
deptno=20
deptno=30
Time taken: 0.221 seconds, Fetched: 2 row(s)
对分区插入数据
hive> Insert overwrite Table ptest Partition(deptno=20) Select ename From emp Where deptno=20;
Total MapReduce jobs = 3
Launching Job 1 out of 3
……
5 Rows loaded to ptest
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 3.24 sec HDFS Read: 568 HDFS Write: 29 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 240 msec
OK
Time taken: 68.34 seconds
可以的表或分区组织成桶,桶是用组织特定字段把行分开,每个桶对应一个reduce操作。在建立桶之前,需要设置hive.enforce.bucketing属性为true,使hive能识别桶。
hive> Create Table bemp(empno Int,ename String,mgr Int,sal Float,deptno Int)
> clustered By (empno) Into 3 buckets
> Row format delimited fields terminated By '\t';
OK
向桶中插入数据,按empno分了三个桶,在插入数据时对应三个reduce操作,输出三个文件
hive> Insert overwrite Table bemp Select * From emp;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 3
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_201312111332_0011, Tracking URL = http://gc:50030/jobdetails.jsp?jobid=job_201312111332_0011
Kill Command = /home/licz/hadoop-1.2.1/libexec/../bin/hadoop job -kill job_201312111332_0011
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3
2013-12-17 01:14:04,048 Stage-1 map = 0%, reduce = 0%
2013-12-17 01:14:29,182 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 8.5 sec
……
Ended Job = job_201312111332_0011
Loading data to table default.bemp
Deleted hdfs://gc:9000/user/hive/warehouse/bemp
Table default.bemp stats: [num_partitions: 0, num_files: 3, num_rows: 0, total_size: 360, raw_data_size: 0]
14 Rows loaded to bemp
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 3 Cumulative CPU: 42.03 sec HDFS Read: 568 HDFS Write: 360 SUCCESS
Total MapReduce CPU Time Spent: 42 seconds 30 msec
OK
Time taken: 116.398 seconds
查看数据仓库下的桶目录,三个桶对应三个文件
hive> dfs -ls /user/hive/warehouse/bemp
> ;
Found 3 items
-rw-r--r-- 2 licz supergroup 177 2013-12-17 01:15 /user/hive/warehouse/bemp/000000_0
-rw-r--r-- 2 licz supergroup 103 2013-12-17 01:15 /user/hive/warehouse/bemp/000001_0
-rw-r--r-- 2 licz supergroup 80 2013-12-17 01:15 /user/hive/warehouse/bemp/000002_0
hive> dfs -ls /user/hive/warehouse/bemp;
Found 3 items
-rw-r--r-- 2 licz supergroup 177 2013-12-17 01:15 /user/hive/warehouse/bemp/000000_0
-rw-r--r-- 2 licz supergroup 103 2013-12-17 01:15 /user/hive/warehouse/bemp/000001_0
-rw-r--r-- 2 licz supergroup 80 2013-12-17 01:15 /user/hive/warehouse/bemp/000002_0
hive> dfs -ls /user/hive/warehouse/bemp/000000_0;
Found 1 items
-rw-r--r-- 2 licz supergroup 177 2013-12-17 01:15 /user/hive/warehouse/bemp/000000_0
Hive使用对分桶所用的值进行hash,并用hash结果除以桶的个数取余运算的方式来分桶,保证每个桶里都有数据,但每个桶中的记录不一定相等。
hive> dfs -cat /user/hive/warehouse/bemp/000000_0;
7788 SCOTT 7566 3000.0 20
7839 KING \N 5000.0 10
7521 WARD 7698 1250.0 30
7566 JONES 7839 2975.0 20
7902 FORD 7566 3000.0 20
7698 BLAKE 7839 2850.0 30
7782 CLARK 7839 2450.0 10
hive> dfs -cat /user/hive/warehouse/bemp/000001_0;
7369 SMITH 7902 800.0 20
7654 MARTIN 7698 1250.0 30
7876 ADAMS 7788 1100.0 20
7900 JAMES 7698 950.0 30
hive> dfs -cat /user/hive/warehouse/bemp/000002_0;
7934 MILLER 7782 1300.0 10
7844 TURNER 7698 1500.0 30
7499 ALLEN 7698 1600.0 30
分桶可以获得比分区更高的查询效率,同时分权也便于对全部数据数据进行采样,如下取样操作
hive> select * from bemp tablesample(bucket 1 out of 3 on empno);
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
……
Total MapReduce CPU Time Spent: 6 seconds 110 msec
OK
7788 SCOTT 7566 3000.0 20
7839 KING NULL 5000.0 10
7521 WARD 7698 1250.0 30
7566 JONES 7839 2975.0 20
7902 FORD 7566 3000.0 20
7698 BLAKE 7839 2850.0 30
7782 CLARK 7839 2450.0 10
Time taken: 57.213 seconds, Fetched: 7 row(s)
多表插入是指使用一条语句,把读取的同一份数据插入到不同的表中,只需要扫描一遍数据即可完成所有表的插入操作,效率很高。
hive> Create Table mutil1 As Select deptno,ename From emp;
hive> Create Table mutil2 Like mutil1;
OK
Time taken: 0.285 seconds
hive> From emp
> Insert overwrite Table mutil1 Select deptno,ename
> Insert overwrite Table mutil2 Select deptno,Count(ename) Group By deptno;
hive> select * from mutil1;
OK
20 SMITH
30 ALLEN
30 WARD
20 JONES
30 MARTIN
30 BLAKE
10 CLARK
20 SCOTT
10 KING
30 TURNER
20 ADAMS
30 JAMES
20 FORD
10 MILLER
Time taken: 0.502 seconds, Fetched: 14 row(s)
hive> select * from mutil2;
OK
10 3
20 5
30 6
Time taken: 0.149 seconds, Fetched: 3 row(s)
重命名、增加字段
hive> alter table mutil1 rename to mutil01;
hive> alter table mutil01 add columns(sal int);
OK
Time taken: 0.262 seconds
hive> desc mutil01;
OK
deptno int None
ename string None
sal int None
Time taken: 0.154 seconds, Fetched: 3 row(s)
是不是跟PL/SQL一样!!
hive> drop table mutil01;
如果只删除表中的数据,保留表名可以HDFS上删除数据文件即可,如下:
hive> select * from mutil2;
OK
10 3
20 5
30 6
hive> dfs -ls /user/hive/warehouse/mutil2;
Found 1 items
-rw-r--r-- 2 licz supergroup 15 2013-12-17 02:34 /user/hive/warehouse/mutil2/000000_0
hive> dfs -cat /user/hive/warehouse/mutil2/000000_0;
103
205
306
hive> dfs -rmr /user/hive/warehouse/mutil2/*;
Deleted hdfs://gc:9000/user/hive/warehouse/mutil2/000000_0
hive> select * from mutil2;
OK
Time taken: 0.269 seconds
hive>
注意:对于托管表,drop操作会把元数据和数据文件都删除掉;对于外部表,只是删除元数据。
Hive的连接操作和PL/SQL的内连接、左外连接、右外连接、全外连接基本上是一样的。
hive> select dept.*,emp.* from dept join emp on dept.deptno=emp.deptno;
10 ACCOUNTING 7839 KING NULL 5000.0 10
10 ACCOUNTING 7782 CLARK 7839 2450.0 10
10 ACCOUNTING 7934 MILLER 7782 1300.0 10
20 RESEARCH 7369 SMITH 7902 800.0 20
20 RESEARCH 7566 JONES 7839 2975.0 20
20 RESEARCH 7876 ADAMS 7788 1100.0 20
20 RESEARCH 7902 FORD 7566 3000.0 20
20 RESEARCH 7788 SCOTT 7566 3000.0 20
30 SALES 7499 ALLEN 7698 1600.0 30
30 SALES 7844 TURNER 7698 1500.0 30
30 SALES 7900 JAMES 7698 950.0 30
30 SALES 7698 BLAKE 7839 2850.0 30
30 SALES 7654 MARTIN 7698 1250.0 30
30 SALES 7521 WARD 7698 1250.0 30
40 OPERATIONS NULL NULL NULL NULL NULL
Time taken: 118.992 seconds, Fetched: 15 row(s)
hive> select dept.*,emp.* from dept left outer join emp on dept.deptno=emp.deptno;
hive> select dept.*,emp.* from dept right outer join emp on dept.deptno=emp.deptno;
hive> select dept.*,emp.* from dept full outer join emp on dept.deptno=emp.deptno;
//outer必需要显示地存在
半连接是Hive所特有的,Hive不支持in操作,替代方案是用left semi join半连接,需要注意的是连接的表不能出现在查询列中,只能出现在on子名中。
hive> select dept.* from dept left semi join emp on dept.deptno=emp.deptno;
OK
10 ACCOUNTING
20 RESEARCH
30 SALES
Time taken: 119.238 seconds, Fetched: 3 row(s)
HiveQL对子查询支持有限,只能在from引导的子句中出现子查询,如下语句在from子句中嵌套了一个子查询
hive> Select Deptno,Max(Num) From (Select Deptno, Count(Empno) Num From Emp Group By Deptno) by group by deptno;
Hive只支持逻辑视图,并不支持物理视图,建立的视图可以Mysql元数据库查到,但在hive的数据仓库目录下没有相应的视图表目录
hive> create view v_test as select a.dname,count(b.empno) from dept a join emp b on a.deptno=b.deptno group by a.dname;
OK
mysql> select database();
+------------+
| database() |
+------------+
| hive |
+------------+
1 row in set (0.00 sec)
mysql> select tbl_name from TBLS;
+---------------------+
| tbl_name |
+---------------------+
| access_20120104_log |
| bemp |
| dept |
| emp |
| ext |
| mutil2 |
| ptest |
| salgrade |
| test |
| v_test |
+---------------------+
10 rows in set (0.00 sec)
mysql>