HiveQL 常用操作

HiveQL 常用操作

 

1.   创建表

Create Table dept (deptno Int,dname String) Row format delimited fields terminated By'\t';
Create
Table emp (empno Int,ename String,mgr Int,sal Float,deptno Int) Row format delimited fields terminated By'\t'
;
Create
Table salgrade (grade Int,losal Int,hisal Int) Row format delimited fields terminated By'\t';

 

Hive中分托管表和外部表,以上是托管表,托管表在在数据仓库目录下,由hive管理。外部表的数据在指定位置,不在hive数据仓库中,只在元数据库中注册。

 

创建外部表:

Create External Table ext(Id Int,Name String);

 

2.   导入数据

hive>load Data Local inpath '/home/licz/data/dept' overwrite Into Table dept;

Copying data from file:/home/licz/data/dept

Copying file: file:/home/licz/data/dept

Loading data to table default.dept

Deleted hdfs://gc:9000/user/hive/warehouse/dept

Table default.dept stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 50, raw_data_size: 0]

OK

Time taken: 1.547 seconds

hive>select * from dept;

OK

10     ACCOUNTING

20     RESEARCH

30     SALES

40     OPERATIONS

Time taken: 0.184 seconds, Fetched: 4 row(s)

 

load Data Local inpath '/home/licz/data/emp' overwrite Into Table emp;
load Data Local inpath '/home/licz/data/salgrade' overwrite Into Table salgrade;

 

如果导入的数据在HDFS上,则不需要加local关键字

 

--查看托管表位置

[licz@gc data]$ hadoop dfs -ls /user/hive/warehouse

Found 4 items

drwxr-xr-x  - licz supergroup          0 2013-12-16 13:44 /user/hive/warehouse/dept

drwxr-xr-x  - licz supergroup          0 2013-12-16 13:42 /user/hive/warehouse/emp

drwxr-xr-x  - licz supergroup          0 2013-12-16 13:07 /user/hive/warehouse/ext

drwxr-xr-x  - licz supergroup          0 2013-12-16 13:38 /user/hive/warehouse/salgrade

 

Hive导入数据时只是复制和移动文件,并不对数据模式进行检查,所以下面操作同样也成功了。这是Hive采用的“schema on read”加载方式,可以提高加载数据的效率。

hive> load Data Local inpath '/home/licz/data/salgrade' overwrite Into Tableemp;    

Copying data from file:/home/licz/data/salgrade

Copying file: file:/home/licz/data/salgrade

Loading data to table default.emp

Deleted hdfs://gc:9000/user/hive/warehouse/emp

Table default.emp stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 59, raw_data_size: 0]

OK

Time taken: 2.118 seconds

hive> select * from emp;

OK

1      700     1200    NULL    NULL

2      1201    1400    NULL    NULL

3      1401    2000    NULL    NULL

4      2001    3000    NULL    NULL

5      3001    9999    NULL    NULL

 

尽管导入数据和表结构不一致,还是导入成功了

 

 

3.   分区

--创建分区表

//注意:table中的列不能和partition中的列重合了

hive> Create Table ptest(ename String) partitioned By (deptno Int) Row format delimited fields terminated By '\t';

OK

Time taken: 0.271 seconds

hive> desc ptest;

OK

ename                  string                  None               

deptno                 int                     None               

                

# Partition Information         

# col_name             data_type               comment            

                

deptno                 int                     None               

Time taken: 0.231 seconds, Fetched: 7 row(s)

 

导入分区表数据

hive>load Data Local inpath '/home/licz/data/ptest20' overwrite Into Table ptest Partition(deptno=20);

Copying data from file:/home/licz/data/ptest20

Copying file: file:/home/licz/data/ptest20

Loading data to table default.ptest partition (deptno=20)

Deleted hdfs://gc:9000/user/hive/warehouse/ptest/deptno=20

Partition default.ptest{deptno=20} stats: [num_files: 1, num_rows: 0, total_size: 29, raw_data_size: 0]

Table default.ptest stats: [num_partitions: 1, num_files: 1, num_rows: 0, total_size: 29, raw_data_size: 0]

OK

Time taken: 2.209 seconds

hive> load Data Local inpath '/home/licz/data/ptest30' Into Table ptest Partition(deptno=30);

hive> select * from ptest;

OK

SMITH  20

JONES  20

SCOTT  20

ADAMS  20

FORD   20

ALLEN  30

WARD   30

MARTIN  30

BLAKE  30

TURNER 30

JAMES  30

Time taken: 0.364 seconds, Fetched: 11 row(s)

 

创建分区后,会在相应的目录下建立以分区命名的目录,目录下是分区的数据

hive> dfs -ls /user/hive/warehouse/ptest;

Found 2 items

drwxr-xr-x  - licz supergroup          0 2013-12-16 14:17 /user/hive/warehouse/ptest/deptno=20

drwxr-xr-x  - licz supergroup          0 2013-12-16 14:33 /user/hive/warehouse/ptest/deptno=30

hive> dfs -ls /user/hive/warehouse/ptest/deptno=20;

Found 1 items

-rw-r--r--  2 licz supergroup         29 2013-12-16 14:17 /user/hive/warehouse/ptest/deptno=20/ptest20

hive> dfs -cat /user/hive/warehouse/ptest/deptno=20/ptest20;

SMITH

JONES

SCOTT

ADAMS

FORD

 

对分区进行查询

hive> select ename from ptest where deptno=20;

Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks is set to 0 since there's no reduce operator

Starting Job = job_201312111332_0005, Tracking URL = http://gc:50030/jobdetails.jsp?jobid=job_201312111332_0005

Kill Command = /home/licz/hadoop-1.2.1/libexec/../bin/hadoop job -kill job_201312111332_0005

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2013-12-16 14:27:58,298 Stage-1 map = 0%, reduce = 0%

……

MapReduce Total cumulative CPU time: 3 seconds 110 msec

Ended Job = job_201312111332_0005

MapReduce Jobs Launched:

Job 0: Map: 1  Cumulative CPU: 3.11 sec   HDFS Read: 241 HDFS Write: 29 SUCCESS

Total MapReduce CPU Time Spent: 3 seconds 110 msec

OK

SMITH

JONES

SCOTT

ADAMS

FORD

Time taken: 61.54 seconds, Fetched: 5 row(s)

 

显示分区

hive> show partitions ptest;

OK

deptno=20

deptno=30

Time taken: 0.221 seconds, Fetched: 2 row(s)

 

对分区插入数据

hive> Insert overwrite Table ptest Partition(deptno=20) Select ename From emp Where deptno=20;

Total MapReduce jobs = 3

Launching Job 1 out of 3

……

5 Rows loaded to ptest

MapReduce Jobs Launched:

Job 0: Map: 1  Cumulative CPU: 3.24 sec   HDFS Read: 568 HDFS Write: 29 SUCCESS

Total MapReduce CPU Time Spent: 3 seconds 240 msec

OK

Time taken: 68.34 seconds

 

4.   桶

可以的表或分区组织成桶,桶是用组织特定字段把行分开,每个桶对应一个reduce操作。在建立桶之前,需要设置hive.enforce.bucketing属性为true,使hive能识别桶。

hive> Create Table bemp(empno Int,ename String,mgr Int,sal Float,deptno Int)

   > clustered By (empno) Into 3 buckets

   > Row format delimited fields terminated By '\t';

OK

 

向桶中插入数据,按empno分了三个桶,在插入数据时对应三个reduce操作,输出三个文件

hive> Insert overwrite Table bemp Select * From emp;

Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 3

In order to change the average load for a reducer (in bytes):

 set hive.exec.reducers.bytes.per.reducer=

In order to limit the maximum number of reducers:

 set hive.exec.reducers.max=

In order to set a constant number of reducers:

 set mapred.reduce.tasks=

Starting Job = job_201312111332_0011, Tracking URL = http://gc:50030/jobdetails.jsp?jobid=job_201312111332_0011

Kill Command = /home/licz/hadoop-1.2.1/libexec/../bin/hadoop job -kill job_201312111332_0011

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3

2013-12-17 01:14:04,048 Stage-1 map = 0%, reduce = 0%

2013-12-17 01:14:29,182 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 8.5 sec

……

Ended Job = job_201312111332_0011

Loading data to table default.bemp

Deleted hdfs://gc:9000/user/hive/warehouse/bemp

Table default.bemp stats: [num_partitions: 0, num_files: 3, num_rows: 0, total_size: 360, raw_data_size: 0]

14 Rows loaded to bemp

MapReduce Jobs Launched:

Job 0: Map: 1 Reduce: 3   Cumulative CPU: 42.03 sec   HDFS Read: 568 HDFS Write: 360 SUCCESS

Total MapReduce CPU Time Spent: 42 seconds 30 msec

OK

Time taken: 116.398 seconds

 

查看数据仓库下的桶目录,三个桶对应三个文件

hive> dfs -ls /user/hive/warehouse/bemp

   > ;

Found 3 items

-rw-r--r--  2 licz supergroup        177 2013-12-17 01:15 /user/hive/warehouse/bemp/000000_0

-rw-r--r--  2 licz supergroup        103 2013-12-17 01:15 /user/hive/warehouse/bemp/000001_0

-rw-r--r--  2 licz supergroup         80 2013-12-17 01:15 /user/hive/warehouse/bemp/000002_0

hive> dfs -ls /user/hive/warehouse/bemp;

Found 3 items

-rw-r--r--  2 licz supergroup        177 2013-12-17 01:15 /user/hive/warehouse/bemp/000000_0

-rw-r--r--  2 licz supergroup        103 2013-12-17 01:15 /user/hive/warehouse/bemp/000001_0

-rw-r--r--  2 licz supergroup         80 2013-12-17 01:15 /user/hive/warehouse/bemp/000002_0

hive> dfs -ls /user/hive/warehouse/bemp/000000_0;

Found 1 items

-rw-r--r--  2 licz supergroup        177 2013-12-17 01:15 /user/hive/warehouse/bemp/000000_0

 

Hive使用对分桶所用的值进行hash,并用hash结果除以桶的个数取余运算的方式来分桶,保证每个桶里都有数据,但每个桶中的记录不一定相等。

hive> dfs -cat /user/hive/warehouse/bemp/000000_0;

7788   SCOTT   7566    3000.0  20

7839   KING    \N      5000.0  10

7521   WARD    7698    1250.0  30

7566   JONES   7839    2975.0  20

7902   FORD    7566    3000.0  20

7698   BLAKE   7839    2850.0  30

7782   CLARK   7839    2450.0  10

hive> dfs -cat /user/hive/warehouse/bemp/000001_0;

7369   SMITH   7902    800.0   20

7654   MARTIN  7698    1250.0  30

7876   ADAMS   7788    1100.0  20

7900   JAMES   7698    950.0   30

hive> dfs -cat /user/hive/warehouse/bemp/000002_0;

7934   MILLER  7782    1300.0  10

7844   TURNER  7698    1500.0  30

7499   ALLEN   7698    1600.0  30

 

分桶可以获得比分区更高的查询效率,同时分权也便于对全部数据数据进行采样,如下取样操作

hive> select * from bemp tablesample(bucket 1 out of 3 on empno);

Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks is set to 0 since there's no reduce operator

……

Total MapReduce CPU Time Spent: 6 seconds 110 msec

OK

7788   SCOTT   7566    3000.0  20

7839   KING    NULL    5000.0  10

7521   WARD    7698    1250.0  30

7566   JONES   7839    2975.0  20

7902   FORD    7566    3000.0  20

7698   BLAKE   7839    2850.0  30

7782   CLARK   7839    2450.0  10

Time taken: 57.213 seconds, Fetched: 7 row(s)

 

 

 

5.   多表插入

多表插入是指使用一条语句,把读取的同一份数据插入到不同的表中,只需要扫描一遍数据即可完成所有表的插入操作,效率很高。

hive> Create Table mutil1 As Select deptno,ename From emp;

hive> Create Table mutil2 Like mutil1;

OK

Time taken: 0.285 seconds

 

hive> From emp

   > Insert overwrite Table mutil1 Select deptno,ename

> Insert overwrite Table mutil2 Select deptno,Count(ename) Group By deptno;

 

hive> select * from mutil1;

OK

20     SMITH

30     ALLEN

30     WARD

20     JONES

30     MARTIN

30     BLAKE

10     CLARK

20     SCOTT

10     KING

30     TURNER

20     ADAMS

30     JAMES

20     FORD

10     MILLER

Time taken: 0.502 seconds, Fetched: 14 row(s)

hive> select * from mutil2;

OK

10     3

20     5

30     6

Time taken: 0.149 seconds, Fetched: 3 row(s)

 

6.   修改表

重命名、增加字段

hive> alter table mutil1 rename to mutil01;

hive> alter table mutil01 add columns(sal int);

OK

Time taken: 0.262 seconds

hive> desc mutil01;

OK

deptno                 int                     None               

ename                  string                  None               

sal                    int                     None               

Time taken: 0.154 seconds, Fetched: 3 row(s)

 

是不是跟PL/SQL一样!!

 

7.   删除表

 

hive> drop table mutil01;

 

如果只删除表中的数据,保留表名可以HDFS上删除数据文件即可,如下:

hive> select * from mutil2;

OK

10     3

20     5

30     6

hive> dfs -ls /user/hive/warehouse/mutil2;

Found 1 items

-rw-r--r--  2 licz supergroup         15 2013-12-17 02:34 /user/hive/warehouse/mutil2/000000_0

hive> dfs -cat /user/hive/warehouse/mutil2/000000_0;

103

205

306

hive> dfs -rmr /user/hive/warehouse/mutil2/*;

Deleted hdfs://gc:9000/user/hive/warehouse/mutil2/000000_0

hive> select * from mutil2;

OK

Time taken: 0.269 seconds

hive>

 

注意:对于托管表,drop操作会把元数据和数据文件都删除掉;对于外部表,只是删除元数据。

 

8.   连接

Hive的连接操作和PL/SQL的内连接、左外连接、右外连接、全外连接基本上是一样的。

hive> select dept.*,emp.* from dept join emp on dept.deptno=emp.deptno;

10     ACCOUNTING      7839    KING    NULL    5000.0  10

10     ACCOUNTING      7782    CLARK   7839    2450.0  10

10     ACCOUNTING      7934    MILLER  7782    1300.0  10

20     RESEARCH        7369    SMITH   7902    800.0   20

20     RESEARCH        7566    JONES   7839    2975.0  20

20     RESEARCH        7876    ADAMS   7788    1100.0  20

20     RESEARCH        7902    FORD    7566    3000.0  20

20     RESEARCH        7788    SCOTT   7566    3000.0  20

30     SALES   7499    ALLEN   7698    1600.0  30

30     SALES   7844    TURNER  7698    1500.0  30

30     SALES   7900    JAMES   7698    950.0   30

30     SALES   7698    BLAKE   7839    2850.0  30

30     SALES   7654    MARTIN  7698    1250.0  30

30     SALES   7521    WARD    7698    1250.0  30

40     OPERATIONS      NULL    NULL    NULL    NULL    NULL

Time taken: 118.992 seconds, Fetched: 15 row(s)

hive> select dept.*,emp.* from dept left outer join emp on dept.deptno=emp.deptno;

hive> select dept.*,emp.* from dept right outer join emp on dept.deptno=emp.deptno;

hive> select dept.*,emp.* from dept full outer join emp on dept.deptno=emp.deptno;

//outer必需要显示地存在

 

半连接是Hive所特有的,Hive不支持in操作,替代方案是用left semi join半连接,需要注意的是连接的表不能出现在查询列中,只能出现在on子名中。

hive> select dept.* from dept left semi join emp on dept.deptno=emp.deptno;

OK

10     ACCOUNTING

20     RESEARCH

30     SALES

Time taken: 119.238 seconds, Fetched: 3 row(s)

 

9.   子查询

HiveQL对子查询支持有限,只能在from引导的子句中出现子查询,如下语句在from子句中嵌套了一个子查询

hive> Select Deptno,Max(Num) From (Select Deptno, Count(Empno) Num From Emp Group By Deptno) by group by deptno;

 

10. 创建视图

Hive只支持逻辑视图,并不支持物理视图,建立的视图可以Mysql元数据库查到,但在hive的数据仓库目录下没有相应的视图表目录

hive> create view v_test as select a.dname,count(b.empno) from dept a join emp b on a.deptno=b.deptno group by a.dname;

OK

 

mysql> select database();

+------------+

| database() |

+------------+

| hive      |

+------------+

1 row in set (0.00 sec)

 

mysql> select tbl_name from TBLS;

+---------------------+

| tbl_name           |

+---------------------+

| access_20120104_log |

| bemp               |

| dept               |

| emp                |

| ext                |

| mutil2             |

| ptest              |

| salgrade           |

| test               |

| v_test             |

+---------------------+

10 rows in set (0.00 sec)

 

mysql>



 

 

你可能感兴趣的:(Hadoop,Hive)