大数据开发之Hive篇4-Hive数据操作语言

备注:
Hive 版本 2.1.1

文章目录

  • 一.Hive的DML(数据操作语言)概述
  • 二.Load 命令
    • 2.1 数据准备
    • 2.2 将服务器文件加载到hive表
    • 2.3 将HDFS文件加载到hive表
  • 三.INSERT INTO TABLE FROM Query
    • 3.1 Insert into select 语句
    • 3.2 Insert overwrite select 语句
    • 3.3 multiple inserts
    • 3.4 动态分区插入
  • 四.CTAS
  • 五.Writing data into the filesystem from queries
  • 六.INSERT
  • 七.Update 语句
  • 八.Delete 语句
  • 九.Merge
  • 参考

一.Hive的DML(数据操作语言)概述

Hive的DML语句继承了传统的关系型数据库,在此之上还有自己独特的语句。
数据仓库实施的过程中,经常会使用到DML,用于写业务逻辑。

语句 描述 备注
Load 命令 向表中加载数据 可以是操作系统文件,也可以是HDFS上的文件
INSERT INTO TABLE FROM Query 查询结构插入表或分区 正式环境使用最频繁的语句
CTAS 查询结果直接创建表 省略了建表语句,直接将数据录入自动创建的表
Writing data into the filesystem from queries 查询结果可以插入到文件系统目录中 将hive表的数据加载到文件
INSERT 标准的insert 语句录入数据 insert values 标准的录入数据语句
Update Update语句,更新数据 需要开启事务
Delete Delete语句,删除数据 需要开启事务
Merge Merge语句,存在update,不存在insert 需要开始事务,Hive 2.2版本开始支持

二.Load 命令

语法:

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
 
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] [INPUTFORMAT 'inputformat' SERDE 'serde'] (3.0 or later)

LOAD操作是一个简单的文件系统的move/copy操作,不起MR Job
如果表是分区表,比如指定要加载到的确定的分区(指定所有分区字段的值)
filepath可以是文件,也可以是目录。如果是文件,则将该文件move/copy到相应的路径下,如果是目录,则将该目录下的所有文件move/copy到相应的路径下;如果是目录,不可包含子目录
如果指定LOCAL关键字,则指本地文件copy到相应的路径下;如果不指定LOCAL,则是将默认HDFS上的文件move到相应的路径下
如果指定OVERWRITE关键字,则先先删除掉表或者分区的内容,将filepath下的内容替换;否则,将filepath下的内容添加到表或者分区路径下

2.1 数据准备

人工造数据:
空值 有null、有’’、有直接为空的

[root@hp1 hive]# more test1.txt 
1,'abc','2020-01-01'
2,'def','2020-01-01 23:00:01'
3,'ghi','2020-01-02 19:00:10'
4,'aaa','2020-01-02 19:00:20'
5,,'2020-01-02 19:00:30'
6,'abc',
7,'def','2020-01-03 19:00:30'
8,'abc','2020-02-02 19:00:30'

将test1.txt上传到hdfs

[root@hp1 hive]# hadoop fs -put test1.txt /tmp
[root@hp1 hive]# 
[root@hp1 hive]# hadoop fs -ls /tmp
Found 6 items
d---------   - hdfs   supergroup          0 2020-12-03 18:16 /tmp/.cloudera_health_monitoring_canary_files
drwxrwxrwt   - root   supergroup          0 2020-12-03 11:15 /tmp/db_test3.db
drwxrwxrwt   - root   supergroup          0 2020-12-03 11:16 /tmp/db_test4.db
drwx-wx-wx   - hive   supergroup          0 2020-11-18 17:57 /tmp/hive
drwxrwxrwt   - mapred hadoop              0 2020-11-19 15:36 /tmp/logs
-rw-rw-rw-   3 root   supergroup        166 2020-12-03 18:16 /tmp/test1.txt
[root@hp1 hive]# 

hive表

CREATE TABLE t2(
  user_id int, 
  name varchar(100) COMMENT '用户姓名', 
  loagin_date string);

2.2 将服务器文件加载到hive表

LOAD DATA LOCAL INPATH '/backup/hive/test1.txt' into table t2;

导入数据1
测试记录:
可以看到都是空格

hive> 
    > LOAD DATA LOCAL INPATH '/backup/hive/test1.txt' into table t2; 
Loading data to table test.t2
OK
Time taken: 0.867 seconds
hive> select * from t2;
OK
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
Time taken: 0.446 seconds, Fetched: 9 row(s)

更改表的分隔符

drop table t2;
CREATE TABLE t2(
  user_id int, 
  name varchar(100) COMMENT '用户姓名', 
  loagin_date string) row format delimited fields terminated by ',' stored as textfile;

导入数据2
可以看到数值和字符类型是正常导入了,但是时间类型没有正常导入

hive> 
    > 
    > LOAD DATA LOCAL INPATH '/backup/hive/test1.txt' into table t2; 
Loading data to table test.t2
OK
Time taken: 0.632 seconds
hive> select * from t2;
OK
1       'abc'   '2020-01-01'
2       'def'   '2020-01-01 23:00:01'
3       'ghi'   '2020-01-02 19:00:10'
4       'aaa'   '2020-01-02 19:00:20'
5               '2020-01-02 19:00:30'
6       'abc'
7       'def'   '2020-01-03 19:00:30'
8       'abc'   '2020-02-02 19:00:30'
Time taken: 0.073 seconds, Fetched: 8 row(s)

2.3 将HDFS文件加载到hive表

与操作系统文件相比,就是少了个local关键字,其余相同。

create table t3 like t2;
LOAD DATA INPATH '/tmp/test1.txt' into table t3;

测试记录:

hive> 
    > create table t3 like t2;
OK
Time taken: 0.078 seconds
hive> 
    > LOAD DATA  INPATH '/tmp/test1.txt' into table t3;
Loading data to table test.t3
OK
Time taken: 0.605 seconds
hive> select * from t3;
OK
1       'abc'   '2020-01-01'
2       'def'   '2020-01-01 23:00:01'
3       'ghi'   '2020-01-02 19:00:10'
4       'aaa'   '2020-01-02 19:00:20'
5               '2020-01-02 19:00:30'
6       'abc'
7       'def'   '2020-01-03 19:00:30'
8       'abc'   '2020-02-02 19:00:30'
Time taken: 0.064 seconds, Fetched: 8 row(s)
hive> 

三.INSERT INTO TABLE FROM Query

语法:

Standard syntax:
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
 
Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
FROM from_statement
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2]
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...;
 
Hive extension (dynamic partition inserts):
INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;

如果表是分区表,比如指定要加载到的确定的分区(指定所有分区字段的值)
如果指定OVERWRITE关键字,则先先删除掉表或者分区的内容,将查询结果的内容替换;否则,将查询结果添加到表或者分区路径下
如果指定OVERWRITE关键字,则先先删除掉表或者分区的内容,将查询结果的内容替换;否则,将查询结果添加到表或者分区路径下

3.1 Insert into select 语句

insert into select 语句同关系型数据的语法,用于从表中加载数据。

create table t4 like t3;
insert into t4 select user_id,name,loagin_date from t3;
select * from t4;

测试记录:

hive> 
    > 
    > create table t4 like t3;
OK
Time taken: 0.091 seconds
hive> insert into t4 select user_id,name,loagin_date from t3;
Query ID = root_20201204105325_8e9447df-993b-4b9c-b347-fd0c60382987
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0021, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0021/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0021
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-04 10:53:33,921 Stage-1 map = 0%,  reduce = 0%
2020-12-04 10:53:39,158 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.78 sec
MapReduce Total cumulative CPU time: 1 seconds 780 msec
Ended Job = job_1606698967173_0021
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://nameservice1/user/hive/warehouse/test.db/t4/.hive-staging_hive_2020-12-04_10-53-25_821_7213321429457977661-1/-ext-10000
Loading data to table test.t4
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.78 sec   HDFS Read: 4468 HDFS Write: 269 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 780 msec
OK
Time taken: 16.067 seconds
hive> 
    > select * from t4;
OK
1       'abc'   '2020-01-01'
2       'def'   '2020-01-01 23:00:01'
3       'ghi'   '2020-01-02 19:00:10'
4       'aaa'   '2020-01-02 19:00:20'
5               '2020-01-02 19:00:30'
6       'abc'
7       'def'   '2020-01-03 19:00:30'
8       'abc'   '2020-02-02 19:00:30'
Time taken: 0.066 seconds, Fetched: 8 row(s)

3.2 Insert overwrite select 语句

Insert overwrite table select 功能是先清空表,然后在将select的语句录入到表
这个语句是hive的原创,数据仓库中全量更新数据的时候经常会使用到。
语法:

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;

代码:

-- 延续上例 再insert 一次
insert into t4 select user_id,name,loagin_date from t3;
-- 查询数据发现重复了
select * from t4;
-- 改成 insert overwrite table命令
insert overwrite table t4 select user_id,name,loagin_date from t3;
-- 再次验证数据
select * from t4;

测试记录:

hive> 
    > insert into t4 select user_id,name,loagin_date from t3;
Query ID = root_20201204110806_8624193d-6311-44f5-8575-2abeb6c9e102
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0022, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0022/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0022
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-04 11:08:14,896 Stage-1 map = 0%,  reduce = 0%
2020-12-04 11:08:22,132 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.72 sec
MapReduce Total cumulative CPU time: 1 seconds 720 msec
Ended Job = job_1606698967173_0022
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://nameservice1/user/hive/warehouse/test.db/t4/.hive-staging_hive_2020-12-04_11-08-06_488_865341941305657171-1/-ext-10000
Loading data to table test.t4
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.72 sec   HDFS Read: 4552 HDFS Write: 269 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 720 msec
OK
Time taken: 17.314 seconds
hive> select * from t4;
OK
1       'abc'   '2020-01-01'
2       'def'   '2020-01-01 23:00:01'
3       'ghi'   '2020-01-02 19:00:10'
4       'aaa'   '2020-01-02 19:00:20'
5               '2020-01-02 19:00:30'
6       'abc'
7       'def'   '2020-01-03 19:00:30'
8       'abc'   '2020-02-02 19:00:30'
1       'abc'   '2020-01-01'
2       'def'   '2020-01-01 23:00:01'
3       'ghi'   '2020-01-02 19:00:10'
4       'aaa'   '2020-01-02 19:00:20'
5               '2020-01-02 19:00:30'
6       'abc'
7       'def'   '2020-01-03 19:00:30'
8       'abc'   '2020-02-02 19:00:30'
Time taken: 0.066 seconds, Fetched: 16 row(s)
hive> 
    > insert overwrite table t4 select user_id,name,loagin_date from t3;
Query ID = root_20201204110931_49346b1e-779f-4b4f-830f-7f363b2855c9
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0023, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0023/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0023
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-04 11:09:38,356 Stage-1 map = 0%,  reduce = 0%
2020-12-04 11:09:43,517 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.75 sec
MapReduce Total cumulative CPU time: 1 seconds 750 msec
Ended Job = job_1606698967173_0023
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://nameservice1/user/hive/warehouse/test.db/t4/.hive-staging_hive_2020-12-04_11-09-31_276_442497908099823168-1/-ext-10000
Loading data to table test.t4
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.75 sec   HDFS Read: 4557 HDFS Write: 269 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 750 msec
OK
Time taken: 14.942 seconds
hive> select * from t4;
OK
1       'abc'   '2020-01-01'
2       'def'   '2020-01-01 23:00:01'
3       'ghi'   '2020-01-02 19:00:10'
4       'aaa'   '2020-01-02 19:00:20'
5               '2020-01-02 19:00:30'
6       'abc'
7       'def'   '2020-01-03 19:00:30'
8       'abc'   '2020-02-02 19:00:30'
Time taken: 0.066 seconds, Fetched: 8 row(s)

3.3 multiple inserts

语法:

Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
FROM from_statement
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2]
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...;

需求:
我要将员工表有奖金和没有奖金的分开2个表存储
一般情况下需要2个insert语句,有了 multiple insert之后,一个语句就可以搞定

代码:

create table emp_bonus like emp;
create table emp_no_bonus like emp;

from emp e
insert into emp_bonus
select empno,ename,job,mgr,hiredate,sal,comm,deptno 
 where comm > 0
insert into emp_no_bonus
select empno,ename,job,mgr,hiredate,sal,comm,deptno 
 where comm = 0 or comm is null;

select * from emp_bonus;
select * from emp_no_bonus;

测试记录:

hive> 
    > create table emp_bonus like emp;
OK
Time taken: 0.078 seconds
hive> create table emp_no_bonus like emp;
OK
Time taken: 0.095 seconds
hive> 
    > from emp e
    > insert into emp_bonus
    > select empno,ename,job,mgr,hiredate,sal,comm,deptno 
    >  where comm > 0
    > insert into emp_no_bonus
    > select empno,ename,job,mgr,hiredate,sal,comm,deptno 
    >  where comm = 0 or comm is null;
Query ID = root_20201204113103_98083a55-2d8a-44d5-85fa-67af5369a21d
Total jobs = 5
Launching Job 1 out of 5
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0024, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0024/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0024
Hadoop job information for Stage-2: number of mappers: 2; number of reducers: 0
2020-12-04 11:31:11,214 Stage-2 map = 0%,  reduce = 0%
2020-12-04 11:31:17,437 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 6.82 sec
MapReduce Total cumulative CPU time: 6 seconds 820 msec
Ended Job = job_1606698967173_0024
Stage-5 is filtered out by condition resolver.
Stage-4 is selected by condition resolver.
Stage-6 is filtered out by condition resolver.
Stage-11 is filtered out by condition resolver.
Stage-10 is selected by condition resolver.
Stage-12 is filtered out by condition resolver.
Launching Job 4 out of 5
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0025, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0025/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0025
Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 0
2020-12-04 11:31:29,467 Stage-4 map = 0%,  reduce = 0%
2020-12-04 11:31:35,647 Stage-4 map = 100%,  reduce = 0%, Cumulative CPU 1.53 sec
MapReduce Total cumulative CPU time: 1 seconds 530 msec
Ended Job = job_1606698967173_0025
Launching Job 5 out of 5
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0026, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0026/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0026
Hadoop job information for Stage-10: number of mappers: 1; number of reducers: 0
2020-12-04 11:31:47,670 Stage-10 map = 0%,  reduce = 0%
2020-12-04 11:31:53,852 Stage-10 map = 100%,  reduce = 0%, Cumulative CPU 1.57 sec
MapReduce Total cumulative CPU time: 1 seconds 570 msec
Ended Job = job_1606698967173_0026
Loading data to table test.emp_bonus
Loading data to table test.emp_no_bonus
MapReduce Jobs Launched: 
Stage-Stage-2: Map: 2   Cumulative CPU: 6.82 sec   HDFS Read: 15263 HDFS Write: 965 HDFS EC Read: 0 SUCCESS
Stage-Stage-4: Map: 1   Cumulative CPU: 1.53 sec   HDFS Read: 3180 HDFS Write: 157 HDFS EC Read: 0 SUCCESS
Stage-Stage-10: Map: 1   Cumulative CPU: 1.57 sec   HDFS Read: 3568 HDFS Write: 520 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 9 seconds 920 msec
OK
Time taken: 52.03 seconds
hive> select * from emp_bonus;
OK
7521    WARD    SALESMAN        7698    1981-02-22      1250.0  500.0   30
7654    MARTIN  SALESMAN        7698    1981-09-28      1250.0  1400.0  30
7499    ALLEN   SALESMAN        7698    1981-02-20      1600.0  300.0   30
Time taken: 0.081 seconds, Fetched: 3 row(s)
hive> 
    > select * from emp_no_bonus;
OK
7566    JONES   MANAGER 7839    1981-04-02      2975.0  NULL    20
7698    BLAKE   MANAGER 7839    1981-05-01      2850.0  NULL    30
7782    CLARK   MANAGER 7839    1981-06-09      2450.0  NULL    10
7788    SCOTT   ANALYST 7566    1987-06-13      3000.0  NULL    20
7839    KING    PRESIDENT       NULL    1981-11-17      5000.0  NULL    10
7844    TURNER  SALESMAN        7698    1981-09-08      1500.0  0.0     30
7876    ADAMS   CLERK   7788    1987-06-13      1100.0  NULL    20
7900    JAMES   CLERK   7698    1981-12-03      950.0   NULL    30
7902    FORD    ANALYST 7566    1981-12-03      3000.0  NULL    20
7934    MILLER  CLERK   7782    1982-01-23      1300.0  NULL    10
7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
Time taken: 0.059 seconds, Fetched: 11 row(s)
hive> 

3.4 动态分区插入

语法:

Hive extension (dynamic partition inserts):
INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;

用户可以不完全指定分区字段的值,而是部分指定。对于固定指定的分区称之为静态分区,相反,对于没有指定的分区称之为动态分区。
动态分区的值取决于select查询中相应的字段值

hive.exec.dynamic.partition.mode=[strict|nonstrict]
strict mode下,必须至少指定一个静态分区; nonstrictmode下,可以全部都是动态分区

四.CTAS

CTAS 省略了建表的语句,直接将查询结果集录入到新表里面,可以用于临时表及备份表。

语法:

CREATE TABLE tablename
[ROW FORMAT row_format] [STORED AS file_format]
AS SELECT … FROM

测试记录:

hive> 
    > 
    > 
    > desc emp_ctas;
FAILED: SemanticException [Error 10001]: Table not found emp_ctas
hive> create table emp_ctas as select * from emp where 1 = 1;
Query ID = root_20201204144641_1e70859c-276e-4ac4-b593-225f3d9e2770
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0027, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0027/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0027
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-12-04 14:46:48,605 Stage-1 map = 0%,  reduce = 0%
2020-12-04 14:46:54,786 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.69 sec
MapReduce Total cumulative CPU time: 3 seconds 690 msec
Ended Job = job_1606698967173_0027
Stage-4 is filtered out by condition resolver.
Stage-3 is selected by condition resolver.
Stage-5 is filtered out by condition resolver.
Launching Job 3 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0028, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0028/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0028
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2020-12-04 14:47:07,574 Stage-3 map = 0%,  reduce = 0%
2020-12-04 14:47:13,760 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 1.52 sec
MapReduce Total cumulative CPU time: 1 seconds 520 msec
Ended Job = job_1606698967173_0028
Moving data to directory hdfs://nameservice1/user/hive/warehouse/test.db/emp_ctas
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 3.69 sec   HDFS Read: 9949 HDFS Write: 816 HDFS EC Read: 0 SUCCESS
Stage-Stage-3: Map: 1   Cumulative CPU: 1.52 sec   HDFS Read: 3004 HDFS Write: 677 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 210 msec
OK
Time taken: 33.892 seconds
hive> 
    > select * from emp_ctas;
OK
7521    WARD    SALESMAN        7698    1981-02-22      1250.0  500.0   30
7566    JONES   MANAGER 7839    1981-04-02      2975.0  NULL    20
7654    MARTIN  SALESMAN        7698    1981-09-28      1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981-05-01      2850.0  NULL    30
7782    CLARK   MANAGER 7839    1981-06-09      2450.0  NULL    10
7788    SCOTT   ANALYST 7566    1987-06-13      3000.0  NULL    20
7839    KING    PRESIDENT       NULL    1981-11-17      5000.0  NULL    10
7844    TURNER  SALESMAN        7698    1981-09-08      1500.0  0.0     30
7876    ADAMS   CLERK   7788    1987-06-13      1100.0  NULL    20
7900    JAMES   CLERK   7698    1981-12-03      950.0   NULL    30
7902    FORD    ANALYST 7566    1981-12-03      3000.0  NULL    20
7934    MILLER  CLERK   7782    1982-01-23      1300.0  NULL    10
7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
7499    ALLEN   SALESMAN        7698    1981-02-20      1600.0  300.0   30
Time taken: 0.061 seconds, Fetched: 14 row(s)

五.Writing data into the filesystem from queries

将查询的结果集录入到目录中

语法:


Standard syntax:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
  [ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0)
  SELECT ... FROM ...
 
Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...
 
  
row_format
  : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
        [NULL DEFINED AS char] (Note: Only available starting with Hive 0.13)

代码:

insert overwrite local directory '/tmp/emp.txt'
select * from emp;

测试记录:

hive> 
    > insert overwrite local directory '/tmp/emp.txt'
    > select * from emp;
Query ID = root_20201204145220_d4e24136-f156-466c-bf0d-f9e36e9a82fe
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0029, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0029/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0029
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-12-04 14:52:26,262 Stage-1 map = 0%,  reduce = 0%
2020-12-04 14:52:33,455 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.42 sec
MapReduce Total cumulative CPU time: 3 seconds 420 msec
Ended Job = job_1606698967173_0029
Moving data to local directory /tmp/emp.txt
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 3.42 sec   HDFS Read: 9863 HDFS Write: 677 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 420 msec
OK
Time taken: 14.343 seconds


[root@hp1 hive]# more /tmp/emp.txt/

*** /tmp/emp.txt/: directory ***

[root@hp1 hive]# cd /tmp/emp.txt/
[root@hp1 emp.txt]# ls
000000_0  000001_0
[root@hp1 emp.txt]# more 000000_0
7521WARDSALESMAN76981981-02-221250.0500.030
7566JONESMANAGER78391981-04-022975.0\N20
7654MARTINSALESMAN76981981-09-281250.01400.030
7698BLAKEMANAGER78391981-05-012850.0\N30
7782CLARKMANAGER78391981-06-092450.0\N10
7788SCOTTANALYST75661987-06-133000.0\N20
7839KINGPRESIDENT\N1981-11-175000.0\N10
7844TURNERSALESMAN76981981-09-081500.00.030
7876ADAMSCLERK77881987-06-131100.0\N20
7900JAMESCLERK76981981-12-03950.0\N30
7902FORDANALYST75661981-12-033000.0\N20
7934MILLERCLERK77821982-01-231300.0\N10
[root@hp1 emp.txt]# more 000001_0
7369SMITHCLERK79021980-12-17800.0\N20
7499ALLENSALESMAN76981981-02-201600.0300.030
[root@hp1 emp.txt]# 

六.INSERT

sql标准的insert语句,使用方法同传统关系型数据库的insert 语句

语法:

Standard Syntax:
INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row ...]
  
Where values_row is:
( value [, value ...] )
where a value is either null or any valid SQL literal

需求:往 emp_ctas表录入数据

代码:

insert into emp_ctas(empno,ename,deptno) values (1,'test1',10);
insert into emp_ctas(empno,ename,deptno) values (2,'test2',20),(3,'test3',30);

测试记录:

hive> 
    > insert overwrite local directory '/tmp/emp.txt'
    > select * from emp;
Query ID = root_20201204145220_d4e24136-f156-466c-bf0d-f9e36e9a82fe
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0029, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0029/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0029
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-12-04 14:52:26,262 Stage-1 map = 0%,  reduce = 0%
2020-12-04 14:52:33,455 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.42 sec
MapReduce Total cumulative CPU time: 3 seconds 420 msec
Ended Job = job_1606698967173_0029
Moving data to local directory /tmp/emp.txt
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 3.42 sec   HDFS Read: 9863 HDFS Write: 677 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 420 msec
OK
Time taken: 14.343 seconds
hive> 
    > insert into emp_ctas(empno,ename,deptno) values (1,'test1',10);
Query ID = root_20201204145950_46e74db1-30a2-47ef-bc68-7334c0bfe75d
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0030, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0030/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0030
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-04 14:59:57,007 Stage-1 map = 0%,  reduce = 0%
2020-12-04 15:00:05,251 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.97 sec
MapReduce Total cumulative CPU time: 2 seconds 970 msec
Ended Job = job_1606698967173_0030
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://nameservice1/user/hive/warehouse/test.db/emp_ctas/.hive-staging_hive_2020-12-04_14-59-50_800_1003035375971432143-1/-ext-10000
Loading data to table test.emp_ctas
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 2.97 sec   HDFS Read: 5352 HDFS Write: 95 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 970 msec
OK
Time taken: 16.171 seconds
hive> insert into emp_ctas(empno,ename,deptno) values (2,'test2',20),(3,'test3',30);
Query ID = root_20201204150012_55a81ff8-bb7f-4e07-b77b-371655c88c3e
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0031, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0031/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606698967173_0031
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-04 15:00:18,975 Stage-1 map = 0%,  reduce = 0%
2020-12-04 15:00:25,227 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.0 sec
MapReduce Total cumulative CPU time: 3 seconds 0 msec
Ended Job = job_1606698967173_0031
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://nameservice1/user/hive/warehouse/test.db/emp_ctas/.hive-staging_hive_2020-12-04_15-00-12_024_6107099678250146899-1/-ext-10000
Loading data to table test.emp_ctas
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 3.0 sec   HDFS Read: 5366 HDFS Write: 121 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 0 msec
OK
Time taken: 14.84 seconds
hive> 

七.Update 语句

语法:

Standard Syntax:
UPDATE tablename SET column = value [, column = value ...] [WHERE expression]

默认是没有开启事务的,所以update语句执行会报错,如下:

hive> 
    > update emp_ctas set comm = 100 where ename = 'test1';
FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations.
hive> 
    > 

后续在事务章节再来讲解ACID事务。

八.Delete 语句

语法:

Standard Syntax:
DELETE FROM tablename [WHERE expression]

后续在事务章节再来讲解ACID事务。

九.Merge

Merge的用法同关系型数据库的merge语句

语法:

Standard Syntax:
MERGE INTO  AS T USING  AS S
ON 
WHEN MATCHED [AND ] THEN UPDATE SET 
WHEN MATCHED [AND ] THEN DELETE
WHEN NOT MATCHED [AND ] THEN INSERT VALUES

hive 2.2版本之后才支持,此处略过。

参考

1.https://cwiki.apache.org/confluence/display/Hive/LanguageManual

你可能感兴趣的:(#,Hive,大数据和数据仓库,hive,数据仓库,DML,加载数据)