1.向数据表中加载文件
当数据被加载到表时,不会对数据进行任何变换,LOAD操作只是将数据复制到Hive表对应的位置。
代码:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE table_name [PARTITION (partitioncol=val,...)]
filepath可以是相对路径,绝对路径或完整的URI。加载的目标可以是一个表或者分区,如果表包含分区,则必须指定每一个分区的分区名。filepath可以引用一个文件,这种情况下,Hive将文件移动到表所对应的目录下,filepath也可以对应一个目录,Hive将目录中的所有文件移动到表所对应的目录中。
如果指定LOCAL,LOAD命令会去查找本地文件系统中的filepath。
如果使用OVERWRITE,目标表中的内容会先被删除。
例子:
创建满足数据格式要求的Hive表:
hive>
drop table test_table;
OK
Time taken: 0.173 seconds
hive>
create table test_table (name string,id string,ip string)
> row format delimited
> fields terminated by '\t';
OK
Time taken: 0.103 seconds
上传本地文件到HDFS:
caiyong@caiyong:/opt/hadoop$
bin/hadoop fs -copyFromLocal /home/caiyong/桌面/hivetestdata /
查看数据:
caiyong@caiyong:/opt/hadoop$
bin/hadoop fs -cat /hi*
name1 001 127.0.0.1
name2 002 127.0.0.1
name3 003 127.0.0.1
name4 004 127.0.0.1
name5 005 127.0.0.1
name6 006 192.168.0.1
name7 007 192.168.0.1
name8 008 192.168.0.1
name9 009 192.168.0.4
name10 010 192.168.0.4
向test_table加载数据:
hive>
LOAD DATA INPATH '/hivetestdata' OVERWRITE INTO TABLE test_table;
Loading data to table default.test_table
Table default.test_table stats: [numFiles=1, numRows=0, totalSize=211, rawDataSize=0]
OK
Time taken: 0.302 seconds
执行查询验证:
hive>
select * from test_table;
OK
name1 001 127.0.0.1
name2 002 127.0.0.1
name3 003 127.0.0.1
name4 004 127.0.0.1
name5 005 127.0.0.1
name6 006 192.168.0.1
name7 007 192.168.0.1
name8 008 192.168.0.1
name9 009 192.168.0.4
name10 010 192.168.0.4
Time taken: 0.084 seconds, Fetched: 10 row(s)
查看表对应目录下的数据:
caiyong@caiyong:/opt/hadoop$
bin/hadoop fs -cat /user/hive/warehouse/test_table/*
name1 001 127.0.0.1
name2 002 127.0.0.1
name3 003 127.0.0.1
name4 004 127.0.0.1
name5 005 127.0.0.1
name6 006 192.168.0.1
name7 007 192.168.0.1
name8 008 192.168.0.1
name9 009 192.168.0.4
name10 010 192.168.0.4
命令行执行查询:
caiyong@caiyong:/opt/hive$
bin/hive -e "select count(*) from test_table;"
Logging initialized using configuration in jar:file:/opt/hive/lib/hive-common-1.0.0.jar!/hive-log4j.properties
Query ID = caiyong_20150311165252_e32d9590-2ba8-46e3-b753-3e6651fa3226
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_201503111440_0001, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201503111440_0001
Kill Command = /opt/hadoop/libexec/../bin/hadoop job -kill job_201503111440_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2015-03-11 16:52:30,782 Stage-1 map = 0%, reduce = 0%
2015-03-11 16:52:33,831 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.32 sec
2015-03-11 16:52:41,906 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 1.32 sec
2015-03-11 16:52:42,915 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.51 sec
MapReduce Total cumulative CPU time: 3 seconds 510 msec
Ended Job = job_201503111440_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.51 sec HDFS Read: 430 HDFS Write: 3 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 510 msec
OK
10
Time taken: 25.566 seconds, Fetched: 1 row(s)
2.将查询结果插入Hive表中
代码:
INSERT OVERWRITE TABLE table_name [PARTITION (partitioncol=val,...)] select_statement FROM from_statement
插入可以对于一个表或一个分区进行操作。OVERWRITE关键字强制将输出结果写入。输出格式和序列化方式由表的元数据决定。在Hive中进行多表插入,可以减少数据扫描的次数,因为Hive可以只是扫描数据一次就对输入数据进行多个操作命令。
例子:
执行将查询结果插入其他表的操作:
hive>
INSERT OVERWRITE TABLE test_table_insert select * from test_table where ip = '127.0.0.1' ;
Query ID = caiyong_20150311190404_e5bfc45e-feb2-4d3a-b3eb-d4205c2c5666
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201503111757_0003, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201503111757_0003
Kill Command = /opt/hadoop/libexec/../bin/hadoop job -kill job_201503111757_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-03-11 19:04:25,706 Stage-1 map = 0%, reduce = 0%
2015-03-11 19:04:29,742 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.9 sec
2015-03-11 19:04:31,752 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.9 sec
MapReduce Total cumulative CPU time: 1 seconds 900 msec
Ended Job = job_201503111757_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://127.0.0.1:8020/tmp/hive/caiyong/c7c7bca9-486e-47ff-8ac0-211c190c09e8/hive_2015-03-11_19-04-19_263_4334135992121304958-1/-ext-10000
Loading data to table default.test_table_insert
Table default.test_table_insert stats: [numFiles=1, numRows=5, totalSize=100, rawDataSize=95]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.9 sec HDFS Read: 430 HDFS Write: 181 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 900 msec
OK
Time taken: 12.853 seconds
验证结果:
hive>
select * from test_table_insert;
OK
name1 001 127.0.0.1
name2 002 127.0.0.1
name3 003 127.0.0.1
name4 004 127.0.0.1
name5 005 127.0.0.1
Time taken: 0.073 seconds, Fetched: 5 row(s)
3.将查询结果写入文件系统
代码:
INSERT OVERWRITE [LOCAL] DIRECTORY directory SELECT...FROM...
LOCAL关键字定义将数据写入本地文件系统。
在数据写入本地文件系统时会进行文本序列化,如果任何一列不是原始数据,这些列会被序列化为JSON格式。
例子:
hive>
INSERT OVERWRITE DIRECTORY '/testcopy/' SELECT * FROM test_table WHERE ip = '127.0.0.1';
Query ID = caiyong_20150311191111_0c2541aa-3f3a-4542-bd50-f64da7a97a1d
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201503111757_0006, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201503111757_0006
Kill Command = /opt/hadoop/libexec/../bin/hadoop job -kill job_201503111757_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-03-11 19:12:04,272 Stage-1 map = 0%, reduce = 0%
2015-03-11 19:12:08,286 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.53 sec
2015-03-11 19:12:10,300 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.53 sec
MapReduce Total cumulative CPU time: 1 seconds 530 msec
Ended Job = job_201503111757_0006
Stage-3 is selected by condition resolver.
Stage-2 is filtered out by condition resolver.
Stage-4 is filtered out by condition resolver.
Moving data to: hdfs://127.0.0.1:8020/tmp/hive/caiyong/c7c7bca9-486e-47ff-8ac0-211c190c09e8/hive_2015-03-11_19-11-59_505_7890928127050906953-1/-ext-10000
Moving data to: /testcopy
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.53 sec HDFS Read: 430 HDFS Write: 100 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 530 msec
OK
Time taken: 10.968 seconds
验证结果:
caiyong@caiyong:/opt/hadoop$
bin/hadoop fs -ls /testcopy/
Found 1 items
-rw-r--r-- 1 caiyong supergroup 100 2015-03-11 19:12
/testcopy/000000_0
caiyong@caiyong:
/opt/hadoop$ bin/hadoop fs -cat /testcopy/*
name1001127.0.0.1
name2002127.0.0.1
name3003127.0.0.1
name4004127.0.0.1
name5005127.0.0.1