声明:如果你是初学者,看我这篇文章的时候,看我上一篇会更好。
Hive表的创建:http://blog.csdn.net/qq_29622761/article/details/51564680
这篇的主要内容目录是:
你现在开始吧!
1. 由一个表创建另一个表
格式:ceate table test3 like test2;
我要做的:create table testtext_c like testtext;
(这种方式不会把数据复制过来,只是创建了相同的数据格式)
我先加载数据到表testtext中:
[root@hadoop1 host]# cat testtext
wer 46
wer 89
weree 78
rr 89
hive> load data local inpath '/usr/host/testtext' into table testtext;
Copying data from file:/usr/host/testtext
Copying file: file:/usr/host/testtext
Loading data to table default.testtext
OK
Time taken: 0.294 seconds
hive> select * from testtext;
OK
wer 46
wer 89
weree 78
rr 89
Time taken: 0.186 seconds
hive>
2 接着创建testtext_c吧(like方式)
hive> create table testtext_c like testtext;
OK
Time taken: 0.181 seconds
hive> select * from testtext;
OK
wer 46
wer 89
weree 78
rr 89
Time taken: 0.204 seconds
hive> select * from testtext_c;
OK
Time taken: 0.158 seconds
hive>
哎,testtext_c中确实没有数据吧!真的没骗你啊!
3 客官,别急,还有一种方式(as)
hive> create table testtext_cc as select name,addr from testtext;
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 1; number of reducers: 0
2016-06-01 20:49:59,404 null map = 0%, reduce = 0%
2016-06-01 20:50:20,644 null map = 100%, reduce = 0%, Cumulative CPU 1.3 sec
2016-06-01 20:50:21,735 null map = 100%, reduce = 0%, Cumulative CPU 1.3 sec
MapReduce Total cumulative CPU time: 1 seconds 300 msec
Ended Job = job_1464828076391_0004
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Ended Job = 1011778050, job is filtered out (removed at runtime).
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2016-06-01_20-49-43_516_5205177189363939745/-ext-10001
Moving data to: hdfs://hadoop1:9000/user/hive/warehouse/testtext_cc
Table default.testtext_cc stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 29, raw_data_size: 0]
OK
Time taken: 48.014 seconds
又跑mapreduce,为啥?create table testtext_c like testtext;这个都不走mapreduce的啊!怎么这里就跑mapreduce?嘿嘿,其实这里有select关键字,只有select * from 啥的不走mapreduce,其余的select都是会跑mapreduce的,hive的底层设计原理其实就是走mapreduce的,不信你看看我前一篇博客。
查查有没有数据:
hive> select * from testtext_cc;
OK
wer 46
wer 89
weree 78
rr 89
Time taken: 0.116 seconds
hive>
有啦有啦!
所以:create table testtext_cc as select name,addr from testtext;
(这一种方式是走mapreduce形式,这种方式是把数据也会复制过来)
4 接下来呢,看看不同文件格式读取对比
有textfile文件格式,sequencefile格式,rcfile格式,还有自定义的文件格式。
hive> create table test_text(name string,val string) stored as textfile;
OK
Time taken: 0.098 seconds
hive> desc formatted test_text;
OK
# col_name data_type comment
name string None
val string None
# Detailed Table Information
Database: default
Owner: root
CreateTime: Wed Jun 01 21:11:15 PDT 2016
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://hadoop1:9000/user/hive/warehouse/test_text
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1464840675
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.2 seconds
hive>
看到Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
输入流是TextInputFormat;输出流是HiveIgnoreKeyTextOutputFormat
hive> create table test_seq(name string,val string) stored as sequencefile;
OK
Time taken: 0.097 seconds
hive> desc formatted test_s;
hive> create table test_rc(name string,val string) stored as rcfile;
OK
Time taken: 0.126 seconds
hive> desc formatted test_rc;
自定义的在这里就不讲了。等xielaoshi厉害一点了再来说。
5.为什么要分区?其实在hive select查询中一般会扫描整个表内容,会消耗很多时间做没必要的工作。
分区表指的是在创建时指定partition的分区空间
分区语法:
create table tablename(name string) partition by(key type,….)
6.砸门来创建一个分区表玩玩:
上一篇我们是创建了三个表:testtable,testtext,xielaoshi。先来show tables看看有哪些表存在:
hive> show tables;
OK
testtable
testtext
xielaoshi
Time taken: 0.264 seconds
如果你想删除表的话,这样:
hive> drop table testtable;
创建分区表:
hive> create table xielaoshi2(
> name string,
> salary float,
> meinv array<string>,
> haoche map<string,float>,
> haoza struct<street:string,city:string,state:string,zip:int>
> )
> partitioned by (dt string,type string)
> row format delimited
> fields terminated by '\t'
> collection items terminated by ','
> map keys terminated by ':'
> lines terminated by '\n'
> stored as textfile;
OK
Time taken: 0.353 seconds
hive>
温馨小指南:你可以在记事本上敲好代码,然后贴到hive命令行上,这样更666哦!就像这样:
7 纳尼?不知道这语法是啥意思?好吧,你不懂的地方可能是collection items terminated by ‘,’map keys terminated by ‘:’ 。你想想,集合和map键值对里面的数据之间都是要分隔的呀,这里用逗号和冒号来分隔咯!
看看描述信息吧!
hive> desc formatted xielaoshi2;
OK
# col_name data_type comment
name string None
salary float None
meinv array<string> None
haoche map<string,float> None
haoza struct<street:string,city:string,state:string,zip:int> None
# Partition Information
# col_name data_type comment
dt string None
type string None
# Detailed Table Information
Database: default
Owner: root
CreateTime: Wed Jun 01 20:09:05 PDT 2016
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://hadoop1:9000/user/hive/warehouse/xielaoshi2
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1464836945
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
colelction.delim ,
field.delim \t
line.delim \n
mapkey.delim :
serialization.format \t
Time taken: 0.194 seconds
hive>
看到多了 Partition Information信息没?分两个区。
8 添加分区
hive> alter table xielaoshi2 add if not exists partition(dt='20160518',type='test');
OK
Time taken: 0.188 seconds
hive>
不过瘾对不对?砸门再来分区:
hive> alter table xielaoshi2 add if not exists partition(dt='20160518',type='test1');
OK
Time taken: 3.986 seconds
hive> alter table xielaoshi2 add if not exists partition(dt='20160518',type='test2');
OK
Time taken: 0.327 seconds
hive> show partitions xielaoshi2;
OK
dt=20160518/type=test
dt=20160518/type=test1
dt=20160518/type=test2
Time taken: 0.273 seconds
hive>
纳尼?你说啥?还不够?那再分一下?好勒!
hive> alter table xielaoshi2 add if not exists partition(dt='20160519',type='test');
OK
Time taken: 0.224 seconds
hive> alter table xielaoshi2 add if not exists partition(dt='20160519',type='test1');
OK
Time taken: 0.275 seconds
hive> alter table xielaoshi2 add if not exists partition(dt='20160519',type='test2');
OK
Time taken: 0.323 seconds
hive> show partitions xielaoshi2;
OK
dt=20160518/type=test
dt=20160518/type=test1
dt=20160518/type=test2
dt=20160519/type=test
dt=20160519/type=test1
dt=20160519/type=test2
Time taken: 0.308 seconds
hive>
看到没?dt下还有子分区type。
9.删除分区
hive> alter table xielaoshi2 drop if exists partition(dt='20160519',type='test2');
Dropping the partition dt=20160519/type=test2
OK
Time taken: 0.541 seconds
hive>
删除一个分区下的所有子分区
hive> alter table xielaoshi2 drop if exists partition(dt='20160519');
Dropping the partition dt=20160519/type=test
Dropping the partition dt=20160519/type=test1
OK
Time taken: 4.24 seconds
hive>
10.分桶
分桶:对于每一个表(table)或者分区,hive可以进一步组织成桶,也就是说桶是更为细粒度的数据范围
是怎么划分的?
hive是针对某一列进行分桶
hive采取对列值哈希,然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中
好处:获得更高的查询处理效率;使取样(sampling)更高效(这才是重点!!!)
来吧,分桶:
hive> create table bucketed_user(
> id string,
> name string
> )
> clustered by(id) sorted by(name) into 4 buckets
> row format delimited fields terminated by '\t' lines terminated by '\n'
> stored as textfile;
OK
Time taken: 0.283 seconds
hive>
查看描述信息:
hive> desc formatted bucketed_user;
OK
# col_name data_type comment
id string None
name string None
# Detailed Table Information
Database: default
Owner: root
CreateTime: Wed Jun 01 20:31:39 PDT 2016
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://hadoop1:9000/user/hive/warehouse/bucketed_user
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1464838299
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: 4
Bucket Columns: [id]
Sort Columns: [Order(col:name, order:1)]
Storage Desc Params:
field.delim \t
line.delim \n
serialization.format \t
Time taken: 0.363 seconds
hive>
看到Num Buckets:4,这里是分了4个桶
hive> select * from bucketed_user;
OK
Time taken: 0.533 seconds
hive>
啥也没有?当然咯,没插入数据呀!那插入数据看看,把testtext表里的数据插入bucketed_user中:
hive>insert overwrite table bucketed_user select name,addr from testtext;
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Hadoop job information for null: number of mappers: 1; number of reducers: 0
2016-06-01 21:17:07,755 null map = 0%, reduce = 0%
2016-06-01 21:17:22,171 null map = 100%, reduce = 0%, Cumulative CPU 1.22 sec
2016-06-01 21:17:23,308 null map = 100%, reduce = 0%, Cumulative CPU 1.22 sec
2016-06-01 21:17:24,401 null map = 100%, reduce = 0%, Cumulative CPU 1.22 sec
MapReduce Total cumulative CPU time: 1 seconds 220 msec
Ended Job = job_1464828076391_0005
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Ended Job = 180668474, job is filtered out (removed at runtime).
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2016-06-01_21-16-49_815_8186991974761152344/-ext-10000
Loading data to table default.bucketed_user
rmr: DEPRECATED: Please use 'rm -r' instead.
Deleted /user/hive/warehouse/bucketed_user
Table default.bucketed_user stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 29, raw_data_size: 0]
OK
Time taken: 37.79 seconds
hive> select * from bucketed_user;
OK
wer 46
wer 89
weree 78
rr 89
Time taken: 0.273 seconds
hive>
启动了两个job.
然而并没有分桶!这是为啥?
要插入这句话:hive> set hive.enforce.bucketing=true;
再执行这句话:
hive> insert overwrite table bucketed_user select name,addr from testtext;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 1; number of reducers: 4
2016-06-01 21:24:40,053 null map = 0%, reduce = 0%
2016-06-01 21:24:54,729 null map = 100%, reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:24:55,909 null map = 100%, reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:24:57,256 null map = 100%, reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:24:58,531 null map = 100%, reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:24:59,631 null map = 100%, reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:00,930 null map = 100%, reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:02,208 null map = 100%, reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:03,485 null map = 100%, reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:04,781 null map = 100%, reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:05,983 null map = 100%, reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:07,272 null map = 100%, reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:08,697 null map = 100%, reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:09,782 null map = 100%, reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:11,017 null map = 100%, reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:12,292 null map = 100%, reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:13,606 null map = 100%, reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:14,870 null map = 100%, reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:17,433 null map = 100%, reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:18,929 null map = 100%, reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:20,801 null map = 100%, reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:22,429 null map = 100%, reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:24,508 null map = 100%, reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:26,192 null map = 100%, reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:27,256 null map = 100%, reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:31,612 null map = 100%, reduce = 51%, Cumulative CPU 1.21 sec
2016-06-01 21:25:33,544 null map = 100%, reduce = 51%, Cumulative CPU 2.94 sec
2016-06-01 21:25:35,433 null map = 100%, reduce = 94%, Cumulative CPU 4.92 sec
2016-06-01 21:25:39,269 null map = 100%, reduce = 100%, Cumulative CPU 6.23 sec
2016-06-01 21:25:40,312 null map = 100%, reduce = 100%, Cumulative CPU 6.23 sec
2016-06-01 21:25:41,730 null map = 100%, reduce = 100%, Cumulative CPU 6.23 sec
2016-06-01 21:25:42,927 null map = 100%, reduce = 100%, Cumulative CPU 6.23 sec
2016-06-01 21:25:44,187 null map = 100%, reduce = 100%, Cumulative CPU 6.23 sec
MapReduce Total cumulative CPU time: 6 seconds 230 msec
Ended Job = job_1464828076391_0006
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Loading data to table default.bucketed_user
rmr: DEPRECATED: Please use 'rm -r' instead.
Deleted /user/hive/warehouse/bucketed_user
Table default.bucketed_user stats: [num_partitions: 0, num_files: 4, num_rows: 0, total_size: 29, raw_data_size: 0]
OK
Time taken: 96.782 seconds
hive>
看这句话Hadoop job information for null: number of mappers: 1; number of reducers: 4,因为分4个桶,出现了4个reducers。
看一下数据:
hive> select * from bucketed_user;
OK
rr 89
weree 78
wer 89
wer 46
Time taken: 1.112 seconds
hive> select * from testtext where name = 'wer';
OK
wer 46
wer 89
Time taken: 31.796 seconds
hive>
,O(∩∩)O嗯!O(∩∩)O嗯!O(∩_∩)O嗯!今天就写到这里,休息一下。如果你看到此文,想进一步学习或者和我沟通,加我微信公众号:名字:五十年后 。
蟹蟹你啊!