黑猴子的家:Hive 存储和压缩结合

1、官网

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

ORC存储方式的压缩

Key Default Notes
orc.compress ZLIB high level compression (one of NONE, ZLIB, SNAPPY)
orc.compress.size 262,144 number of bytes in each compression chunk
orc.stripe.size 67,108,864 number of bytes in each stripe
orc.row.index.stride 10,000 number of rows between index entries (must be >= 1000)
orc.create.index true whether to create row indexes
orc.bloom.filter.columns "" comma separated list of column names for which bloom filter should be created
orc.bloom.filter.fpp 0.05 false positive probability for bloom filter (must >0.0 and <1.0)

2、创建一个非压缩的的ORC存储方式

(1)建表语句

create table log_orc_none(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as orc tblproperties ("orc.compress"="NONE");

(2)插入数据

insert into table log_orc_none select * from log_text ;

(3)查看插入后数据

dfs -du -h /user/hive/warehouse/log_orc_none/ ;
7.7 M  /user/hive/warehouse/log_orc_none/000000_0

3、创建一个SNAPPY压缩的ORC存储方式

(1)建表语句

create table log_orc_snappy(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as orc tblproperties ("orc.compress"="SNAPPY");

(2)插入数据

insert into table log_orc_snappy select * from log_text ;

(3)查看插入后数据

dfs -du -h /user/hive/warehouse/log_orc_snappy/ ;
3.8 M  /user/hive/warehouse/log_orc_snappy/000000_0

4、上一节中默认创建的ORC存储方式,导入数据后的大小为

2.8 M /user/hive/warehouse/log_orc/000000_0
比Snappy压缩的还小。原因是orc存储文件默认采用ZLIB压缩。比snappy压缩的小。

5、存储方式和压缩总结:

在实际的项目开发当中,hive表的数据存储格式一般选择:orc或parquet。压缩方式一般选择snappy。

你可能感兴趣的:(黑猴子的家:Hive 存储和压缩结合)