黑猴子的家:Hive 主流文件存储格式对比实验

从存储文件的压缩比和查询速度两个角度对比

1、存储文件的压缩比测试

1)测试数据

https://github.com/liufengji/Compression_Format_Data.git

2)TextFile

(a)创建表,存储数据格式为TEXTFILE

create table log_text (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as textfile ;

(b)向表中加载数据

hive (default)> load data local inpath '/opt/module/datas/log.data' into table log_text ;

(c)查看表中数据大小

dfs -du -h /user/hive/warehouse/log_text;
18.1 M  /user/hive/warehouse/log_text/log.data
3)ORC

(a)创建表,存储数据格式为ORC

create table log_orc(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as orc ;

(b)向表中加载数据

insert into table log_orc select * from log_text ;

(c)查看表中数据大小

dfs -du -h /user/hive/warehouse/log_orc/ ;
2.8 M  /user/hive/warehouse/log_orc/000000_0
4)Parquet

(a)创建表,存储数据格式为parquet

create table log_parquet(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as parquet ; 

(b)向表中加载数据

insert into table log_parquet select * from log_text ;

(c)查看表中数据大小

dfs -du -h /user/hive/warehouse/log_parquet/ ;
13.1 M  /user/hive/warehouse/log_parquet/000000_0
5)存储文件的压缩比总结
ORC >  Parquet >  textFile

2、存储文件的查询速度测试

1)TextFile
hive (default)> select count(*) from log_text;
_c0
100000

Time taken: 20.346 seconds, Fetched: 1 row(s)
2)ORC
hive (default)> select count(*) from log_orc;
_c0
100000

Time taken: 20.174 seconds, Fetched: 1 row(s)
3)Parquet
hive (default)> select count(*) from log_parquet;
_c0
100000

Time taken: 20.149 seconds, Fetched: 1 row(s)
4)存储文件的查询速度总结
ORC > TextFile > Parquet

你可能感兴趣的:(黑猴子的家:Hive 主流文件存储格式对比实验)