hive各种文件格式与压缩方式的结合测试

最近在给整个集群做一个整体各种压缩方式的测试,稍候带来测试的结果报告。

测试环境:

Linux master 2.6.18-348.12.1.el5 #1 SMP Wed Jul 10 05:28:41 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux

hadoop-1.0.3

hive-0.9.0

Oracle JRockit(R) (build R28.1.5-20-146757-1.6.0_29-20111004-1750-linux-x86_64, compiled mode)

共5台datanode

hive测试的文件格式:

RCFile

SequenceFile

压缩模式:

snappy

bz2

(后续再加入对Lzo、Gzip的压缩测试)

测试的指标包含:

1、压缩率 2、读取数据量、3、hive执行速度

 

第一:bz压缩

以下是各文件格式的bz2压缩对比

原始数据大小 RCFile压缩后大小 SequenceFile压缩后大小
12.89GB 2.29GB 2.59GB

压缩率:

RCFile压缩后大小 SequenceFile压缩后大小
82.23% 79.91%

通过以上结果,发现使用什么格式与压缩率关系不大。

下面测试下,hive在这两种压缩情况下的SQL执行效果:

使用rcfile(bz压缩)在进行统计读取时,如下图:

Time taken: 68.169 seconds
使用sequenceFile(bz压缩)进行读取统计时,如下图:

Time taken: 194.226 seconds
通过以上对比:
1、发现采用RCfile的格式读取的数据量(373.94MB)远远小于sequenceFile的读取量(2.59GB)
2、执行速度前者(68秒)比后者(194秒)快很多
 

 第二:snappy压缩

 

在进行snappy压缩时,我只对RCFile进行测试(sequenceFile基本不在我后期考虑优化的范围内)

原始数据大小 bz压缩后大小 snappy压缩后大小
12.89GB 2.29GB 4.87GB

压缩率:

bz压缩后 snappy压缩后
82.23% 62.22%

从节约磁盘空间来看bz优势很大,(注:这里没有对lzo进行测试,是因为通过hbase的测试效果lzo的节省空间不会有太大优势)

下面测试下bz和snappy压缩在sql执行的效果:

bz执行的进度 snappy执行的进度

2013-11-06 18:18:56,840 Stage-1 map = 0%, reduce = 0%
2013-11-06 18:19:24,020 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:25,028 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:26,036 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:27,045 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:28,053 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:29,060 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:30,068 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:31,074 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:32,081 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:33,088 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:34,095 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:35,101 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:36,108 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:37,115 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:38,122 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:39,129 Stage-1 map = 3%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:40,135 Stage-1 map = 3%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:41,141 Stage-1 map = 3%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:42,148 Stage-1 map = 3%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:43,154 Stage-1 map = 3%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:44,161 Stage-1 map = 3%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:45,168 Stage-1 map = 3%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:46,175 Stage-1 map = 3%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:47,182 Stage-1 map = 3%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:48,188 Stage-1 map = 4%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:49,195 Stage-1 map = 4%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:50,201 Stage-1 map = 4%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:51,207 Stage-1 map = 4%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:52,214 Stage-1 map = 4%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:53,220 Stage-1 map = 4%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:54,227 Stage-1 map = 5%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:55,233 Stage-1 map = 5%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:56,239 Stage-1 map = 5%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:57,246 Stage-1 map = 5%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:58,252 Stage-1 map = 5%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:19:59,259 Stage-1 map = 5%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:20:00,265 Stage-1 map = 6%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:20:01,272 Stage-1 map = 6%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:20:02,279 Stage-1 map = 6%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:20:03,286 Stage-1 map = 6%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:20:04,293 Stage-1 map = 6%, reduce = 0%, Cumulative CPU 66.32 sec
2013-11-06 18:20:05,309 Stage-1 map = 6%, reduce = 0%, Cumulative CPU 405.79 sec
2013-11-06 18:20:06,316 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 1373.04 sec
2013-11-06 18:20:07,323 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 1373.04 sec
2013-11-06 18:20:08,331 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 1373.04 sec
2013-11-06 18:20:09,338 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 1373.04 sec
2013-11-06 18:20:10,345 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 1373.04 sec
2013-11-06 18:20:11,352 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 1373.04 sec
2013-11-06 18:20:12,359 Stage-1 map = 8%, reduce = 0%, Cumulative CPU 1373.04 sec
2013-11-06 18:20:13,366 Stage-1 map = 8%, reduce = 0%, Cumulative CPU 1373.04 sec
2013-11-06 18:20:14,373 Stage-1 map = 8%, reduce = 0%, Cumulative CPU 1373.04 sec
2013-11-06 18:20:15,380 Stage-1 map = 8%, reduce = 0%, Cumulative CPU 1373.04 sec
2013-11-06 18:20:16,387 Stage-1 map = 8%, reduce = 0%, Cumulative CPU 1373.04 sec
2013-11-06 18:20:17,394 Stage-1 map = 8%, reduce = 0%, Cumulative CPU 1373.04 sec
2013-11-06 18:20:18,401 Stage-1 map = 9%, reduce = 0%, Cumulative CPU 1373.04 sec
2013-11-06 18:20:19,408 Stage-1 map = 9%, reduce = 0%, Cumulative CPU 1373.04 sec
2013-11-06 18:20:20,415 Stage-1 map = 9%, reduce = 0%, Cumulative CPU 1373.04 sec

2013-11-06 19:06:33,666 Stage-1 map = 0%, reduce = 0%
2013-11-06 19:06:43,699 Stage-1 map = 5%, reduce = 0%, Cumulative CPU 4.65 sec
2013-11-06 19:06:44,704 Stage-1 map = 5%, reduce = 0%, Cumulative CPU 4.65 sec
2013-11-06 19:06:45,709 Stage-1 map = 14%, reduce = 0%, Cumulative CPU 4.65 sec
2013-11-06 19:06:46,714 Stage-1 map = 29%, reduce = 0%, Cumulative CPU 20.37 sec
2013-11-06 19:06:47,719 Stage-1 map = 29%, reduce = 0%, Cumulative CPU 20.37 sec
2013-11-06 19:06:48,724 Stage-1 map = 41%, reduce = 0%, Cumulative CPU 42.85 sec
2013-11-06 19:06:49,729 Stage-1 map = 44%, reduce = 0%, Cumulative CPU 52.4 sec
2013-11-06 19:06:50,734 Stage-1 map = 44%, reduce = 0%, Cumulative CPU 52.4 sec
2013-11-06 19:06:51,739 Stage-1 map = 49%, reduce = 0%, Cumulative CPU 85.82 sec
2013-11-06 19:06:52,744 Stage-1 map = 55%, reduce = 0%, Cumulative CPU 85.82 sec
2013-11-06 19:06:53,749 Stage-1 map = 55%, reduce = 0%, Cumulative CPU 85.82 sec
2013-11-06 19:06:54,754 Stage-1 map = 55%, reduce = 0%, Cumulative CPU 85.82 sec
2013-11-06 19:06:55,759 Stage-1 map = 56%, reduce = 0%, Cumulative CPU 85.82 sec
2013-11-06 19:06:56,764 Stage-1 map = 56%, reduce = 0%, Cumulative CPU 85.82 sec
2013-11-06 19:06:57,769 Stage-1 map = 56%, reduce = 0%, Cumulative CPU 85.82 sec
2013-11-06 19:06:58,774 Stage-1 map = 62%, reduce = 0%, Cumulative CPU 85.82 sec
2013-11-06 19:06:59,779 Stage-1 map = 62%, reduce = 0%, Cumulative CPU 85.82 sec
2013-11-06 19:07:00,784 Stage-1 map = 62%, reduce = 0%, Cumulative CPU 85.82 sec
2013-11-06 19:07:01,789 Stage-1 map = 69%, reduce = 0%, Cumulative CPU 85.82 sec
2013-11-06 19:07:02,794 Stage-1 map = 69%, reduce = 0%, Cumulative CPU 85.82 sec
2013-11-06 19:07:03,799 Stage-1 map = 69%, reduce = 0%, Cumulative CPU 85.82 sec
2013-11-06 19:07:04,804 Stage-1 map = 76%, reduce = 0%, Cumulative CPU 116.32 sec
2013-11-06 19:07:05,809 Stage-1 map = 76%, reduce = 0%, Cumulative CPU 116.32 sec
2013-11-06 19:07:06,814 Stage-1 map = 76%, reduce = 0%, Cumulative CPU 116.32 sec
2013-11-06 19:07:07,820 Stage-1 map = 86%, reduce = 0%, Cumulative CPU 209.05 sec
2013-11-06 19:07:08,825 Stage-1 map = 86%, reduce = 0%, Cumulative CPU 209.05 sec
2013-11-06 19:07:09,831 Stage-1 map = 86%, reduce = 0%, Cumulative CPU 209.05 sec
2013-11-06 19:07:10,836 Stage-1 map = 89%, reduce = 0%, Cumulative CPU 209.05 sec
2013-11-06 19:07:11,841 Stage-1 map = 89%, reduce = 0%, Cumulative CPU 209.05 sec
2013-11-06 19:07:12,846 Stage-1 map = 89%, reduce = 0%, Cumulative CPU 209.05 sec
2013-11-06 19:07:13,851 Stage-1 map = 93%, reduce = 0%, Cumulative CPU 210.72 sec
2013-11-06 19:07:14,857 Stage-1 map = 93%, reduce = 0%, Cumulative CPU 210.72 sec
2013-11-06 19:07:15,863 Stage-1 map = 93%, reduce = 0%, Cumulative CPU 210.72 sec
2013-11-06 19:07:16,868 Stage-1 map = 98%, reduce = 0%, Cumulative CPU 219.67 sec
2013-11-06 19:07:17,873 Stage-1 map = 98%, reduce = 0%, Cumulative CPU 219.67 sec
2013-11-06 19:07:18,878 Stage-1 map = 98%, reduce = 0%, Cumulative CPU 219.67 sec
2013-11-06 19:07:19,884 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 225.03 sec
2013-11-06 19:07:20,889 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 225.03 sec
2013-11-06 19:07:21,894 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 225.03 sec
2013-11-06 19:07:22,900 Stage-1 map = 100%, reduce = 32%, Cumulative CPU 225.03 sec
2013-11-06 19:07:23,905 Stage-1 map = 100%, reduce = 32%, Cumulative CPU 225.03 sec
2013-11-06 19:07:24,920 Stage-1 map = 100%, reduce = 32%, Cumulative CPU 225.03 sec
2013-11-06 19:07:25,925 Stage-1 map = 100%, reduce = 32%, Cumulative CPU 225.03 sec
2013-11-06 19:07:26,930 Stage-1 map = 100%, reduce = 32%, Cumulative CPU 225.03 sec
2013-11-06 19:07:27,935 Stage-1 map = 100%, reduce = 32%, Cumulative CPU 225.03 sec
2013-11-06 19:07:28,940 Stage-1 map = 100%, reduce = 32%, Cumulative CPU 225.03 sec
2013-11-06 19:07:29,946 Stage-1 map = 100%, reduce = 32%, Cumulative CPU 225.03 sec
2013-11-06 19:07:30,959 Stage-1 map = 100%, reduce = 32%, Cumulative CPU 225.03 sec
2013-11-06 19:07:31,964 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 227.32 sec
2013-11-06 19:07:32,970 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 227.32 sec
2013-11-06 19:07:33,975 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 227.32 sec
2013-11-06 19:07:34,981 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 227.32 sec
2013-11-06 19:07:35,987 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 227.32 sec
2013-11-06 19:07:36,993 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 227.32 sec
2013-11-06 19:07:37,999 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 227.32 sec

从以上的运行进度看,snappy的执行进度远远高于bz的执行进度。

接着我们在分析下采用snappy压缩模式执行SQL的MR状态如下:

 


 读取的数据总量在608.77MB,也还好。

总结:

在hive中使用压缩需要灵活的方式,如果是数据源的话,采用RCFile+bz或RCFile+gz的方式,这样可以很大程度上节省磁盘空间;而在计算的过程中,为了不影响执行的速度,可以浪费一点磁盘空间,建议采用RCFile+snappy的方式,这样可以整体提升hive的执行速度。

至于lzo的方式,也可以在计算过程中使用,只不过综合考虑(速度和压缩比)还是考虑snappy适宜。

 

 

 

 

你可能感兴趣的:(hive)