[Hadoop] Hadoop中压缩的使用

1. 常用缩格式

压缩格式需要关注两个因素:

压缩比(Compression Ratio): Snappy < LZ4 < LZO < GZIP < BZIP2

其中,BZIP2的压缩比大概是30%,Snappy/LZ4/LZO的压缩比大概是50%.

[Hadoop] Hadoop中压缩的使用_第1张图片

解压速度(Compression Speed): Snappy > LZ4 > LZO > GZIP > BZIP2

很明显,压缩速度这张图和上面的图是反过来的,压缩比越大,压缩/解压缩的速度越慢。

[Hadoop] Hadoop中压缩的使用_第2张图片

使用压缩,带来的好处是:减少磁盘空间,减少磁盘的I/O,加速网络传输;同时,压缩带来的CPU消耗是比较高的。而且,压缩速度和压缩比是成反比的,生产中具体压缩方式的选择,是要根据实际的需要来选择(如果是历史数据,选择压缩比大的;如果要求执行速度快的,选择压缩比小的)。

 

1.1 lzo压缩

优点:压缩/解压速度比较快,合理的压缩率(< 50%);支持split,是hadoop中最流行的压缩格式;支持hadoop native库;可以在linux系统下安装lzop命令,使用方便。

缺点:压缩率比gzip要低一些;hadoop本身不支持,需要安装;在应用中对lzo格式的文件需要做一些特殊处理(为了支持split需要建立索引,还需要知道inputformat为lzo格式)

应用场景:一个很大的文本文件,压缩之后还大于200M以上的可以考虑,而且单个文件越大,lzo优点越明显。

 

1.2 gzip压缩

优点:压缩率比较高(大于30%),而且压缩/解压缩速度也比较快;hadoop本身支持,在应用中处理gzip格式的文件就和直接处理文本一样;有hadoop native库;大部分linux系统都自带gzip命令,使用方便。

缺点:不支持split。

应用场景:当每个文件压缩后在130M以内的(1个块大小内),都可以考虑用gzip压缩格式。比如说一天或者一个小时的日志压缩成一个gzip文件,运行mapreduce程序的时候通过gzip达到并发。hive程序、streaming程序,和java写的mapreduce程序完全和文本处理一样,压缩之后原来的程序不需要做任何修改。

 

1.3 bzip2压缩

优点:支持split;具有很高的压缩率(30%),比gzip压缩率都高;hadoop本身支持,但不支持native;在linux系统下自带bzip2命令,使用方便。

缺点:压缩/解压缩速度慢;不支持native。

应用场景:适合对速度要求不高,但需要较高的压缩率的时候,可以作为mapreduce作业的输出格式;或者输出之后的数据比较大,处理之后的数据需要压缩存档减少磁盘空间并且以后数据用得比较少的情况;或者对单个很大的文本文件想压缩减少存储空间,同时又需要支持split,而且兼容之前的应用程序(即应用程序不需要修改)的情况。

 

1.4 snappy压缩

优点:高速压缩速度和合理的压缩率(约50%);支持hadoop native库。

缺点:不支持split;压缩率比gzip要低;hadoop本身不支持,需要安装;linux系统下没有对应的命令。

应用场景:当mapreduce作业的map输出的数据比较大的时候,作为map到reduce的中间数据的压缩格式;或者作为一个mapreduce作业的输出和另外一个mapreduce作业的输入。

 

四种压缩格式优缺点对照表,

压缩格式 split native 压缩率 速度 是否hadoop自带 linux命令 换成压缩格式后原来应用程序是否需要修改
gzip 很高 比较快 是,直接使用 和文本处理一样,不需要修改
lzo 比较高 很快 否,需要安装 需要建索引,还需要指定输入格式
snappy 比较高 很快 否,需要安装 没有 和文本处理异议,不需要修改
bzip2 最高 是,直接使用 和文本处理异议,不需要修改

 

2. 压缩在Hadoop中的应用

Hadoop的jobs都是I/O密集型的,压缩可以加速IO操作;压缩可以减少数据在网络传输时的size,提升性能。压缩格式要支持切割(分片)。

  • Hadoop jobs are usually IO bound, compressing data can speed up the IO operations; 
  • Compression reduces the size of data transferred across network;
  • Overall job performance may be increased by simple enabing;
  • Splittability must be taken into account;

 

2.1 支持分割(分片)的压缩格式

在上面的压缩格式中,只有bzip2和LZO是支持分割的,且LZO是有条件的(被索引:com.hadoop.compression.lzo.LzoIndexer去创建一个index)分割。

Compression format Tool Algorithm File extention Splittable
gzip gzip DEFLATE .gz No
bzip2 bzip2 bzip2 .bz2 Yes
LZO lzop LZO .lzo Yes if indexed
Snappy N/A Snappy .snappy No

概念理解:

  • 1G的没压缩的数据Map Task数量:1024M/128M个
  • 1G的gzip压缩的数据Map Task的数量:1个(gzip不能分片,所有的数据只能用一个task来处理)

即如果压缩后的格式支持分片,就可以分成多个Task并发执行,提高执行效率。

查看机器上的hadoop支持哪些压缩格式(只有编译后的hadoop的才能支持各种压缩方式)

[hadoop@hadoop01 ~]$ hadoop checknative 
18/08/13 18:43:08 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
18/08/13 18:43:08 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop:  true /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/lib/native/libhadoop.so.1.0.0
zlib:    true /lib64/libz.so.1
snappy:  true /lib64/libsnappy.so.1
lz4:     true revision:99
bzip2:   true /lib64/libbz2.so.1
openssl: true /lib64/libcrypto.so
[hadoop@hadoop01 ~]$ 

2.2 常用的Codec

  • Zlib:org.apache.hadoop.io.compress.DefaultCodec
  • Gzip:org.apache.hadoop.io.compress.GzioCodec
  • Bzip2:org.apache.hadoop.io.compress.Bzip2Codec
  • Lzo:com.apache.compression.lzo.LzoCodec
  • Lz4:org.apache.hadoop.io.compress.Lz4Codec
  • Snappy:org.apache.hadoop.io.compress.SnappyCodec

Codec在使用时,只需要写在hadoop的配置文件中即可。

 

2.3 压缩在MapReduce中的应用

 

[Hadoop] Hadoop中压缩的使用_第3张图片

 

MapReduce的三个阶段都可以压缩,map读进来的时候可以压缩,读进来之后解压;reduce读的时候要解压,reduce写出去的时候要压缩。

 

[Hadoop] Hadoop中压缩的使用_第4张图片

结论:

  • Map:使用可以分割的压缩方式;(bzip2/lzo)
  • Shuffle & Sort:中间的过程采用速度快的压缩方式;
  • Reduce:采用压缩比最大的压缩方式以节省磁盘(如果这个作业的输出是作为下一个作业的输入,要选择可分割的压缩方式)。

 

2.4 压缩实战

2.4.1 hadoop的压缩

2.4.1.1 修改hadoop配置文件

配置core-site.xml(①)


    io.compression.codecs
    
        org.apache.hadoop.io.compress.GzipCodec,
        org.apache.hadoop.io.compress.DefaultCodec,
        org.apache.hadoop.io.compress.BZip2Codec,
    

配置mapred-site.xml(③)


    mapreduce.output.fileoutputformat.compress
    true


    mapreduce.output.fileoutputformat.compress.codec
    org.apache.hadoop.io.compress.BZip2Codec

中间一步(②),打开hadoop官网,查看mapred-default.xml配置文件。

[Hadoop] Hadoop中压缩的使用_第5张图片

修改完配置文件之后要重启hdfs和yarn服务。

 2.4.1.2 跑一个MapReduce作业测试

[hadoop@hadoop01 ~]$ cd app/hadoop-2.6.0-cdh5.7.0/share/hadoop/mapreduce/
[hadoop@hadoop01 mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount /tmp/input.txt /tmp/compression-out/
...
[hadoop@hadoop01 mapreduce]$ 

查看结果,输出结果的压缩格式为.bz2,与配置文件一致

[hadoop@hadoop01 mapreduce]$ hdfs dfs -ls /tmp/compression-out/
Found 2 items
-rw-r--r--   1 hadoop supergroup          0 2018-08-13 20:01 /tmp/compression-out/_SUCCESS
-rw-r--r--   1 hadoop supergroup         65 2018-08-13 20:01 /tmp/compression-out/part-r-00000.bz2
[hadoop@hadoop01 mapreduce]$ hdfs dfs -text /tmp/compression-out/part-r-00000.bz2
18/08/13 20:02:53 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
18/08/13 20:02:53 INFO compress.CodecPool: Got brand-new decompressor [.bz2]
data	1
is	2
sample	1
test	2
this	2
[hadoop@hadoop01 mapreduce]$ 

 

2.4.1 hive的压缩

Hive的建表语句里面有一个STORED AS file_format结合使用的方法,指定hive的存储格式。不仅能节省hive的存储空间,还可以提高执行效率。

file_format参考:https://blog.csdn.net/wawa8899/article/details/81674817

2.4.1.1 不压缩

在hive创建一张不压缩的表,把数据导进去

hive> create table test1(
    > c1 string,
    > c2 string,
    > c3 string,
    > c4 string,
    > c5 string,
    > c6 string,
    > c7 string,
    > c8 string,
    > c9 string,
    > c10 string,
    > c11 string,
    > c12 string,
    > c13 string,
    > c14 string,
    > c15 string,
    > c16 string,
    > c17 string,
    > c18 string,
    > c19 string,
    > c20 string,
    > c21 string,
    > c22 string,
    > c23 string,
    > c24 string,
    > c25 string,
    > c26 string,
    > c27 string,
    > c28 string,
    > c29 string,
    > c30 string,
    > c31 string,
    > c32 string,
    > c33 string
    > )
    > row format delimited fields terminated by '||';
OK
Time taken: 0.716 seconds
hive> load data local inpath '/home/hadoop/data/20180813000203.txt' overwrite into table test1;
hive> select count(1) from test1;
OK
76241
Time taken: 20.67 seconds, Fetched: 1 row(s)
hive> 

此时hdfs上查看一下文件的大小

[hadoop@hadoop01 data]$ hdfs dfs -du -s -h /user/hive/warehouse/test1
37.4 M  37.4 M  /user/hive/warehouse/test1
[hadoop@hadoop01 data]$ 

2.4.1.2 bzip2压缩

在hive创建一张bzip2的表,把数据导进去(查看hive怎么压缩,打开hive官网,点击compression)

查看hive当前的压缩格式,默认是不压缩的

hive> SET hive.exec.compress.output;
hive.exec.compress.output=false
hive> 

查看hive当前的codec,默认是bzip2

hive> SET mapreduce.output.fileoutputformat.compress.codec;
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec
hive> 

设置一下压缩格式为bzip2,codec也为bzip2,并且创建一张表

hive> SET hive.exec.compress.output=true;
hive> SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec;
hive> create table test1_bzip2
    > row format delimited fields terminated by '||'
    > as select * from test1;

去hdfs上查看文件的大小,文件大小由最初的37.4M变成了450.0K(这里bzip2的压缩比应该是30%左右,因为我的数据本身有很多重复,所以压缩后体积缩小非常大),hdfs上存储的格式也变成了.bz2

[hadoop@hadoop01 data]$ hdfs dfs -du -s -h /user/hive/warehouse/test1_bzip2
450.0 K  450.0 K  /user/hive/warehouse/test1_bzip2
[hadoop@hadoop01 data]$ hdfs dfs -ls /user/hive/warehouse/test1_bzip2
Found 1 items
-rwxr-xr-x   1 hadoop supergroup     460749 2018-08-13 20:32 /user/hive/warehouse/test1_bzip2/000000_0.bz2

特别注意:为了不影响其他用户,通常使用压缩格式的时候不要设置为全局设置,使用完了以后要马上改回来!!

hive> SET hive.exec.compress.output=false;

 

2.4.1.3 SEQUENCEFILE存储

再创建一张表,存储成SEQUENCEFILE格式,数据导入的方法为先创建一张文本格式存储的表,把数据导入文本格式的表,再加载进SEQUENCEFILE表。

hive> create table test1(
    > c1 string,
    > c2 string,
    > c3 string,
    > c4 string,
    > c5 string,
    > c6 string,
    > c7 string,
    > c8 string,
    > c9 string,
    > c10 string,
    > c11 string,
    > c12 string,
    > c13 string,
    > c14 string,
    > c15 string,
    > c16 string,
    > c17 string,
    > c18 string,
    > c19 string,
    > c20 string,
    > c21 string,
    > c22 string,
    > c23 string,
    > c24 string,
    > c25 string,
    > c26 string,
    > c27 string,
    > c28 string,
    > c29 string,
    > c30 string,
    > c31 string,
    > c32 string,
    > c33 string
    > )
    > row format delimited fields terminated by '||';
OK
Time taken: 0.145 seconds
hive> load data local inpath '/home/hadoop/data/20180813000203.txt' overwrite into table test1;
Loading data to table default.test1
Table default.test1 stats: [numFiles=1, numRows=0, totalSize=39187874, rawDataSize=0]
OK
Time taken: 1.316 seconds
hive>
hive> create table test1_seq(
    > c1 string,
    > c2 string,
    > c3 string,
    > c4 string,
    > c5 string,
    > c6 string,
    > c7 string,
    > c8 string,
    > c9 string,
    > c10 string,
    > c11 string,
    > c12 string,
    > c13 string,
    > c14 string,
    > c15 string,
    > c16 string,
    > c17 string,
    > c18 string,
    > c19 string,
    > c20 string,
    > c21 string,
    > c22 string,
    > c23 string,
    > c24 string,
    > c25 string,
    > c26 string,
    > c27 string,
    > c28 string,
    > c29 string,
    > c30 string,
    > c31 string,
    > c32 string,
    > c33 string
    > )
    > row format delimited fields terminated by '||'
    > stored as SEQUENCEFILE;
OK
Time taken: 0.142 seconds
hive> insert into table test1_seq select * from test1;
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1534181281204_0002, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0002/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1534181281204_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2018-08-14 03:31:06,612 Stage-1 map = 0%,  reduce = 0%
2018-08-14 03:31:14,123 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.86 sec
MapReduce Total cumulative CPU time: 1 seconds 860 msec
Ended Job = job_1534181281204_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://192.168.1.8:9000/user/hive/warehouse/test1_seq/.hive-staging_hive_2018-08-14_03-30-59_220_768607288298096260-1/-ext-10000
Loading data to table default.test1_seq
Table default.test1_seq stats: [numFiles=1, numRows=76241, totalSize=12155988, rawDataSize=10972575]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 1.86 sec   HDFS Read: 39194480 HDFS Write: 12156070 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 860 msec
OK
Time taken: 17.464 seconds
hive>

去hdfs上查看,SEQUENCEFILE格式的文件比实际大小还大。(因为我的文件里面有很多重复数据,结果不做参考)

hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1;
37.4 M  37.4 M  /user/hive/warehouse/test1
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1_seq;
11.6 M  11.6 M  /user/hive/warehouse/test1_seq
hive>

2.4.1.4 RCFILE存储

它是一个混合的行列编成的,它保证所有行的一个列都在一个节点(block)之上,缺点是row group太小了。所以现在基本也不使用了。

创建一个RCFILE表,同样因为它是一个行、列式数据,所以数据不能直接导入,要从文本格式的表导入

hive> create table test1_rcfile(
    > c1 string,
    > c2 string,
    > c3 string,
    > c4 string,
    > c5 string,
    > c6 string,
    > c7 string,
    > c8 string,
    > c9 string,
    > c10 string,
    > c11 string,
    > c12 string,
    > c13 string,
    > c14 string,
    > c15 string,
    > c16 string,
    > c17 string,
    > c18 string,
    > c19 string,
    > c20 string,
    > c21 string,
    > c22 string,
    > c23 string,
    > c24 string,
    > c25 string,
    > c26 string,
    > c27 string,
    > c28 string,
    > c29 string,
    > c30 string,
    > c31 string,
    > c32 string,
    > c33 string
    > )
    > row format delimited fields terminated by '||'
    > stored as rcfile;
OK
Time taken: 0.146 seconds
hive> insert into test1_rcfile select * from test1;
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1534181281204_0003, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0003/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1534181281204_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2018-08-14 03:46:15,737 Stage-1 map = 0%,  reduce = 0%
2018-08-14 03:46:23,228 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.87 sec
MapReduce Total cumulative CPU time: 1 seconds 870 msec
Ended Job = job_1534181281204_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://192.168.1.8:9000/user/hive/warehouse/test1_rcfile/.hive-staging_hive_2018-08-14_03-46-08_429_3722754602144122242-1/-ext-10000
Loading data to table default.test1_rcfile
Table default.test1_rcfile stats: [numFiles=1, numRows=76241, totalSize=8604253, rawDataSize=8532863]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 1.87 sec   HDFS Read: 39194682 HDFS Write: 8604337 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 870 msec
OK
Time taken: 17.323 seconds
hive>

去hdfs上查看大小,RCFILE的结果只比原始数据空间节省了10%,所以工作中也不会大面积被使用。(因为我的文件里面有很多重复数据,结果不做参考)

hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1;
37.4 M  37.4 M  /user/hive/warehouse/test1
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1_rcfile;
8.2 M  8.2 M  /user/hive/warehouse/test1_rcfile
hive>

2.4.1.5 ORC存储

hive> create table test1_orc(
    > c1 string,
    > c2 string,
    > c3 string,
    > c4 string,
    > c5 string,
    > c6 string,
    > c7 string,
    > c8 string,
    > c9 string,
    > c10 string,
    > c11 string,
    > c12 string,
    > c13 string,
    > c14 string,
    > c15 string,
    > c16 string,
    > c17 string,
    > c18 string,
    > c19 string,
    > c20 string,
    > c21 string,
    > c22 string,
    > c23 string,
    > c24 string,
    > c25 string,
    > c26 string,
    > c27 string,
    > c28 string,
    > c29 string,
    > c30 string,
    > c31 string,
    > c32 string,
    > c33 string
    > )
    > row format delimited fields terminated by '||'
    > stored as orc;
OK
Time taken: 0.116 seconds
hive> insert into test1_orc select * from test1;
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1534181281204_0004, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0004/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1534181281204_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2018-08-14 03:53:54,382 Stage-1 map = 0%,  reduce = 0%
2018-08-14 03:54:04,091 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.38 sec
MapReduce Total cumulative CPU time: 4 seconds 380 msec
Ended Job = job_1534181281204_0004
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://192.168.1.8:9000/user/hive/warehouse/test1_orc/.hive-staging_hive_2018-08-14_03-53-46_132_1524862905095712631-1/-ext-10000
Loading data to table default.test1_orc
Table default.test1_orc stats: [numFiles=1, numRows=76241, totalSize=678984, rawDataSize=219269116]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 4.38 sec   HDFS Read: 39194660 HDFS Write: 679067 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 380 msec
OK
Time taken: 20.572 seconds
hive>

去hdfs上查看大小,体积缩小比例惊人。orc默认采用zlib压缩方式。(因为我的文件里面有很多重复数据,结果不做参考)

hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1;
37.4 M  37.4 M  /user/hive/warehouse/test1
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1_orc;
663.1 K  663.1 K  /user/hive/warehouse/test1_orc
hive>

再创建一张表,orc格式存储,不采用压缩

hive> create table test1_orc_null(
    > c1 string,
    > c2 string,
    > c3 string,
    > c4 string,
    > c5 string,
    > c6 string,
    > c7 string,
    > c8 string,
    > c9 string,
    > c10 string,
    > c11 string,
    > c12 string,
    > c13 string,
    > c14 string,
    > c15 string,
    > c16 string,
    > c17 string,
    > c18 string,
    > c19 string,
    > c20 string,
    > c21 string,
    > c22 string,
    > c23 string,
    > c24 string,
    > c25 string,
    > c26 string,
    > c27 string,
    > c28 string,
    > c29 string,
    > c30 string,
    > c31 string,
    > c32 string,
    > c33 string
    > )
    > row format delimited fields terminated by '||'
    > stored as orc tblproperties ("orc.compress"="NONE");
OK
Time taken: 0.156 seconds
hive> insert into test1_orc_null select * from test1;
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1534181281204_0005, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0005/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1534181281204_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2018-08-14 04:02:40,232 Stage-1 map = 0%,  reduce = 0%
2018-08-14 04:02:49,821 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.48 sec
MapReduce Total cumulative CPU time: 3 seconds 480 msec
Ended Job = job_1534181281204_0005
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://192.168.1.8:9000/user/hive/warehouse/test1_orc_null/.hive-staging_hive_2018-08-14_04-02-32_108_711422127353495082-1/-ext-10000
Loading data to table default.test1_orc_null
Table default.test1_orc_null stats: [numFiles=1, numRows=76241, totalSize=2087911, rawDataSize=219269116]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 3.48 sec   HDFS Read: 39194703 HDFS Write: 2087999 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 480 msec
OK
Time taken: 19.238 seconds
hive>

查看大小,可以看到不压缩的orc格式比压缩的orc要大一些。(因为我的文件里面有很多重复数据,结果不做参考)

hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1;
37.4 M  37.4 M  /user/hive/warehouse/test1
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1_orc;
663.1 K  663.1 K  /user/hive/warehouse/test1_orc
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1_orc_null;
2.0 M  2.0 M  /user/hive/warehouse/test1_orc_null
hive>

2.4.1.6 PARQUET存储

hive> create table test1_parquet(
    > c1 string,
    > c2 string,
    > c3 string,
    > c4 string,
    > c5 string,
    > c6 string,
    > c7 string,
    > c8 string,
    > c9 string,
    > c10 string,
    > c11 string,
    > c12 string,
    > c13 string,
    > c14 string,
    > c15 string,
    > c16 string,
    > c17 string,
    > c18 string,
    > c19 string,
    > c20 string,
    > c21 string,
    > c22 string,
    > c23 string,
    > c24 string,
    > c25 string,
    > c26 string,
    > c27 string,
    > c28 string,
    > c29 string,
    > c30 string,
    > c31 string,
    > c32 string,
    > c33 string
    > )
    > row format delimited fields terminated by '||'
    > stored as parquet;
OK
Time taken: 0.137 seconds
hive> insert into test1_parquet select * from test1;
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1534181281204_0006, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0006/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1534181281204_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2018-08-14 04:09:14,081 Stage-1 map = 0%,  reduce = 0%
2018-08-14 04:09:23,547 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.47 sec
MapReduce Total cumulative CPU time: 3 seconds 470 msec
Ended Job = job_1534181281204_0006
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://192.168.1.8:9000/user/hive/warehouse/test1_parquet/.hive-staging_hive_2018-08-14_04-09-06_783_4844806752989603100-1/-ext-10000
Loading data to table default.test1_parquet
Table default.test1_parquet stats: [numFiles=1, numRows=76241, totalSize=2239503, rawDataSize=2515953]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 3.47 sec   HDFS Read: 39194597 HDFS Write: 2239588 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 470 msec
OK
Time taken: 19.319 seconds
hive>

查看大小,它的压缩比没有orc好。(因为我的文件里面有很多重复数据,结果不做参考)

hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1;
37.4 M  37.4 M  /user/hive/warehouse/test1
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1_parquet;
2.1 M  2.1 M  /user/hive/warehouse/test1_parquet
hive>

继续,来对PARQUET压缩后存储

hive> set parquet.compression=GZIP;
hive> create table test1_parquet_gzip
    > row format delimited fields terminated by '||'
    > stored as parquet
    > as select * from test1;
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1534181281204_0007, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0007/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1534181281204_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2018-08-14 04:13:34,200 Stage-1 map = 0%,  reduce = 0%
2018-08-14 04:13:43,790 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.85 sec
MapReduce Total cumulative CPU time: 3 seconds 850 msec
Ended Job = job_1534181281204_0007
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://192.168.1.8:9000/user/hive/warehouse/.hive-staging_hive_2018-08-14_04-13-25_921_7360586646978839920-1/-ext-10001
Moving data to: hdfs://192.168.1.8:9000/user/hive/warehouse/test1_parquet_gzip
Table default.test1_parquet_gzip stats: [numFiles=1, numRows=76241, totalSize=605285, rawDataSize=2515953]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 3.85 sec   HDFS Read: 39193754 HDFS Write: 605375 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 850 msec
OK
Time taken: 20.185 seconds
hive>

再次查看大小,体积缩小了非常多。(因为我的文件里面有很多重复数据,结果不做参考)

hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1;
37.4 M  37.4 M  /user/hive/warehouse/test1
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1_parquet;
2.1 M  2.1 M  /user/hive/warehouse/test1_parquet
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1_parquet_gzip;
591.1 K  591.1 K  /user/hive/warehouse/test1_parquet_gzip
hive>

对比了以上所有的存储格式和压缩方式之后,在计算方面也是有区别的(看HDFS Read)。

  • 原始表全表数据量是76241行,查询的数据量是39196646
  • test1表是行式存储,查询的数据量是39197330
  • test1_orc是列式存储,查询的数据量是296842(它只会去读c1列的数据)
  • test1_parquet也是列式存储,查询的数据量是1249432(它只会去读c1列的数据,性能会比原始表要好很多,但是没有orc好)
hive> select count(1) from test1;
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Job = job_1534181281204_0011, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0011/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1534181281204_0011
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-08-14 04:24:12,595 Stage-1 map = 0%,  reduce = 0%
2018-08-14 04:24:18,976 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.3 sec
2018-08-14 04:24:27,315 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.68 sec
MapReduce Total cumulative CPU time: 2 seconds 680 msec
Ended Job = job_1534181281204_0011
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.68 sec   HDFS Read: 39196646 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 680 msec
OK
76241
Time taken: 23.739 seconds, Fetched: 1 row(s)
hive> select count(1) from test1 where c1 = 'WS-C4506-E';
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Job = job_1534181281204_0008, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0008/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1534181281204_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-08-14 04:19:36,821 Stage-1 map = 0%,  reduce = 0%
2018-08-14 04:19:44,152 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.66 sec
2018-08-14 04:19:51,497 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.94 sec
MapReduce Total cumulative CPU time: 2 seconds 940 msec
Ended Job = job_1534181281204_0008
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.94 sec   HDFS Read: 39197330 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 940 msec
OK
1
Time taken: 23.09 seconds, Fetched: 1 row(s)
hive> select count(1) from test1_rc where c1 = 'WS-C4506-E';
FAILED: SemanticException [Error 10001]: Line 1:21 Table not found 'test1_rc'
hive> select count(1) from test1_orc where c1 = 'WS-C4506-E';
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Job = job_1534181281204_0009, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0009/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1534181281204_0009
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-08-14 04:20:22,163 Stage-1 map = 0%,  reduce = 0%
2018-08-14 04:20:29,589 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.95 sec
2018-08-14 04:20:36,951 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.29 sec
MapReduce Total cumulative CPU time: 3 seconds 290 msec
Ended Job = job_1534181281204_0009
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.29 sec   HDFS Read: 296842 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 290 msec
OK
1
Time taken: 22.92 seconds, Fetched: 1 row(s)
hive> select count(1) from test1_parquet where c1 = 'WS-C4506-E';
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Job = job_1534181281204_0010, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0010/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1534181281204_0010
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-08-14 04:20:53,130 Stage-1 map = 0%,  reduce = 0%
2018-08-14 04:21:00,451 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.11 sec
2018-08-14 04:21:07,737 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.53 sec
MapReduce Total cumulative CPU time: 3 seconds 530 msec
Ended Job = job_1534181281204_0010
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.53 sec   HDFS Read: 1249432 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 530 msec
OK
1
Time taken: 22.837 seconds, Fetched: 1 row(s)
hive>

当查询的列数大于1列时,列式存储的表查询的数据量相应会增大很多。

 

你可能感兴趣的:(Hadoop)