压缩格式需要关注两个因素:
压缩比(Compression Ratio): Snappy < LZ4 < LZO < GZIP < BZIP2
其中,BZIP2的压缩比大概是30%,Snappy/LZ4/LZO的压缩比大概是50%.
解压速度(Compression Speed): Snappy > LZ4 > LZO > GZIP > BZIP2
很明显,压缩速度这张图和上面的图是反过来的,压缩比越大,压缩/解压缩的速度越慢。
使用压缩,带来的好处是:减少磁盘空间,减少磁盘的I/O,加速网络传输;同时,压缩带来的CPU消耗是比较高的。而且,压缩速度和压缩比是成反比的,生产中具体压缩方式的选择,是要根据实际的需要来选择(如果是历史数据,选择压缩比大的;如果要求执行速度快的,选择压缩比小的)。
优点:压缩/解压速度比较快,合理的压缩率(< 50%);支持split,是hadoop中最流行的压缩格式;支持hadoop native库;可以在linux系统下安装lzop命令,使用方便。
缺点:压缩率比gzip要低一些;hadoop本身不支持,需要安装;在应用中对lzo格式的文件需要做一些特殊处理(为了支持split需要建立索引,还需要知道inputformat为lzo格式)
应用场景:一个很大的文本文件,压缩之后还大于200M以上的可以考虑,而且单个文件越大,lzo优点越明显。
优点:压缩率比较高(大于30%),而且压缩/解压缩速度也比较快;hadoop本身支持,在应用中处理gzip格式的文件就和直接处理文本一样;有hadoop native库;大部分linux系统都自带gzip命令,使用方便。
缺点:不支持split。
应用场景:当每个文件压缩后在130M以内的(1个块大小内),都可以考虑用gzip压缩格式。比如说一天或者一个小时的日志压缩成一个gzip文件,运行mapreduce程序的时候通过gzip达到并发。hive程序、streaming程序,和java写的mapreduce程序完全和文本处理一样,压缩之后原来的程序不需要做任何修改。
优点:支持split;具有很高的压缩率(30%),比gzip压缩率都高;hadoop本身支持,但不支持native;在linux系统下自带bzip2命令,使用方便。
缺点:压缩/解压缩速度慢;不支持native。
应用场景:适合对速度要求不高,但需要较高的压缩率的时候,可以作为mapreduce作业的输出格式;或者输出之后的数据比较大,处理之后的数据需要压缩存档减少磁盘空间并且以后数据用得比较少的情况;或者对单个很大的文本文件想压缩减少存储空间,同时又需要支持split,而且兼容之前的应用程序(即应用程序不需要修改)的情况。
优点:高速压缩速度和合理的压缩率(约50%);支持hadoop native库。
缺点:不支持split;压缩率比gzip要低;hadoop本身不支持,需要安装;linux系统下没有对应的命令。
应用场景:当mapreduce作业的map输出的数据比较大的时候,作为map到reduce的中间数据的压缩格式;或者作为一个mapreduce作业的输出和另外一个mapreduce作业的输入。
四种压缩格式优缺点对照表,
压缩格式 | split | native | 压缩率 | 速度 | 是否hadoop自带 | linux命令 | 换成压缩格式后原来应用程序是否需要修改 |
---|---|---|---|---|---|---|---|
gzip | 否 | 是 | 很高 | 比较快 | 是,直接使用 | 有 | 和文本处理一样,不需要修改 |
lzo | 是 | 是 | 比较高 | 很快 | 否,需要安装 | 有 | 需要建索引,还需要指定输入格式 |
snappy | 否 | 是 | 比较高 | 很快 | 否,需要安装 | 没有 | 和文本处理异议,不需要修改 |
bzip2 | 是 | 否 | 最高 | 慢 | 是,直接使用 | 有 | 和文本处理异议,不需要修改 |
Hadoop的jobs都是I/O密集型的,压缩可以加速IO操作;压缩可以减少数据在网络传输时的size,提升性能。压缩格式要支持切割(分片)。
在上面的压缩格式中,只有bzip2和LZO是支持分割的,且LZO是有条件的(被索引:com.hadoop.compression.lzo.LzoIndexer去创建一个index)分割。
Compression format | Tool | Algorithm | File extention | Splittable |
---|---|---|---|---|
gzip | gzip | DEFLATE | .gz | No |
bzip2 | bzip2 | bzip2 | .bz2 | Yes |
LZO | lzop | LZO | .lzo | Yes if indexed |
Snappy | N/A | Snappy | .snappy | No |
概念理解:
即如果压缩后的格式支持分片,就可以分成多个Task并发执行,提高执行效率。
查看机器上的hadoop支持哪些压缩格式(只有编译后的hadoop的才能支持各种压缩方式)
[hadoop@hadoop01 ~]$ hadoop checknative
18/08/13 18:43:08 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
18/08/13 18:43:08 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop: true /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/lib/native/libhadoop.so.1.0.0
zlib: true /lib64/libz.so.1
snappy: true /lib64/libsnappy.so.1
lz4: true revision:99
bzip2: true /lib64/libbz2.so.1
openssl: true /lib64/libcrypto.so
[hadoop@hadoop01 ~]$
Codec在使用时,只需要写在hadoop的配置文件中即可。
MapReduce的三个阶段都可以压缩,map读进来的时候可以压缩,读进来之后解压;reduce读的时候要解压,reduce写出去的时候要压缩。
结论:
2.4.1.1 修改hadoop配置文件
配置core-site.xml(①)
io.compression.codecs
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec,
配置mapred-site.xml(③)
mapreduce.output.fileoutputformat.compress
true
mapreduce.output.fileoutputformat.compress.codec
org.apache.hadoop.io.compress.BZip2Codec
中间一步(②),打开hadoop官网,查看mapred-default.xml配置文件。
修改完配置文件之后要重启hdfs和yarn服务。
2.4.1.2 跑一个MapReduce作业测试
[hadoop@hadoop01 ~]$ cd app/hadoop-2.6.0-cdh5.7.0/share/hadoop/mapreduce/
[hadoop@hadoop01 mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount /tmp/input.txt /tmp/compression-out/
...
[hadoop@hadoop01 mapreduce]$
查看结果,输出结果的压缩格式为.bz2,与配置文件一致
[hadoop@hadoop01 mapreduce]$ hdfs dfs -ls /tmp/compression-out/
Found 2 items
-rw-r--r-- 1 hadoop supergroup 0 2018-08-13 20:01 /tmp/compression-out/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 65 2018-08-13 20:01 /tmp/compression-out/part-r-00000.bz2
[hadoop@hadoop01 mapreduce]$ hdfs dfs -text /tmp/compression-out/part-r-00000.bz2
18/08/13 20:02:53 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
18/08/13 20:02:53 INFO compress.CodecPool: Got brand-new decompressor [.bz2]
data 1
is 2
sample 1
test 2
this 2
[hadoop@hadoop01 mapreduce]$
Hive的建表语句里面有一个STORED AS file_format结合使用的方法,指定hive的存储格式。不仅能节省hive的存储空间,还可以提高执行效率。
file_format参考:https://blog.csdn.net/wawa8899/article/details/81674817
2.4.1.1 不压缩
在hive创建一张不压缩的表,把数据导进去
hive> create table test1(
> c1 string,
> c2 string,
> c3 string,
> c4 string,
> c5 string,
> c6 string,
> c7 string,
> c8 string,
> c9 string,
> c10 string,
> c11 string,
> c12 string,
> c13 string,
> c14 string,
> c15 string,
> c16 string,
> c17 string,
> c18 string,
> c19 string,
> c20 string,
> c21 string,
> c22 string,
> c23 string,
> c24 string,
> c25 string,
> c26 string,
> c27 string,
> c28 string,
> c29 string,
> c30 string,
> c31 string,
> c32 string,
> c33 string
> )
> row format delimited fields terminated by '||';
OK
Time taken: 0.716 seconds
hive> load data local inpath '/home/hadoop/data/20180813000203.txt' overwrite into table test1;
hive> select count(1) from test1;
OK
76241
Time taken: 20.67 seconds, Fetched: 1 row(s)
hive>
此时hdfs上查看一下文件的大小
[hadoop@hadoop01 data]$ hdfs dfs -du -s -h /user/hive/warehouse/test1
37.4 M 37.4 M /user/hive/warehouse/test1
[hadoop@hadoop01 data]$
2.4.1.2 bzip2压缩
在hive创建一张bzip2的表,把数据导进去(查看hive怎么压缩,打开hive官网,点击compression)
查看hive当前的压缩格式,默认是不压缩的
hive> SET hive.exec.compress.output;
hive.exec.compress.output=false
hive>
查看hive当前的codec,默认是bzip2
hive> SET mapreduce.output.fileoutputformat.compress.codec;
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec
hive>
设置一下压缩格式为bzip2,codec也为bzip2,并且创建一张表
hive> SET hive.exec.compress.output=true;
hive> SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec;
hive> create table test1_bzip2
> row format delimited fields terminated by '||'
> as select * from test1;
去hdfs上查看文件的大小,文件大小由最初的37.4M变成了450.0K(这里bzip2的压缩比应该是30%左右,因为我的数据本身有很多重复,所以压缩后体积缩小非常大),hdfs上存储的格式也变成了.bz2
[hadoop@hadoop01 data]$ hdfs dfs -du -s -h /user/hive/warehouse/test1_bzip2
450.0 K 450.0 K /user/hive/warehouse/test1_bzip2
[hadoop@hadoop01 data]$ hdfs dfs -ls /user/hive/warehouse/test1_bzip2
Found 1 items
-rwxr-xr-x 1 hadoop supergroup 460749 2018-08-13 20:32 /user/hive/warehouse/test1_bzip2/000000_0.bz2
特别注意:为了不影响其他用户,通常使用压缩格式的时候不要设置为全局设置,使用完了以后要马上改回来!!
hive> SET hive.exec.compress.output=false;
2.4.1.3 SEQUENCEFILE存储
再创建一张表,存储成SEQUENCEFILE格式,数据导入的方法为先创建一张文本格式存储的表,把数据导入文本格式的表,再加载进SEQUENCEFILE表。
hive> create table test1(
> c1 string,
> c2 string,
> c3 string,
> c4 string,
> c5 string,
> c6 string,
> c7 string,
> c8 string,
> c9 string,
> c10 string,
> c11 string,
> c12 string,
> c13 string,
> c14 string,
> c15 string,
> c16 string,
> c17 string,
> c18 string,
> c19 string,
> c20 string,
> c21 string,
> c22 string,
> c23 string,
> c24 string,
> c25 string,
> c26 string,
> c27 string,
> c28 string,
> c29 string,
> c30 string,
> c31 string,
> c32 string,
> c33 string
> )
> row format delimited fields terminated by '||';
OK
Time taken: 0.145 seconds
hive> load data local inpath '/home/hadoop/data/20180813000203.txt' overwrite into table test1;
Loading data to table default.test1
Table default.test1 stats: [numFiles=1, numRows=0, totalSize=39187874, rawDataSize=0]
OK
Time taken: 1.316 seconds
hive>
hive> create table test1_seq(
> c1 string,
> c2 string,
> c3 string,
> c4 string,
> c5 string,
> c6 string,
> c7 string,
> c8 string,
> c9 string,
> c10 string,
> c11 string,
> c12 string,
> c13 string,
> c14 string,
> c15 string,
> c16 string,
> c17 string,
> c18 string,
> c19 string,
> c20 string,
> c21 string,
> c22 string,
> c23 string,
> c24 string,
> c25 string,
> c26 string,
> c27 string,
> c28 string,
> c29 string,
> c30 string,
> c31 string,
> c32 string,
> c33 string
> )
> row format delimited fields terminated by '||'
> stored as SEQUENCEFILE;
OK
Time taken: 0.142 seconds
hive> insert into table test1_seq select * from test1;
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1534181281204_0002, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0002/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1534181281204_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2018-08-14 03:31:06,612 Stage-1 map = 0%, reduce = 0%
2018-08-14 03:31:14,123 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.86 sec
MapReduce Total cumulative CPU time: 1 seconds 860 msec
Ended Job = job_1534181281204_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://192.168.1.8:9000/user/hive/warehouse/test1_seq/.hive-staging_hive_2018-08-14_03-30-59_220_768607288298096260-1/-ext-10000
Loading data to table default.test1_seq
Table default.test1_seq stats: [numFiles=1, numRows=76241, totalSize=12155988, rawDataSize=10972575]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.86 sec HDFS Read: 39194480 HDFS Write: 12156070 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 860 msec
OK
Time taken: 17.464 seconds
hive>
去hdfs上查看,SEQUENCEFILE格式的文件比实际大小还大。(因为我的文件里面有很多重复数据,结果不做参考)
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1;
37.4 M 37.4 M /user/hive/warehouse/test1
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1_seq;
11.6 M 11.6 M /user/hive/warehouse/test1_seq
hive>
2.4.1.4 RCFILE存储
它是一个混合的行列编成的,它保证所有行的一个列都在一个节点(block)之上,缺点是row group太小了。所以现在基本也不使用了。
创建一个RCFILE表,同样因为它是一个行、列式数据,所以数据不能直接导入,要从文本格式的表导入
hive> create table test1_rcfile(
> c1 string,
> c2 string,
> c3 string,
> c4 string,
> c5 string,
> c6 string,
> c7 string,
> c8 string,
> c9 string,
> c10 string,
> c11 string,
> c12 string,
> c13 string,
> c14 string,
> c15 string,
> c16 string,
> c17 string,
> c18 string,
> c19 string,
> c20 string,
> c21 string,
> c22 string,
> c23 string,
> c24 string,
> c25 string,
> c26 string,
> c27 string,
> c28 string,
> c29 string,
> c30 string,
> c31 string,
> c32 string,
> c33 string
> )
> row format delimited fields terminated by '||'
> stored as rcfile;
OK
Time taken: 0.146 seconds
hive> insert into test1_rcfile select * from test1;
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1534181281204_0003, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0003/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1534181281204_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2018-08-14 03:46:15,737 Stage-1 map = 0%, reduce = 0%
2018-08-14 03:46:23,228 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.87 sec
MapReduce Total cumulative CPU time: 1 seconds 870 msec
Ended Job = job_1534181281204_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://192.168.1.8:9000/user/hive/warehouse/test1_rcfile/.hive-staging_hive_2018-08-14_03-46-08_429_3722754602144122242-1/-ext-10000
Loading data to table default.test1_rcfile
Table default.test1_rcfile stats: [numFiles=1, numRows=76241, totalSize=8604253, rawDataSize=8532863]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.87 sec HDFS Read: 39194682 HDFS Write: 8604337 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 870 msec
OK
Time taken: 17.323 seconds
hive>
去hdfs上查看大小,RCFILE的结果只比原始数据空间节省了10%,所以工作中也不会大面积被使用。(因为我的文件里面有很多重复数据,结果不做参考)
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1;
37.4 M 37.4 M /user/hive/warehouse/test1
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1_rcfile;
8.2 M 8.2 M /user/hive/warehouse/test1_rcfile
hive>
2.4.1.5 ORC存储
hive> create table test1_orc(
> c1 string,
> c2 string,
> c3 string,
> c4 string,
> c5 string,
> c6 string,
> c7 string,
> c8 string,
> c9 string,
> c10 string,
> c11 string,
> c12 string,
> c13 string,
> c14 string,
> c15 string,
> c16 string,
> c17 string,
> c18 string,
> c19 string,
> c20 string,
> c21 string,
> c22 string,
> c23 string,
> c24 string,
> c25 string,
> c26 string,
> c27 string,
> c28 string,
> c29 string,
> c30 string,
> c31 string,
> c32 string,
> c33 string
> )
> row format delimited fields terminated by '||'
> stored as orc;
OK
Time taken: 0.116 seconds
hive> insert into test1_orc select * from test1;
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1534181281204_0004, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0004/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1534181281204_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2018-08-14 03:53:54,382 Stage-1 map = 0%, reduce = 0%
2018-08-14 03:54:04,091 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.38 sec
MapReduce Total cumulative CPU time: 4 seconds 380 msec
Ended Job = job_1534181281204_0004
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://192.168.1.8:9000/user/hive/warehouse/test1_orc/.hive-staging_hive_2018-08-14_03-53-46_132_1524862905095712631-1/-ext-10000
Loading data to table default.test1_orc
Table default.test1_orc stats: [numFiles=1, numRows=76241, totalSize=678984, rawDataSize=219269116]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 4.38 sec HDFS Read: 39194660 HDFS Write: 679067 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 380 msec
OK
Time taken: 20.572 seconds
hive>
去hdfs上查看大小,体积缩小比例惊人。orc默认采用zlib压缩方式。(因为我的文件里面有很多重复数据,结果不做参考)
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1;
37.4 M 37.4 M /user/hive/warehouse/test1
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1_orc;
663.1 K 663.1 K /user/hive/warehouse/test1_orc
hive>
再创建一张表,orc格式存储,不采用压缩
hive> create table test1_orc_null(
> c1 string,
> c2 string,
> c3 string,
> c4 string,
> c5 string,
> c6 string,
> c7 string,
> c8 string,
> c9 string,
> c10 string,
> c11 string,
> c12 string,
> c13 string,
> c14 string,
> c15 string,
> c16 string,
> c17 string,
> c18 string,
> c19 string,
> c20 string,
> c21 string,
> c22 string,
> c23 string,
> c24 string,
> c25 string,
> c26 string,
> c27 string,
> c28 string,
> c29 string,
> c30 string,
> c31 string,
> c32 string,
> c33 string
> )
> row format delimited fields terminated by '||'
> stored as orc tblproperties ("orc.compress"="NONE");
OK
Time taken: 0.156 seconds
hive> insert into test1_orc_null select * from test1;
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1534181281204_0005, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0005/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1534181281204_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2018-08-14 04:02:40,232 Stage-1 map = 0%, reduce = 0%
2018-08-14 04:02:49,821 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.48 sec
MapReduce Total cumulative CPU time: 3 seconds 480 msec
Ended Job = job_1534181281204_0005
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://192.168.1.8:9000/user/hive/warehouse/test1_orc_null/.hive-staging_hive_2018-08-14_04-02-32_108_711422127353495082-1/-ext-10000
Loading data to table default.test1_orc_null
Table default.test1_orc_null stats: [numFiles=1, numRows=76241, totalSize=2087911, rawDataSize=219269116]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.48 sec HDFS Read: 39194703 HDFS Write: 2087999 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 480 msec
OK
Time taken: 19.238 seconds
hive>
查看大小,可以看到不压缩的orc格式比压缩的orc要大一些。(因为我的文件里面有很多重复数据,结果不做参考)
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1;
37.4 M 37.4 M /user/hive/warehouse/test1
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1_orc;
663.1 K 663.1 K /user/hive/warehouse/test1_orc
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1_orc_null;
2.0 M 2.0 M /user/hive/warehouse/test1_orc_null
hive>
2.4.1.6 PARQUET存储
hive> create table test1_parquet(
> c1 string,
> c2 string,
> c3 string,
> c4 string,
> c5 string,
> c6 string,
> c7 string,
> c8 string,
> c9 string,
> c10 string,
> c11 string,
> c12 string,
> c13 string,
> c14 string,
> c15 string,
> c16 string,
> c17 string,
> c18 string,
> c19 string,
> c20 string,
> c21 string,
> c22 string,
> c23 string,
> c24 string,
> c25 string,
> c26 string,
> c27 string,
> c28 string,
> c29 string,
> c30 string,
> c31 string,
> c32 string,
> c33 string
> )
> row format delimited fields terminated by '||'
> stored as parquet;
OK
Time taken: 0.137 seconds
hive> insert into test1_parquet select * from test1;
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1534181281204_0006, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0006/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1534181281204_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2018-08-14 04:09:14,081 Stage-1 map = 0%, reduce = 0%
2018-08-14 04:09:23,547 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.47 sec
MapReduce Total cumulative CPU time: 3 seconds 470 msec
Ended Job = job_1534181281204_0006
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://192.168.1.8:9000/user/hive/warehouse/test1_parquet/.hive-staging_hive_2018-08-14_04-09-06_783_4844806752989603100-1/-ext-10000
Loading data to table default.test1_parquet
Table default.test1_parquet stats: [numFiles=1, numRows=76241, totalSize=2239503, rawDataSize=2515953]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.47 sec HDFS Read: 39194597 HDFS Write: 2239588 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 470 msec
OK
Time taken: 19.319 seconds
hive>
查看大小,它的压缩比没有orc好。(因为我的文件里面有很多重复数据,结果不做参考)
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1;
37.4 M 37.4 M /user/hive/warehouse/test1
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1_parquet;
2.1 M 2.1 M /user/hive/warehouse/test1_parquet
hive>
继续,来对PARQUET压缩后存储
hive> set parquet.compression=GZIP;
hive> create table test1_parquet_gzip
> row format delimited fields terminated by '||'
> stored as parquet
> as select * from test1;
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1534181281204_0007, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0007/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1534181281204_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2018-08-14 04:13:34,200 Stage-1 map = 0%, reduce = 0%
2018-08-14 04:13:43,790 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.85 sec
MapReduce Total cumulative CPU time: 3 seconds 850 msec
Ended Job = job_1534181281204_0007
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://192.168.1.8:9000/user/hive/warehouse/.hive-staging_hive_2018-08-14_04-13-25_921_7360586646978839920-1/-ext-10001
Moving data to: hdfs://192.168.1.8:9000/user/hive/warehouse/test1_parquet_gzip
Table default.test1_parquet_gzip stats: [numFiles=1, numRows=76241, totalSize=605285, rawDataSize=2515953]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.85 sec HDFS Read: 39193754 HDFS Write: 605375 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 850 msec
OK
Time taken: 20.185 seconds
hive>
再次查看大小,体积缩小了非常多。(因为我的文件里面有很多重复数据,结果不做参考)
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1;
37.4 M 37.4 M /user/hive/warehouse/test1
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1_parquet;
2.1 M 2.1 M /user/hive/warehouse/test1_parquet
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1_parquet_gzip;
591.1 K 591.1 K /user/hive/warehouse/test1_parquet_gzip
hive>
对比了以上所有的存储格式和压缩方式之后,在计算方面也是有区别的(看HDFS Read)。
hive> select count(1) from test1;
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1534181281204_0011, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0011/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1534181281204_0011
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-08-14 04:24:12,595 Stage-1 map = 0%, reduce = 0%
2018-08-14 04:24:18,976 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.3 sec
2018-08-14 04:24:27,315 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.68 sec
MapReduce Total cumulative CPU time: 2 seconds 680 msec
Ended Job = job_1534181281204_0011
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.68 sec HDFS Read: 39196646 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 680 msec
OK
76241
Time taken: 23.739 seconds, Fetched: 1 row(s)
hive> select count(1) from test1 where c1 = 'WS-C4506-E';
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1534181281204_0008, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0008/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1534181281204_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-08-14 04:19:36,821 Stage-1 map = 0%, reduce = 0%
2018-08-14 04:19:44,152 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.66 sec
2018-08-14 04:19:51,497 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.94 sec
MapReduce Total cumulative CPU time: 2 seconds 940 msec
Ended Job = job_1534181281204_0008
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.94 sec HDFS Read: 39197330 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 940 msec
OK
1
Time taken: 23.09 seconds, Fetched: 1 row(s)
hive> select count(1) from test1_rc where c1 = 'WS-C4506-E';
FAILED: SemanticException [Error 10001]: Line 1:21 Table not found 'test1_rc'
hive> select count(1) from test1_orc where c1 = 'WS-C4506-E';
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1534181281204_0009, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0009/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1534181281204_0009
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-08-14 04:20:22,163 Stage-1 map = 0%, reduce = 0%
2018-08-14 04:20:29,589 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.95 sec
2018-08-14 04:20:36,951 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.29 sec
MapReduce Total cumulative CPU time: 3 seconds 290 msec
Ended Job = job_1534181281204_0009
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.29 sec HDFS Read: 296842 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 290 msec
OK
1
Time taken: 22.92 seconds, Fetched: 1 row(s)
hive> select count(1) from test1_parquet where c1 = 'WS-C4506-E';
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1534181281204_0010, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0010/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1534181281204_0010
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-08-14 04:20:53,130 Stage-1 map = 0%, reduce = 0%
2018-08-14 04:21:00,451 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.11 sec
2018-08-14 04:21:07,737 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.53 sec
MapReduce Total cumulative CPU time: 3 seconds 530 msec
Ended Job = job_1534181281204_0010
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.53 sec HDFS Read: 1249432 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 530 msec
OK
1
Time taken: 22.837 seconds, Fetched: 1 row(s)
hive>
当查询的列数大于1列时,列式存储的表查询的数据量相应会增大很多。