1、准备测试数据、在hive上创建表page_views,并将测试将数据导入
create table page_views(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t";
load data local inpath "/opt/data/page_views.dat" overwrite into table page_views;
2、检查未经压缩的文件大少
[root@hadoop001 data]# hadoop fs -du -h /user/hive/warehouse/demo.db/page_views/page_views.dat
18.1 M 54.4 M /user/hive/warehouse/demo.db/page_views/page_views.dat
[root@hadoop001 data]# du -h page_views.dat
19M page_views.dat
3、使用不同的格式对文件进行压缩,并比较大小
注意:hadoop checknative 检查是否支持压缩,不支持的话,需要进行源码编译将Native library编译进hadoop才行。具体请移步至 http://blog.csdn.net/qq_26369213/article/details/78925760
[root@hadoop001 data]# hadoop checknative
18/03/01 00:54:52 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
18/03/01 00:54:52 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop: true /opt/software/hadoop-2.6.0-cdh5.7.0/lib/native/libhadoop.so
zlib: true /lib64/libz.so.1
snappy: true /usr/lib64/libsnappy.so.1
lz4: true revision:99
bzip2: true /lib64/libbz2.so.1
openssl: true /usr/lib64/libcrypto.so
使用BZip2压缩
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec;
create table page_views_bzip2
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
as select * from page_views;
Time taken: 21.224 seconds
使用Snappy压缩
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
create table page_views_snappy
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
as select * from page_views;
Time taken: 17.899 seconds
使用Lz4Codec
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.Lz4Codec;
create table page_views_lz4
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
as select * from page_views;
Time taken: 17.663 seconds
4、检查源数据与压缩后的数据大小
hadoop fs -du -h /user/hive/warehouse/demo.db/page_views/page_views.dat /user/hive/warehouse/demo.db/page_views_snappy/000000_0.snappy /user/hive/warehouse/demo.db/page_views_bzip2/000000_0.bz2 /user/hive/warehouse/demo.db/page_views_lz4/000000_0.lz4 | sort -nk1
3.6 M 10.9 M /user/hive/warehouse/demo.db/page_views_bzip2/000000_0.bz2
8.3 M 25.0 M /user/hive/warehouse/demo.db/page_views_lz4/000000_0.lz4
8.4 M 25.2 M /user/hive/warehouse/demo.db/page_views_snappy/000000_0.snappy
18.1 M 54.4 M /user/hive/warehouse/demo.db/page_views/page_views.dat
5、执行sql进行效率比较
hive> select count(*) from page_views;
Query ID = root_20180301002323_fdf409bc-f1af-4c2c-b26c-755722c31bfd
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1519828756164_0009, Tracking URL = http://hadoop001:8088/proxy/application_1519828756164_0009/
Kill Command = /opt/software/hadoop/bin/hadoop job -kill job_1519828756164_0009
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-03-01 01:28:22,389 Stage-1 map = 0%, reduce = 0%
2018-03-01 01:28:28,791 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.56 sec
2018-03-01 01:28:35,089 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.78 sec
MapReduce Total cumulative CPU time: 2 seconds 780 msec
Ended Job = job_1519828756164_0009
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.78 sec HDFS Read: 19021459 HDFS Write: 16 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 780 msec
OK
100000
Time taken: 22.692 seconds, Fetched: 1 row(s)
hive> select count(*) from page_views_lz4;
Query ID = root_20180301002323_fdf409bc-f1af-4c2c-b26c-755722c31bfd
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1519828756164_0006, Tracking URL = http://hadoop001:8088/proxy/application_1519828756164_0006/
Kill Command = /opt/software/hadoop/bin/hadoop job -kill job_1519828756164_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-03-01 01:14:53,317 Stage-1 map = 0%, reduce = 0%
2018-03-01 01:15:01,713 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.15 sec
2018-03-01 01:15:09,076 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.38 sec
MapReduce Total cumulative CPU time: 4 seconds 380 msec
Ended Job = job_1519828756164_0006
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.38 sec HDFS Read: 8753905 HDFS Write: 16 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 380 msec
OK
100000
Time taken: 24.707 seconds, Fetched: 1 row(s)
hive> select count(*) from page_views_snappy;
Query ID = root_20180301002323_fdf409bc-f1af-4c2c-b26c-755722c31bfd
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1519828756164_0007, Tracking URL = http://hadoop001:8088/proxy/application_1519828756164_0007/
Kill Command = /opt/software/hadoop/bin/hadoop job -kill job_1519828756164_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-03-01 01:16:11,531 Stage-1 map = 0%, reduce = 0%
2018-03-01 01:16:18,858 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.74 sec
2018-03-01 01:16:26,216 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.0 sec
MapReduce Total cumulative CPU time: 3 seconds 0 msec
Ended Job = job_1519828756164_0007
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.0 sec HDFS Read: 8820268 HDFS Write: 16 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 0 msec
OK
100000
Time taken: 22.719 seconds, Fetched: 1 row(s)
hive> select count(*) from page_views_bzip2;
Query ID = root_20180301002323_fdf409bc-f1af-4c2c-b26c-755722c31bfd
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1519828756164_0008, Tracking URL = http://hadoop001:8088/proxy/application_1519828756164_0008/
Kill Command = /opt/software/hadoop/bin/hadoop job -kill job_1519828756164_0008
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2018-03-01 01:21:36,104 Stage-1 map = 0%, reduce = 0%
2018-03-01 01:21:50,147 Stage-1 map = 3%, reduce = 0%, Cumulative CPU 2.95 sec
2018-03-01 01:21:53,385 Stage-1 map = 22%, reduce = 0%, Cumulative CPU 7.77 sec
2018-03-01 01:21:55,517 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 8.63 sec
2018-03-01 01:21:56,549 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 10.46 sec
2018-03-01 01:22:02,852 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 11.77 sec
MapReduce Total cumulative CPU time: 11 seconds 770 msec
Ended Job = job_1519828756164_0008
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 11.77 sec HDFS Read: 4106723 HDFS Write: 16 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 770 msec
OK
100000
Time taken: 34.986 seconds, Fetched: 1 row(s)
5、对比数据