Hive的压缩

压缩
减少磁盘存储压力,负载
减少网络IO负载
1).首先,要保证hadoop是支持压缩
检查是否支持压缩算法
$ bin/hadoop checknative
Native library checking:
hadoop:  false 
zlib:    false 
snappy:  false 
lz4:     false 
bzip2:   false 
openssl: false


snappy 压缩比和速度相当于适中的


2)编译hadoop源码:mvn package -Pdist,native,docs -DskipTests -Dtar  -Drequire.snappy


3)##替换$HADOOP_HOME/lib/native 直接上传到$HADOOP_HOME
$ tar -zxf cdh5.3.6-snappy-lib-natirve.tar.gz 
再次检查  $ bin/hadoop checknative
Native library checking:
hadoop:  true /opt/modules/cdh/hadoop-2.5.0-cdh5.3.6/lib/native/libhadoop.so.1.0.0
zlib:    true /lib64/libz.so.1
snappy:  true /opt/modules/cdh/hadoop-2.5.0-cdh5.3.6/lib/native/libsnappy.so.1
lz4:     true revision:99
bzip2:   true /lib64/libbz2.so.1
openssl: true /usr/lib64/libcrypto.so


启动dfs,yarn,historyserver,然后提交一个job
$ bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0-cdh5.3.6.jar pi 1 2
完成之后,web界面查看完成这个任务的配置


mapreduce.map.output.compress  false 
mapreduce.map.output.compress.codec  org.apache.hadoop.io.compress.DefaultCodec 


4)配置mapred-site.xml,在原有的下方增加如下内容

mapreduce.map.output.compress
true



mapreduce.map.output.compress.codec
org.apache.hadoop.io.compress.SnappyCodec



结束所有进程,重新启动所有进程dfs,yarn,historyserver,再提交一个job
$ bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0-cdh5.3.6.jar pi 3 5
完成之后,web界面重新查看完成这个任务的配置




mapreduce.map.output.compress  true  job.xml ⬅ mapred-site.xml
mapreduce.map.output.compress.codec  org.apache.hadoop.io.compress.SnappyCodec 




/启用压缩snappy+存储为ORC格式
方式一 在MapReduce的shuffle阶段启用压缩


set hive.exec.compress.output=true;
set mapred.output.compress=true;
set mapred.output.compression.codec=org apache.hadoop.io.compress.SnappyCodec;




create table if not exists file_orc_snappy(
t_time string,
t_url string,
t_uuid string,
t_refered_url string,
t_ip string,
t_user string,
t_city string
)
row format delimited fields terminated by '\t'
stored as  ORC
tblproperties("orc.compression"="Snappy");


insert into table file_orc_snappy select *  from file_text;


方式二:对reduce输出的结果文件进行压缩
set mapreduce.output.fileoutputformat.compress=true;
set mapreduce.output.fileoutputformat.compress.codec=org apache.hadoop.io.compress.SnappyCodec;


create table if not exists file_parquet_snappy(
t_time string,
t_url string,
t_uuid string,
t_refered_url string,
t_ip string,
t_user string,
t_city string
)
row format delimited fields terminated by '\t'
stored as parquet
tblproperties("parquet.compression"="Snappy");


insert into table file_parquet_snappy select * from file_text;
insert overwrite table file_parquet_snappy select * from file_text;

你可能感兴趣的:(Hive)