[Author]: kwu
基于CDH5集群配置snappy压缩,配置步骤如下:
1、常用的三种压缩gzip,lzo,snappy,经分析对比
算法 压缩后/压缩前 压缩速度 解压速度
GZIP 13.4% 21 MB/s 118 MB/s
LZO 20.5% 135 MB/s 410 MB/s
Snappy 22.2% 172 MB/s 409 MB/s
snappy综合实力最佳,lzo我们也尝试使用,但是常导致个别老机器down机。
2、配置hdfs的core-site.xml相应压缩项
<property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DeflateCodec,org.apache.hadoop.io.compress.SnappyCodec,org.apache.hadoop.io.compress.Lz4Codec</value> </property>
3、配置mapreduce的mapred-site.xml压缩项
<property> <name>mapreduce.output.fileoutputformat.compress</name> <value>true</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress.type</name> <value>BLOCK</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> <property> <name>mapreduce.map.output.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> <property> <name>mapreduce.map.output.compress</name> <value>true</value> </property>
<property> <name>hive.enforce.bucketing</name> <value>true</value> </property> <property> <name>hive.exec.compress.output</name> <value>true</value> </property> <property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> <property> <name>hive.auto.convert.join</name> <value>false</value> </property> <property> <name>hive.support.concurrency</name> <value>false</value> </property>
spark-env.sh
export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera export SPARK_MASTER_IP=10.130.2.20 export SPARK_MASTER_PORT=7077 export SPARK_WORKER_CORES=48 export SPARK_WORKER_INSTANCES=1 export SPARK_WORKER_MEMORY=37g export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_HOME/lib/native export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:$HADOOP_HOME/lib/native export SPARK_CLASSPATH=$SPARK_CLASSPATH:$HADOOP_HOME/lib/snappy-java-1.0.4.1.jar
spark.local.dir /diskb/sparktmp,/diskc/sparktmp,/diskd/sparktmp,/diske/sparktmp,/diskf/sparktmp,/diskg/sparktmp spark.io.compression.codec snappy
经过如上配置,集群中的mr, hive ,spark的作业,都会以snappy进行压缩处理,极大的减少了IO的消耗,提高了性能。