基于CDH5集群配置snappy压缩

基于CDH5集群配置snappy压缩,配置步骤如下:

1、常用的三种压缩gzip,lzo,snappy,经分析对比

算法 压缩后/压缩前 压缩速度 解压速度
GZIP 13.4% 21 MB/s 118 MB/s
LZO 20.5% 135 MB/s 410 MB/s
Snappy 22.2% 172 MB/s 409 MB/s

snappy综合实力最佳,lzo我们也尝试使用,但是常导致个别老机器down机。


2、配置hdfs的core-site.xml相应压缩项

  
    io.compression.codecs
    org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DeflateCodec,org.apache.hadoop.io.compress.SnappyCodec,org.apache.hadoop.io.compress.Lz4Codec
  


3、配置mapreduce的mapred-site.xml压缩项

  
    mapreduce.output.fileoutputformat.compress
    true
  
  
    mapreduce.output.fileoutputformat.compress.type
    BLOCK
  
  
    mapreduce.output.fileoutputformat.compress.codec
    org.apache.hadoop.io.compress.SnappyCodec
  
  
    mapreduce.map.output.compress.codec
    org.apache.hadoop.io.compress.SnappyCodec
  
  
    mapreduce.map.output.compress
    true
  

4、配置hive的hive-site.xml压缩项


  hive.enforce.bucketing
  true


  hive.exec.compress.output
  true


  io.compression.codecs
  org.apache.hadoop.io.compress.SnappyCodec


  hive.auto.convert.join
  false


  hive.support.concurrency
  false


5、配置spark的压缩项

spark-env.sh

export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
export SPARK_MASTER_IP=10.130.2.20
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=48
export SPARK_WORKER_INSTANCES=1
export SPARK_WORKER_MEMORY=37g
export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_HOME/lib/native
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native
export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:$HADOOP_HOME/lib/native
export SPARK_CLASSPATH=$SPARK_CLASSPATH:$HADOOP_HOME/lib/snappy-java-1.0.4.1.jar

spark-defaults.conf

spark.local.dir /diskb/sparktmp,/diskc/sparktmp,/diskd/sparktmp,/diske/sparktmp,/diskf/sparktmp,/diskg/sparktmp
spark.io.compression.codec snappy


总结:

经过如上配置,集群中的mr, hive ,spark的作业,都会以snappy进行压缩处理,极大的减少了IO的消耗,提高了性能。

你可能感兴趣的:(Hive,Spark,Hadoop)