hadoop,spark中使用lzo

一、环境准备

  • hadoop-2.6.0-cdh5.15.1并支持压缩(参考:hadoop安装文档)
  • lzo jar包(下载地址:lzo jar下载地址)
  • lzo安装包(下载地址:lzo下载地址)
  • lzop安装包(下载地址:lzop下载地址)
yum -y install gcc-c++ lzo-devel zlib-devel autoconf automake libtool

二、安装配置lzo

[hadoop@hadoop000 software]$ wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.10.tar.gz
[hadoop@hadoop000 software]$ tar -zxvf lzo-2.10.tar.gz -C ../app/
[hadoop@hadoop000 software]$ cd ../app/
[hadoop@hadoop000 app]$ ll
drwxr-xr-x  13 hadoop hadoop 4096 Sep 29 10:07 lzo-2.10
[root@hadoop000 lzo-2.10]# ./configure
[root@hadoop000 lzo-2.10]# make install
make[1]: Entering directory `/home/hadoop/app/lzo-2.10'
 /bin/mkdir -p '/usr/local/lib'
 /bin/sh ./libtool   --mode=install /bin/install -c   src/liblzo2.la '/usr/local/lib'
libtool: install: /bin/install -c src/.libs/liblzo2.lai /usr/local/lib/liblzo2.la
libtool: install: /bin/install -c src/.libs/liblzo2.a /usr/local/lib/liblzo2.a
libtool: install: chmod 644 /usr/local/lib/liblzo2.a
libtool: install: ranlib /usr/local/lib/liblzo2.a
libtool: finish: PATH="/usr/local/openresty/nginx/sbin:/usr/java/jdk1.8.0_45/bin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/root/bin:/sbin" ldconfig -n /usr/local/lib
----------------------------------------------------------------------
Libraries have been installed in:
   /usr/local/lib

If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the '-LLIBDIR'
flag during linking and do at least one of the following:
   - add LIBDIR to the 'LD_LIBRARY_PATH' environment variable
     during execution
   - add LIBDIR to the 'LD_RUN_PATH' environment variable
     during linking
   - use the '-Wl,-rpath -Wl,LIBDIR' linker flag
   - have your system administrator add LIBDIR to '/etc/ld.so.conf'

See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------
 /bin/mkdir -p '/usr/local/share/doc/lzo'
 /bin/install -c -m 644 AUTHORS COPYING NEWS THANKS doc/LZO.FAQ doc/LZO.TXT doc/LZOAPI.TXT '/usr/local/share/doc/lzo'
 /bin/mkdir -p '/usr/local/lib/pkgconfig'
 /bin/install -c -m 644 lzo2.pc '/usr/local/lib/pkgconfig'
 /bin/mkdir -p '/usr/local/include/lzo'
 /bin/install -c -m 644 include/lzo/lzo1.h include/lzo/lzo1a.h include/lzo/lzo1b.h include/lzo/lzo1c.h include/lzo/lzo1f.h include/lzo/lzo1x.h include/lzo/lzo1y.h include/lzo/lzo1z.h include/lzo/lzo2a.h include/lzo/lzo_asm.h include/lzo/lzoconf.h include/lzo/lzodefs.h include/lzo/lzoutil.h '/usr/local/include/lzo'
make[1]: Leaving directory `/home/hadoop/app/lzo-2.10'

三、安装配置lzop

[hadoop@hadoop000 software]$ wget http://www.lzop.org/download/lzop-1.04.tar.gz
[hadoop@hadoop000 software]$ tar -zxvf lzop-1.04.tar.gz -C ../app/
[hadoop@hadoop000 app]$ ll
drwxr-xr-x   6 hadoop hadoop 4096 Aug 10  2017 lzop-1.04
[root@hadoop000 ~]# cd /home/hadoop/app/lzop-1.04/
[root@hadoop000 lzop-1.04]# ./configure
[root@hadoop000 lzop-1.04]# make  && make install
make[1]: Leaving directory `/home/hadoop/app/lzop-1.04'
make[1]: Entering directory `/home/hadoop/app/lzop-1.04'
 /bin/mkdir -p '/usr/local/bin'
  /bin/install -c src/lzop '/usr/local/bin'
 /bin/mkdir -p '/usr/local/share/doc/lzop'
 /bin/install -c -m 644 AUTHORS COPYING NEWS README THANKS doc/lzop.html doc/lzop.man doc/lzop.ps doc/lzop.tex doc/lzop.txt doc/lzop.pod '/usr/local/share/doc/lzop'
 /bin/mkdir -p '/usr/local/share/man/man1'
 /bin/install -c -m 644 doc/lzop.1 '/usr/local/share/man/man1'
make[1]: Leaving directory `/home/hadoop/app/lzop-1.04'

四、测试lzop

[hadoop@hadoop000 ~]$ ll
-rw-rw-r--.  1 hadoop hadoop 4448 Sep  7 23:36 zookeeper.out
[hadoop@hadoop000 ~]$ lzop zookeeper.out 
[hadoop@hadoop000 ~]$ ll
-rw-rw-r--.  1 hadoop hadoop 4448 Sep  7 23:36 zookeeper.out
-rw-rw-r--   1 hadoop hadoop 1630 Sep  7 23:36 zookeeper.out.lzo

五、上传hadoop-lzo jar包

[hadoop@hadoop000 common]$ pwd
/home/hadoop/app/hadoop/share/hadoop/common
[hadoop@hadoop000 common]$ ll
-rw-r--r--  1 hadoop hadoop  193831 Sep 29 09:18 hadoop-lzo-0.4.20.jar

六、编译hadoop-lzo(可选)

[root@hadoop000 tar]# wget https://github.com/twitter/hadoop-lzo/archive/master.zip
[root@hadoop000 tar]# unzip -d /home/hadoop/app/ master.zip 
修改hadoop version
  
    UTF-8
    2.6.0-cdh5.15.1    
    1.0.4
  
  添加仓库
      
      cloudera
      https://repository.cloudera.com/artifactory/cloudera-repos
    
    readme文件需要设置
[hadoop@hadoop000 hadoop-lzo-master]$C_INCLUDE_PATH=/usr/local/include 
[hadoop@hadoop000 hadoop-lzo-master]$LIBRARY_PATH=/usr/local/lib
[hadoop@hadoop000 hadoop-lzo-master]$  mvn clean package -Dmaven.test.skip=true
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 02:41 min
[INFO] Finished at: 2019-04-14T12:03:05+00:00
[INFO] Final Memory: 37M/1252M
[INFO] ------------------------------------------------------------------------
拷贝相关文件放到本地库
[hadoop@hadoop000 hadoop-lzo-master]$ cd target/native/Linux-amd64-64/
[hadoop@hadoop000 Linux-amd64-64]$ tar -cBf - -C lib . | tar -xBvf - -C ~
[hadoop@hadoop000 ~]$ cp ~/libgplcompression* $HADOOP_HOME/lib/native/
把编译好的jar  加入 hadoop包下
[hadoop@hadoop000 target]$ cp hadoop-lzo-0.4.21-SNAPSHOT.jar $HADOOP_HOME/share/hadoop/common/

七、配置core-site.xml

    
        io.compression.codecs
        org.apache.hadoop.io.compress.GzipCodec,
                org.apache.hadoop.io.compress.DefaultCodec,
                org.apache.hadoop.io.compress.BZip2Codec,
                org.apache.hadoop.io.compress.SnappyCodec,
                com.hadoop.compression.lzo.LzoCodec,
                com.hadoop.compression.lzo.LzopCodec
        
     
        
     
         io.compression.codec.lzo.class
         com.hadoop.compression.lzo.LzopCodec
     

八、配置mapred-site.xml



    mapreduce.map.output.compress
    true


   mapreduce.map.output.compression.codec
   com.hadoop.compression.lzo.LzopCodec



    mapreduce.output.fileoutputformat.compress
    true
 


   mapreduce.output.fileoutputformat.compress.codec
   com.hadoop.compression.lzo.LzopCodec



    mapred.child.env
    LD_LIBRARY_PATH=/usr/local/lib

九、测试生成lzo文件

[hadoop@hadoop000 data]$ ll
-rw-r--r--  1 hadoop hadoop 533444411 Apr  1  2015 ratings.csv
[hadoop@hadoop000 data]$ du -sh *
509M    ratings.csv
[hadoop@hadoop000 data]$ lzop ratings.csv 
[hadoop@hadoop000 data]$ du -sh *
509M    ratings.csv
220M    ratings.csv.lzo
[hadoop@hadoop000 data]$ hdfs dfs -put ratings.csv.lzo /ruozedata/input/
[hadoop@hadoop000 hadoop]$ hadoop jar \
share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.15.1.jar \
wordcount \
-Dmapreduce.output.fileoutputformat.compress=true \
-Dmapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec \
/ruozedata/input/ratings.csv.lzo \
/ruozedata/output/lzo_02/ 
19/09/29 11:12:56 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/09/29 11:12:56 INFO input.FileInputFormat: Total input paths to process : 1
19/09/29 11:12:56 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
19/09/29 11:12:56 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 52decc77982b58949890770d22720a91adce0c3f]
19/09/29 11:12:57 INFO mapreduce.JobSubmitter: number of splits:1 只有1个分片 说明这种lzo 不支持分片

十、产生index文件

需要建立lzo索引

[hadoop@hadoop000 hadoop]$ hdfs dfs -mkdir /ruozedata/index/
[hadoop@hadoop000 hadoop]$ hadoop jar /home/hadoop/app/hadoop/share/hadoop/common/hadoop-lzo-0.4.20.jar com.hadoop.compression.lzo.DistributedLzoIndexer /ruozedata/input/ratings.csv.lzo
[hadoop@hadoop000 hadoop]$ hdfs dfs -ls /ruozedata/input/*
-rw-r--r--   1 hadoop hadoop  533444411 2019-09-29 10:25 /ruozedata/input/ratings.csv
-rw-r--r--   1 hadoop hadoop  230567633 2019-09-29 11:11 /ruozedata/input/ratings.csv.lzo
-rw-r--r--   1 hadoop hadoop      16280 2019-09-29 11:28 /ruozedata/input/ratings.csv.lzo.index
同目录下生成了一个index后缀的文件
重新执行命令
[hadoop@hadoop000 hadoop]$ hadoop jar \
> share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.15.1.jar \
> wordcount \
> -Dmapreduce.output.fileoutputformat.compress=true \
> -Dmapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec \
> -Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat \
> /ruozedata/input/ratings.csv.lzo \
> /ruozedata/output/lzo_04/
19/09/29 11:40:59 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/09/29 11:41:00 INFO input.FileInputFormat: Total input paths to process : 1
19/09/29 11:41:00 INFO mapreduce.JobSubmitter: number of splits:2

注意:lzo文件必须在hdfs文件系统

十一、spark中使用lzo

[hadoop@hadoop000 conf]$ vi spark-defaults.conf
spark.jars                       /home/hadoop/app/hadoop/share/hadoop/common/hadoop-lzo-0.4.20.jar
object CompressionApp {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    val sc = new SparkContext(conf)
    val input = args(0)
    val output = args(1)
    FileUtils.deleteTarget(output, new Configuration())
    val rdd = sc.newAPIHadoopFile[LongWritable, Text, LzoTextInputFormat](input)
    rdd.map(_._2.toString).flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
      .saveAsTextFile(output, classOf[com.hadoop.compression.lzo.LzopCodec])
    sc.stop()
  }
spark-submit \
 --class com.ruozedata.bigdata.spark.homework.CompressionApp \
 --master yarn \
 --deploy-mode client \
 --executor-memory 3G \
 --num-executors 1 \
/home/hadoop/lib/ruozedata-spark-flink-1.0.jar \
/ruozedata/input/ratings.csv.lzo \
/ruozedata/output/lzo_spark/

十一、总结

220M支持分片的话应该有 2个split,不支持就一个split,
支持分片当大于blocksize时,会有2个map处理,提高效率
不支持分片不论多大都只有一个 map处理  耗费时间
所以工作中使用lzo要合理控制生成的lzo大小,不要超过一个block大小。因为如果没有lzo的index文件,该lzo会由一个map处理。如果lzo过大,
会导致某个map处理时间过长。

也可以配合lzo.index文件使用,这样就支持split,好处是文件大小不受限制,可以将文件设置的稍微大点,这样有利于减少文件数目。
但是生成lzo.index文件虽然占空间不大但也本身需要开销。

你可能感兴趣的:(hadoop,spark中使用lzo)