个人关于hadoop使用LZO压缩主要步骤以及带来的后续问题和解决办法

hadoop-lzo安装教程请链接

https://github.com/twitter/hadoop-lzo

下载打包hadoop-lzo

https://github.com/twitter/hadoop-lzo/zipball/master

1.其中说明:首先要在本地安装lzo库,方法如下:

http://www.oberhumer.com/opensource/lzo/#download下载解压后,安装说明编译安装,建议指定安装路径如下:

tar -zxvf lzo-2.06.tar.gz -C /opt/tool/
cd /opt/tool/lzo-2.06/
mkdir /usr/local/lzo
./configure --enable-shared --prefix /usr/local/lzo
make & sudo makeinstall

2.本地库安装完成后,回头制作hadoop-lzo的jar包,即解压下载过来的hadoop-lzo-master.zip

unzip hadoop-lzo-master.zip

在目录下执行:

export CFLAGS=-m64
export CXXFLAGS=-m64
export LIBRARY_PATH=/usr/local/lzo/lib
C_INCLUDE_PATH=/usr/local/lzo/include \
LIBRARY_PATH=/usr/local/lzo/lib \
mvn clean package -Dmaven.test.skip=true
cd target/native/Linux-amd64-64
tar -cBf - -C lib . | tar -xBvf - -C ~
mv ~/libgplcompression* $HADOOP_HOME/lib/native/

3.将mvn打包的hadoop-lzo-0.4.20-SNAPSHOT.jar复制到hadoop的common目录下

cp hadoop-lzo-0.4.20-SNAPSHOT.jar $HADOOP_HOME/share/hadoop/common/
4.mapreduce中间压缩的配置

在core-site.xml中添加配置

	     
		io.compression.codecs	
		org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec
	
	     
		io.compression.codec.lzo.class     
		com.hadoop.compression.lzo.LzoCodec
	
在mapred.size中设置

	
		mapred.compress.map.output     
		true   
	   
	     
		mapred.map.output.compression.codec      
		com.hadoop.compression.lzo.LzoCodec   
	
5.测试

hadoop中给Lzo文件建立Index

hadoop jar $HADOOP_HOME/share/common/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer /test/input
hadoop中对mr任务直接使用LzoCodec压缩

在inputformat中使用LzoTextInputFormat,可对lzo压缩格式的数据进行mr优化job

在hive中建表使用lzo压缩格式文件

create table lzo(id int,name string)
row format delimited fields terminated by '^'
stored as inputformat 'com.hadoop.mapred.DeprecatedLzoTextInputFormat' outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
load data local inpath '/home/hadoop/test/hive/lzo.txt.lzo' into table lzo;
select * from lzo;
后续问题:

6 .在给HIve使用TEZ引擎后,导致mapreduce on tez运行抛异常,hive on tez直接无法启动hive的cli抛异常,原因:LZO的jar包缺失

解决办法:

不要使用编译出来的tez-0.8.4.tar.gz,应当使用tez-dist/target/中的tez-0.8.4-minimal.tar.gz,在本地解压,设置环境变量

export TEZ_HOME=/opt/single/tez
export TEZ_CONF_DIR=$TEZ_HOME/conf
export TEZ_JARS=$TEZ_HOME
在hadoop-env.sh中加入
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$TEZ_CONF_DIR:$TEZ_JARS/*:$TEZ_JARS/lib/*
将tez-0.8.4-minimal.tar.gz上传到hdfs://hadoop:9000/apps/tez-0.8.4/目录下
在$TEZ_HOME下建立conf,创建tez-site.xml,




    
        tez.lib.uris
        hdfs://hadoop:9000/apps/tez-0.8.4/tez-0.8.4-minimal.tar.gz
    
    
        tez.use.cluster.hadoop-libs
        true
    
    

这是关于lzo问题修改的部分tez配置,其他配置参考其他文章

7.在配置的Spark on hadoop中,无法运行Spark的app

异常大概如下:

java.lang.RuntimeException: Error in configuring object
	at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:112)
	at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:78)
	at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:136)
	at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:185)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:198)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
	....
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
	at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
	at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
	at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
	at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
	at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
	at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
	at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
	at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
	at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
	at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
	at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
	at org.apache.spark.repl.Main$.main(Main.scala:31)
	at org.apache.spark.repl.Main.main(Main.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
	... 76 more
Caused by: java.lang.IllegalArgumentException: Compression codec com.hadoop.compression.lzo.LzoCodec not found.
	at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:139)
	at org.apache.hadoop.io.compress.CompressionCodecFactory.(CompressionCodecFactory.java:180)
	at org.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:45)
	... 81 more
Caused by: java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found
	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
	at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:132)
	... 83 more

解决:

在spark-env.sh中添加配置如下:

export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/opt/single/spark/lib:/usr/local/lzo/lib
export SPARK_CLASSPATH=$SPARK_CLASSPATH:/opt/single/hadoop-2.7.2/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar
export HADOOP_HOME=/opt/single/hadoop-2.7.2
export HADOOP_CONF_DIR=/opt/single/hadoop-2.7.2/etc/hadoop
其他不变

你可能感兴趣的:(综合)