废话不多说,我在apache Hadoop2.2.0测试集群上配置支持使用LZO进行压缩的时候,遇到很多坑,不过最后到搞定了,这里把具体过程记录下来,以供参考。
环境:
Centos6.4 64位
Hadoop2.2.0
Sun JDK1.7.0_45
hive-0.12.0
准备工作:
yum -y install lzo-devel zlib-devel gcc autoconf automake libtool
开始了哦!
(1)安装LZO
wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz
tar -zxvf lzo-2.06.tar.gz
./configure -enable-shared -prefix=/usr/local/hadoop/lzo/
make && make test && make install
(2)安装LZOP
wget http://www.lzop.org/download/lzop-1.03.tar.gz
tar -zxvf lzop-1.03.tar.gz
./configure -enable-shared -prefix=/usr/local/hadoop/lzop
make && make install
(3)把lzop复制到/usr/bin/
ln -s /usr/local/hadoop/lzop/bin/lzop /usr/bin/lzop
(4)测试lzop
lzop /home/hadoop/data/access_20131219.log
会在生成一个lzo后缀的压缩文件: /home/hadoop/data/access_20131219.log.lzo即表示前述几个步骤正确哦。
(5)安装Hadoop-LZO
当然的还有一个前提,就是配置好maven和svn 或者Git(我使用的是SVN),这个就不说了,如果这些搞不定,其实也不必要进行下去了!
我这里使用https://github.com/twitter/hadoop-lzo
使用SVN从https://github.com/twitter/hadoop-lzo/trunk下载代码,修改pom.xml文件中的一部分。
从:
<properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <hadoop.current.version>2.1.0-beta</hadoop.current.version> <hadoop.old.version>1.0.4</hadoop.old.version> </properties>
<properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <hadoop.current.version>2.2.0</hadoop.current.version> <hadoop.old.version>1.0.4</hadoop.old.version> </properties>
mvn clean package -Dmaven.test.skip=true tar -cBf - -C target/native/Linux-amd64-64/lib . | tar -xBvf - -C /home/hadoop/hadoop-2.2.0/lib/native/ cp target/hadoop-lzo-0.4.20-SNAPSHOT.jar /home/hadoop/hadoop-2.2.0/share/hadoop/common/接下来就是将 /home/hadoop/hadoop-2.2.0/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar以及/home/hadoop/hadoop-2.2.0/lib/native/ 同步到其它所有的hadoop节点。注意,要保证目录/home/hadoop/hadoop-2.2.0/lib/native/ 下的jar包,你运行hadoop的用户都有执行权限。 (6)配置Hadoop
在文件$HADOOP_HOME/etc/hadoop/hadoop-env.sh中追加如下内容:
export LD_LIBRARY_PATH=/usr/local/hadoop/lzo/lib
在文件$HADOOP_HOME/etc/hadoop/core-site.xml中追加如下内容:
<property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec, org.apache.hadoop.io.compress.BZip2Codec </value> </property> <property> <name>io.compression.codec.lzo.class</name> <value>com.hadoop.compression.lzo.LzoCodec</value> </property>
在文件$HADOOP_HOME/etc/hadoop/mapred-site.xml中追加如下内容:
<property> <name>mapred.compress.map.output</name> <value>true</value> </property> <property> <name>mapred.map.output.compression.codec</name> <value>com.hadoop.compression.lzo.LzoCodec</value> </property> <property> <name>mapred.child.env</name> <value>LD_LIBRARY_PATH=/usr/local/hadoop/lzo/lib</value> </property>
A:首先创建nginx_lzo的表
CREATE TABLE logs_app_nginx ( ip STRING, user STRING, time STRING, request STRING, status STRING, size STRING, rt STRING, referer STRING, agent STRING, forwarded String ) partitioned by ( date string, host string ) row format delimited fields terminated by '\t' STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";B:导入数据
LOAD DATA Local INPATH '/home/hadoop/data/access_20131230_25.log.lzo' INTO TABLE logs_app_nginx PARTITION(date=20131229,host=25);/home/hadoop/data/access_20131219.log文件的格式如下:
221.207.93.109 - [23/Dec/2013:23:22:38 +0800] "GET /ClientGetResourceDetail.action?id=318880&token=Ocm HTTP/1.1" 200 199 0.008 "xxx.com" "Android4.1.2/LENOVO/Lenovo A706/ch_lenovo/80" "-"
直接采用lzop /home/hadoop/data/access_20131219.log即可生成lzo格式压缩文件/home/hadoop/data/access_20131219.log.lzo
C:索引LZO文件
$HADOOP_HOME/bin/hadoop jar /home/hadoop/hadoop-2.2.0/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.DistributedLzoIndexer /user/hive/warehouse/logs_app_nginx
D:开始跑利用hive来跑map/reduce任务了
set hive.exec.reducers.max=10; set mapred.reduce.tasks=10; select ip,rt from nginx_lzo limit 10;
hive> set hive.exec.reducers.max=10; hive> set mapred.reduce.tasks=10; hive> select ip,rt from nginx_lzo limit 10; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1388065803340_0009, Tracking URL = http://lrts216:8088/proxy/application_1388065803340_0009/ Kill Command = /home/hadoop/hadoop-2.2.0/bin/hadoop job -kill job_1388065803340_0009 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2013-12-27 09:13:39,163 Stage-1 map = 0%, reduce = 0% 2013-12-27 09:13:45,343 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.22 sec 2013-12-27 09:13:46,369 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.22 sec MapReduce Total cumulative CPU time: 1 seconds 220 msec Ended Job = job_1388065803340_0009 MapReduce Jobs Launched: Job 0: Map: 1 Cumulative CPU: 1.22 sec HDFS Read: 63570 HDFS Write: 315 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 220 msec OK 221.207.93.109 "XXX.com" Time taken: 17.498 seconds, Fetched: 10 row(s)
至今没有找到任何一个BLOG可以直接参照着就能顺利完成,我也是折腾了3个晚上才终于让测试集群跑了起来的。
参考:
https://github.com/kevinweil/hadoop-lzo
https://github.com/twitter/hadoop-lzo
http://blog.csdn.net/lalaguozhe/article/details/10912527
http://blog.csdn.net/xiaoping8411/article/details/7605039
http://share.blog.51cto.com/278008/549393
http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/NativeLibraries.html