Author By esxu
2015.07.09
GlusterFS为Apache Hadoop提供了兼容性,它使用Hadoop中的标准文件系统API为Hadoop的部署提供新的存储选项。现有的基于MapReduce的应用程序可以使用GlusterFS无缝连接。
这里以Cloudera的版本为例,GLusterfs支持多个Hadoop版本。Hadoop的安装配置这里不做说明,默认已经存在一个可用的Hadoop集群。
主节点:
IP:11.11.11.205
HOSTNAME:jobsub-138-015
计算节点4个:
IP:11.11.11.205
HOSTNAME:jobsub-138-015
IP:11.11.11.207
HOSTNAME:jobsub-138-017
IP:11.11.11.208
HOSTNAME:jobsub-138-018
IP:11.11.11.209
HOSTNAME:jobsub-138-019
这里目的是配置glustefs文件系统,替换原Hadoop的分布式存储HDFS,过程如下:
主要修改三个配置文件:core-site.xml、yarn-site.xml、mapred-site.xml
Edit core-site.xml
<configuration>
<property>
<name>fs.glusterfs.impl</name>
<value>org.apache.hadoop.fs.glusterfs.GlusterFileSystem</value>
</property>
<property>
<name>fs.default.name</name>
<value>glusterfs:///</value>
</property>
<property>
<name>fs.glusterfs.mount</name>
<value>/mnt/glusterfs</value>
</property>
<property>
<name>fs.AbstractFileSystem.glusterfs.impl</name>
<value>org.apache.hadoop.fs.local.GlusterFs</value>
</property>
<property>
<name>fs.glusterfs.volumes</name>
<value>volume6</value>
</property>
<property>
<name>fs.glusterfs.volume.fuse.volume6</name>
<value>/mnt/glusterfs</value>
</property>
<property>
<name>gluster.daemon.user</name>
<value>hadoop</value>
</property>
</configuration>
参数说明:volume6为我们创建的卷名,/mnt/glusterfs为各个计算节点挂载的路径,所有计算节点必须保持挂载路径一致,否则会出错。以上内容为支持glusterfs必须要修改或增加的配置项,其他原先配置文件中的内容如对默认配置做过修改,可再继续添加相应内容。
Edit yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<description>List of directories to store localized files in.</description>
<name>yarn.nodemanager.local-dirs</name>
<value>file:///var/lib/hadoop-yarn/cache/${user.name}/nm-local-dir</value>
</property>
<property>
<description>Where to store container logs.</description>
<name>yarn.nodemanager.log-dirs</name>
<value>file:///var/log/hadoop-yarn/containers</value>
</property>
<property>
<description>Where to aggregate logs to.</description>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>glusterfs:///home/hadoop/hadoop-logs/hadoop-yarn/apps</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>/usr/lib/hadoop-yarn/lib/*,/usr/lib/hadoop-yarn/*,/usr/lib/hadoop/lib/*,/usr/lib/hadoop/*,/usr/lib/hadoop-mapreduce/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,$HADOOP_MAPRED_HOME/lib/*,$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/share/hadoop/common/*,$HADOOP_COMMON_HOME/share/hadoop/common/lib/*,$HADOOP_YARN_HOME/share/hadoop/yarn/*,$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*</value>
</property>
<property>
<name>mapreduce.jobtracker.address</name>
<value>11.11.11.205</value>
</property>
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/tmp/hadoop-yarn/staging</value>
</property>
<property>
<name>yarn.nodemanager.container-executor.class</name>
<value>org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor</value>
</property>
<property>
<name>yarn.nodemanager.linux-container-executor.group</name>
<value>hadoop</value>
</property>
<property>
<name>yarn.nodemanager.delete.debug-delay-sec</name>
<value>6000</value>
</property>
<property>
<name>yarn.log.server.url</name>
<value>http://11.11.11.205:19888/jobhistory/logs</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>jobsub-138-015</value>
</property>
</configuration>
在每个slave节点上需要添加下面参数,主节点上不需要添加,这里一定要写hosname,而不是IP地址
<property>
<name>yarn.nodemanager.hostname</name>
<value>$LOCAL_HOSTNAME</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>$MASTER_HOSTNAME</value>
</property>
Edit mapred-site.xml
<configuration>
<property>
<name>mapred.system.dir</name>
<value>glusterfs:///mapred/system</value>
</property>
<property>
<name>mapred.jobtracker.system.dir</name>
<value>glusterfs:///mapred/system</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobtracker.system.dir</name>
<value>glusterfs:///mapred/system</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>/usr/lib/hadoop-yarn/lib/*,/usr/lib/hadoop-yarn/*,/usr/lib/hadoop/lib/*,/usr/lib/hadoop/*,/usr/lib/hadoop-mapreduce/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,$HADOOP_MAPRED_HOME/lib/*</value>
</property>
<property>
<name>mapreduce.jobhistory.done-dir</name>
<value>glusterfs:///mr_history/done</value>
</property>
<property>
<name>mapreduce.jobhistory.intermediate-done-dir</name>
<value>glusterfs:///mr_history/tmp</value>
</property>
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>glusterfs:///tmp/hadoop-yarn/staging</value>
</property>
</configuration>
以上为基本配置,修改完后glusterfs可正常工作,其他关于mapreduce任务运行相关的优化参数可在次基础上进行添加,这里不做过多说明。
同步每个节点上的时间,必须保持集群中所有节点时间严格同步,否则提交任务时会报错
ntpdate ntp_server_ip
时间不同步导致任务报错信息:
15/06/29 16:09:39 INFO mapreduce.Job: map 0% reduce 0%
15/06/29 16:09:39 INFO mapreduce.Job: Job job_1435221905667_0006 failed with state FAILED due to: Application application_1435221905667_0006 failed 2 times due to AM Container for appattempt_1435221905667_0006_000002 exited with exitCode: -1000 due to: Resource glusterfs:/tmp/hadoop-yarn/staging/hadoop/.staging/job_1435221905667_0006/job.jar changed on src filesystem (expected 1435565206000, was 1435565211000
.Failing this attempt.. Failing the application.
15/06/29 16:09:39 INFO mapreduce.Job: Counters: 0
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:838)
at org.apache.hadoop.fs.TestDFSIO.runIOTest(TestDFSIO.java:443)
at org.apache.hadoop.fs.TestDFSIO.writeTest(TestDFSIO.java:425)
at org.apache.hadoop.fs.TestDFSIO.run(TestDFSIO.java:755)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.fs.TestDFSIO.main(TestDFSIO.java:650)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
at org.apache.hadoop.test.MapredTestDriver.run(MapredTestDriver.java:118)
at org.apache.hadoop.test.MapredTestDriver.main(MapredTestDriver.java:126)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Glusterfs能在hadoop上正常工作,完全依赖于此插件。可前往页面进行下载最新版本的插件
http://rhbd.s3.amazonaws.com/maven/index.html
将jar包拷贝到/home/hadoop/hadoop2/share/hadoop/common/lib/目录下即可。
这里只需要启动Resourcemanager和nodemanager服务即可,Namenode和datanode这2个角色已不存在。在主节点上执行:
start-yarn.sh
即可启动整个集群服务。
这里执行一个terasort任务
time hadoop jar /home/hadoop/hadoop2/share/hadoop/mapreduce1/hadoop-examples-2.5.0-mr1-cdh5.3.2.jar teragen -Dmapreduce.job.maps=180 200000000 /TeraSort/input_files
time hadoop jar /home/hadoop/hadoop2/share/hadoop/mapreduce1/hadoop-examples-2.5.0-mr1-cdh5.3.2.jar terasort /TeraSort/input_files /TeraSort/output_files
先生成数据,再对数据进行处理,这里包括了map和reduce两个阶段,测试通过。
我们希望任务在执行的过程中输入数据从HDFS上读取,同时将最终计算结果输出到GlusterFS上,可以通过下面的方式来实现
time hadoop jar /home/hadoop/hadoop2/share/hadoop/mapreduce1/hadoop-examples-2.5.0-mr1-cdh5.3.2.jar terasort hdfs://11.11.11.205:8020/TeraSort/input_files file:///mnt/glusterfs/TeraSort/output_files
这里直接将任务结果输出到本地,而实际上这并非是本地目录,而是挂载的Glusterfs文件系统,这就实现了将结果数据输出到GlusterFS上。这里需要注意的是必须保证每个计算节点均挂载了GlusterFS文件系统,且挂载目录统一为/mnt/glusterfs,因每个map执行完是直接写入到GlusterFS上的,如果分配到map的计算节点本地未挂载/mnt/glusterfs目录,将无法找到输出路径。