hadoop_on_glusterfs

Hadoop On GlusterFS

Author By esxu

2015.07.09

概述

GlusterFS为Apache Hadoop提供了兼容性,它使用Hadoop中的标准文件系统API为Hadoop的部署提供新的存储选项。现有的基于MapReduce的应用程序可以使用GlusterFS无缝连接。

  • 提供Hadoop的范围内同时基于文件和基于对象的访问
  • 消除了集中的元数据节点
  • 兼容原有MapReduce程序,无需重写
  • 提供容错文件系统

安装配置

这里以Cloudera的版本为例,GLusterfs支持多个Hadoop版本。Hadoop的安装配置这里不做说明,默认已经存在一个可用的Hadoop集群。

节点分布

主节点:

IP:11.11.11.205

HOSTNAME:jobsub-138-015

计算节点4个:

IP:11.11.11.205

HOSTNAME:jobsub-138-015

IP:11.11.11.207

HOSTNAME:jobsub-138-017

IP:11.11.11.208

HOSTNAME:jobsub-138-018

IP:11.11.11.209

HOSTNAME:jobsub-138-019

这里目的是配置glustefs文件系统,替换原Hadoop的分布式存储HDFS,过程如下:

修改配置文件

主要修改三个配置文件:core-site.xml、yarn-site.xml、mapred-site.xml

  • Edit core-site.xml

    <configuration>
    <property>
    <name>fs.glusterfs.impl</name>  
    <value>org.apache.hadoop.fs.glusterfs.GlusterFileSystem</value>  
    </property>  
    <property>
    <name>fs.default.name</name>          
    <value>glusterfs:///</value>                   
    </property>
    <property>
    <name>fs.glusterfs.mount</name>          
    <value>/mnt/glusterfs</value>                   
    </property>
    <property>
    <name>fs.AbstractFileSystem.glusterfs.impl</name>          
    <value>org.apache.hadoop.fs.local.GlusterFs</value>                   
    </property>
    <property>
    <name>fs.glusterfs.volumes</name>          
    <value>volume6</value>                   
    </property>
    <property>
    <name>fs.glusterfs.volume.fuse.volume6</name>          
    <value>/mnt/glusterfs</value>                   
    </property>
    <property>
    <name>gluster.daemon.user</name>
    <value>hadoop</value>
    </property>
    </configuration>
    

参数说明:volume6为我们创建的卷名,/mnt/glusterfs为各个计算节点挂载的路径,所有计算节点必须保持挂载路径一致,否则会出错。以上内容为支持glusterfs必须要修改或增加的配置项,其他原先配置文件中的内容如对默认配置做过修改,可再继续添加相应内容。

  • Edit yarn-site.xml

    <configuration>
    <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
    </property>
    <property>
    <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
    
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
    </property>
    
    <property>
    <description>List of directories to store localized files in.</description>
    <name>yarn.nodemanager.local-dirs</name>
    <value>file:///var/lib/hadoop-yarn/cache/${user.name}/nm-local-dir</value>
    </property>
    <property>
    <description>Where to store container logs.</description>
    <name>yarn.nodemanager.log-dirs</name>
    <value>file:///var/log/hadoop-yarn/containers</value>
    </property>
    <property>
    <description>Where to aggregate logs to.</description>
    <name>yarn.nodemanager.remote-app-log-dir</name>
    <value>glusterfs:///home/hadoop/hadoop-logs/hadoop-yarn/apps</value>
    </property>
    <property>
    <name>yarn.application.classpath</name>
    <value>/usr/lib/hadoop-yarn/lib/*,/usr/lib/hadoop-yarn/*,/usr/lib/hadoop/lib/*,/usr/lib/hadoop/*,/usr/lib/hadoop-mapreduce/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,$HADOOP_MAPRED_HOME/lib/*,$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/share/hadoop/common/*,$HADOOP_COMMON_HOME/share/hadoop/common/lib/*,$HADOOP_YARN_HOME/share/hadoop/yarn/*,$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*</value>
    </property>
    
    <property>
    <name>mapreduce.jobtracker.address</name>
    <value>11.11.11.205</value>
    
    </property>
    
    <property>
    <name>yarn.app.mapreduce.am.staging-dir</name>
    <value>/tmp/hadoop-yarn/staging</value>
    
    </property>
    
    <property>    
    
    <name>yarn.nodemanager.container-executor.class</name>
    <value>org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor</value>
    
    </property>
    
    <property>
    
    <name>yarn.nodemanager.linux-container-executor.group</name>
    <value>hadoop</value>
    
    </property>
    
    <property>
    <name>yarn.nodemanager.delete.debug-delay-sec</name>
    <value>6000</value>
    </property>
    
    <property>
    <name>yarn.log.server.url</name>
    <value>http://11.11.11.205:19888/jobhistory/logs</value>
    </property>
    <property>  
    <name>yarn.resourcemanager.hostname</name>  
    <value>jobsub-138-015</value>  
    </property>  
    </configuration>
    

在每个slave节点上需要添加下面参数,主节点上不需要添加,这里一定要写hosname,而不是IP地址

    <property>
    <name>yarn.nodemanager.hostname</name>
    <value>$LOCAL_HOSTNAME</value>
    </property>

    <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>$MASTER_HOSTNAME</value>
    </property>
  • Edit mapred-site.xml

    <configuration>
    <property>
    <name>mapred.system.dir</name>  
    <value>glusterfs:///mapred/system</value>  
    
    </property>  
    
    <property>
    
    <name>mapred.jobtracker.system.dir</name>
    
    <value>glusterfs:///mapred/system</value>
    </property>
    
    <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
    </property>
    
    <property>
    
    <name>mapreduce.jobtracker.system.dir</name>
    
    <value>glusterfs:///mapred/system</value>
    </property>
    
    <property>
    <name>mapreduce.application.classpath</name>
    <value>/usr/lib/hadoop-yarn/lib/*,/usr/lib/hadoop-yarn/*,/usr/lib/hadoop/lib/*,/usr/lib/hadoop/*,/usr/lib/hadoop-mapreduce/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,$HADOOP_MAPRED_HOME/lib/*</value>
    </property>
    <property>
    <name>mapreduce.jobhistory.done-dir</name>
    <value>glusterfs:///mr_history/done</value>
    </property>
    <property>
    <name>mapreduce.jobhistory.intermediate-done-dir</name>
    <value>glusterfs:///mr_history/tmp</value>
    </property>
    <property>
    <name>yarn.app.mapreduce.am.staging-dir</name>
    <value>glusterfs:///tmp/hadoop-yarn/staging</value>
    </property>
    </configuration>
    

以上为基本配置,修改完后glusterfs可正常工作,其他关于mapreduce任务运行相关的优化参数可在次基础上进行添加,这里不做过多说明。

时间同步

同步每个节点上的时间,必须保持集群中所有节点时间严格同步,否则提交任务时会报错

ntpdate ntp_server_ip

时间不同步导致任务报错信息:

15/06/29 16:09:39 INFO mapreduce.Job:  map 0% reduce 0%
15/06/29 16:09:39 INFO mapreduce.Job: Job job_1435221905667_0006 failed with state FAILED due to: Application application_1435221905667_0006 failed 2 times due to AM Container for appattempt_1435221905667_0006_000002 exited with  exitCode: -1000 due to: Resource glusterfs:/tmp/hadoop-yarn/staging/hadoop/.staging/job_1435221905667_0006/job.jar changed on src filesystem (expected 1435565206000, was 1435565211000
.Failing this attempt.. Failing the application.
15/06/29 16:09:39 INFO mapreduce.Job: Counters: 0
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:838)
at org.apache.hadoop.fs.TestDFSIO.runIOTest(TestDFSIO.java:443)
at org.apache.hadoop.fs.TestDFSIO.writeTest(TestDFSIO.java:425)
at org.apache.hadoop.fs.TestDFSIO.run(TestDFSIO.java:755)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.fs.TestDFSIO.main(TestDFSIO.java:650)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
at org.apache.hadoop.test.MapredTestDriver.run(MapredTestDriver.java:118)
at org.apache.hadoop.test.MapredTestDriver.main(MapredTestDriver.java:126)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
  • 插件上传

Glusterfs能在hadoop上正常工作,完全依赖于此插件。可前往页面进行下载最新版本的插件

http://rhbd.s3.amazonaws.com/maven/index.html

将jar包拷贝到/home/hadoop/hadoop2/share/hadoop/common/lib/目录下即可。

  • 启动服务

这里只需要启动Resourcemanager和nodemanager服务即可,Namenode和datanode这2个角色已不存在。在主节点上执行:

start-yarn.sh

即可启动整个集群服务。

  • 测试任务

这里执行一个terasort任务

time hadoop jar /home/hadoop/hadoop2/share/hadoop/mapreduce1/hadoop-examples-2.5.0-mr1-cdh5.3.2.jar  teragen -Dmapreduce.job.maps=180 200000000 /TeraSort/input_files

time hadoop jar /home/hadoop/hadoop2/share/hadoop/mapreduce1/hadoop-examples-2.5.0-mr1-cdh5.3.2.jar  terasort  /TeraSort/input_files  /TeraSort/output_files

先生成数据,再对数据进行处理,这里包括了map和reduce两个阶段,测试通过。

  • HDFS&&GlusterFS

我们希望任务在执行的过程中输入数据从HDFS上读取,同时将最终计算结果输出到GlusterFS上,可以通过下面的方式来实现

time hadoop jar /home/hadoop/hadoop2/share/hadoop/mapreduce1/hadoop-examples-2.5.0-mr1-cdh5.3.2.jar terasort   hdfs://11.11.11.205:8020/TeraSort/input_files  file:///mnt/glusterfs/TeraSort/output_files

这里直接将任务结果输出到本地,而实际上这并非是本地目录,而是挂载的Glusterfs文件系统,这就实现了将结果数据输出到GlusterFS上。这里需要注意的是必须保证每个计算节点均挂载了GlusterFS文件系统,且挂载目录统一为/mnt/glusterfs,因每个map执行完是直接写入到GlusterFS上的,如果分配到map的计算节点本地未挂载/mnt/glusterfs目录,将无法找到输出路径。

你可能感兴趣的:(hadoop,存储)