Ubuntu系统下Hadoop 2.0.4集群安装配置

Hadoop 2已经将HDFS和YARN分开管理,这样分开管理,可以是HDFS更方便地进行HA或Federation,实现HDFS的线性扩展(Scale out),从而保证HDFS集群的高可用性。从另一个方面们来说,HDFS可以作为一个通用的分布式存储系统,而为第三方的分布式计算框架提供方便,就像类似YARN的计算框架,其他的如,Spark等等。YARN就是MapReduce V2,将原来Hadoop 1.x中的JobTracker拆分为两部分:一部分是负责资源的管理(Resource Manager),另一部分负责任务的调度(Scheduler)。

安装配置

1、目录结构 

下载hadoop-2.0.4软件包,解压缩后,可以看到如下目录结构:
shirdrn@master:~/cloud/hadoop2/hadoop-2.0.4-alpha$ ls
bin  etc  include  lib  libexec  LICENSE.txt  logs  NOTICE.txt  README.txt  sbin  share
  • etc目录
HDFS和YARN的配置文件,都存放在etc/hadoop目录下面,可以多各个文件进行配置:
shirdrn@master:~/cloud/hadoop2/hadoop-2.0.4-alpha$ ls etc/hadoop/
capacity-scheduler.xml      hadoop-metrics.properties  httpfs-site.xml             ssl-client.xml.example
configuration.xsl           hadoop-policy.xml          log4j.properties            ssl-server.xml.example
container-executor.cfg      hdfs-site.xml              mapred-env.sh               yarn-env.sh
core-site.xml               httpfs-env.sh              mapred-queues.xml.template  yarn-site.xml
hadoop-env.sh               httpfs-log4j.properties    mapred-site.xml.template
hadoop-metrics2.properties  httpfs-signature.secret    slaves
  • bin目录
bin目录下是用来管理HDFS、YARN的工具,如下所示:
shirdrn@master:~/cloud/hadoop2/hadoop-2.0.4-alpha$ ls bin
container-executor  hadoop  hdfs  mapred  rcc  test-container-executor  yarn
下面,对Hadoop进行配置,Hadoop 2.x已经没有了mapred-site.xml配置文件,完全由yarn-site.xml取代。

2、HDFS安装配置

配置 etc/hadoop/core-site.xml文件内容:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
     <property>
                <name>fs.defaultFS</name>
                <value>hdfs://master:9000/</value>
                <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description>
        </property>
        <property>
                <name>dfs.replication</name>
                <value>3</value>
        </property>
        <property>
                <name>hadoop.tmp.dir</name>
                <value>/tmp/hadoop-${user.name}</value>
                <description></description>
        </property>
</configuration>
配置 etc/hadoop/hdfs-site.xml 文件内容:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
     <property>
                <name>dfs.namenode.name.dir</name>
                <value>/home/shirdrn/storage/hadoop2/hdfs/name</value>
                <description>Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently.</description>
        </property>
        <property>
                <name>dfs.datanode.data.dir</name>
                <value>/home/shirdrn/storage/hadoop2/hdfs/data1,/home/shirdrn/storage/hadoop2/hdfs/data2,/home/shirdrn/storage/hadoop2/hdfs/data3</value>
                <description>Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks.</description>
        </property>
     <property>
                <name>hadoop.tmp.dir</name>
                <value>/home/shirdrn/storage/hadoop2/hdfs/tmp/hadoop-${user.name}</value>
                <description>A base for other temporary directories.</description>
        </property>
</configuration>

3、YARN安装配置

配置 etc/hadoop/yarn-site.xml文件内容:
<?xml version="1.0"?>

<configuration>
<property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>master:8031</value>
    <description>host is the hostname of the resource manager and
    port is the port on which the NodeManagers contact the Resource Manager.
    </description>
  </property>

  <property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>master:8030</value>
    <description>host is the hostname of the resourcemanager and port is the port
    on which the Applications in the cluster talk to the Resource Manager.
    </description>
  </property>

  <property>
    <name>yarn.resourcemanager.scheduler.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
    <description>In case you do not want to use the default scheduler</description>
  </property>

  <property>
    <name>yarn.resourcemanager.address</name>
    <value>master:8032</value>
    <description>the host is the hostname of the ResourceManager and the port is the port on
    which the clients can talk to the Resource Manager. </description>
  </property>

  <property>
    <name>yarn.nodemanager.local-dirs</name>
    <value>${hadoop.tmp.dir}/nodemanager/local</value>
    <description>the local directories used by the nodemanager</description>
  </property>

  <property>
    <name>yarn.nodemanager.address</name>
    <value>0.0.0.0:8034</value>
    <description>the nodemanagers bind to this port</description>
  </property> 

  <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>10240</value>
    <description>the amount of memory on the NodeManager in GB</description>
  </property>

  <property>
    <name>yarn.nodemanager.remote-app-log-dir</name>
    <value>${hadoop.tmp.dir}/nodemanager/remote</value>
    <description>directory on hdfs where the application logs are moved to </description>
  </property>

   <property>
    <name>yarn.nodemanager.log-dirs</name>
    <value>${hadoop.tmp.dir}/nodemanager/logs</value>
    <description>the directories used by Nodemanagers as log directories</description>
  </property>

  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce.shuffle</value>
    <description>shuffle service that needs to be set for Map Reduce to run </description>
  </property>
</configuration>


启动集群

  • 启动HDFS集群
首先,需要格式化HDFS,执行如下命令:
shirdrn@master:~/cloud/hadoop2/hadoop-2.0.4-alpha$ bin/hdfs namenode -format
如果格式化正常,日志中不会出现异常信息,可以继续启动集群相关服务。
启动HDFS集群,执行如下命令:
shirdrn@master:~/cloud/hadoop2/hadoop-2.0.4-alpha$ sbin/start-dfs.sh
可以在master结点上看到如下几个进程:
shirdrn@master:~/cloud/hadoop2/hadoop-2.0.4-alpha$ jps
17238 Jps
16845 NameNode
17128 SecondaryNameNode
而在slave结点上看到如下进程:
shirdrn@slave01:~/programs$ jps
4865 Jps
4753 DataNode

shirdrn@slave02:~/programs$ jps
4867 DataNode
4971 Jps
  • 启动YARN集群
如果配置完成以后,启动YARN集群非常容易,只需要执行几个脚本就可以。
启动ResourceManager,执行如下命令:
shirdrn@master:~/cloud/hadoop2/hadoop-2.0.4-alpha$ sbin/yarn-daemon.sh start resourcemanager
可以看到,多了一个ResourceManager进程:
shirdrn@master:~/cloud/hadoop2/hadoop-2.0.4-alpha$ jps
16845 NameNode
17128 SecondaryNameNode
17490 Jps
17284 ResourceManager
然后,在slave结点上启动NodeManager进程,执行如下命令:
shirdrn@slave01:~/programs/hadoop2/hadoop-2.0.4-alpha$ sbin/yarn-daemon.sh start nodemanager
shirdrn@slave02:~/programs/hadoop2/hadoop-2.0.4-alpha$ sbin/yarn-daemon.sh start nodemanager
这时通过jps命令可以看到,各个slave结点上又多了一个NodeManager进程:
shirdrn@slave01:~/programs/hadoop2/hadoop-2.0.4-alpha$ jps
5544 DataNode
5735 NodeManager
5904 Jps

shirdrn@slave02:~/programs/hadoop2/hadoop-2.0.4-alpha$ jps
5544 DataNode
5735 NodeManager
5904 Jps
或者,可以查看启动对应进程的日志来确定是否启动成功:
shirdrn@slave01:~/programs/hadoop2/hadoop-2.0.4-alpha$  tail -100f /home/shirdrn/programs/hadoop2/hadoop-2.0.4-alpha/logs/yarn-shirdrn-resourcemanager-master.log

shirdrn@slave01:~/programs/hadoop2/hadoop-2.0.4-alpha$  tail -100f /home/shirdrn/programs/hadoop2/hadoop-2.0.4-alpha/logs/yarn-shirdrn-nodemanager-master.log
另外,启动整个Hadoop集群(包括HDFS和YARN),也可以直接执行下面一个脚本,启动全部相关进程,如下所示:
shirdrn@master:~/cloud/hadoop2/hadoop-2.0.4-alpha$ sbin/start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [master]
master: starting namenode, logging to /home/shirdrn/programs/hadoop2/hadoop-2.0.4-alpha/logs/hadoop-shirdrn-namenode-master.out
slave02: starting datanode, logging to /home/shirdrn/programs/hadoop2/hadoop-2.0.4-alpha/logs/hadoop-shirdrn-datanode-slave02.out
slave01: starting datanode, logging to /home/shirdrn/programs/hadoop2/hadoop-2.0.4-alpha/logs/hadoop-shirdrn-datanode-slave01.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /home/shirdrn/programs/hadoop2/hadoop-2.0.4-alpha/logs/hadoop-shirdrn-secondarynamenode-master.out
starting yarn daemons
starting resourcemanager, logging to /home/shirdrn/programs/hadoop2/hadoop-2.0.4-alpha/logs/yarn-shirdrn-resourcemanager-master.out
slave01: starting nodemanager, logging to /home/shirdrn/programs/hadoop2/hadoop-2.0.4-alpha/logs/yarn-shirdrn-nodemanager-slave01.out
slave02: starting nodemanager, logging to /home/shirdrn/programs/hadoop2/hadoop-2.0.4-alpha/logs/yarn-shirdrn-nodemanager-slave02.out

验证集群

最后,验证集群计算,执行Hadoop自带的examples,执行如下命令:
shirdrn@master:~/cloud/hadoop2/hadoop-2.0.4-alpha$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.0.4-alpha.jar randomwriter out


参考链接
  • http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html
  • http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html

你可能感兴趣的:(Ubuntu系统下Hadoop 2.0.4集群安装配置)