Ubuntu系统下Hadoop 2.0.4集群安装配置

Hadoop 2已经将HDFS和YARN分开管理,这样分开管理,可以是HDFS更方便地进行HA或Federation,实现HDFS的线性扩展(Scale out),从而保证HDFS集群的高可用性。从另一个方面们来说,HDFS可以作为一个通用的分布式存储系统,而为第三方的分布式计算框架提供方便,就像类似YARN的计算框架,其他的如,Spark等等。YARN就是MapReduce V2,将原来Hadoop 1.x中的JobTracker拆分为两部分:一部分是负责资源的管理(Resource Manager),另一部分负责任务的调度(Scheduler)。

安装配置

1、目录结构 

下载hadoop-2.0.4软件包,解压缩后,可以看到如下目录结构:
[plain] view plain copy
  1. shirdrn@master:~/cloud/hadoop2/hadoop-2.0.4-alpha$ ls  
  2. bin  etc  include  lib  libexec  LICENSE.txt  logs  NOTICE.txt  README.txt  sbin  share  
  • etc目录
HDFS和YARN的配置文件,都存放在etc/hadoop目录下面,可以多各个文件进行配置:
[plain] view plain copy
  1. shirdrn@master:~/cloud/hadoop2/hadoop-2.0.4-alpha$ ls etc/hadoop/  
  2. capacity-scheduler.xml      hadoop-metrics.properties  httpfs-site.xml             ssl-client.xml.example  
  3. configuration.xsl           hadoop-policy.xml          log4j.properties            ssl-server.xml.example  
  4. container-executor.cfg      hdfs-site.xml              mapred-env.sh               yarn-env.sh  
  5. core-site.xml               httpfs-env.sh              mapred-queues.xml.template  yarn-site.xml  
  6. hadoop-env.sh               httpfs-log4j.properties    mapred-site.xml.template  
  7. hadoop-metrics2.properties  httpfs-signature.secret    slaves  
  • bin目录
bin目录下是用来管理HDFS、YARN的工具,如下所示:
[plain] view plain copy
  1. shirdrn@master:~/cloud/hadoop2/hadoop-2.0.4-alpha$ ls bin  
  2. container-executor  hadoop  hdfs  mapred  rcc  test-container-executor  yarn  
下面,对Hadoop进行配置,Hadoop 2.x已经没有了mapred-site.xml配置文件,完全由yarn-site.xml取代。

2、HDFS安装配置

配置 etc/hadoop/core-site.xml文件内容:
[html] view plain copy
  1. <?xml version="1.0" encoding="UTF-8"?>  
  2. <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  
  3.   
  4. <configuration>  
  5.      <property>  
  6.                 <name>fs.defaultFS</name>  
  7.                 <value>hdfs://master:9000/</value>  
  8.                 <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description>  
  9.         </property>  
  10.         <property>  
  11.                 <name>dfs.replication</name>  
  12.                 <value>3</value>  
  13.         </property>  
  14.         <property>  
  15.                 <name>hadoop.tmp.dir</name>  
  16.                 <value>/tmp/hadoop-${user.name}</value>  
  17.                 <description></description>  
  18.         </property>  
  19. </configuration>  
配置 etc/hadoop/hdfs-site.xml 文件内容:
[html] view plain copy
  1. <?xml version="1.0" encoding="UTF-8"?>  
  2. <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  
  3.   
  4. <configuration>  
  5.      <property>  
  6.                 <name>dfs.namenode.name.dir</name>  
  7.                 <value>/home/shirdrn/storage/hadoop2/hdfs/name</value>  
  8.                 <description>Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently.</description>  
  9.         </property>  
  10.         <property>  
  11.                 <name>dfs.datanode.data.dir</name>  
  12.                 <value>/home/shirdrn/storage/hadoop2/hdfs/data1,/home/shirdrn/storage/hadoop2/hdfs/data2,/home/shirdrn/storage/hadoop2/hdfs/data3</value>  
  13.                 <description>Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks.</description>  
  14.         </property>  
  15.      <property>  
  16.                 <name>hadoop.tmp.dir</name>  
  17.                 <value>/home/shirdrn/storage/hadoop2/hdfs/tmp/hadoop-${user.name}</value>  
  18.                 <description>A base for other temporary directories.</description>  
  19.         </property>  
  20. </configuration>  

3、YARN安装配置

配置 etc/hadoop/yarn-site.xml文件内容:
[html] view plain copy
  1. <?xml version="1.0"?>  
  2.   
  3. <configuration>  
  4. <property>  
  5.     <name>yarn.resourcemanager.resource-tracker.address</name>  
  6.     <value>master:8031</value>  
  7.     <description>host is the hostname of the resource manager and  
  8.     port is the port on which the NodeManagers contact the Resource Manager.  
  9.     </description>  
  10.   </property>  
  11.   
  12.   <property>  
  13.     <name>yarn.resourcemanager.scheduler.address</name>  
  14.     <value>master:8030</value>  
  15.     <description>host is the hostname of the resourcemanager and port is the port  
  16.     on which the Applications in the cluster talk to the Resource Manager.  
  17.     </description>  
  18.   </property>  
  19.   
  20.   <property>  
  21.     <name>yarn.resourcemanager.scheduler.class</name>  
  22.     <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>  
  23.     <description>In case you do not want to use the default scheduler</description>  
  24.   </property>  
  25.   
  26.   <property>  
  27.     <name>yarn.resourcemanager.address</name>  
  28.     <value>master:8032</value>  
  29.     <description>the host is the hostname of the ResourceManager and the port is the port on  
  30.     which the clients can talk to the Resource Manager. </description>  
  31.   </property>  
  32.   
  33.   <property>  
  34.     <name>yarn.nodemanager.local-dirs</name>  
  35.     <value>${hadoop.tmp.dir}/nodemanager/local</value>  
  36.     <description>the local directories used by the nodemanager</description>  
  37.   </property>  
  38.   
  39.   <property>  
  40.     <name>yarn.nodemanager.address</name>  
  41.     <value>0.0.0.0:8034</value>  
  42.     <description>the nodemanagers bind to this port</description>  
  43.   </property>   
  44.   
  45.   <property>  
  46.     <name>yarn.nodemanager.resource.memory-mb</name>  
  47.     <value>10240</value>  
  48.     <description>the amount of memory on the NodeManager in GB</description>  
  49.   </property>  
  50.   
  51.   <property>  
  52.     <name>yarn.nodemanager.remote-app-log-dir</name>  
  53.     <value>${hadoop.tmp.dir}/nodemanager/remote</value>  
  54.     <description>directory on hdfs where the application logs are moved to </description>  
  55.   </property>  
  56.   
  57.    <property>  
  58.     <name>yarn.nodemanager.log-dirs</name>  
  59.     <value>${hadoop.tmp.dir}/nodemanager/logs</value>  
  60.     <description>the directories used by Nodemanagers as log directories</description>  
  61.   </property>  
  62.   
  63.   <property>  
  64.     <name>yarn.nodemanager.aux-services</name>  
  65.     <value>mapreduce.shuffle</value>  
  66.     <description>shuffle service that needs to be set for Map Reduce to run </description>  
  67.   </property>  
  68. </configuration>  


启动集群

  • 启动HDFS集群
首先,需要格式化HDFS,执行如下命令:
[plain] view plain copy
  1. shirdrn@master:~/cloud/hadoop2/hadoop-2.0.4-alpha$ bin/hdfs namenode -format  
如果格式化正常,日志中不会出现异常信息,可以继续启动集群相关服务。
启动HDFS集群,执行如下命令:
[plain] view plain copy
  1. shirdrn@master:~/cloud/hadoop2/hadoop-2.0.4-alpha$ sbin/start-dfs.sh  
可以在master结点上看到如下几个进程:
[plain] view plain copy
  1. shirdrn@master:~/cloud/hadoop2/hadoop-2.0.4-alpha$ jps  
  2. 17238 Jps  
  3. 16845 NameNode  
  4. 17128 SecondaryNameNode  
而在slave结点上看到如下进程:
[plain] view plain copy
  1. shirdrn@slave01:~/programs$ jps  
  2. 4865 Jps  
  3. 4753 DataNode  
  4.   
  5. shirdrn@slave02:~/programs$ jps  
  6. 4867 DataNode  
  7. 4971 Jps  
  • 启动YARN集群
如果配置完成以后,启动YARN集群非常容易,只需要执行几个脚本就可以。
启动ResourceManager,执行如下命令:
[plain] view plain copy
  1. shirdrn@master:~/cloud/hadoop2/hadoop-2.0.4-alpha$ sbin/yarn-daemon.sh start resourcemanager  
可以看到,多了一个ResourceManager进程:
[plain] view plain copy
  1. shirdrn@master:~/cloud/hadoop2/hadoop-2.0.4-alpha$ jps  
  2. 16845 NameNode  
  3. 17128 SecondaryNameNode  
  4. 17490 Jps  
  5. 17284 ResourceManager  
然后,在slave结点上启动NodeManager进程,执行如下命令:
[plain] view plain copy
  1. shirdrn@slave01:~/programs/hadoop2/hadoop-2.0.4-alpha$ sbin/yarn-daemon.sh start nodemanager  
  2. shirdrn@slave02:~/programs/hadoop2/hadoop-2.0.4-alpha$ sbin/yarn-daemon.sh start nodemanager  
这时通过jps命令可以看到,各个slave结点上又多了一个NodeManager进程:
[plain] view plain copy
  1. shirdrn@slave01:~/programs/hadoop2/hadoop-2.0.4-alpha$ jps  
  2. 5544 DataNode  
  3. 5735 NodeManager  
  4. 5904 Jps  
  5.   
  6. shirdrn@slave02:~/programs/hadoop2/hadoop-2.0.4-alpha$ jps  
  7. 5544 DataNode  
  8. 5735 NodeManager  
  9. 5904 Jps  
或者,可以查看启动对应进程的日志来确定是否启动成功:
[plain] view plain copy
  1. shirdrn@slave01:~/programs/hadoop2/hadoop-2.0.4-alpha$  tail -100f /home/shirdrn/programs/hadoop2/hadoop-2.0.4-alpha/logs/yarn-shirdrn-resourcemanager-master.log  
  2.   
  3. shirdrn@slave01:~/programs/hadoop2/hadoop-2.0.4-alpha$  tail -100f /home/shirdrn/programs/hadoop2/hadoop-2.0.4-alpha/logs/yarn-shirdrn-nodemanager-master.log  
另外,启动整个Hadoop集群(包括HDFS和YARN),也可以直接执行下面一个脚本,启动全部相关进程,如下所示:
[plain] view plain copy
  1. shirdrn@master:~/cloud/hadoop2/hadoop-2.0.4-alpha$ sbin/start-all.sh  
  2. This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh  
  3. Starting namenodes on [master]  
  4. master: starting namenode, logging to /home/shirdrn/programs/hadoop2/hadoop-2.0.4-alpha/logs/hadoop-shirdrn-namenode-master.out  
  5. slave02: starting datanode, logging to /home/shirdrn/programs/hadoop2/hadoop-2.0.4-alpha/logs/hadoop-shirdrn-datanode-slave02.out  
  6. slave01: starting datanode, logging to /home/shirdrn/programs/hadoop2/hadoop-2.0.4-alpha/logs/hadoop-shirdrn-datanode-slave01.out  
  7. Starting secondary namenodes [0.0.0.0]  
  8. 0.0.0.0: starting secondarynamenode, logging to /home/shirdrn/programs/hadoop2/hadoop-2.0.4-alpha/logs/hadoop-shirdrn-secondarynamenode-master.out  
  9. starting yarn daemons  
  10. starting resourcemanager, logging to /home/shirdrn/programs/hadoop2/hadoop-2.0.4-alpha/logs/yarn-shirdrn-resourcemanager-master.out  
  11. slave01: starting nodemanager, logging to /home/shirdrn/programs/hadoop2/hadoop-2.0.4-alpha/logs/yarn-shirdrn-nodemanager-slave01.out  
  12. slave02: starting nodemanager, logging to /home/shirdrn/programs/hadoop2/hadoop-2.0.4-alpha/logs/yarn-shirdrn-nodemanager-slave02.out  

验证集群

最后,验证集群计算,执行Hadoop自带的examples,执行如下命令:
[plain] view plain copy
  1. shirdrn@master:~/cloud/hadoop2/hadoop-2.0.4-alpha$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.0.4-alpha.jar randomwriter out  


参考链接
  • http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html
  • http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html

你可能感兴趣的:(hadoop,2.0.4)