1、 HDFS的NameNode可以以集群的方式布署,增强了NameNode的水平扩展能力和可用性;
2、 MapReduce将JobTracker中的资源管理及任务生命周期管理(包括定时触发及监控),拆分成两个独立的组件,并更名为YARN(Yet Another Resource Negotiator)。
在安装hadoop之前需要安装zookeeper,安装zookeeper集群的方式见点击打开链接
我的集群具体机器规划为:
master1 | 192.168.56.101 | (active namenode,RM,zk) |
masterha | 192.168.56.102 | (standby namenode,jn,zk) |
master2 | 192.168.56.103 | (active namenode,jn,RM,zk) |
masterha | 192.168.56.104 | (standby namenode,jn) |
slave1 | 192.168.56.105 | (datanode,nodemanager) |
slave2 | 192.168.56.106 | (datanode,nodemanager) |
slave3 | 192.168.56.107 | (datanode,nodemanager) |
安装hadoop2.6.0
大致流程为:上传、解压、重命名、修改环境变量、修改配置文件、发送到其他机器上。
1、上传
用工具上传hadoop2安装文件,或者命令:
su – hadoop cd /home/hadoop rz -y2、解压 (一定要在hadoop用户下操作)
tar -zxvf hadoop-2.6.0.tar.gz3、重命名
mv hadoop-2.2.0 hadoop4、配置环境变量
su – root vi /etc/profile添加如下配置:
export HADOOP_HOME=/home/hadoop/hadoop export PATH=$PATH:$HADOOP_HOME/bin source /etc/peofile su - hadoop
5、修改配置文件:进入/hadoop/etc/hadoop目录下
core-site.xml的配置
<span style="font-size:18px;"><?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- federation的配置,相当于视图 --> <property> <name>fs.defaultFS</name> <value>viewfs:///</value> </property> <!-- 第一套namenode集群 --> <property> <name>fs.viewfs.mounttable.default.link./tmp</name> <value>hdfs://hadoop-cluster1/</value> </property> <!-- 第二套namenode集群 --> <property> <name>fs.viewfs.mounttable.default.link./tmp1</name> <value>hdfs://hadoop-cluster2/</value> </property> </configuration></span>hdfs-site.xml的配置
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- 两套namenode集群的名字 --> <property> <name>dfs.nameservices</name> <value>hadoop-cluster1,hadoop-cluster2</value> </property> <!-- 第一套nemenode集群的两台机器 --> <property> <name>dfs.ha.namenodes.hadoop-cluster1</name> <value>nn1,nn2</value> </property> <!-- 第一套namenode的主的数据传输地址 --> <property> <name>dfs.namenode.rpc-address.hadoop-cluster1.nn1</name> <value>master1:9000</value> </property> <!-- 第一套namenode的备的数据传输地址 --> <property> <name>dfs.namenode.rpc-address.hadoop-cluster1.nn2</name> <value>master1ha:9000</value> </property> <!-- 第一套namenode的主的WEB地址 --> <property> <name>dfs.namenode.http-address.hadoop-cluster1.nn1</name> <value>master1:50070</value> </property> <!-- 第一套namenode的备的WEB地址 --> <property> <name>dfs.namenode.http-address.hadoop-cluster1.nn2</name> <value>master1ha:50070</value> </property> <!-- 第一套secondarynamenode的主的http地址 --> <property> <name>dfs.namenode.secondary.http-address.hadoop-cluster1.nn1</name> <value>master1:9001</value> </property> <!-- 第一套secondarynamenode的备的http地址 --> <property> <name>dfs.namenode.secondary.http-address.hadoop-cluster1.nn2</name> <value>master1ha:9001</value> </property> <!-- 第一套namenode的主备切换实现类 --> <property> <name>dfs.client.failover.proxy.provider.hadoop-cluster1</name> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> </property> <!-- 第二套secondarynamenode的配置和第一套一样 --> <property> <name>dfs.ha.namenodes.hadoop-cluster2</name> <value>nn3,nn4</value> </property> <property> <name>dfs.namenode.rpc-address.hadoop-cluster2.nn3</name> <value>master2:9000</value> </property> <property> <name>dfs.namenode.rpc-address.hadoop-cluster2.nn4</name> <value>master2ha:9000</value> </property> <property> <name>dfs.namenode.http-address.hadoop-cluster2.nn3</name> <value>master2:50070</value> </property> <property> <name>dfs.namenode.http-address.hadoop-cluster2.nn4</name> <value>master2ha:50070</value> </property> <property> <name>dfs.namenode.secondary.http-address.hadoop-cluster2.nn3</name> <value>master2:9001</value> </property> <property> <name>dfs.namenode.secondary.http-address.hadoop-cluster2.nn4</name> <value>master2ha:9001</value> </property> <property> <name>dfs.client.failover.proxy.provider.hadoop-cluster2</name> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> </property> <!-- namenode的本地文件夹 --> <property> <name>dfs.namenode.name.dir</name> <value>/home/hadoop/hadoop/namedir</value> </property> <!-- 第一套namenode的主的journal文件位置 --> <property> <name>dfs.namenode.shared.edits.dir.hadoop-cluster1.nn1</name> <value>qjournal://master1ha:8485;master2:8485;master2ha:8485/cluster1</value> </property> <!-- 第一套namenode的备的journal文件位置,合主相同 --> <property> <name>dfs.namenode.shared.edits.dir.hadoop-cluster1.nn2</name> <value>qjournal://master1ha:8485;master2:8485;master2ha:8485/cluster1</value> </property> <!-- 第二套namenode的主的journal文件位置 --> <property> <name>dfs.namenode.shared.edits.dir.hadoop-cluster2.nn3</name> <value>qjournal://master1ha:8485;master2:8485;master2ha:8485/cluster2</value> </property> <!-- 第二套namenode的备的journal文件位置,合主相同 --> <property> <name>dfs.namenode.shared.edits.dir.hadoop-cluster2.nn4</name> <value>qjournal://master1ha:8485;master2:8485;master2ha:8485/cluster2</value> </property> <!-- 数据存放的文件夹 --> <property> <name>dfs.datanode.data.dir</name> <value>/home/hadoop/hadoop/datadir</value> </property> <!-- zookeeper地址 --> <property> <name>ha.zookeeper.quorum</name> <value>master1:2181,master1ha:2181,master2:2181</value> </property> <!-- ssh采用的方法 --> <property> <name>dfs.ha.fencing.methods</name> <value>sshfence</value> </property> <!-- zookeeper超时 --> <property> <name>ha.zookeeper.session-timeout.ms</name> <value>5000</value> </property> <!-- 是否namenode主备自动切换 --> <property> <name>dfs.ha.automatic-failover.enabled</name> <value>true</value> </property> <!-- journalnode文件夹 --> <property> <name>dfs.journalnode.edits.dir</name> <value>/home/hadoop/hadoop/jndir</value> </property> <!-- 备份数 --> <property> <name>dfs.replication</name> <value>3</value> </property> <!-- 权限 --> <property> <name>dfs.permission</name> <value>false</value> </property> <!-- 是否允许web页面访问hdfs --> <property> <name>dfs.webhdfs.enabled</name> <value>true</value> </property> <!-- 是否支持追加 --> <property> <name>dfs.support.append</name> <value>true</value> </property> <!-- 临时文件夹 --> <property> <name>hadoop.tmp.dir</name> <value>/home/hadoop/hadoop/tmp</value> </property> <!-- hadoop代理用户配置 --> <property> <name>hadoop.proxyuser.hduser.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.hduser.groups</name> <value>*</value> </property> <!-- ssh私钥位置 --> <property> <name>dfs.ha.fencing.ssh.private-key-files</name> <value>/home/hadoop/.ssh/id_rsa</value> </property> </configuration>mapred-site.xml的配置
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!--mapreduce运行平台的名字,mapreduce运行时,需要将之设置为yarn --> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <!--mapreduce job history server的IPC传输地址 --> <property> <name>mapreduce.jobhistory.address</name> <value>master1:10020</value> </property> <!--mapreduce job history server的WEB地址 --> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>master1:19888</value> </property> <!--mapreduce 用于存放数据的文件夹 --> <property> <name>mapred.system.dir</name> <value>/home/hadoop/hadoop/hadoopmrsys</value> <final>true</final> </property> <!--mapreduce 用于存放数据的文件夹--> <property> <name>mapred.local.dir</name> <value>/home/hadoop/hadoop/hadoopmrlocal</value> <final>true</final> </property> </configuration>
slaves的配置
slave1 slave2 slave3
<?xml version="1.0"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <configuration> <!--rm失联后重新链接的时间--> <property> <name>yarn.resourcemanager.connect.retry-interval.ms</name> <value>2000</value> </property> <!--开启resource manager HA,默认为false--> <property> <name>yarn.resourcemanager.ha.enabled</name> <value>true</value> </property> <!--配置resource manager --> <property> <name>yarn.resourcemanager.ha.rm-ids</name> <value>rm1,rm2</value> </property> <!-- zookeeper地址 --> <property> <name>ha.zookeeper.quorum</name> <value>master2ha:2181,master1ha:2181,master2:2181</value> </property> <!--开启resourcemanager故障自动切换--> <property> <name>yarn.resourcemanager.ha.automatic-failover.enabled</name> <value>true</value> </property> <!-- rm1的hostname --> <property> <name>yarn.resourcemanager.hostname.rm1</name> <value>master1</value> </property> <!-- rm2的hostname --> <property> <name>yarn.resourcemanager.hostname.rm2</name> <value>master2</value> </property> <!--在master1上配置rm1,在master2上配置rm2,注意:一般都喜欢把配置好的文件远程复制到其它机器上,但这个在YARN的另一个机器上一定要修改为rm2--> <property> <name>yarn.resourcemanager.ha.id</name> <value>rm1</value> <description>If we want to launch more than one RM in single node, we need this configuration</description> </property> <!--开启resourcemanager自动恢复功能--> <property> <name>yarn.resourcemanager.recovery.enabled</name> <value>true</value> </property> <!--配置与zookeeper的连接地址--> <!--用于状态存储的类,默认为org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore,基于Hadoop文件系统的实现。还可以为org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore,该类为基于ZooKeeper的实现。--> <property> <name>yarn.resourcemanager.store.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value> </property> <!--被RM用于状态存储的ZooKeeper服务器的主机:端口号,多个ZooKeeper的话使用逗号分隔。--> <property> <name>yarn.resourcemanager.zk-address</name> <value>master2ha:2181,master1ha:2181,master2:2181</value> </property> <property> <name>yarn.resourcemanager.cluster-id</name> <value>hadoop-cluster1-yarn</value> </property> <!--schelduler失联等待连接时间--> <property> <name>yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms</name> <value>5000</value> </property> <!--配置rm1--> <!--rm1对外应用程序管理接口的地址--> <property> <name>yarn.resourcemanager.address.rm1</name> <value>master1:8132</value> </property> <!--rm1调度器对外的接口地址--> <property> <name>yarn.resourcemanager.scheduler.address.rm1</name> <value>master1:8130</value> </property> <!--rm1对外的WEB访问地址--> <property> <name>yarn.resourcemanager.webapp.address.rm1</name> <value>master1:8188</value> </property> <!--rm1的resource-tracker对外的接口地址--> <property> <name>yarn.resourcemanager.resource-tracker.address.rm1</name> <value>master1:8131</value> </property> <!--The address of the RM1 admin interface.--> <property> <name>yarn.resourcemanager.admin.address.rm1</name> <value>master1:8033</value> </property> <!--rm1的Admin的ha对外的接口地址--> <property> <name>yarn.resourcemanager.ha.admin.address.rm1</name> <value>master1:23142</value> </property> <!--配置rm2,同rm1--> <property> <name>yarn.resourcemanager.address.rm2</name> <value>master2:8132</value> </property> <property> <name>yarn.resourcemanager.scheduler.address.rm2</name> <value>master2:8130</value> </property> <property> <name>yarn.resourcemanager.webapp.address.rm2</name> <value>master2:8188</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address.rm2</name> <value>master2:8131</value> </property> <property> <name>yarn.resourcemanager.admin.address.rm2</name> <value>master2:8033</value> </property> <property> <name>yarn.resourcemanager.ha.admin.address.rm2</name> <value>master2:23142</value> </property> <!--附属服务名称,如果使用mapreduce,需将只配置为mapreduce_shuffle--> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <!--mapreduce_shuffle的Handler类--> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <!--nodemanager存放临时文件的本地目录--> <property> <name>yarn.nodemanager.local-dirs</name> <value>/home/hadoop/hadoop/nodemanagerlocal</value> </property> <!--nodemanager存放日志的本地目录--> <property> <name>yarn.nodemanager.log-dirs</name> <value>/home/hadoop/hadoop/nodemanagerlogs</value> </property> <property> <name>mapreduce.shuffle.port</name> <value>23080</value> </property> <!--故障处理类--> <property> <name>yarn.client.failover-proxy-provider</name> <value>org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider</value> </property> <property> <name>yarn.resourcemanager.ha.automatic-failover.zk-base-path</name> <value>/yarn-leader-election</value> </property> </configuration>修改 hadoop-env.sh
# The java implementation to use.,wilson:配置jdk环境变量 export JAVA_HOME=/usr/jdk # The jsvc implementation to use. Jsvc is required to run secure datanodes. #export JSVC_HOME=${JSVC_HOME} export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}修改 yarn-env.sh
# User for YARN daemons export HADOOP_YARN_USER=${HADOOP_YARN_USER:-yarn} # resolve links - $0 may be a softlink export YARN_CONF_DIR="${YARN_CONF_DIR:-$HADOOP_YARN_HOME/conf}" # some Java parameters:wilson,jdk环境变量 export JAVA_HOME=/usr/jdk if [ "$JAVA_HOME" != "" ]; then #echo "run java in $JAVA_HOME" JAVA_HOME=$JAVA_HOME fi if [ "$JAVA_HOME" = "" ]; then echo "Error: JAVA_HOME is not set." exit 1 fi JAVA=$JAVA_HOME/bin/java JAVA_HEAP_MAX=-Xmx1000m # For setting YARN specific HEAP sizes please use this # Parameter and set appropriately # YARN_HEAPSIZE=1000 # check envvars which might override default args if [ "$YARN_HEAPSIZE" != "" ]; then JAVA_HEAP_MAX="-Xmx""$YARN_HEAPSIZE""m" fi # Resource Manager specific parameters # Specify the max Heapsize for the ResourceManager using a numerical value # in the scale of MB. For example, to specify an jvm option of -Xmx1000m, set # the value to 1000. # This value will be overridden by an Xmx setting specified in either YARN_OPTS # and/or YARN_RESOURCEMANAGER_OPTS. # If not specified, the default value will be picked from either YARN_HEAPMAX # or JAVA_HEAP_MAX with YARN_HEAPMAX as the preferred option of the two. #export YARN_RESOURCEMANAGER_HEAPSIZE=1000 # Specify the JVM options to be used when starting the ResourceManager. # These options will be appended to the options specified as YARN_OPTS # and therefore may override any similar flags set in YARN_OPTS #export YARN_RESOURCEMANAGER_OPTS= # Node Manager specific parameters # Specify the max Heapsize for the NodeManager using a numerical value # in the scale of MB. For example, to specify an jvm option of -Xmx1000m, set # the value to 1000. # This value will be overridden by an Xmx setting specified in either YARN_OPTS # and/or YARN_NODEMANAGER_OPTS. # If not specified, the default value will be picked from either YARN_HEAPMAX # or JAVA_HEAP_MAX with YARN_HEAPMAX as the preferred option of the two. #export YARN_NODEMANAGER_HEAPSIZE=1000 # Specify the JVM options to be used when starting the NodeManager. # These options will be appended to the options specified as YARN_OPTS # and therefore may override any similar flags set in YARN_OPTS #export YARN_NODEMANAGER_OPTS= # so that filenames w/ spaces are handled correctly in loops below IFS= # default log directory & file if [ "$YARN_LOG_DIR" = "" ]; then YARN_LOG_DIR="$HADOOP_YARN_HOME/logs" fi if [ "$YARN_LOGFILE" = "" ]; then YARN_LOGFILE='yarn.log' fi # default policy file for service-level authorization if [ "$YARN_POLICYFILE" = "" ]; then YARN_POLICYFILE="hadoop-policy.xml" fi # restore ordinary behaviour unset IFS YARN_OPTS="$YARN_OPTS -Dhadoop.log.dir=$YARN_LOG_DIR" YARN_OPTS="$YARN_OPTS -Dyarn.log.dir=$YARN_LOG_DIR" YARN_OPTS="$YARN_OPTS -Dhadoop.log.file=$YARN_LOGFILE" YARN_OPTS="$YARN_OPTS -Dyarn.log.file=$YARN_LOGFILE" YARN_OPTS="$YARN_OPTS -Dyarn.home.dir=$YARN_COMMON_HOME" YARN_OPTS="$YARN_OPTS -Dyarn.id.str=$YARN_IDENT_STRING" YARN_OPTS="$YARN_OPTS -Dhadoop.root.logger=${YARN_ROOT_LOGGER:-INFO,console}" YARN_OPTS="$YARN_OPTS -Dyarn.root.logger=${YARN_ROOT_LOGGER:-INFO,console}" if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then YARN_OPTS="$YARN_OPTS -Djava.library.path=$JAVA_LIBRARY_PATH" fi YARN_OPTS="$YARN_OPTS -Dyarn.policy.file=$YARN_POLICYFILE"
6、创建文件夹:
mkdir -m 755 namedir mkdir -m 755 datadir mkdir -m 755 tmp mkdir -m 755 jndir mkdir -m 755 hadoopmrsys mkdir -m 755 hadoopmrlocal mkdir -m 755 nodemanagerlocal mkdir -m 755 nodemanagerlogs
7、发送到其他节点
scp -r /home/hadoop/hadoop [email protected]:/home/hadoop/hadoop scp -r /home/hadoop/hadoop [email protected]:/home/hadoop/hadoop scp -r /home/hadoop/hadoop [email protected]:/home/hadoop/hadoop scp -r /home/hadoop/hadoop [email protected]:/home/hadoop/hadoop scp -r /home/hadoop/hadoop [email protected]:/home/hadoop/hadoop scp -r /home/hadoop/hadoop [email protected]:/home/hadoop/hadoop
<!--在master1上配置rm1,在master2上配置rm2,注意:一般都喜欢把配置好的文件远程复制到其它机器上,但这个在YARN的另一个机器上一定要修改--> <property> <name>yarn.resourcemanager.ha.id</name> <value>rm2</value> <description>If we want to launch more than one RM in single node, we need this configuration</description> </property>
9、初始化配置
9.1 启动zookeeper集群
在slave1、slave2、slave3上面执行 zkServer.sh start 输入jps查看是否有zookeeper进程。
9.2、格式化zookeeper集群吗,目的是在zookeeper集群上建立HA的相应节点
在第一个namenode集群的主(master1)上执行命令: /home/hadoop/hadoop/bin/hdfs zkfc –formatZK 在第二个namenode集群的主(master2)上执行命令: /home/hadoop/hadoop/bin/hdfs zkfc -formatZK9.3、启动JournalNode集群
在安装了journal的机器上启动journalNode集群,例如我安装在master1ha、master2、master2ha这三台机器上,所以在这三台机器上启动。 启动: 在master1ha机器上: /home/hadoop/hadoop/sbin/hadoop-daemon.sh start journalnode 在master2机器上: /home/hadoop/hadoop/sbin/hadoop-daemon.sh start journalnode 在master2ha机器上: /home/hadoop/hadoop/sbin/hadoop-daemon.sh start journalnode
9.4、 格式化master1的namenode
在master1节点上执行,clusterId为这个集群的id /home/hadoop/hadoop/bin/hdfs namenode -format -clusterId hellokitty
9.5、启动master1的namenode
<p>/home/hadoop/hadoop/sbin/hadoop-daemon.sh startnamenode</p>9.6、在master1ha上执行,将master1上的namenode数据同步到master1ha上
/home/hadoop/hadoop/bin/hdfs namenode -bootstrapStandby
/home/hadoop/hadoop/sbin/hadoop-daemon.sh start namenode9.8、 将master1的namenode置成active状态
在master1上执行: /home/hadoop/hadoop/sbin/hadoop-daemon.sh start zkfc 在master1ha上执行: /home/hadoop/hadoop/sbin/hadoop-daemon.sh start zkfc
/home/hadoop/hadoop/sbin/hadoop-daemon.sh start namenode
/home/hadoop/hadoop/bin/hdfs namenode -bootstrapStandby
/home/hadoop/hadoop/sbin/hadoop-daemon.sh start namenode9.12、 将master2置成active状态
在master2上执行: hadoop-daemon.sh start zkfc 在master2ha上执行: hadoop-daemon.sh start zkfc
在三台机器上分别执行: hadoop-daemon.sh start datanode9.14、 启动yarn
在master1上执行 start-yarn.sh
在master2上执行 yarn-daemon.sh start resourcemanager