Hadoop升级(HA) 2.2升级到2.6

官网的文档[HDFSHighAvailabilityWithQJM.html]和[HdfsRollingUpgrade.html](Note that rolling upgrade is supported only from Hadoop-2.4.0 onwards.)很详细,但是没有一个整体的案例。这里整理下操作记录下来。

  1. 关闭所有的namenode,部署新版本的hadoop
  2. 启动所有的journalnode,是所有!!升级namenode的同时,也会升级所有journalnode!!
  3. 使用-upgrade选项启动一台namenode。启动的这台namenode会直接进入active状态,升级本地的元数据,同时会升级shared edit log(也就是journalnode的数据)
  4. 使用-bootstrapStandby启动其他namenode,同步更新。不能使用-upgrade选项!(我也没试,不知道试了是啥效果)

关闭集群,部署新版本的hadoop

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
[hadoop@hadoop-master1 hadoop-2.2.0]$ sbin/stop-dfs.sh 16/01/08 09:10:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Stopping namenodes on [hadoop-master1 hadoop-master2] hadoop-master2: stopping namenode hadoop-master1: stopping namenode hadoop-slaver1: stopping datanode hadoop-slaver2: stopping datanode hadoop-slaver3: stopping datanode Stopping journal nodes [hadoop-master1] hadoop-master1: stopping journalnode 16/01/08 09:10:43 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Stopping ZK Failover Controllers on NN hosts [hadoop-master1 hadoop-master2] hadoop-master1: stopping zkfc hadoop-master2: stopping zkfc [hadoop@hadoop-master1 hadoop-2.2.0]$  [hadoop@hadoop-master1 hadoop-2.2.0]$ cd ~/hadoop-2.6.3 [hadoop@hadoop-master1 hadoop-2.6.3]$ ll total 52 drwxr-xr-x 2 hadoop hadoop 4096 Dec 18 01:52 bin lrwxrwxrwx 1 hadoop hadoop 32 Jan 8 06:05 etc -> /home/hadoop/hadoop-2.2.0/ha-etc drwxr-xr-x 2 hadoop hadoop 4096 Dec 18 01:52 include drwxr-xr-x 3 hadoop hadoop 4096 Dec 18 01:52 lib drwxr-xr-x 2 hadoop hadoop 4096 Dec 18 01:52 libexec -rw-r--r-- 1 hadoop hadoop 15429 Dec 18 01:52 LICENSE.txt drwxrwxr-x 2 hadoop hadoop 4096 Jan 8 03:37 logs -rw-r--r-- 1 hadoop hadoop 101 Dec 18 01:52 NOTICE.txt -rw-r--r-- 1 hadoop hadoop 1366 Dec 18 01:52 README.txt drwxr-xr-x 2 hadoop hadoop 4096 Dec 18 01:52 sbin drwxr-xr-x 3 hadoop hadoop 4096 Jan 7 08:00 share  #// 同步 [hadoop@hadoop-master1 ~]$ for h in hadoop-master2 hadoop-slaver1 hadoop-slaver2 hadoop-slaver3 ; do rsync -vaz --delete --exclude=logs ~/hadoop-2.6.3 $h:~/ ; done

启动所有Journalnode

2.6和2.2用的是一份配置!etc通过软链接到2.2的ha-etc配置。

1
2
3
4
5
6
[hadoop@hadoop-master1 hadoop-2.6.3]$ sbin/hadoop-daemons.sh --hostnames "hadoop-master1" --script /home/hadoop/hadoop-2.2.0/bin/hdfs start journalnode hadoop-master1: starting journalnode, logging to /home/hadoop/hadoop-2.6.3/logs/hadoop-hadoop-journalnode-hadoop-master1.out [hadoop@hadoop-master1 hadoop-2.6.3]$ jps 31047 JournalNode 244 QuorumPeerMain 31097 Jps

升级一台namenode

1
2
3
4
5
6
7
8
9
10
11
12
[hadoop@hadoop-master1 hadoop-2.6.3]$ bin/hdfs namenode -upgrade ... 16/01/08 09:13:54 INFO namenode.NameNode: createNameNode [-upgrade] ... 16/01/08 09:13:57 INFO namenode.FSImage: Starting upgrade of local storage directories.  old LV = -47; old CTime = 0.  new LV = -60; new CTime = 1452244437060 16/01/08 09:13:57 INFO namenode.NNUpgradeUtil: Starting upgrade of storage directory /data/tmp/dfs/name 16/01/08 09:13:57 INFO namenode.FSImageTransactionalStorageInspector: No version file in /data/tmp/dfs/name 16/01/08 09:13:57 INFO namenode.NNUpgradeUtil: Performing upgrade of storage directory /data/tmp/dfs/name 16/01/08 09:13:57 INFO namenode.FSNamesystem: Need to save fs image? false (staleImage=false, haEnabled=true, isRollingUpgrade=false) ... 

官网文档上说,除了升级了namenode的本地元数据外,sharededitlog也被升级了的。

查看journalnode的日志,确实journalnode也升级了:

1
2
3
4
5
6
7
8
9
10
11
12
[hadoop@hadoop-master1 hadoop-2.6.3]$ less logs/hadoop-hadoop-journalnode-hadoop-master1.log ... 2016-01-08 09:13:57,070 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Starting upgrade of edits directory /data/journal/zfcluster 2016-01-08 09:13:57,072 INFO org.apache.hadoop.hdfs.server.namenode.NNUpgradeUtil: Starting upgrade of storage directory /data/journal/zfcluster 2016-01-08 09:13:57,185 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Starting upgrade of edits directory: .  old LV = -47; old CTime = 0.  new LV = -60; new CTime = 1452244437060 2016-01-08 09:13:57,185 INFO org.apache.hadoop.hdfs.server.namenode.NNUpgradeUtil: Performing upgrade of storage directory /data/journal/zfcluster 2016-01-08 09:13:57,222 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Updating lastWriterEpoch from 2 to 3 for client /172.17.0.1 2016-01-08 09:16:57,731 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Updating lastPromisedEpoch from 3 to 4 for client /172.17.0.1 2016-01-08 09:16:57,735 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Scanning storage FileJournalManager(root=/data/journal/zfcluster) ... 

升级的namenode是前台运行的,不要关闭这个进程。接下来把另一台namenode同步一下。

同步另一台namenode

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
[hadoop@hadoop-master2 hadoop-2.6.3]$ bin/hdfs namenode -bootstrapStandby ... ===================================================== About to bootstrap Standby ID nn2 from:  Nameservice ID: zfcluster  Other Namenode ID: nn1  Other NN's HTTP address: http://hadoop-master1:50070  Other NN's IPC address: hadoop-master1/172.17.0.1:8020  Namespace ID: 639021326  Block pool ID: BP-1695500896-172.17.0.1-1452152050513  Cluster ID: CID-7d5c31d8-5cd4-46c8-8e04-49151578e5bb  Layout version: -60  isUpgradeFinalized: false ===================================================== 16/01/08 09:15:19 INFO ha.BootstrapStandby: The active NameNode is in Upgrade. Prepare the upgrade for the standby NameNode as well. 16/01/08 09:15:19 INFO common.Storage: Lock on /data/tmp/dfs/name/in_use.lock acquired by nodename 5008@hadoop-master2 16/01/08 09:15:21 INFO namenode.TransferFsImage: Opening connection to http://hadoop-master1:50070/imagetransfer?getimage=1&txid=1126&storageInfo=-60:639021326:1452244437060:CID-7d5c31d8-5cd4-46c8-8e04-49151578e5bb 16/01/08 09:15:21 INFO namenode.TransferFsImage: Image Transfer timeout configured to 60000 milliseconds 16/01/08 09:15:21 INFO namenode.TransferFsImage: Transfer took 0.00s at 0.00 KB/s 16/01/08 09:15:21 INFO namenode.TransferFsImage: Downloaded file fsimage.ckpt_0000000000000001126 size 977 bytes. 16/01/08 09:15:21 INFO namenode.NNUpgradeUtil: Performing upgrade of storage directory /data/tmp/dfs/name ... 

重新启动集群

ctrl+c关闭hadoop-master1 upgrade的namenode。启动整个集群。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[hadoop@hadoop-master1 hadoop-2.6.3]$ sbin/start-dfs.sh 16/01/08 09:16:44 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Starting namenodes on [hadoop-master1 hadoop-master2] hadoop-master1: starting namenode, logging to /home/hadoop/hadoop-2.6.3/logs/hadoop-hadoop-namenode-hadoop-master1.out hadoop-master2: starting namenode, logging to /home/hadoop/hadoop-2.6.3/logs/hadoop-hadoop-namenode-hadoop-master2.out hadoop-slaver3: starting datanode, logging to /home/hadoop/hadoop-2.6.3/logs/hadoop-hadoop-datanode-hadoop-slaver3.out hadoop-slaver2: starting datanode, logging to /home/hadoop/hadoop-2.6.3/logs/hadoop-hadoop-datanode-hadoop-slaver2.out hadoop-slaver1: starting datanode, logging to /home/hadoop/hadoop-2.6.3/logs/hadoop-hadoop-datanode-hadoop-slaver1.out Starting journal nodes [hadoop-master1] hadoop-master1: journalnode running as process 31047. Stop it first. 16/01/08 09:16:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Starting ZK Failover Controllers on NN hosts [hadoop-master1 hadoop-master2] hadoop-master2: starting zkfc, logging to /home/hadoop/hadoop-2.6.3/logs/hadoop-hadoop-zkfc-hadoop-master2.out hadoop-master1: starting zkfc, logging to /home/hadoop/hadoop-2.6.3/logs/hadoop-hadoop-zkfc-hadoop-master1.out [hadoop@hadoop-master1 hadoop-2.6.3]$ jps 31047 JournalNode 244 QuorumPeerMain 31596 DFSZKFailoverController 31655 Jps 31294 NameNode 

后记:Journalnode重置

在HA和non-HA环境来回的切换,最后启动HA时master起不来,执行bootstrapStandby也不行。

1
2
3
4
5
6
7
8
9
2016-01-08 06:15:36,746 WARN org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input streams from QJM to [172.17.0.1:8485]. Skipping. org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 1/1. 1 exceptions thrown: 172.17.0.1:8485: Asked for firstTxId 1022 which is in the middle of file /data/journal/zfcluster/current/edits_0000000000000001021-0000000000000001022  at org.apache.hadoop.hdfs.server.namenode.FileJournalManager.getRemoteEditLogs(FileJournalManager.java:198)  at org.apache.hadoop.hdfs.qjournal.server.Journal.getEditLogManifest(Journal.java:640)  at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:181)  at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:203)  at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:17453)  at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)

关闭集群,启动journalnode,跳转到没有问题的namenode机器,执行initializeSharedEdits命令。然后在有问题的namenode上重新初始化!

1
2
3
4
5
6
7
8
9
[hadoop@hadoop-master1 hadoop-2.2.0]$ sbin/hadoop-daemon.sh start journalnode  [hadoop@hadoop-master2 hadoop-2.2.0]$ bin/hdfs namenode -initializeSharedEdits  [hadoop@hadoop-master2 hadoop-2.2.0]$ sbin/hadoop-daemon.sh start namenode  [hadoop@hadoop-master1 hadoop-2.2.0]$ bin/hdfs namenode -bootstrapStandby  [hadoop@hadoop-master1 hadoop-2.2.0]$ sbin/start-dfs.sh

后话(谨慎,没有试验过,猜想而已): 其实上面HA升级的步骤,如果upgrade时没用启动journalnode,导致了问题的话,把journalnode重置应该也是可以的。

参考

  • http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#HDFS_UpgradeFinalizationRollback_with_HA_Enabled
  • http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Upgrade_and_Rollback
  • http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html

–END

你可能感兴趣的:(Hadoop升级(HA) 2.2升级到2.6)