关于DataNode更改IP地址后所可能引发HDFS集群状态变化的分析

 

    在Apache Hadoop2.0版本中,测试如果DataNode更改HostName或者IP地址,会引起什么样的情况发生。

1 测试环境

操作系统:CentOS 6.2

Hadoop版本:Apache Hadoop2.0.2

Block副本数:2个

节点部署:

NodeType

HostName

IP

NameNode

sdc1

10.28.169.121

NameNode

sdc2

10.28.169.122

DataNode2

Tdatanode0

10.28.169.126

DataNode1

sdc2

10.28.169.122

DataNode0

datanode0

10.28.169.225

2测试类别

    由于机器的限制,在环境中只用了三个DataNode,其中为了方便测试,在DataNode0和DataNode2节点上只部署数据节点,在测试中修改DataNode的HostName和IP只针对DataNode0节点。我们分为两种情况进行测试,在Hadoop集群其中之后:第一种,修改DataNode0的HostName,观察和分析HDFS集群的状态变化;第二种,修改DataNode0的IP地址,观察和分析HDFS集群的状态变化。

2.1 更改DataNode的HostName

    在修改DataNode0节点的HostName之前,首先记录下NN端和DN(DataNode0节点)端的VERSION文件中的版本ID,便于对比。

DN端有两个VERSION文件:

(1)DN-VERSION文件1:

所在路径:

/data/hadoop2.0_dn/current/BP-2147169311-10.28.169.122-1355378443940/current

内容:

namespaceID=1728141100

cTime=0

blockpoolID=BP-2147169311-10.28.169.122-1355378443940

layoutVersion=-40

(2)DN-VERSION文件2:

所在路径:

/data/hadoop2.0_dn/current

内容:

storageID=DS-382431371-10.28.169.225-50010-1355324528657

clusterID=hadoop2.0

cTime=0

storageType=DATA_NODE

layoutVersion=-40

    NN端也有一个版本文件:

(1) NN-VERSION文件1:

路径:

/data/hadoop2.0_nn_edits/current

内容:

namespaceID=1728141100

clusterID=hadoop2.0

cTime=0

storageType=NAME_NODE

blockpoolID=BP-2147169311-10.28.169.122-1355378443940

layoutVersion=-40

(2) NN-VERSION文件2:

路径:

/data/hadoop2.0_nn_fsimage/current

内容:

namespaceID=1728141100

clusterID=hadoop2.0

cTime=0

storageType=NAME_NODE

blockpoolID=BP-2147169311-10.28.169.122-1355378443940

layoutVersion=-40

    更改DataNode0的HostName之前,首先来看下在HDFS的web界面中向NameNode注册的数据节点信息,截图如下:

关于DataNode更改IP地址后所可能引发HDFS集群状态变化的分析_第1张图片

可以看到三个DataNode都正常提供服务。

    现在,在DataNode0节点上,利用hostname命令更改节点的hostname,并且更改/etc/hosts文件中的hostname列表,新的hostname为ddd。更改之后,web界面上,以及DataNode0本机的日志均未有异常出现,web界面上的数据节点名字未改变还是原来的DataNode名字:

关于DataNode更改IP地址后所可能引发HDFS集群状态变化的分析_第2张图片

    把DataNode0节点利用hadoop-daemon.sh stop datanode命令停止其服务,之后:

(1)若在web界面上等待约630秒之后,DN判断DataNode0节点已经死去,Live数据节点个数变为2。

                                                                   关于DataNode更改IP地址后所可能引发HDFS集群状态变化的分析_第3张图片

    此时,把DataNode0节点利用hadoop-daemon.sh start datanode命令重新开启数据节点服务。DataNode0端无异常情况出现,HDFS的web界面发生变化:

                                                              关于DataNode更改IP地址后所可能引发HDFS集群状态变化的分析_第4张图片

    此时,datanode0节点的名字变为ddd,并且NN有如下日志打印,判定DN死亡的时候,会remove掉,DN服务重启的时候,会增加进来:

2012-12-17 17:21:37,697 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default-rack/10.28.169.231:50010

2012-12-17 17:21:37,697 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default-rack/10.28.169.231:50010

2012-12-17 17:21:37,698 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/10.28.169.231:50010

2012-12-17 17:21:57,756 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* processReport: from DatanodeRegistration(10.28.169.231, storageID=DS-382431371-10.28.169.225-50010-1355324528657, infoPort=50075, ipcPort=50020, storageInfo=lv=-40;cid=hadoop2.0;nsid=1728141100;c=0), blocks: 1480, processing time: 15 msecs

    可见,在DataNode服务重启之后,NN端注册的DataNode的名字会发生变化,但是块的数量等不会变化。

(2)若立即重启DN的数据节点服务,则不会触发hdfs发生块移动的操作,并且在web端显示的datanode0的名字也变为ddd。

                                                 关于DataNode更改IP地址后所可能引发HDFS集群状态变化的分析_第5张图片

    并且,以上两种情况,通过查看NN和DN端的四个VERSION文件可知,四个VERSION文件均未发生变化。StorageID、BlockPoolID、ClusterID均未变化。

2.2 更改DataNode的IP地址

    在更改DataNode0端的IP地址之前,先看下web界面中的DN节点:

关于DataNode更改IP地址后所可能引发HDFS集群状态变化的分析_第6张图片

    现在更改/etc/hosts文件中的ip地址,将10.28.169.225改为10.28.169.231。并且,更改/etc/sysconfig/network-scripts/ifcfg-eth1文件中的IP地址,也改为10.28.169.231。改完之后,重启网络服务:service network restart

    从重启网络服务的时刻起,NN就和DataNode0失去心跳连接,DN端的Last Contact会随着时间增长:

                                                            关于DataNode更改IP地址后所可能引发HDFS集群状态变化的分析_第7张图片

    直到Last Contact达到630,NN判断DN已经死亡(NN判断DN死掉的超时时间为630秒):

                                                            关于DataNode更改IP地址后所可能引发HDFS集群状态变化的分析_第8张图片

    而实际上,此时在DN端DataNode的进程依然存在,我们只是更改了DN的IP地址,并未对DataNode的服务进行更改。此时在客户端如果对HDFS进行查看操作时,可以正常进行;若进行上传文件操作,则在客户端会爆出异常。

    当修改IP地址,对网络服务重启之后,立即对DN端的数据节点服务重启,则不会发生数据库的移动。

当NN判断DN死亡之后,此时,NN会把死亡DN节点上的block向其他数据节点进行数据块的转移,以达到相应的replica的副本数目,在NN端会有如下类似日志打印:

2012-12-17 15:53:58,336 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default-rack/10.28.169.225:50010

2012-12-17 15:54:01,219 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10.28.169.122:50010 to replicate blk_970296822908430116_4695 to datanode(s) 10.28.169.126:50010

2012-12-17 15:54:01,220 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10.28.169.122:50010 to replicate blk_-9223121983761508339_4691 to datanode(s) 10.28.169.126:50010

2012-12-17 15:54:01,220 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10.28.169.122:50010 to replicate blk_-7669541053393372310_4689 to datanode(s) 10.28.169.126:50010

2012-12-17 15:54:01,220 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10.28.169.122:50010 to replicate blk_7265027769022057795_4687 to datanode(s) 10.28.169.126:50010

在DN(126节点)会有如下日志打印:

2012-12-17 15:54:02,344 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block BP-2147169311-10.28.169.122-1355378443940:blk_970296822908430116_4695 src: /10.28.169.122:54664 dest: /10.28.169.126:50010

2012-12-17 15:54:02,349 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block BP-2147169311-10.28.169.122-1355378443940:blk_-9223121983761508339_4691 src: /10.28.169.122:54663 dest: /10.28.169.126:50010

2012-12-17 15:54:02,354 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block BP-2147169311-10.28.169.122-1355378443940:blk_970296822908430116_4695 src: /10.28.169.122:54664 dest: /10.28.169.126:50010 of size 6323

2012-12-17 15:54:02,356 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block BP-2147169311-10.28.169.122-1355378443940:blk_-9223121983761508339_4691 src: /10.28.169.122:54663 dest: /10.28.169.126:50010 of size 8810

2012-12-17 15:54:05,302 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block BP-2147169311-10.28.169.122-1355378443940:blk_7265027769022057795_4687 src: /10.28.169.122:54665 dest: /10.28.169.126:50010

2012-12-17 15:54:05,303 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block BP-2147169311-10.28.169.122-1355378443940:blk_-7669541053393372310_4689 src: /10.28.169.122:54666 dest: /10.28.169.126:50010

在修改IP的DN节点端(datanode0),当重启网络服务之后,DN端会中断二十分钟左右(原因待定,无日志打印),之后,开始打印日志,并爆出如下错误:

2012-12-16 02:53:54,636 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-2147169311-10.28.169.122-1355378443940:blk_970296822908430116_4695, type=LAST_IN_PIPELINE, downstreams=0:[] terminating

 

---------   中断时间,无日志打印   --------------=-==-=-=-=-=-=-=-=-=-=-=

 

2012-12-16 03:17:04,863 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService

java.io.IOException: Failed on local exception: java.io.IOException: Connection timed out; Host Details : local host is: "datanode0/10.28.169.225"; destination host is: "sdc2":9000;

      at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:760)

      at org.apache.hadoop.ipc.Client.call(Client.java:1168)

      at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)

      at $Proxy10.sendHeartbeat(Unknown Source)

      at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)

      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

      at java.lang.reflect.Method.invoke(Method.java:597)

      at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)

      at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)

      at $Proxy10.sendHeartbeat(Unknown Source)

      at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167)

      at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:441)

      at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:521)

      at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:673)

      at java.lang.Thread.run(Thread.java:662)

Caused by: java.io.IOException: Connection timed out

      at sun.nio.ch.FileDispatcher.read0(Native Method)

      at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)

      at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)

      at sun.nio.ch.IOUtil.read(IOUtil.java:171)

      at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)

      at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)

      at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)

      at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:159)

      at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)

      at java.io.FilterInputStream.read(FilterInputStream.java:116)

      at java.io.FilterInputStream.read(FilterInputStream.java:116)

      at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:388)

      at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)

      at java.io.BufferedInputStream.read(BufferedInputStream.java:237)

      at java.io.FilterInputStream.read(FilterInputStream.java:66)

      at com.google.protobuf.AbstractMessageLite$Builder.mergeDelimitedFrom(AbstractMessageLite.java:276)

      at com.google.protobuf.AbstractMessage$Builder.mergeDelimitedFrom(AbstractMessage.java:760)

      at com.google.protobuf.AbstractMessageLite$Builder.mergeDelimitedFrom(AbstractMessageLite.java:288)

      at com.google.protobuf.AbstractMessage$Builder.mergeDelimitedFrom(AbstractMessage.java:752)

      at org.apache.hadoop.ipc.protobuf.RpcPayloadHeaderProtos$RpcResponseHeaderProto.parseDelimitedFrom(RpcPayloadHeaderProtos.java:985)

      at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:886)

      at org.apache.hadoop.ipc.Client$Connection.run(Client.java:817)

2012-12-16 03:17:04,876 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action: DNA_REGISTER

2012-12-16 03:17:04,881 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool BP-2147169311-10.28.169.122-1355378443940 (storage id DS-382431371-10.28.169.225-50010-1355324528657) service to sdc2/10.28.169.122:9000 beginning handshake with NN

2012-12-16 03:17:05,877 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService

java.io.IOException: Failed on local exception: java.io.IOException: Connection timed out; Host Details : local host is: "datanode0/10.28.169.225"; destination host is: "sdc1":9000;

      at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:760)

      at org.apache.hadoop.ipc.Client.call(Client.java:1168)

      at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)

      at $Proxy10.sendHeartbeat(Unknown Source)

      at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)

      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

      at java.lang.reflect.Method.invoke(Method.java:597)

      at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)

      at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)

      at $Proxy10.sendHeartbeat(Unknown Source)

      at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167)

      at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:441)

      at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:521)

      at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:673)

      at java.lang.Thread.run(Thread.java:662)

Caused by: java.io.IOException: Connection timed out

      at sun.nio.ch.FileDispatcher.read0(Native Method)

      at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)

      at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)

      at sun.nio.ch.IOUtil.read(IOUtil.java:171)

      at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)

      at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)

      at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)

      at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:159)

      at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)

      at java.io.FilterInputStream.read(FilterInputStream.java:116)

      at java.io.FilterInputStream.read(FilterInputStream.java:116)

      at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:388)

      at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)

      at java.io.BufferedInputStream.read(BufferedInputStream.java:237)

      at java.io.FilterInputStream.read(FilterInputStream.java:66)

      at com.google.protobuf.AbstractMessageLite$Builder.mergeDelimitedFrom(AbstractMessageLite.java:276)

      at com.google.protobuf.AbstractMessage$Builder.mergeDelimitedFrom(AbstractMessage.java:760)

      at com.google.protobuf.AbstractMessageLite$Builder.mergeDelimitedFrom(AbstractMessageLite.java:288)

      at com.google.protobuf.AbstractMessage$Builder.mergeDelimitedFrom(AbstractMessage.java:752)

      at org.apache.hadoop.ipc.protobuf.RpcPayloadHeaderProtos$RpcResponseHeaderProto.parseDelimitedFrom(RpcPayloadHeaderProtos.java:985)

      at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:886)

      at org.apache.hadoop.ipc.Client$Connection.run(Client.java:817)

2012-12-16 03:17:14,956 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool Block pool BP-2147169311-10.28.169.122-1355378443940 (storage id DS-382431371-10.28.169.225-50010-1355324528657) service to sdc2/10.28.169.122:9000 successfully registered with NN

2012-12-16 03:17:14,957 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Took 10081ms to process 1 commands from NN

2012-12-16 03:17:14,958 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action from standby: DNA_REGISTER

2012-12-16 03:17:14,969 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool BP-2147169311-10.28.169.122-1355378443940 (storage id DS-382431371-10.28.169.225-50010-1355324528657) service to sdc1/10.28.169.121:9000 beginning handshake with NN

2012-12-16 03:17:15,005 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 1480 blocks took 6 msec to generate and 42 msecs for RPC and NN processing

2012-12-16 03:17:15,006 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: sent block report, processed command:org.apache.hadoop.hdfs.server.protocol.FinalizeCommand@189a2557

2012-12-16 03:17:24,983 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool Block pool BP-2147169311-10.28.169.122-1355378443940 (storage id DS-382431371-10.28.169.225-50010-1355324528657) service to sdc1/10.28.169.121:9000 successfully registered with NN

2012-12-16 03:17:24,984 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Took 10026ms to process 1 commands from NN

2012-12-16 03:17:25,073 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 1480 blocks took 8 msec to generate and 81 msecs for RPC and NN processing

2012-12-16 03:17:25,073 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: sent block report, processed command:null

    DN向两个NN发起连接请求,由于IP地址发生变化,所以首次连接会发生WrapException,之后,RPC连接成功,向NN注册和握手,发送块报告获得NN的返回命令。

                                                             关于DataNode更改IP地址后所可能引发HDFS集群状态变化的分析_第9张图片

    可见,数据节点端的块的数量和修改DN的ip之前相比,只有datanode0没有发生变化。

    利用命令hadoop-daemon.sh start/stop datanode重启datanode0数据节点的DN服务。重启之后,在datanode0端会有如下日志打印:

2012-12-16 03:28:20,845 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: sent block report, processed command:org.apache.hadoop.hdfs.server.protocol.FinalizeCommand@1cecd92c

2012-12-16 03:28:20,847 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Periodic Block Verification Scanner initialized with interval 504 hours for block pool BP-2147169311-10.28.169.122-1355378443940.

2012-12-16 03:28:20,879 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Added bpid=BP-2147169311-10.28.169.122-1355378443940 to blockPoolScannerMap, new size=1

    重启之后,会发送块报告,并且启动块扫描,对块池中的上一次启动之后新增的所有块进行验证。

2147169311-10.28.169.122-1355378443940:blk_-2110417446587413716_4193

2012-12-16 03:28:32,628 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-2147169311-10.28.169.122-1355378443940:blk_-4904216641349090176_3919

2012-12-16 03:28:32,629 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-2147169311-10.28.169.122-1355378443940:blk_-2305285786919610835_4205

2012-12-16 03:28:32,630 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-2147169311-10.28.169.122-1355378443940:blk_7458965476558187997_3659

2012-12-16 03:28:32,631 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-2147169311-10.28.169.122-1355378443940:blk_8899992160584874259_3671

2012-12-16 03:28:32,642 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-2147169311-10.28.169.122-1355378443940:blk_-7669541053393372310_4689

并且,在更改DataNode0节点的IP地址到最后块数量达到一致的时间段内,DN和NN端的VERSION文件都不会发生变化,StorageID、ClusterID、BlockPoolID均不会发生变化。

3 结果和分析

    通过以上测试,发现:

 (1) 改变hostname和ip,均不会导致DN端的StorageID、ClusterID、BlockPoolID发生变化

    ClusterID和BlockPoolID是在集群格式化的时候创建的,除非集群再次格式化,否则是不会再发生变化的;StorageID是在DataNode启动的时候创建,在DataStorage类中由方法createStorageID实现。

在创建StorageID号之前,会首先读取本地的VERSION文件(如果该文件存在的话),若id号为空,则创建新的StorageID号,否则就延用之前的StorageID号。这一步骤的实现是由DataStorage类中的setFieldsFromProperties方法完成。

由于所修改hostname或者IP的DN在修改之前都是正常运行的,所有VERSION文件是正常存在的,所以修改之后,DN仍沿用之前的ID号。

(2)若修改DN的hostname,则在DN的数据节点服务重启之后才有效,才能在HDFS的web界面中显示出来

    DN节点只有在服务启动的时候才会获取本节点的hostname,然后通过和NN的握手,把hostname发给NN注册,创建新的DatanodeID,此功能在DataNode类中的createBPRegistration方法中实现。

(3)修改DN的IP地址之后,会首先出现WrapException异常,之后DN会重新和NN握手、注册,接收NN返回的命令

    改变IP地址之前,是利用ip1和namenode建立的rpc连接,改变ip之后,DN的ip地址为ip2,已建立好的rpc连接不能正确返回结果,故此会抛出WrapException异常。此后,DN会重新和两个NN建立rpc连接,连接之后,会接受NN返回的命令,进行相应的处理。

(4)

1)若改变IP地址,在超时时间(630秒)后NN判断DN死亡,此时无论是否再重启DN端的数据节点服务,都会导致块未死亡数据节点出现block的移动;

2)若改变hostname或者IP地址,在超时时间(630秒)内,重启DN端的数据节点服务,不会导致未死亡数据节点出现block的移动。

    在Hadoop2.0中,NN判断DN心跳超时的时间间隔为:

heartbeatExpireInterval = 2 * heartbeatRecheckInterval +

10 * 1000 * heartbeatIntervalSeconds

其中,heartbeatRecheckInterval 的默认值为5*60*1000,heartbeatIntervalSeconds的默认值为3,所以超时时间为:

10分钟+30秒=630秒

在超过630秒的超时时间后,NN和DN仍未建立好心跳关系,NN判定DN已经死亡,由于每个Block都有一定数量的副本数,当DN死亡之后,有些Block的副本数目会发生变化,NN检测到之后,会从Live的DataNode节点上进行块的复制,来保证所有的Block的副本数不会因为某个DN的死亡而发生变化。

    当在超时时间内,就重启DN服务之后,DN与NN会重新建立连接,DN为Live状态,所以Block的副本数量不会发生变化,不会出现从一个DN节点向另一个DN节点进行块的复制传输操作出现。

(5)改变IP地址之后,需要等待17分钟左右之后,才会在DN端有日志打印出来

    CentOS Linux设定的TCP连接超时是1分钟,当连接失败后会尝试连接15次,其中超时时间在/proc/sys/net/ipv4/tcp_fin_timeout文件中设定,重新尝试连接的次数在/proc/sys/net/ipv4/tcp_retries2文件中设定。当重试次数达到15之后,Linux系统就把失效的TCP连接清除掉。同时,RCP连接中等待应答是阻塞的,期间没有日志打出,而且响应超时为1分钟,再加上网络的原因、操作的原因等,所以DN端会有20分钟左右的时间处于“停滞”状态都属于正常,不会有心跳方面的日志打出。之后,会重新建立新的心跳连接,恢复正常。

改变IP之前的TCP连接情况:

[root@datanode0 ~]# netstat -anptl | grep 9000

tcp        0    382 10.28.169.231:42116         10.28.169.121:9000          ESTABLISHED 7546/java           

tcp        0    382 10.28.169.231:60975         10.28.169.122:9000          ESTABLISHED 7546/java  

改变IP之后的TCP连接情况:

[root@datanode0 ~]# netstat -anptl | grep 9000

tcp        0      0 10.28.169.225:55198         10.28.169.121:9000          ESTABLISHED 7546/java           

tcp        0      0 10.28.169.225:33945         10.28.169.122:9000          ESTABLISHED 7546/java  

你可能感兴趣的:(云计算,datanode,datanode,datanode,hadoop2.0,修改LinuxIP)