在append出现之前,一个file被close之后就是immutable的了,close之前是不能被read的。而在append出现之后,一个未close的file的last block对于read来说也是visible的,那么逻辑就复杂多了。
Apache社区的jira里有对HDFS append设计的详细文档(https://issues.apache.org/jira/secure/attachment/12445209/appendDesign3.pdf)。对于概念和逻辑在那个文档里讲的更详细些,那个文档更像是类似于bible或者C++ primer,可以当字典来查。这篇文章重点解释的是append的执行流程和前后逻辑,可能更容易读懂。
1,概念block和replica
首先区分两个概念:block和replica。在NN中我们叫block,在DN中我们叫replica。
在append问世之前,replica在DN中就两种状态:temporary和finalized。Replica被创建和写入时为temporary状态,当client发送完所有的byte请求DN关闭这个replica时,转化为finalized。 DN restart会删除处于temporary状态的replica。
在append问世之后,逻辑就复杂多了,状态也多了。首先明确block和replica的状态:(如何区分block和replica:在NN中叫block,在每个DN上叫replica)
NN中的block有下面4种状态:
static public enum BlockUCState { COMPLETE, UNDER_CONSTRUCTION, UNDER_RECOVERY, COMMITTED; }
注意NN中的block的状态是在内存中的,不会持久化到硬盘上。当NN重启后上次未关闭文件的last block将变成under construction,其余的为Complete。
1) complete:block的length和gs不再发生变化,并且NN已经收到至少有一个DN报告有finalized状态的replica(DN上的replica状态发生变化会通过RPC blockReceivedAndDeleted向NN报告)。一个complete的block会保存finalized的replica的locations在NN的内存中。只有当文件的所有的block都是complete的,该文件才能被close。
2) under_construction: 文件被create或者append时,正在被写入的block就处于under_construction状态。该状态的length和gs都不是finalized的,但是处在该状态的block对于read来说是visible的(具体多少byte是visible的,是通过client向DN询问得知的,这个在DFSInputStream的构造函数中会发起RPC调用获得,其实是某个DN所ACK的大小,该block其他replica所received的byte都大于任何DN ACK的大小,所以这个长度在任何replica上都是能够达到的)。
3) under_recovery:如果一个file的last block处于under_construction状态的时候,client异常退出,lease超过softLimit过期,那么该block就需要走下面要说的Lease recovery和Block recovery流程释放lease关闭file。那么正在走Lease recovery和Block recovery流程的block就处于under_recovery状态。
4) committed: client端在写文件的时候,每次请求新的block(addBlock RPC请求)或者close文件时,都会顺带把previous block进行commit操作(previous block从under_construction状态转化成committed状态)。这个时候Client已经把所有的该block的byte都发送给了DN组成的pipeline,已经收到ACK请求。但是NN还没有收到任何一个DN说有finalized replica。
DN中replica有下面5种状态:
1) Finalized(类FinalizedReplica表示)
2) RBW(类ReplicaBeingWritten表示,继承自ReplicaInPipeline):刚刚被create或者append的replica,处在write的pipeline中,正在被写入。但是byte还是visible to read的。
3) RUR(类ReplicaUnderRecovery表示):Lease过期之后发生Lease和Block recovery时replica所处的状态。
4) RWR(类ReplicaWaitingToBeRecovered表示):如果一个DN挂掉并且重启之后,所有RBW的replica将会转换为RWR。RWR的replica不会出现在pipeline中,结果就是等着Lease recovery恢复。
5) Temporary(类ReplicaInPipeline表示):DN之间传输replica(例如cluster rebalance)时,正在传输的就是处在Temporary。和RBW不同的是,它对read不是visible的,DN如果重启直接删除处于Temporary状态的replica。
NN中block的blockId,numBytes和GS会持久化到硬盘,但是block状态不会持久化到磁盘;然而DN中replica的状态会持久化到磁盘。所以NN如果发生restart,那么只有最后一个block会被加载成under construction,其他的都是complete;但是DN重启会加载已经持久化到磁盘的replica的状态。
有关block和replica的状态转换过程可以参考https://issues.apache.org/jira/secure/attachment/12445209/appendDesign3.pdf的9.1和9.2,那里讲的很详细。
2,write/append代码流程
对于Append一个文件,这个文件可能处于的状态就有很多种可能了:
1) 上次关闭的时候是正常关闭,那么就是调用了DFSOutputStream.close()->FSNamesystem.completeFile()->commitOrCompleteLastBlock()并且finalizeINodeFileUnderConstruction(),那么NN中file的状态就是INode而不是INodeUnderConstruction,同时last block也一定被commit或者complete。如果是这样的文件,我们append的时候就省心了。
2) 上次关闭的时候是非正常关闭(例如client异常退出等),那么就没有调用close及一系列的后续操作。那么NN中File元数据就处于INodeUnderConstruction状态,last block也没有被commit,而且绑在这个file身上的锁(Lease)也没有被释放。
注解:HDFS中的Lease相当于写锁,HDFS只给write加锁,也就是在client发送create()或者append()请求时,NN给这个file加锁,就是lease。然后client负责定期renew lease,而在NN端Lease monitor线程检测lease是否过期。而Lease expire有两个过期时间:softLimit(60s)和hardLimit(1hour)。NN的Lease monitor只会删除超过hardLimit的Lease,而超过softLimit的Lease虽然也已经过期,但是不删除,是通过下次append或者显式调用recoverLease RPC的时候检查是否超过softLimit来处理的。
在append(或者显式调用recoverLease)的时候,上次正常关闭的file,对应的lease一定不存在了;上次未正常关闭的file,对应的lease如果超过了softLimit,那么就要进行Lease recovery操作。
未正常关闭的file,NN端Lease超过softLimit而过期,那么可能再次打开这个file的client是原来的client,也可能是新的client。这个file上次未正常关闭,last block对应的三个replica也可能处在不同的状态,需要先Block recovery多个replica的信息达成一致状态才能进行append。说白了就是在append一个file的时候,如果检查出现了lease超过softLimit过期,那么就需要先给上次异常操作擦屁股,使其达到正常close状态,这个过程就是Lease recovery和Block recovery.
先回忆下HDFS的write流程所涉及到的RPC。Client先create一个文件,然后addBlock分配块及其存放的DN,建立pipeline,然后write数据。 HDFS的append操作类似:client先向NN发送append RPC,然后是addBlock,然后建立pipeline,然后write。
Client调用DistributedFilesystem.append()接口,然后向NN发送append RPC。
Append在NN的处理逻辑会走到FSNamesystem.startFileInternal(),跟create()走到一块去了,但是在内部处理逻辑上就有分支了。
对所有的append请求都执行这句:
recoverLeaseInternal(myFile,src,holder,clientMachine,false);
recoverLeaseInternal()这个函数是用来Lease recovery的,这个函数内部逻辑只处理那些处于under construction状态的file。因为如果file已经construction完毕(处于INode状态),那么上次关闭一定是走了准确的关闭流程(上述流程1),这样的不需要Lease recovery。
3,Lease recovery:
用一句最简单的话形容Lease recovery就是:上次没有正常关闭一个write的file,没有正常释放lease,last block的多个replica可能处于不同的状态(大小和generationStamp),现在需要由NN和DN配合来完成正常关闭file的流程,Lease recovery的最终结果和上次正常关闭是一样的。
private void recoverLeaseInternal(INode fileInode, String src, String holder, String clientMachine, boolean force) throws IOException
这个函数执行recover lease操作:首先获取对应file的INodeFileUnderConstruction(此时该file一定处于under construction状态 );然后检查该file对应的lease是否一致,是否该Lease原来的holder又继续renew lease了;然后就看如果lease超过了softLimit就调用internalReleaseLease()函数进行下一步操作。
boolean internalReleaseLease(Lease lease, String src, String recoveryLeaseHolder) throws AlreadyBeingCreatedException, IOException, UnresolvedLinkException { LOG.info("Recovering lease=" + lease + ", src=" + src); assert !isInSafeMode(); assert hasWriteLock(); INodeFile iFile = dir.getFileINode(src); if (iFile == null) { final String message = "DIR* NameSystem.internalReleaseLease: " + "attempt to release a create lock on " + src + " file does not exist."; NameNode.stateChangeLog.warn(message); throw new IOException(message); } if (!iFile.isUnderConstruction()) { final String message = "DIR* NameSystem.internalReleaseLease: " + "attempt to release a create lock on " + src + " but file is already closed."; NameNode.stateChangeLog.warn(message); throw new IOException(message); } INodeFileUnderConstruction pendingFile = (INodeFileUnderConstruction) iFile; int nrBlocks = pendingFile.numBlocks(); BlockInfo[] blocks = pendingFile.getBlocks(); int nrCompleteBlocks; BlockInfo curBlock = null; //首先检查NN保存的该file的block信息,看是否有block的状态不是complete for(nrCompleteBlocks = 0; nrCompleteBlocks < nrBlocks; nrCompleteBlocks++) { curBlock = blocks[nrCompleteBlocks]; if(!curBlock.isComplete()) break; assert blockManager.checkMinReplication(curBlock) : "A COMPLETE block is not minimally replicated in " + src; } // If there are no incomplete blocks associated with this file, // then reap lease immediately and close the file. if(nrCompleteBlocks == nrBlocks) { //所有block都是complete的:释放lease,file由INodeUnderConstruction变成INode,然后close file finalizeINodeFileUnderConstruction(src, pendingFile); NameNode.stateChangeLog.warn("BLOCK*" + " internalReleaseLease: All existing blocks are COMPLETE," + " lease removed, file closed."); return true; // closed! } // Only the last and the penultimate blocks may be in non COMPLETE state. // If the penultimate block is not COMPLETE, then it must be COMMITTED. // 执行到这说明有block不是complete的,所以得先修复block,再finalize & close file。 if(nrCompleteBlocks < nrBlocks - 2 || nrCompleteBlocks == nrBlocks - 2 && curBlock != null && curBlock.getBlockUCState() != BlockUCState.COMMITTED) { final String message = "DIR* NameSystem.internalReleaseLease: " + "attempt to release a create lock on " + src + " but file is already closed."; NameNode.stateChangeLog.warn(message); throw new IOException(message); } // The last block is not COMPLETE, and // that the penultimate block if exists is either COMPLETE or COMMITTED final BlockInfo lastBlock = pendingFile.getLastBlock(); BlockUCState lastBlockState = lastBlock.getBlockUCState(); BlockInfo penultimateBlock = pendingFile.getPenultimateBlock(); boolean penultimateBlockMinReplication; BlockUCState penultimateBlockState; if (penultimateBlock == null) { penultimateBlockState = BlockUCState.COMPLETE; // If penultimate block doesn't exist then its minReplication is met penultimateBlockMinReplication = true; } else { penultimateBlockState = BlockUCState.COMMITTED; penultimateBlockMinReplication = blockManager.checkMinReplication(penultimateBlock); } assert penultimateBlockState == BlockUCState.COMPLETE || penultimateBlockState == BlockUCState.COMMITTED : "Unexpected state of penultimate block in " + src; switch(lastBlockState) { case COMPLETE: assert false : "Already checked that the last block is incomplete"; break; case COMMITTED: // Close file if committed blocks are minimally replicated if(penultimateBlockMinReplication && blockManager.checkMinReplication(lastBlock)) { finalizeINodeFileUnderConstruction(src, pendingFile); NameNode.stateChangeLog.warn("BLOCK*" + " internalReleaseLease: Committed blocks are minimally replicated," + " lease removed, file closed."); return true; // closed! } // Cannot close file right now, since some blocks // are not yet minimally replicated. // This may potentially cause infinite loop in lease recovery // if there are no valid replicas on data-nodes. String message = "DIR* NameSystem.internalReleaseLease: " + "Failed to release lease for file " + src + ". Committed blocks are waiting to be minimally replicated." + " Try again later."; NameNode.stateChangeLog.warn(message); throw new AlreadyBeingCreatedException(message); case UNDER_CONSTRUCTION: case UNDER_RECOVERY: final BlockInfoUnderConstruction uc = (BlockInfoUnderConstruction)lastBlock; // setup the last block locations from the blockManager if not known if (uc.getNumExpectedLocations() == 0) { uc.setExpectedLocations(blockManager.getNodes(lastBlock)); } // start recovery of the last block for this file //为这个Block生成新的GS,这个GS是在recovery过程中非常重要的变量。 long blockRecoveryId = nextGenerationStamp(); //重新分配lease的持有者。 //如果是client explicit调用recoverLease RPC,那么新的lease持有者为 NAMENODE_LEASE_HOLDER,由NN作为代理持有该Lease。 //如果是client通过调用append间接调用lease recovery,那么新的lease持有者为调用请求的client。 lease = reassignLease(lease, src, recoveryLeaseHolder, pendingFile); //初始化Block recovery,首先从拥有replica的DN中选择一个primary DN作为代理发起这个过程。(具体过程在下面详细描述) uc.initializeBlockRecovery(blockRecoveryId); leaseManager.renewLease(lease); // Cannot close file right now, since the last block requires recovery. // This may potentially cause infinite loop in lease recovery // if there are no valid replicas on data-nodes. NameNode.stateChangeLog.warn( "DIR* NameSystem.internalReleaseLease: " + "File " + src + " has not been closed." + " Lease recovery is in progress. " + "RecoveryId = " + blockRecoveryId + " for block " + lastBlock); break; } return false; }
4,Block recovery:
这里要用到NN和DN之间的heartbeat机制。DN每隔3s向NN发送心跳包,NN收到心跳包,除了更新该DN信息,刷新lastUpdate外,还要给DN发送一些任务:lease recovery; block replication; block invalidation; update balancer bandwith。这些功能在DatanodeManager.handleHeartbeat()这个函数中完成。
以Lease recovery为例,在每个DatanodeDescriptor中有如下数据结构存放需要recover的block:
private BlockQueue<BlockInfoUnderConstruction> recoverBlocks = new BlockQueue<BlockInfoUnderConstruction>();
NN端初始化block recovery的时候选定primary DN,就会把这个Block加入primary DN对应的这个queue里面。然后在NN处理来自primary DN的heartbeat RPC请求时,检查该DN对应的这个queue里面是不是有需要recover的block,然后组成recover block命令发送给对应的DN。
这个命令发送到primary DN之后怎么处理的呢?DN有两个组件负责处理RPC请求:BPServiceActor负责处理与NN的 RPC对话;而client与DN,DN与其他DN之间的RPC对话在DataNode主线程内部处理的。
DN在BPServiceActor.run()方法内部循环执行:connectToNNAndHandshake()和offerService()向NN发送心跳,并接收来自NN的response。然后看NN发送给DN的response里面有啥内容?这里我们主要讨论Block recovery命令。DN调用DataNode.recoverBlocks(),这个函数启动一个单独的线程去做Block recovery,这个线程执行DataNode.recoverBlock()函数。
(下面这个函数只会在primary DN上执行)
private void recoverBlock(RecoveringBlock rBlock) throws IOException { ExtendedBlock block = rBlock.getBlock(); String blookPoolId = block.getBlockPoolId(); DatanodeID[] datanodeids = rBlock.getLocations(); List<BlockRecord> syncList = new ArrayList<BlockRecord>(datanodeids.length); int errorCount = 0; //遍历该Block所有的replica所在的DN for(DatanodeID id : datanodeids) { try { BPOfferService bpos = blockPoolManager.get(blookPoolId); DatanodeRegistration bpReg = bpos.bpRegistration; InterDatanodeProtocol datanode = bpReg.equals(id)? this: DataNode.createInterDataNodeProtocolProxy(id, getConf(), dnConf.socketTimeout, dnConf.connectToDnViaHostname); //分别向包含replica的DN发送initReplicaRecovery RPC命令,让包含replica的每台DN都去执行自己的initReplicaRecovery过程。这就跟DN底层的数据存储相关了,所以会调用到FsDatasetImpl.initReplicaRecovery()函数。这个函数的执行过程在下面。 ReplicaRecoveryInfo info = callInitReplicaRecovery(datanode, rBlock); //下面就是判断来自其他DN报告的replica信息是否有效,如果一个replica的GS比这个block的要老,那显然这个replica不是有效的。 if (info != null && info.getGenerationStamp() >= block.getGenerationStamp() && info.getNumBytes() > 0) { //把有效的replica加入待同步的数组中。同步过程就是大家来商量一个一致的对外的长度。 syncList.add(new BlockRecord(id, datanode, info)); } } catch (RecoveryInProgressException ripE) { //任何DN抛出RecoveryInProgressException,primary DN将中止Recovery InterDatanodeProtocol.LOG.warn( "Recovery for replica " + block + " on data-node " + id + " is already in progress. Recovery id = " + rBlock.getNewGenerationStamp() + " is aborted.", ripE); return; } catch (IOException e) { ++errorCount; InterDatanodeProtocol.LOG.warn( "Failed to obtain replica info for block (=" + block + ") from datanode (=" + id + ")", e); } } if (errorCount == datanodeids.length) { //所有的DN都抛出异常,当然也得终止了。 throw new IOException("All datanodes failed: block=" + block + ", datanodeids=" + Arrays.asList(datanodeids)); } //商量把这些replica弄到一致的状态,长度等。 syncBlock(rBlock, syncList); }
上面说到,primary DN发送initReplicaRecovery RPC命令给包含replica的其他DN,然后其他DN上就开始执行Replica Recovery并把结果作为RPC response返回给primary DN。包含replica的每台DN都会执行这个函数:
static ReplicaRecoveryInfo initReplicaRecovery(String bpid,
ReplicaMap map, Block block, long recoveryId) throws IOException
1) 停止写:如果一个replica处于写状态(RBW),并且有对应的写线程,那么interrupted这个写线程并且等待结束。然后检查磁盘上的block文件(bytesOnDisk)和BR是否一致,检查crc文件是否有效。关闭block文件和crc文件。这样client写和block recovery就不能并发执行了。
2) 如果该replica已经处于RUR状态,就是说该replica可能已经开始了recovery。那么就要检查开始的recovery和这次recovery是否是同一次。判断的标准就是那个NN发送给primary DN后扩散给每个具备replica的DN的recoveryId,也就是该block新的GS。如果这次的Id比该replica中的recoveryId要老,抛出RecoveryInProgressException异常。否则将处于RUR状态的replica的RecoveryID记为新的Id。
3) 如果没有正在运行的Recovery,那么将replica改为RUR,设置它的RecoveryId为新的Id。任何从primary DN到其他DN的交互都用这个RecoveryId标识。对于并发的blockRecovery,新的Recovery永远kill老的Recovery,两个Recovery绝不能交叉执行。
如果没有发生异常的情况下,每台DN执行上述流程后给primary DN发送response(用InitReplicaRecoveryResponseProto标识)。
Primary DN收到来自其他DN发送过来的response之后怎么处理呢,又回到了DataNode.recoverBlock()函数里。(请看上面代码里的注释)
下面看看DataNode.syncBlock()这个函数怎么把状态和大小各异的replica同步成一致状态的。
道理其实很简单,就是先找到所有replica中具有的最好的状态。什么是更好的状态呢?排名从更好到更坏依次是:Finalized, RBW, RWR, RUR, Temporary。其实也就是说明了replica的持久化程度。如果所有replica中最好的状态是Finalized,那么就以这个Finalized的为准,其他和这个replica不一致的replica都被exclude在外;如果最好的状态是RBW或者RWR,那么就选择所有replica中最短的那个作为所有replica的recovery之后的长度。
void syncBlock(RecoveringBlock rBlock, List<BlockRecord> syncList) throws IOException { ExtendedBlock block = rBlock.getBlock(); final String bpid = block.getBlockPoolId(); DatanodeProtocolClientSideTranslatorPB nn = getActiveNamenodeForBP(block.getBlockPoolId()); if (nn == null) { throw new IOException( "Unable to synchronize block " + rBlock + ", since this DN " + " has not acknowledged any NN as active."); } long recoveryId = rBlock.getNewGenerationStamp(); if (LOG.isDebugEnabled()) { LOG.debug("block=" + block + ", (length=" + block.getNumBytes() + "), syncList=" + syncList); } // syncList.isEmpty() means that all data-nodes do not have the block // or their replicas have 0 length. // The block can be deleted. if (syncList.isEmpty()) { nn.commitBlockSynchronization(block, recoveryId, 0, true, true, DatanodeID.EMPTY_ARRAY, null); return; } // Calculate the best available replica state. ReplicaState bestState = ReplicaState.RWR; long finalizedLength = -1; for(BlockRecord r : syncList) { assert r.rInfo.getNumBytes() > 0 : "zero length replica"; ReplicaState rState = r.rInfo.getOriginalReplicaState(); if(rState.getValue() < bestState.getValue()) bestState = rState; if(rState == ReplicaState.FINALIZED) { if(finalizedLength > 0 && finalizedLength != r.rInfo.getNumBytes()) throw new IOException("Inconsistent size of finalized replicas. " + "Replica " + r.rInfo + " expected size: " + finalizedLength); finalizedLength = r.rInfo.getNumBytes(); } } // Calculate list of nodes that will participate in the recovery // and the new block size List<BlockRecord> participatingList = new ArrayList<BlockRecord>(); final ExtendedBlock newBlock = new ExtendedBlock(bpid, block.getBlockId(), -1, recoveryId); switch(bestState) { case FINALIZED: assert finalizedLength > 0 : "finalizedLength is not positive"; for(BlockRecord r : syncList) { ReplicaState rState = r.rInfo.getOriginalReplicaState(); if(rState == ReplicaState.FINALIZED || rState == ReplicaState.RBW && r.rInfo.getNumBytes() == finalizedLength) participatingList.add(r); } newBlock.setNumBytes(finalizedLength); break; case RBW: case RWR: long minLength = Long.MAX_VALUE; for(BlockRecord r : syncList) { ReplicaState rState = r.rInfo.getOriginalReplicaState(); if(rState == bestState) { minLength = Math.min(minLength, r.rInfo.getNumBytes()); participatingList.add(r); } } newBlock.setNumBytes(minLength); break; case RUR: case TEMPORARY: assert false : "bad replica state: " + bestState; } List<DatanodeID> failedList = new ArrayList<DatanodeID>(); final List<BlockRecord> successList = new ArrayList<BlockRecord>(); for(BlockRecord r : participatingList) { try { //通过InterDatanodeProtocol RPC向其他DN发送RPC更新replica的长度和GS。这个RPC是在block recovery中专用的。各个DN分别update长度和GS,然后把replica变成Finalized。 r.updateReplicaUnderRecovery(bpid, recoveryId, newBlock.getNumBytes()); successList.add(r); } catch (IOException e) { InterDatanodeProtocol.LOG.warn("Failed to updateBlock (newblock=" + newBlock + ", datanode=" + r.id + ")", e); failedList.add(r.id); } } // If any of the data-nodes failed, the recovery fails, because // we never know the actual state of the replica on failed data-nodes. // The recovery should be started over. if(!failedList.isEmpty()) { StringBuilder b = new StringBuilder(); for(DatanodeID id : failedList) { b.append("\n " + id); } throw new IOException("Cannot recover " + block + ", the following " + failedList.size() + " data-nodes failed {" + b + "\n}"); } // Notify the name-node about successfully recovered replicas. final DatanodeID[] datanodes = new DatanodeID[successList.size()]; final String[] storages = new String[datanodes.length]; for(int i = 0; i < datanodes.length; i++) { final BlockRecord r = successList.get(i); datanodes[i] = r.id; storages[i] = r.storageID; } //向NN发送RPC表明block recovery顺利完成,NN完成元数据持久化工作,commitOrCompleteBlock,然后close file。 nn.commitBlockSynchronization(block, newBlock.getGenerationStamp(), newBlock.getNumBytes(), true, false, datanodes, storages); }
执行完Lease recovery和Block recovery之后,一个unclose的file被close掉了,这个文件恢复成正常close的状态了。那么这个时候执行append操作就和一个正常close的file没有区别了。然后就是通过addBlock获取replica所在DN,建立pipeline,向DN write数据,这个过程可以参考上一篇博客( HDFS write流程与代码分析) 。
从这个流程中再一次深刻体会到一个系统级软件,60%以上的代码是在解决异常情况。一个unclose给我们后续的处理带来这么大的代码量。。。
参考文献:
https://issues.apache.org/jira/secure/attachment/12445209/appendDesign3.pdf
http://blog.csdn.net/chenpingbupt/article/details/7972589