在append出现之前,一个file被close之后就是immutable的了,close之前是不能被read的。而在append出现之后,一个未close的file的last block对于read来说也是visible的,那么逻辑就复杂多了。
Apache社区的jira里有对HDFS append设计的详细文档(https://issues.apache.org/jira/secure/attachment/12445209/appendDesign3.pdf)。对于概念和逻辑在那个文档里讲的更详细些,那个文档更像是类似于bible或者C++ primer,可以当字典来查。这篇文章重点解释的是append的执行流程和前后逻辑,可能更容易读懂。
1,概念block和replica
首先区分两个概念:block和replica。在NN中我们叫block,在DN中我们叫replica。
在append问世之前,replica在DN中就两种状态:temporary和finalized。Replica被创建和写入时为temporary状态,当client发送完所有的byte请求DN关闭这个replica时,转化为finalized。 DN restart会删除处于temporary状态的replica。
在append问世之后,逻辑就复杂多了,状态也多了。首先明确block和replica的状态:(如何区分block和replica:在NN中叫block,在每个DN上叫replica)
NN中的block有下面4种状态:
static public enum BlockUCState {
COMPLETE,
UNDER_CONSTRUCTION,
UNDER_RECOVERY,
COMMITTED;
}
注意NN中的block的状态是在内存中的,不会持久化到硬盘上。当NN重启后上次未关闭文件的last block将变成under construction,其余的为Complete。
1) complete:block的length和gs不再发生变化,并且NN已经收到至少有一个DN报告有finalized状态的replica(DN上的replica状态发生变化会通过RPC blockReceivedAndDeleted向NN报告)。一个complete的block会保存finalized的replica的locations在NN的内存中。只有当文件的所有的block都是complete的,该文件才能被close。
2) under_construction: 文件被create或者append时,正在被写入的block就处于under_construction状态。该状态的length和gs都不是finalized的,但是处在该状态的block对于read来说是visible的(具体多少byte是visible的,是通过client向DN询问得知的,这个在DFSInputStream的构造函数中会发起RPC调用获得,其实是某个DN所ACK的大小,该block其他replica所received的byte都大于任何DN ACK的大小,所以这个长度在任何replica上都是能够达到的)。
3) under_recovery:如果一个file的last block处于under_construction状态的时候,client异常退出,lease超过softLimit过期,那么该block就需要走下面要说的Lease recovery和Block recovery流程释放lease关闭file。那么正在走Lease recovery和Block recovery流程的block就处于under_recovery状态。
4) committed: client端在写文件的时候,每次请求新的block(addBlock RPC请求)或者close文件时,都会顺带把previous block进行commit操作(previous block从under_construction状态转化成committed状态)。这个时候Client已经把所有的该block的byte都发送给了DN组成的pipeline,已经收到ACK请求。但是NN还没有收到任何一个DN说有finalized replica。
DN中replica有下面5种状态:
1) Finalized(类FinalizedReplica表示)
2) RBW(类ReplicaBeingWritten表示,继承自ReplicaInPipeline):刚刚被create或者append的replica,处在write的pipeline中,正在被写入。但是byte还是visible to read的。
3) RUR(类ReplicaUnderRecovery表示):Lease过期之后发生Lease和Block recovery时replica所处的状态。
4) RWR(类ReplicaWaitingToBeRecovered表示):如果一个DN挂掉并且重启之后,所有RBW的replica将会转换为RWR。RWR的replica不会出现在pipeline中,结果就是等着Lease recovery恢复。
5) Temporary(类ReplicaInPipeline表示):DN之间传输replica(例如cluster rebalance)时,正在传输的就是处在Temporary。和RBW不同的是,它对read不是visible的,DN如果重启直接删除处于Temporary状态的replica。
NN中block的blockId,numBytes和GS会持久化到硬盘,但是block状态不会持久化到磁盘;然而DN中replica的状态会持久化到磁盘。所以NN如果发生restart,那么只有最后一个block会被加载成under construction,其他的都是complete;但是DN重启会加载已经持久化到磁盘的replica的状态。
有关block和replica的状态转换过程可以参考https://issues.apache.org/jira/secure/attachment/12445209/appendDesign3.pdf的9.1和9.2,那里讲的很详细。
2,write/append代码流程
对于Append一个文件,这个文件可能处于的状态就有很多种可能了:
1) 上次关闭的时候是正常关闭,那么就是调用了DFSOutputStream.close()->FSNamesystem.completeFile()->commitOrCompleteLastBlock()并且finalizeINodeFileUnderConstruction(),那么NN中file的状态就是INode而不是INodeUnderConstruction,同时last block也一定被commit或者complete。如果是这样的文件,我们append的时候就省心了。
2) 上次关闭的时候是非正常关闭(例如client异常退出等),那么就没有调用close及一系列的后续操作。那么NN中File元数据就处于INodeUnderConstruction状态,last block也没有被commit,而且绑在这个file身上的锁(Lease)也没有被释放。
注解:HDFS中的Lease相当于写锁,HDFS只给write加锁,也就是在client发送create()或者append()请求时,NN给这个file加锁,就是lease。然后client负责定期renew lease,而在NN端Lease monitor线程检测lease是否过期。而Lease expire有两个过期时间:softLimit(60s)和hardLimit(1hour)。NN的Lease monitor只会删除超过hardLimit的Lease,而超过softLimit的Lease虽然也已经过期,但是不删除,是通过下次append或者显式调用recoverLease RPC的时候检查是否超过softLimit来处理的。
在append(或者显式调用recoverLease)的时候,上次正常关闭的file,对应的lease一定不存在了;上次未正常关闭的file,对应的lease如果超过了softLimit,那么就要进行Lease recovery操作。
未正常关闭的file,NN端Lease超过softLimit而过期,那么可能再次打开这个file的client是原来的client,也可能是新的client。这个file上次未正常关闭,last block对应的三个replica也可能处在不同的状态,需要先Block recovery多个replica的信息达成一致状态才能进行append。说白了就是在append一个file的时候,如果检查出现了lease超过softLimit过期,那么就需要先给上次异常操作擦屁股,使其达到正常close状态,这个过程就是Lease recovery和Block recovery.
先回忆下HDFS的write流程所涉及到的RPC。Client先create一个文件,然后addBlock分配块及其存放的DN,建立pipeline,然后write数据。 HDFS的append操作类似:client先向NN发送append RPC,然后是addBlock,然后建立pipeline,然后write。
Client调用DistributedFilesystem.append()接口,然后向NN发送append RPC。
Append在NN的处理逻辑会走到FSNamesystem.startFileInternal(),跟create()走到一块去了,但是在内部处理逻辑上就有分支了。
对所有的append请求都执行这句:
recoverLeaseInternal(myFile,src,holder,clientMachine,false);
recoverLeaseInternal()这个函数是用来Lease recovery的,这个函数内部逻辑只处理那些处于under construction状态的file。因为如果file已经construction完毕(处于INode状态),那么上次关闭一定是走了准确的关闭流程(上述流程1),这样的不需要Lease recovery。
3,Lease recovery:
用一句最简单的话形容Lease recovery就是:上次没有正常关闭一个write的file,没有正常释放lease,last block的多个replica可能处于不同的状态(大小和generationStamp),现在需要由NN和DN配合来完成正常关闭file的流程,Lease recovery的最终结果和上次正常关闭是一样的。
private void recoverLeaseInternal(INode fileInode,
String src, String holder, String clientMachine,boolean force)
throwsIOException
这个函数执行recover lease操作:首先获取对应file的INodeFileUnderConstruction(此时该file一定处于under construction状态 );然后检查该file对应的lease是否一致,是否该Lease原来的holder又继续renew lease了;然后就看如果lease超过了softLimit就调用internalReleaseLease()函数进行下一步操作。
boolean internalReleaseLease(Lease lease, String src,
String recoveryLeaseHolder)throws AlreadyBeingCreatedException,
IOException, UnresolvedLinkException {
LOG.info("Recovering lease="+ lease + ", src="+ src);
assert!isInSafeMode();
asserthasWriteLock();
INodeFile iFile = dir.getFileINode(src);
if(iFile == null) {
finalString message = "DIR* NameSystem.internalReleaseLease: "
+"attempt to release a create lock on "
+ src +" file does not exist.";
NameNode.stateChangeLog.warn(message);
thrownew IOException(message);
}
if(!iFile.isUnderConstruction()) {
finalString message = "DIR* NameSystem.internalReleaseLease: "
+"attempt to release a create lock on "
+ src +" but file is already closed.";
NameNode.stateChangeLog.warn(message);
thrownew IOException(message);
}
INodeFileUnderConstruction pendingFile = (INodeFileUnderConstruction) iFile;
intnrBlocks = pendingFile.numBlocks();
BlockInfo[] blocks = pendingFile.getBlocks();
intnrCompleteBlocks;
BlockInfo curBlock =null;
//首先检查NN保存的该file的block信息,看是否有block的状态不是complete
for(nrCompleteBlocks =0; nrCompleteBlocks < nrBlocks; nrCompleteBlocks++) {
curBlock = blocks[nrCompleteBlocks];
if(!curBlock.isComplete())
break;
assertblockManager.checkMinReplication(curBlock) :
"A COMPLETE block is not minimally replicated in "+ src;
}
// If there are no incomplete blocks associated with this file,
// then reap lease immediately and close the file.
if(nrCompleteBlocks == nrBlocks) {
//所有block都是complete的:释放lease,file由INodeUnderConstruction变成INode,然后close file
finalizeINodeFileUnderConstruction(src, pendingFile);
NameNode.stateChangeLog.warn("BLOCK*"
+" internalReleaseLease: All existing blocks are COMPLETE,"
+" lease removed, file closed.");
returntrue; // closed!
}
// Only the last and the penultimate blocks may be in non COMPLETE state.
// If the penultimate block is not COMPLETE, then it must be COMMITTED.
// 执行到这说明有block不是complete的,所以得先修复block,再finalize & close file。
if(nrCompleteBlocks < nrBlocks -2 ||
nrCompleteBlocks == nrBlocks -2 &&
curBlock !=null &&
curBlock.getBlockUCState() != BlockUCState.COMMITTED) {
finalString message = "DIR* NameSystem.internalReleaseLease: "
+"attempt to release a create lock on "
+ src +" but file is already closed.";
NameNode.stateChangeLog.warn(message);
thrownew IOException(message);
}
// The last block is not COMPLETE, and
// that the penultimate block if exists is either COMPLETE or COMMITTED
finalBlockInfo lastBlock = pendingFile.getLastBlock();
BlockUCState lastBlockState = lastBlock.getBlockUCState();
BlockInfo penultimateBlock = pendingFile.getPenultimateBlock();
booleanpenultimateBlockMinReplication;
BlockUCState penultimateBlockState;
if(penultimateBlock == null) {
penultimateBlockState = BlockUCState.COMPLETE;
// If penultimate block doesn't exist then its minReplication is met
penultimateBlockMinReplication =true;
}else {
penultimateBlockState = BlockUCState.COMMITTED;
penultimateBlockMinReplication =
blockManager.checkMinReplication(penultimateBlock);
}
assertpenultimateBlockState == BlockUCState.COMPLETE ||
penultimateBlockState == BlockUCState.COMMITTED :
"Unexpected state of penultimate block in "+ src;
switch(lastBlockState) {
caseCOMPLETE:
assertfalse : "Already checked that the last block is incomplete";
break;
caseCOMMITTED:
// Close file if committed blocks are minimally replicated
if(penultimateBlockMinReplication &&
blockManager.checkMinReplication(lastBlock)) {
finalizeINodeFileUnderConstruction(src, pendingFile);
NameNode.stateChangeLog.warn("BLOCK*"
+" internalReleaseLease: Committed blocks are minimally replicated,"
+" lease removed, file closed.");
returntrue; // closed!
}
// Cannot close file right now, since some blocks
// are not yet minimally replicated.
// This may potentially cause infinite loop in lease recovery
// if there are no valid replicas on data-nodes.
String message ="DIR* NameSystem.internalReleaseLease: " +
"Failed to release lease for file "+ src +
". Committed blocks are waiting to be minimally replicated."+
" Try again later.";
NameNode.stateChangeLog.warn(message);
thrownew AlreadyBeingCreatedException(message);
caseUNDER_CONSTRUCTION:
caseUNDER_RECOVERY:
finalBlockInfoUnderConstruction uc = (BlockInfoUnderConstruction)lastBlock;
// setup the last block locations from the blockManager if not known
if(uc.getNumExpectedLocations() == 0) {
uc.setExpectedLocations(blockManager.getNodes(lastBlock));
}
// start recovery of the last block for this file
//为这个Block生成新的GS,这个GS是在recovery过程中非常重要的变量。
longblockRecoveryId = nextGenerationStamp();
//重新分配lease的持有者。
//如果是client explicit调用recoverLease RPC,那么新的lease持有者为 NAMENODE_LEASE_HOLDER,由NN作为代理持有该Lease。
//如果是client通过调用append间接调用lease recovery,那么新的lease持有者为调用请求的client。
lease = reassignLease(lease, src, recoveryLeaseHolder, pendingFile);
//初始化Block recovery,首先从拥有replica的DN中选择一个primary DN作为代理发起这个过程。(具体过程在下面详细描述)
uc.initializeBlockRecovery(blockRecoveryId);
leaseManager.renewLease(lease);
// Cannot close file right now, since the last block requires recovery.
// This may potentially cause infinite loop in lease recovery
// if there are no valid replicas on data-nodes.
NameNode.stateChangeLog.warn(
"DIR* NameSystem.internalReleaseLease: "+
"File "+ src + " has not been closed."+
" Lease recovery is in progress. "+
"RecoveryId = "+ blockRecoveryId + " for block "+ lastBlock);
break;
}
returnfalse;
}
4,Block recovery:
这里要用到NN和DN之间的heartbeat机制。DN每隔3s向NN发送心跳包,NN收到心跳包,除了更新该DN信息,刷新lastUpdate外,还要给DN发送一些任务:lease recovery; block replication; block invalidation; update balancer bandwith。这些功能在DatanodeManager.handleHeartbeat()这个函数中完成。
以Lease recovery为例,在每个DatanodeDescriptor中有如下数据结构存放需要recover的block:
private BlockQueue recoverBlocks =
newBlockQueue();
NN端初始化block recovery的时候选定primary DN,就会把这个Block加入primary DN对应的这个queue里面。然后在NN处理来自primary DN的heartbeat RPC请求时,检查该DN对应的这个queue里面是不是有需要recover的block,然后组成recover block命令发送给对应的DN。
这个命令发送到primary DN之后怎么处理的呢?DN有两个组件负责处理RPC请求:BPServiceActor负责处理与NN的 RPC对话;而client与DN,DN与其他DN之间的RPC对话在DataNode主线程内部处理的。
DN在BPServiceActor.run()方法内部循环执行:connectToNNAndHandshake()和offerService()向NN发送心跳,并接收来自NN的response。然后看NN发送给DN的response里面有啥内容?这里我们主要讨论Block recovery命令。DN调用DataNode.recoverBlocks(),这个函数启动一个单独的线程去做Block recovery,这个线程执行DataNode.recoverBlock()函数。
(下面这个函数只会在primary DN上执行)
private void recoverBlock(RecoveringBlock rBlock) throws IOException {
ExtendedBlock block = rBlock.getBlock();
String blookPoolId = block.getBlockPoolId();
DatanodeID[] datanodeids = rBlock.getLocations();
List syncList =new ArrayList(datanodeids.length);
interrorCount = 0;
//遍历该Block所有的replica所在的DN
for(DatanodeID id : datanodeids) {
try{
BPOfferService bpos = blockPoolManager.get(blookPoolId);
DatanodeRegistration bpReg = bpos.bpRegistration;
InterDatanodeProtocol datanode = bpReg.equals(id)?
this: DataNode.createInterDataNodeProtocolProxy(id, getConf(),
dnConf.socketTimeout, dnConf.connectToDnViaHostname);
//分别向包含replica的DN发送initReplicaRecovery RPC命令,让包含replica的每台DN都去执行自己的initReplicaRecovery过程。这就跟DN底层的数据存储相关了,所以会调用到FsDatasetImpl.initReplicaRecovery()函数。这个函数的执行过程在下面。
ReplicaRecoveryInfo info = callInitReplicaRecovery(datanode, rBlock);
//下面就是判断来自其他DN报告的replica信息是否有效,如果一个replica的GS比这个block的要老,那显然这个replica不是有效的。
if(info != null &&
info.getGenerationStamp() >= block.getGenerationStamp() &&
info.getNumBytes() >0) {
//把有效的replica加入待同步的数组中。同步过程就是大家来商量一个一致的对外的长度。
syncList.add(newBlockRecord(id, datanode, info));
}
}catch (RecoveryInProgressException ripE) {
//任何DN抛出RecoveryInProgressException,primary DN将中止Recovery
InterDatanodeProtocol.LOG.warn(
"Recovery for replica "+ block + " on data-node "+ id
+" is already in progress. Recovery id = "
+ rBlock.getNewGenerationStamp() +" is aborted.", ripE);
return;
}catch (IOException e) {
++errorCount;
InterDatanodeProtocol.LOG.warn(
"Failed to obtain replica info for block (="+ block
+") from datanode (=" + id + ")", e);
}
}
if(errorCount == datanodeids.length) {
//所有的DN都抛出异常,当然也得终止了。
thrownew IOException("All datanodes failed: block="+ block
+", datanodeids=" + Arrays.asList(datanodeids));
}
//商量把这些replica弄到一致的状态,长度等。
syncBlock(rBlock, syncList);
}
上面说到,primary DN发送initReplicaRecovery RPC命令给包含replica的其他DN,然后其他DN上就开始执行Replica Recovery并把结果作为RPC response返回给primary DN。包含replica的每台DN都会执行这个函数:
static ReplicaRecoveryInfo initReplicaRecovery(String bpid,
ReplicaMap map, Block block, long recoveryId) throws IOException
1) 停止写:如果一个replica处于写状态(RBW),并且有对应的写线程,那么interrupted这个写线程并且等待结束。然后检查磁盘上的block文件(bytesOnDisk)和BR是否一致,检查crc文件是否有效。关闭block文件和crc文件。这样client写和block recovery就不能并发执行了。
2) 如果该replica已经处于RUR状态,就是说该replica可能已经开始了recovery。那么就要检查开始的recovery和这次recovery是否是同一次。判断的标准就是那个NN发送给primary DN后扩散给每个具备replica的DN的recoveryId,也就是该block新的GS。如果这次的Id比该replica中的recoveryId要老,抛出RecoveryInProgressException异常。否则将处于RUR状态的replica的RecoveryID记为新的Id。
3) 如果没有正在运行的Recovery,那么将replica改为RUR,设置它的RecoveryId为新的Id。任何从primary DN到其他DN的交互都用这个RecoveryId标识。对于并发的blockRecovery,新的Recovery永远kill老的Recovery,两个Recovery绝不能交叉执行。
如果没有发生异常的情况下,每台DN执行上述流程后给primary DN发送response(用InitReplicaRecoveryResponseProto标识)。
Primary DN收到来自其他DN发送过来的response之后怎么处理呢,又回到了DataNode.recoverBlock()函数里。(请看上面代码里的注释)
下面看看DataNode.syncBlock()这个函数怎么把状态和大小各异的replica同步成一致状态的。
道理其实很简单,就是先找到所有replica中具有的最好的状态。什么是更好的状态呢?排名从更好到更坏依次是:Finalized, RBW, RWR, RUR, Temporary。其实也就是说明了replica的持久化程度。如果所有replica中最好的状态是Finalized,那么就以这个Finalized的为准,其他和这个replica不一致的replica都被exclude在外;如果最好的状态是RBW或者RWR,那么就选择所有replica中最短的那个作为所有replica的recovery之后的长度。
void syncBlock(RecoveringBlock rBlock,
List syncList)throws IOException {
ExtendedBlock block = rBlock.getBlock();
finalString bpid = block.getBlockPoolId();
DatanodeProtocolClientSideTranslatorPB nn =
getActiveNamenodeForBP(block.getBlockPoolId());
if(nn == null) {
thrownew IOException(
"Unable to synchronize block "+ rBlock + ", since this DN "
+" has not acknowledged any NN as active.");
}
longrecoveryId = rBlock.getNewGenerationStamp();
if(LOG.isDebugEnabled()) {
LOG.debug("block="+ block + ", (length="+ block.getNumBytes()
+"), syncList=" + syncList);
}
// syncList.isEmpty() means that all data-nodes do not have the block
// or their replicas have 0 length.
// The block can be deleted.
if(syncList.isEmpty()) {
nn.commitBlockSynchronization(block, recoveryId,0,
true,true, DatanodeID.EMPTY_ARRAY,null);
return;
}
// Calculate the best available replica state.
ReplicaState bestState = ReplicaState.RWR;
longfinalizedLength = -1;
for(BlockRecord r : syncList) {
assertr.rInfo.getNumBytes() > 0: "zero length replica";
ReplicaState rState = r.rInfo.getOriginalReplicaState();
if(rState.getValue() < bestState.getValue())
bestState = rState;
if(rState == ReplicaState.FINALIZED) {
if(finalizedLength >0 && finalizedLength != r.rInfo.getNumBytes())
thrownew IOException("Inconsistent size of finalized replicas. "+
"Replica "+ r.rInfo + " expected size: "+ finalizedLength);
finalizedLength = r.rInfo.getNumBytes();
}
}
// Calculate list of nodes that will participate in the recovery
// and the new block size
List participatingList =new ArrayList();
finalExtendedBlock newBlock = newExtendedBlock(bpid, block.getBlockId(),
-1, recoveryId);
switch(bestState) {
caseFINALIZED:
assertfinalizedLength > 0: "finalizedLength is not positive";
for(BlockRecord r : syncList) {
ReplicaState rState = r.rInfo.getOriginalReplicaState();
if(rState == ReplicaState.FINALIZED ||
rState == ReplicaState.RBW &&
r.rInfo.getNumBytes() == finalizedLength)
participatingList.add(r);
}
newBlock.setNumBytes(finalizedLength);
break;
caseRBW:
caseRWR:
longminLength = Long.MAX_VALUE;
for(BlockRecord r : syncList) {
ReplicaState rState = r.rInfo.getOriginalReplicaState();
if(rState == bestState) {
minLength = Math.min(minLength, r.rInfo.getNumBytes());
participatingList.add(r);
}
}
newBlock.setNumBytes(minLength);
break;
caseRUR:
caseTEMPORARY:
assertfalse : "bad replica state: "+ bestState;
}
List failedList =new ArrayList();
finalList successList = newArrayList();
for(BlockRecord r : participatingList) {
try{
//通过InterDatanodeProtocol RPC向其他DN发送RPC更新replica的长度和GS。这个RPC是在block recovery中专用的。各个DN分别update长度和GS,然后把replica变成Finalized。
r.updateReplicaUnderRecovery(bpid, recoveryId, newBlock.getNumBytes());
successList.add(r);
}catch (IOException e) {
InterDatanodeProtocol.LOG.warn("Failed to updateBlock (newblock="
+ newBlock +", datanode=" + r.id +")", e);
failedList.add(r.id);
}
}
// If any of the data-nodes failed, the recovery fails, because
// we never know the actual state of the replica on failed data-nodes.
// The recovery should be started over.
if(!failedList.isEmpty()) {
StringBuilder b =new StringBuilder();
for(DatanodeID id : failedList) {
b.append("\n "+ id);
}
thrownew IOException("Cannot recover "+ block + ", the following "
+ failedList.size() +" data-nodes failed {" + b + "\n}");
}
// Notify the name-node about successfully recovered replicas.
finalDatanodeID[] datanodes = newDatanodeID[successList.size()];
finalString[] storages = newString[datanodes.length];
for(inti = 0; i < datanodes.length; i++) {
finalBlockRecord r = successList.get(i);
datanodes[i] = r.id;
storages[i] = r.storageID;
}
//向NN发送RPC表明block recovery顺利完成,NN完成元数据持久化工作,commitOrCompleteBlock,然后close file。
nn.commitBlockSynchronization(block,
newBlock.getGenerationStamp(), newBlock.getNumBytes(),true, false,
datanodes, storages);
}
执行完Lease recovery和Block recovery之后,一个unclose的file被close掉了,这个文件恢复成正常close的状态了。那么这个时候执行append操作就和一个正常close的file没有区别了。然后就是通过addBlock获取replica所在DN,建立pipeline,向DN write数据,这个过程可以参考上一篇博客(HDFS write流程与代码分析)。
从这个流程中再一次深刻体会到一个系统级软件,60%以上的代码是在解决异常情况。一个unclose给我们后续的处理带来这么大的代码量。。。
参考文献:
https://issues.apache.org/jira/secure/attachment/12445209/appendDesign3.pdf
http://blog.csdn.net/chenpingbupt/article/details/7972589