HDFS append原理与代码分析(Hadoop 2.0)

在append出现之前,一个file被close之后就是immutable的了,close之前是不能被read的。而在append出现之后,一个未close的file的last block对于read来说也是visible的,那么逻辑就复杂多了。

Apache社区的jira里有对HDFS append设计的详细文档(https://issues.apache.org/jira/secure/attachment/12445209/appendDesign3.pdf)。对于概念和逻辑在那个文档里讲的更详细些,那个文档更像是类似于bible或者C++ primer,可以当字典来查。这篇文章重点解释的是append的执行流程和前后逻辑,可能更容易读懂。

1,概念block和replica

首先区分两个概念:block和replica。在NN中我们叫block,在DN中我们叫replica。

在append问世之前,replica在DN中就两种状态:temporary和finalized。Replica被创建和写入时为temporary状态,当client发送完所有的byte请求DN关闭这个replica时,转化为finalized。 DN restart会删除处于temporary状态的replica。

在append问世之后,逻辑就复杂多了,状态也多了。首先明确block和replica的状态:(如何区分block和replica:在NN中叫block,在每个DN上叫replica)

NN中的block有下面4种状态:

static public enum BlockUCState {
    COMPLETE,
    UNDER_CONSTRUCTION,
    UNDER_RECOVERY,
    COMMITTED;
  }

注意NN中的block的状态是在内存中的,不会持久化到硬盘上。当NN重启后上次未关闭文件的last block将变成under construction,其余的为Complete。

1) complete:block的length和gs不再发生变化,并且NN已经收到至少有一个DN报告有finalized状态的replica(DN上的replica状态发生变化会通过RPC blockReceivedAndDeleted向NN报告)。一个complete的block会保存finalized的replica的locations在NN的内存中。只有当文件的所有的block都是complete的,该文件才能被close。

2) under_construction: 文件被create或者append时,正在被写入的block就处于under_construction状态。该状态的length和gs都不是finalized的,但是处在该状态的block对于read来说是visible的(具体多少byte是visible的,是通过client向DN询问得知的,这个在DFSInputStream的构造函数中会发起RPC调用获得,其实是某个DN所ACK的大小,该block其他replica所received的byte都大于任何DN ACK的大小,所以这个长度在任何replica上都是能够达到的)。

3) under_recovery:如果一个file的last block处于under_construction状态的时候,client异常退出,lease超过softLimit过期,那么该block就需要走下面要说的Lease recovery和Block recovery流程释放lease关闭file。那么正在走Lease recovery和Block recovery流程的block就处于under_recovery状态。

4) committed: client端在写文件的时候,每次请求新的block(addBlock RPC请求)或者close文件时,都会顺带把previous block进行commit操作(previous block从under_construction状态转化成committed状态)。这个时候Client已经把所有的该block的byte都发送给了DN组成的pipeline,已经收到ACK请求。但是NN还没有收到任何一个DN说有finalized replica。

DN中replica有下面5种状态:

1) Finalized(类FinalizedReplica表示)

2) RBW(类ReplicaBeingWritten表示,继承自ReplicaInPipeline):刚刚被create或者append的replica,处在write的pipeline中,正在被写入。但是byte还是visible to read的。

3) RUR(类ReplicaUnderRecovery表示):Lease过期之后发生Lease和Block recovery时replica所处的状态。

4) RWR(类ReplicaWaitingToBeRecovered表示):如果一个DN挂掉并且重启之后,所有RBW的replica将会转换为RWR。RWR的replica不会出现在pipeline中,结果就是等着Lease recovery恢复。

5) Temporary(类ReplicaInPipeline表示):DN之间传输replica(例如cluster rebalance)时,正在传输的就是处在Temporary。和RBW不同的是,它对read不是visible的,DN如果重启直接删除处于Temporary状态的replica。

NN中block的blockId,numBytes和GS会持久化到硬盘,但是block状态不会持久化到磁盘;然而DN中replica的状态会持久化到磁盘。所以NN如果发生restart,那么只有最后一个block会被加载成under construction,其他的都是complete;但是DN重启会加载已经持久化到磁盘的replica的状态。

有关block和replica的状态转换过程可以参考https://issues.apache.org/jira/secure/attachment/12445209/appendDesign3.pdf的9.1和9.2,那里讲的很详细。

2,write/append代码流程

对于Append一个文件,这个文件可能处于的状态就有很多种可能了:

1) 上次关闭的时候是正常关闭,那么就是调用了DFSOutputStream.close()->FSNamesystem.completeFile()->commitOrCompleteLastBlock()并且finalizeINodeFileUnderConstruction(),那么NN中file的状态就是INode而不是INodeUnderConstruction,同时last block也一定被commit或者complete。如果是这样的文件,我们append的时候就省心了。

2) 上次关闭的时候是非正常关闭(例如client异常退出等),那么就没有调用close及一系列的后续操作。那么NN中File元数据就处于INodeUnderConstruction状态,last block也没有被commit,而且绑在这个file身上的锁(Lease)也没有被释放。

注解:HDFS中的Lease相当于写锁,HDFS只给write加锁,也就是在client发送create()或者append()请求时,NN给这个file加锁,就是lease。然后client负责定期renew lease,而在NN端Lease monitor线程检测lease是否过期。而Lease expire有两个过期时间:softLimit(60s)和hardLimit(1hour)。NN的Lease monitor只会删除超过hardLimit的Lease,而超过softLimit的Lease虽然也已经过期,但是不删除,是通过下次append或者显式调用recoverLease RPC的时候检查是否超过softLimit来处理的。

在append(或者显式调用recoverLease)的时候,上次正常关闭的file,对应的lease一定不存在了;上次未正常关闭的file,对应的lease如果超过了softLimit,那么就要进行Lease recovery操作。

未正常关闭的file,NN端Lease超过softLimit而过期,那么可能再次打开这个file的client是原来的client,也可能是新的client。这个file上次未正常关闭,last block对应的三个replica也可能处在不同的状态,需要先Block recovery多个replica的信息达成一致状态才能进行append。说白了就是在append一个file的时候,如果检查出现了lease超过softLimit过期,那么就需要先给上次异常操作擦屁股,使其达到正常close状态,这个过程就是Lease recovery和Block recovery.

先回忆下HDFS的write流程所涉及到的RPC。Client先create一个文件,然后addBlock分配块及其存放的DN,建立pipeline,然后write数据。 HDFS的append操作类似:client先向NN发送append RPC,然后是addBlock,然后建立pipeline,然后write。

Client调用DistributedFilesystem.append()接口,然后向NN发送append RPC。

Append在NN的处理逻辑会走到FSNamesystem.startFileInternal(),跟create()走到一块去了,但是在内部处理逻辑上就有分支了。

对所有的append请求都执行这句:

recoverLeaseInternal(myFile,src,holder,clientMachine,false);

recoverLeaseInternal()这个函数是用来Lease recovery的,这个函数内部逻辑只处理那些处于under construction状态的file。因为如果file已经construction完毕(处于INode状态),那么上次关闭一定是走了准确的关闭流程(上述流程1),这样的不需要Lease recovery。

3,Lease recovery:

用一句最简单的话形容Lease recovery就是:上次没有正常关闭一个write的file,没有正常释放lease,last block的多个replica可能处于不同的状态(大小和generationStamp),现在需要由NN和DN配合来完成正常关闭file的流程,Lease recovery的最终结果和上次正常关闭是一样的。

private void recoverLeaseInternal(INode fileInode,
      String src, String holder, String clientMachine,boolean force)
      throwsIOException

这个函数执行recover lease操作:首先获取对应file的INodeFileUnderConstruction(此时该file一定处于under construction状态 );然后检查该file对应的lease是否一致,是否该Lease原来的holder又继续renew lease了;然后就看如果lease超过了softLimit就调用internalReleaseLease()函数进行下一步操作。

boolean internalReleaseLease(Lease lease, String src,
      String recoveryLeaseHolder)throws AlreadyBeingCreatedException,
      IOException, UnresolvedLinkException {
    LOG.info("Recovering lease="+ lease + ", src="+ src);
    assert!isInSafeMode();
    asserthasWriteLock();
    INodeFile iFile = dir.getFileINode(src);
    if(iFile == null) {
      finalString message = "DIR* NameSystem.internalReleaseLease: "
        +"attempt to release a create lock on "
        + src +" file does not exist.";
      NameNode.stateChangeLog.warn(message);
      thrownew IOException(message);
    }
    if(!iFile.isUnderConstruction()) {
      finalString message = "DIR* NameSystem.internalReleaseLease: "
        +"attempt to release a create lock on "
        + src +" but file is already closed.";
      NameNode.stateChangeLog.warn(message);
      thrownew IOException(message);
    }
 
    INodeFileUnderConstruction pendingFile = (INodeFileUnderConstruction) iFile;
    intnrBlocks = pendingFile.numBlocks();
    BlockInfo[] blocks = pendingFile.getBlocks();
 
    intnrCompleteBlocks;
    BlockInfo curBlock =null;
    //首先检查NN保存的该file的block信息,看是否有block的状态不是complete
    for(nrCompleteBlocks =0; nrCompleteBlocks < nrBlocks; nrCompleteBlocks++) {
      curBlock = blocks[nrCompleteBlocks];
      if(!curBlock.isComplete())
        break;
      assertblockManager.checkMinReplication(curBlock) :
              "A COMPLETE block is not minimally replicated in "+ src;
    }
 
    // If there are no incomplete blocks associated with this file,
    // then reap lease immediately and close the file.
    if(nrCompleteBlocks == nrBlocks) {
      //所有block都是complete的:释放lease,file由INodeUnderConstruction变成INode,然后close file
      finalizeINodeFileUnderConstruction(src, pendingFile);
      NameNode.stateChangeLog.warn("BLOCK*"
        +" internalReleaseLease: All existing blocks are COMPLETE,"
        +" lease removed, file closed.");
      returntrue;  // closed!
    }
 
    // Only the last and the penultimate blocks may be in non COMPLETE state.
    // If the penultimate block is not COMPLETE, then it must be COMMITTED.
    // 执行到这说明有block不是complete的,所以得先修复block,再finalize & close file。
    if(nrCompleteBlocks < nrBlocks -2 ||
       nrCompleteBlocks == nrBlocks -2 &&
         curBlock !=null &&
         curBlock.getBlockUCState() != BlockUCState.COMMITTED) {
      finalString message = "DIR* NameSystem.internalReleaseLease: "
        +"attempt to release a create lock on "
        + src +" but file is already closed.";
      NameNode.stateChangeLog.warn(message);
      thrownew IOException(message);
    }
 
    // The last block is not COMPLETE, and
    // that the penultimate block if exists is either COMPLETE or COMMITTED
    finalBlockInfo lastBlock = pendingFile.getLastBlock();
    BlockUCState lastBlockState = lastBlock.getBlockUCState();
    BlockInfo penultimateBlock = pendingFile.getPenultimateBlock();
    booleanpenultimateBlockMinReplication;
    BlockUCState penultimateBlockState;
    if(penultimateBlock == null) {
      penultimateBlockState = BlockUCState.COMPLETE;
      // If penultimate block doesn't exist then its minReplication is met
      penultimateBlockMinReplication =true;
    }else {
      penultimateBlockState = BlockUCState.COMMITTED;
      penultimateBlockMinReplication =
        blockManager.checkMinReplication(penultimateBlock);
    }
    assertpenultimateBlockState == BlockUCState.COMPLETE ||
           penultimateBlockState == BlockUCState.COMMITTED :
           "Unexpected state of penultimate block in "+ src;
 
    switch(lastBlockState) {
    caseCOMPLETE:
      assertfalse : "Already checked that the last block is incomplete";
      break;
    caseCOMMITTED:
      // Close file if committed blocks are minimally replicated
      if(penultimateBlockMinReplication &&
          blockManager.checkMinReplication(lastBlock)) {
        finalizeINodeFileUnderConstruction(src, pendingFile);
        NameNode.stateChangeLog.warn("BLOCK*"
          +" internalReleaseLease: Committed blocks are minimally replicated,"
          +" lease removed, file closed.");
        returntrue;  // closed!
      }
      // Cannot close file right now, since some blocks
      // are not yet minimally replicated.
      // This may potentially cause infinite loop in lease recovery
      // if there are no valid replicas on data-nodes.
      String message ="DIR* NameSystem.internalReleaseLease: " +
          "Failed to release lease for file "+ src +
          ". Committed blocks are waiting to be minimally replicated."+
          " Try again later.";
      NameNode.stateChangeLog.warn(message);
      thrownew AlreadyBeingCreatedException(message);
    caseUNDER_CONSTRUCTION:
    caseUNDER_RECOVERY:
      finalBlockInfoUnderConstruction uc = (BlockInfoUnderConstruction)lastBlock;
      // setup the last block locations from the blockManager if not known
      if(uc.getNumExpectedLocations() == 0) {
        uc.setExpectedLocations(blockManager.getNodes(lastBlock));
      }
      // start recovery of the last block for this file
      //为这个Block生成新的GS,这个GS是在recovery过程中非常重要的变量。
      longblockRecoveryId = nextGenerationStamp();
      //重新分配lease的持有者。
      //如果是client explicit调用recoverLease RPC,那么新的lease持有者为    NAMENODE_LEASE_HOLDER,由NN作为代理持有该Lease。
      //如果是client通过调用append间接调用lease recovery,那么新的lease持有者为调用请求的client。
      lease = reassignLease(lease, src, recoveryLeaseHolder, pendingFile);
      //初始化Block recovery,首先从拥有replica的DN中选择一个primary DN作为代理发起这个过程。(具体过程在下面详细描述)
      uc.initializeBlockRecovery(blockRecoveryId);
      leaseManager.renewLease(lease);
      // Cannot close file right now, since the last block requires recovery.
      // This may potentially cause infinite loop in lease recovery
      // if there are no valid replicas on data-nodes.
      NameNode.stateChangeLog.warn(
                "DIR* NameSystem.internalReleaseLease: "+
                "File "+ src + " has not been closed."+
               " Lease recovery is in progress. "+
                "RecoveryId = "+ blockRecoveryId + " for block "+ lastBlock);
      break;
    }
    returnfalse;
  }

4,Block recovery:

这里要用到NN和DN之间的heartbeat机制。DN每隔3s向NN发送心跳包,NN收到心跳包,除了更新该DN信息,刷新lastUpdate外,还要给DN发送一些任务:lease recovery; block replication; block invalidation; update balancer bandwith。这些功能在DatanodeManager.handleHeartbeat()这个函数中完成。

以Lease recovery为例,在每个DatanodeDescriptor中有如下数据结构存放需要recover的block:

private BlockQueue recoverBlocks =
                                newBlockQueue();

NN端初始化block recovery的时候选定primary DN,就会把这个Block加入primary DN对应的这个queue里面。然后在NN处理来自primary DN的heartbeat RPC请求时,检查该DN对应的这个queue里面是不是有需要recover的block,然后组成recover block命令发送给对应的DN。

这个命令发送到primary DN之后怎么处理的呢?DN有两个组件负责处理RPC请求:BPServiceActor负责处理与NN的 RPC对话;而client与DN,DN与其他DN之间的RPC对话在DataNode主线程内部处理的。

DN在BPServiceActor.run()方法内部循环执行:connectToNNAndHandshake()和offerService()向NN发送心跳,并接收来自NN的response。然后看NN发送给DN的response里面有啥内容?这里我们主要讨论Block recovery命令。DN调用DataNode.recoverBlocks(),这个函数启动一个单独的线程去做Block recovery,这个线程执行DataNode.recoverBlock()函数。

(下面这个函数只会在primary DN上执行)

private void recoverBlock(RecoveringBlock rBlock) throws IOException {
    ExtendedBlock block = rBlock.getBlock();
    String blookPoolId = block.getBlockPoolId();
    DatanodeID[] datanodeids = rBlock.getLocations();
    List syncList =new ArrayList(datanodeids.length);
    interrorCount = 0;
 
    //遍历该Block所有的replica所在的DN
    for(DatanodeID id : datanodeids) {
      try{
        BPOfferService bpos = blockPoolManager.get(blookPoolId);
        DatanodeRegistration bpReg = bpos.bpRegistration;
        InterDatanodeProtocol datanode = bpReg.equals(id)?
            this: DataNode.createInterDataNodeProtocolProxy(id, getConf(),
                dnConf.socketTimeout, dnConf.connectToDnViaHostname);
        //分别向包含replica的DN发送initReplicaRecovery RPC命令,让包含replica的每台DN都去执行自己的initReplicaRecovery过程。这就跟DN底层的数据存储相关了,所以会调用到FsDatasetImpl.initReplicaRecovery()函数。这个函数的执行过程在下面。
        ReplicaRecoveryInfo info = callInitReplicaRecovery(datanode, rBlock);
        //下面就是判断来自其他DN报告的replica信息是否有效,如果一个replica的GS比这个block的要老,那显然这个replica不是有效的。
        if(info != null &&
            info.getGenerationStamp() >= block.getGenerationStamp() &&
            info.getNumBytes() >0) {
          //把有效的replica加入待同步的数组中。同步过程就是大家来商量一个一致的对外的长度。
          syncList.add(newBlockRecord(id, datanode, info));
        }
      }catch (RecoveryInProgressException ripE) {
        //任何DN抛出RecoveryInProgressException,primary DN将中止Recovery
        InterDatanodeProtocol.LOG.warn(
            "Recovery for replica "+ block + " on data-node "+ id
            +" is already in progress. Recovery id = "
            + rBlock.getNewGenerationStamp() +" is aborted.", ripE);
        return;
      }catch (IOException e) {
        ++errorCount;
        InterDatanodeProtocol.LOG.warn(
            "Failed to obtain replica info for block (="+ block
            +") from datanode (=" + id + ")", e);
      }
    }
 
    if(errorCount == datanodeids.length) {
      //所有的DN都抛出异常,当然也得终止了。
      thrownew IOException("All datanodes failed: block="+ block
          +", datanodeids=" + Arrays.asList(datanodeids));
    }
    //商量把这些replica弄到一致的状态,长度等。
    syncBlock(rBlock, syncList);
  }

上面说到,primary DN发送initReplicaRecovery RPC命令给包含replica的其他DN,然后其他DN上就开始执行Replica Recovery并把结果作为RPC response返回给primary DN。包含replica的每台DN都会执行这个函数:

static ReplicaRecoveryInfo initReplicaRecovery(String bpid,
      ReplicaMap map, Block block, long recoveryId) throws IOException

1) 停止写:如果一个replica处于写状态(RBW),并且有对应的写线程,那么interrupted这个写线程并且等待结束。然后检查磁盘上的block文件(bytesOnDisk)和BR是否一致,检查crc文件是否有效。关闭block文件和crc文件。这样client写和block recovery就不能并发执行了。

2) 如果该replica已经处于RUR状态,就是说该replica可能已经开始了recovery。那么就要检查开始的recovery和这次recovery是否是同一次。判断的标准就是那个NN发送给primary DN后扩散给每个具备replica的DN的recoveryId,也就是该block新的GS。如果这次的Id比该replica中的recoveryId要老,抛出RecoveryInProgressException异常。否则将处于RUR状态的replica的RecoveryID记为新的Id。

3) 如果没有正在运行的Recovery,那么将replica改为RUR,设置它的RecoveryId为新的Id。任何从primary DN到其他DN的交互都用这个RecoveryId标识。对于并发的blockRecovery,新的Recovery永远kill老的Recovery,两个Recovery绝不能交叉执行。

如果没有发生异常的情况下,每台DN执行上述流程后给primary DN发送response(用InitReplicaRecoveryResponseProto标识)。

Primary DN收到来自其他DN发送过来的response之后怎么处理呢,又回到了DataNode.recoverBlock()函数里。(请看上面代码里的注释)

下面看看DataNode.syncBlock()这个函数怎么把状态和大小各异的replica同步成一致状态的。

道理其实很简单,就是先找到所有replica中具有的最好的状态。什么是更好的状态呢?排名从更好到更坏依次是:Finalized, RBW, RWR, RUR, Temporary。其实也就是说明了replica的持久化程度。如果所有replica中最好的状态是Finalized,那么就以这个Finalized的为准,其他和这个replica不一致的replica都被exclude在外;如果最好的状态是RBW或者RWR,那么就选择所有replica中最短的那个作为所有replica的recovery之后的长度。

void syncBlock(RecoveringBlock rBlock,
                         List syncList)throws IOException {
    ExtendedBlock block = rBlock.getBlock();
    finalString bpid = block.getBlockPoolId();
    DatanodeProtocolClientSideTranslatorPB nn =
      getActiveNamenodeForBP(block.getBlockPoolId());
    if(nn == null) {
      thrownew IOException(
          "Unable to synchronize block "+ rBlock + ", since this DN "
          +" has not acknowledged any NN as active.");
    }
 
    longrecoveryId = rBlock.getNewGenerationStamp();
    if(LOG.isDebugEnabled()) {
      LOG.debug("block="+ block + ", (length="+ block.getNumBytes()
          +"), syncList=" + syncList);
    }
 
    // syncList.isEmpty() means that all data-nodes do not have the block
    // or their replicas have 0 length.
    // The block can be deleted.
    if(syncList.isEmpty()) {
      nn.commitBlockSynchronization(block, recoveryId,0,
          true,true, DatanodeID.EMPTY_ARRAY,null);
      return;
    }
 
    // Calculate the best available replica state.
    ReplicaState bestState = ReplicaState.RWR;
    longfinalizedLength = -1;
    for(BlockRecord r : syncList) {
      assertr.rInfo.getNumBytes() > 0: "zero length replica";
      ReplicaState rState = r.rInfo.getOriginalReplicaState();
      if(rState.getValue() < bestState.getValue())
        bestState = rState;
      if(rState == ReplicaState.FINALIZED) {
        if(finalizedLength >0 && finalizedLength != r.rInfo.getNumBytes())
          thrownew IOException("Inconsistent size of finalized replicas. "+
              "Replica "+ r.rInfo + " expected size: "+ finalizedLength);
        finalizedLength = r.rInfo.getNumBytes();
      }
    }
 
    // Calculate list of nodes that will participate in the recovery
    // and the new block size
    List participatingList =new ArrayList();
    finalExtendedBlock newBlock = newExtendedBlock(bpid, block.getBlockId(),
        -1, recoveryId);
    switch(bestState) {
    caseFINALIZED:
      assertfinalizedLength > 0: "finalizedLength is not positive";
      for(BlockRecord r : syncList) {
        ReplicaState rState = r.rInfo.getOriginalReplicaState();
        if(rState == ReplicaState.FINALIZED ||
           rState == ReplicaState.RBW &&
                      r.rInfo.getNumBytes() == finalizedLength)
          participatingList.add(r);
      }
      newBlock.setNumBytes(finalizedLength);
      break;
    caseRBW:
    caseRWR:
      longminLength = Long.MAX_VALUE;
      for(BlockRecord r : syncList) {
        ReplicaState rState = r.rInfo.getOriginalReplicaState();
        if(rState == bestState) {
          minLength = Math.min(minLength, r.rInfo.getNumBytes());
          participatingList.add(r);
        }
      }
      newBlock.setNumBytes(minLength);
      break;
    caseRUR:
    caseTEMPORARY:
      assertfalse : "bad replica state: "+ bestState;
    }
 
    List failedList =new ArrayList();
    finalList successList = newArrayList();
    for(BlockRecord r : participatingList) {
      try{
        //通过InterDatanodeProtocol RPC向其他DN发送RPC更新replica的长度和GS。这个RPC是在block recovery中专用的。各个DN分别update长度和GS,然后把replica变成Finalized。
        r.updateReplicaUnderRecovery(bpid, recoveryId, newBlock.getNumBytes());
        successList.add(r);
      }catch (IOException e) {
        InterDatanodeProtocol.LOG.warn("Failed to updateBlock (newblock="
            + newBlock +", datanode=" + r.id +")", e);
        failedList.add(r.id);
      }
    }
 
    // If any of the data-nodes failed, the recovery fails, because
    // we never know the actual state of the replica on failed data-nodes.
    // The recovery should be started over.
    if(!failedList.isEmpty()) {
      StringBuilder b =new StringBuilder();
      for(DatanodeID id : failedList) {
        b.append("\n  "+ id);
      }
      thrownew IOException("Cannot recover "+ block + ", the following "
          + failedList.size() +" data-nodes failed {" + b + "\n}");
    }
 
    // Notify the name-node about successfully recovered replicas.
    finalDatanodeID[] datanodes = newDatanodeID[successList.size()];
    finalString[] storages = newString[datanodes.length];
    for(inti = 0; i < datanodes.length; i++) {
      finalBlockRecord r = successList.get(i);
      datanodes[i] = r.id;
      storages[i] = r.storageID;
    }
    //向NN发送RPC表明block recovery顺利完成,NN完成元数据持久化工作,commitOrCompleteBlock,然后close file。
    nn.commitBlockSynchronization(block,
        newBlock.getGenerationStamp(), newBlock.getNumBytes(),true, false,
        datanodes, storages);
  }

执行完Lease recovery和Block recovery之后,一个unclose的file被close掉了,这个文件恢复成正常close的状态了。那么这个时候执行append操作就和一个正常close的file没有区别了。然后就是通过addBlock获取replica所在DN,建立pipeline,向DN write数据,这个过程可以参考上一篇博客(HDFS write流程与代码分析)。

从这个流程中再一次深刻体会到一个系统级软件,60%以上的代码是在解决异常情况。一个unclose给我们后续的处理带来这么大的代码量。。。

参考文献:

https://issues.apache.org/jira/secure/attachment/12445209/appendDesign3.pdf

http://blog.csdn.net/chenpingbupt/article/details/7972589

你可能感兴趣的:(hadoop开发)