【精】彻底理解HDFS写文件流程

本文包含如下内容：
① Pipeline的建立过程，以及下游节点如何给上游节点发Ack
② DFSOutputStream、DataStreamer的原理
③ Sender、BlockReceiver、PacketResponder的原理

作为引子，先从最上游谈起：

我们使用HDFS API创建文件，写文件时，首先会调用FileSystem的create方法，获得一个FSDataOutputStream流，然后通过用这个流来write数据即可。

别看API这么简单，这后面发生的事情可是十分复杂！比如这后面涉及到Client通过RPC调用在NameNode侧的文件系统目录中创建文件、addBlock，NameNode给客户端返回文件的LocatedBlock、客户端根据LocatedBlock创建输出流、Datanode建立Pipeline、Client发送Packet到Pipeline，Sender发送数据到DataXeicver等等。

接下来让我们一起来深入探索吧！

一、写数据Pipeline总览

我们跳过客户端调用RPC添加数据块，NameNode侧为数据块选择最优存放的DataNode列表这一步骤的讲解，直接来到已经选好DataNode列表后，客户端向列表中的DataNode写数据的步骤。

这个步骤需要客户端与DataNode列表之间建立一个Pipeline。如下图所示：

数据是以DFSPacket对象的形式封装的，一个Block可能由很多个Packet组成，Packet的具体格式参照HDFS源码中的DFSPacket.java的注释。

Client发送Packet给Pipeline中的第一个DataNode A，A收到数据后转发给B，B再转发给C。当Pipeline中的最下游节点收到数据包后，会按照数据包传送方向的反方向发送对数据包的Ack信息。这个Ack信息是一个数据包的序列号，在Client侧是单调递增的。上图是从单个数据包的视角出发。我们来从多个数据包的视角出发来看，下图引自雅虎实验室的HDFS论文：

OK，从整体上看，整个写pipeline的过程很容易理解，但这里面的代码量非常多，涉及到Client侧、DataNode侧的很多线程类。后文中，我们将逐一击破。

二、DFSOutputStream的获取

首先来看看DFSClient中如何获取写数据的流：DFSOutputStream

  /**
   * Same as {@link #create(String, FsPermission, EnumSet, boolean, short, long,
   * Progressable, int, ChecksumOpt, InetSocketAddress[])} with the addition of
   * ecPolicyName that is used to specify a specific erasure coding policy
   * instead of inheriting any policy from this new file's parent directory.
   * This policy will be persisted in HDFS. A value of null means inheriting
   * parent groups' whatever policy.
   */
  public DFSOutputStream create(String src, FsPermission permission,
      EnumSet flag, boolean createParent, short replication,
      long blockSize, Progressable progress, int buffersize,
      ChecksumOpt checksumOpt, InetSocketAddress[] favoredNodes,
      String ecPolicyName) throws IOException {
    checkOpen();
    final FsPermission masked = applyUMask(permission);
    LOG.debug("{}: masked={}", src, masked);
    final DFSOutputStream result = DFSOutputStream.newStreamForCreate(this,
        src, masked, flag, createParent, replication, blockSize, progress,
        dfsClientConf.createChecksum(checksumOpt),
        getFavoredNodesStr(favoredNodes), ecPolicyName);
    beginFileLease(result.getFileId(), result);
    return result;
  }

通过DFSOutputStream#newStreamForCreate方法获取到一个DFSOutputStream流对象。参数比较多，关注一下this代表DFSClient对象，还有favoredNodes列表代表优先选择的DataNode。

在newStreamForCreate方法中主要做两个工作：
①调用Namenode代理对象上的create方法，在NameNode文件系统中创建一个文件，并获得此文件的元数据信息HdfsFileStatus对象。
②将①中的HdfsFileStatus对象传入DFSOutputStream构造方法中，构造出一个DFSOutputStream流对象，同时启动DFSOutputStream对象中的DataStreamer线程。

  static DFSOutputStream newStreamForCreate(DFSClient dfsClient, String src,
      FsPermission masked, EnumSet flag, boolean createParent,
      short replication, long blockSize, Progressable progress,
      DataChecksum checksum, String[] favoredNodes, String ecPolicyName)
      throws IOException {
    try (TraceScope ignored =
             dfsClient.newPathTraceScope("newStreamForCreate", src)) {
      HdfsFileStatus stat = null;

      // Retry the create if we get a RetryStartFileException up to a maximum
      // number of times
      boolean shouldRetry = true;
      int retryCount = CREATE_RETRY_COUNT;
      while (shouldRetry) {
        shouldRetry = false;
        try {
          stat = dfsClient.namenode.create(src, masked, dfsClient.clientName,
              new EnumSetWritable<>(flag), createParent, replication,
              blockSize, SUPPORTED_CRYPTO_VERSIONS, ecPolicyName);
          break;
        } catch (RemoteException re) {
          IOException e = re.unwrapRemoteException(
              AccessControlException.class,
              DSQuotaExceededException.class,
              QuotaByStorageTypeExceededException.class,
              FileAlreadyExistsException.class,
              FileNotFoundException.class,
              ParentNotDirectoryException.class,
              NSQuotaExceededException.class,
              RetryStartFileException.class,
              SafeModeException.class,
              UnresolvedPathException.class,
              SnapshotAccessControlException.class,
              UnknownCryptoProtocolVersionException.class);
          if (e instanceof RetryStartFileException) {
            if (retryCount > 0) {
              shouldRetry = true;
              retryCount--;
            } else {
              throw new IOException("Too many retries because of encryption" +
                  " zone operations", e);
            }
          } else {
            throw e;
          }
        }
      }
      Preconditions.checkNotNull(stat, "HdfsFileStatus should not be null!");
      // 构造一个DFSOutputStream对象
      final DFSOutputStream out;
      if(stat.getErasureCodingPolicy() != null) {
        out = new DFSStripedOutputStream(dfsClient, src, stat,
            flag, progress, checksum, favoredNodes);
      } else {
        out = new DFSOutputStream(dfsClient, src, stat,
            flag, progress, checksum, favoredNodes, true);
      }
      // 启动DataStreamer线程
      out.start();
      return out;
    }
  }

那我们就一路跟踪DFSOutputStream的构造方法吧：

也是主要做了两件事：
①设置一下DFSOutputStream的成员变量的值，比如对应的fileId、src、对应的DFSClient对象、数据packet大小等等
②创建DataStreamer线程（外层去start这个线程）

/** Construct a new output stream for creating a file. */
  protected DFSOutputStream(DFSClient dfsClient, String src,
      HdfsFileStatus stat, EnumSet flag, Progressable progress,
      DataChecksum checksum, String[] favoredNodes, boolean createStreamer) {
    this(dfsClient, src, flag, progress, stat, checksum);
    this.shouldSyncBlock = flag.contains(CreateFlag.SYNC_BLOCK);

    computePacketChunkSize(dfsClient.getConf().getWritePacketSize(),
        bytesPerChecksum);

    if (createStreamer) {
      streamer = new DataStreamer(stat, null, dfsClient, src, progress,
          checksum, cachingStrategy, byteArrayManager, favoredNodes,
          addBlockFlags);
    }
  }

OK，到这里似乎断线了？不知道接下来改怎么看了？DataStreamer线程的run方法啊。为了让行文逻辑更顺畅，我进行了段落的重排，下面的输出流write数据这块代码是我DataStreamer源码后回来写的。

DataStreamer线程从dataQueue中提取数据包，并将数据包发送到将管道中的第一个datanode，然后把这个数据包从dataQueue移动到ackQueue。那我们不就得看看数据包是怎么添加到dataQueue中的。全局代码搜索dataQueue.add，发现dataQueue.addLast是我们想要的方法。通过查看调用关系：

最后你会发现：是在调用FSOutputSummer#write方法时，才会将数据封装成DFSPacket对象入到dataQueue里等待DataStreamer去发送给DataNode的。

而DFSOutputStream类恰恰就是FSOutputSummer的子类，且没有重写write方法。也就是说我们最后的write方法是走FSOutputSummer的write方法，也就会把数据封装成DFSPacket对象入队到dataQueue中了。

notes
①我们在使用HDFS API的时候，虽然流对象是FSDataOutputStream类型的，但是本质上底层的流还是DFSOutputStream的，FSDataOutputStream这个流是个包装流，是DFSClient#createWrappedOutputStream方法中把DFSOutputStream类包装成了FSDataOutputStream类。
②构造流的时候，我们根据配置参数获得到max packet的大小，然后再write数据的时候，由于流是FSOutputSummer的子类，会记录当前已经写了的字节数，如果达到配置设置的最大上限，那么就构造DFSPacket，入队到dataQueue中，等待DataStreamer线程进行处理。

三、DataStream类

DataStreamer是Client写数据的核心类。它本质上是个Thread类。它的主要功能在JavaDoc中描述的很详细，这里摘录一下：

/*********************************************************************
 *
 * The DataStreamer class is responsible for sending data packets to the
 * datanodes in the pipeline. It retrieves a new blockid and block locations
 * from the namenode, and starts streaming packets to the pipeline of
 * Datanodes. Every packet has a sequence number associated with
 * it. When all the packets for a block are sent out and acks for each
 * if them are received, the DataStreamer closes the current block.
 *
 * The DataStreamer thread picks up packets from the dataQueue, sends it to
 * the first datanode in the pipeline and moves it from the dataQueue to the
 * ackQueue. The ResponseProcessor receives acks from the datanodes. When an
 * successful ack for a packet is received from all datanodes, the
 * ResponseProcessor removes the corresponding packet from the ackQueue.
 *
 * In case of error, all outstanding packets are moved from ackQueue. A new
 * pipeline is setup by eliminating the bad datanode from the original
 * pipeline. The DataStreamer now starts sending packets from the dataQueue.
 *
 *********************************************************************/

翻译一下，什么tmd叫tmd惊喜：

DataStreamer类负责给在pipeline中的datanode发送数据包。它从namenode检索一个新的blockid和block位置，并开始把数据包以流的形式发送到datanodes组成的pipeline中。每个数据包都有一个与之相关联的序列号。当一个块的所有包都被发送出去并且每个包的ack信息也都被收到，则DataStreamer会关闭当前块。

DataStreamer线程从dataQueue中提取数据包，并将数据包发送到将管道中的第一个datanode，然后把这个数据包从dataQueue移动到ackQueue。ResponseProcessor线程接收来自datanode的响应。当一个数据包的成功的ack信息从所有datanode发送过来时，ResponseProcessor会从ackQueue中移除相应的数据包。

在出现错误的情况下，所有未完成的数据包将从ackQueue中被移除。接着在原来的出错的pipeline中消除掉bad datanode的基础上构建一个新的pipeline。DataStreamer再继续开始发送dataQueue中的数据包。

OK，看完JavaDoc知道了DataStreamer的作用之后，来看下流程图和一些关键源码。

从整体上看，DataStream#run方法的执行逻辑如下图：

它是一个线程类，那我们就看它的run方法就好了。run方法很长，大概有200行代码左右，读者可以不用全看，只需关注我在代码中注释中抛出的问题即可，后面我会针对每个问题进行解答：

  /*
   * streamer thread is the only thread that opens streams to datanode,
   * and closes them. Any error recovery is also done by this thread.
   */
  @Override
  public void run() {
    long lastPacket = Time.monotonicNow();
    TraceScope scope = null;
    while (!streamerClosed && dfsClient.clientRunning) {
      // if the Responder encountered an error, shutdown Responder
      if (errorState.hasError()) {
        closeResponder();
      }

      DFSPacket one;
      try {
        // process datanode IO errors if any
        boolean doSleep = processDatanodeOrExternalError();

        final int halfSocketTimeout = dfsClient.getConf().getSocketTimeout()/2;
        synchronized (dataQueue) {
          // wait for a packet to be sent.
          long now = Time.monotonicNow();
          // ①halfSocketTimeout和timeout时间是怎么回事？
          while ((!shouldStop() && dataQueue.size() == 0 &&
              (stage != BlockConstructionStage.DATA_STREAMING ||
                  now - lastPacket < halfSocketTimeout)) || doSleep) {
            long timeout = halfSocketTimeout - (now-lastPacket);
            timeout = timeout <= 0 ? 1000 : timeout;
            timeout = (stage == BlockConstructionStage.DATA_STREAMING)?
                timeout : 1000;
            try {
              dataQueue.wait(timeout);
            } catch (InterruptedException  e) {
              LOG.warn("Caught exception", e);
            }
            doSleep = false;
            now = Time.monotonicNow();
          }
          if (shouldStop()) {
            continue;
          }
          // get packet to be sent.
          if (dataQueue.isEmpty()) {
            one = createHeartbeatPacket();
          } else {
            try {
              backOffIfNecessary();
            } catch (InterruptedException e) {
              LOG.warn("Caught exception", e);
            }
            one = dataQueue.getFirst(); // regular data packet
            SpanId[] parents = one.getTraceParents();
            if (parents.length > 0) {
              scope = dfsClient.getTracer().
                  newScope("dataStreamer", parents[0]);
              scope.getSpan().setParents(parents);
            }
          }
        }

        // get new block from namenode.
        if (LOG.isDebugEnabled()) {
          LOG.debug("stage=" + stage + ", " + this);
        }
        // ②如何建立起Pipeline的，Socket怎么创建的？
        if (stage == BlockConstructionStage.PIPELINE_SETUP_CREATE) {
          LOG.debug("Allocating new block: {}", this);
          setPipeline(nextBlockOutputStream());
          initDataStreaming();
        } else if (stage == BlockConstructionStage.PIPELINE_SETUP_APPEND) {
          LOG.debug("Append to block {}", block);
          setupPipelineForAppendOrRecovery();
          if (streamerClosed) {
            continue;
          }
          initDataStreaming();
        }

        long lastByteOffsetInBlock = one.getLastByteOffsetBlock();
        if (lastByteOffsetInBlock > stat.getBlockSize()) {
          throw new IOException("BlockSize " + stat.getBlockSize() +
              " < lastByteOffsetInBlock, " + this + ", " + one);
        }

        if (one.isLastPacketInBlock()) {
          // wait for all data packets have been successfully acked
          synchronized (dataQueue) {
            while (!shouldStop() && ackQueue.size() != 0) {
              try {
                // wait for acks to arrive from datanodes
                dataQueue.wait(1000);
              } catch (InterruptedException  e) {
                LOG.warn("Caught exception", e);
              }
            }
          }
          if (shouldStop()) {
            continue;
          }
          stage = BlockConstructionStage.PIPELINE_CLOSE;
        }

        // send the packet
        SpanId spanId = SpanId.INVALID;
        synchronized (dataQueue) {
          // move packet from dataQueue to ackQueue
          if (!one.isHeartbeatPacket()) {
            if (scope != null) {
              spanId = scope.getSpanId();
              scope.detach();
              one.setTraceScope(scope);
            }
            scope = null;
            dataQueue.removeFirst();
            ackQueue.addLast(one);
            packetSendTime.put(one.getSeqno(), Time.monotonicNow());
            dataQueue.notifyAll();
          }
        }

        LOG.debug("{} sending {}", this, one);

        // write out data to remote datanode
        // 此处是写，调了flush刷流
        try (TraceScope ignored = dfsClient.getTracer().
            newScope("DataStreamer#writeTo", spanId)) {
          one.writeTo(blockStream);
          blockStream.flush();
        } catch (IOException e) {
          // HDFS-3398 treat primary DN is down since client is unable to
          // write to primary DN. If a failed or restarting node has already
          // been recorded by the responder, the following call will have no
          // effect. Pipeline recovery can handle only one node error at a
          // time. If the primary node fails again during the recovery, it
          // will be taken out then.
          errorState.markFirstNodeIfNotMarked();
          throw e;
        }
        lastPacket = Time.monotonicNow();

        // update bytesSent
        long tmpBytesSent = one.getLastByteOffsetBlock();
        if (bytesSent < tmpBytesSent) {
          bytesSent = tmpBytesSent;
        }

        if (shouldStop()) {
          continue;
        }
         // ③如果是block的最后一个packet，则等待ackQueue中的Packet都被移除（代表接收到了DataNode的响应）
        // Is this block full?
        if (one.isLastPacketInBlock()) {
          // wait for the close packet has been acked
          synchronized (dataQueue) {
            while (!shouldStop() && ackQueue.size() != 0) {
              dataQueue.wait(1000);// wait for acks to arrive from datanodes
            }
          }
          if (shouldStop()) {
            continue;
          }

          endBlock();
        }
        if (progress != null) { progress.progress(); }

        // This is used by unit test to trigger race conditions.
        if (artificialSlowdown != 0 && dfsClient.clientRunning) {
          Thread.sleep(artificialSlowdown);
        }
      } catch (Throwable e) {
        // Log warning if there was a real error.
        if (!errorState.isRestartingNode()) {
          // Since their messages are descriptive enough, do not always
          // log a verbose stack-trace WARN for quota exceptions.
          if (e instanceof QuotaExceededException) {
            LOG.debug("DataStreamer Quota Exception", e);
          } else {
            LOG.warn("DataStreamer Exception", e);
          }
        }
        lastException.set(e);
        assert !(e instanceof NullPointerException);
        errorState.setInternalError();
        if (!errorState.isNodeMarked()) {
          // Not a datanode issue
          streamerClosed = true;
        }
      } finally {
        if (scope != null) {
          scope.close();
          scope = null;
        }
      }
    }
    closeInternal();
  }

OK，上面我们抛出了三个问题：
① wait的参数timeout时间是怎么确定的？

为什么需要wait呢？看while的条件：

while ((!shouldStop() && dataQueue.size() == 0 &&
              (stage != BlockConstructionStage.DATA_STREAMING ||
                  now - lastPacket < halfSocketTimeout)) || doSleep)

dataQueue的size是0，证明没有数据需要发送，因此不需要执行后面的发送逻辑，所以线程可以wait进入等待状态。至于为什么要用halfSocketTimeout这个值，我觉得单纯是HDFS这块开发者做的一个trade-off，你也可以减小这个值，这样无非就是多发送几个heartbeat Packet而已。（而且，最新的HDFS社区代码这里已经移除了wait halfSocketTimeout的逻辑）。

做个简单的数学计算
long timeout = halfSocketTimeout - (now-lastPacket);
timeout = timeout <= 0 ? 1000 : timeout;
那如果timeout>0实际上结束wait的时间是：
now + timeout = now + halfSocketTimeout - now + lastPacket = lastPacket + halfSocketTimeout;
也即保证在上一个Packet包发送后，在wait至少halfSocketTimeout时长。

②如何构建的DN网络连接，Pipeline

这块是重中之重，以create文件为例：

Pipeline建立过程.png

这里有个细节，需要了解一下。就是假设数据块存放的DataNode列表是：【A、B、C】三台。 Client首先和A节点建立Socket连接，然后A节点和B节点再建立Socket连接，B节点再和C节点建立Socket连接。这里实现的方法如下：

在构造参数的序列化对象时，涉及到生成发送到下一步的目标节点列表时，会调用PBHelperClient.convert方法，并传入原始列表和startIndex=1。看这个convert方法的实现，startIndex = 1的意思就是每次都从列表的第二个元素开始构造新的target列表。举个例子说明：开始是target DataNode列表是：【A、B、C】，客户端和A建立Socket连接时，发送的target DataNode列表就是：【B、C】了，这样A就可以和B建立连接，以此类推 B -> C。

OK，到这里我们就知道Pipeline的整个构造过程了。

③ pipeline下游的DataNode怎么给上游发送ack的，以及ackQueue这个数据结构相关的操作。这块放到第三章讲。

三、BlockReceiver & PacketResponder

上面提到过，DataStream线程从dataQueue队列中取出待发送的DFSPacket对象时，会把packet加入到ackQueue中，表示此Packet需要等待pipeline中的DataNode都返回ACK信息。那DataNode是如果给上游的DataNode以及Client返回Ack信息的呢？下面我们就来看看。

在DataStreamer中不是调用了new Sender(xxx).writeBlock方法么？这个东西会被DataNode侧的DataXceiver类中的writeBlock方法处理。writeBlock方法中又会委托给BlockReceiver#receiveBlock方法。receiveBlock方法中启动了PacketResponder线程用来对接收到的packet进行响应。

观察一下receiveBlock方法的参数：

这三个流分别代表给下游datanode发送数据的输出流、接收下游datanode数据的输入流、以及回复上游数据的输出流。看一下PacketResponder的构造函数，将其中的两个流传入作为参数了：

type表示此responder是pipeline中的最后一个节点还是中间节点。

回到receiveBlock里，会有这样一行代码：

while (receivePacket() >= 0) { /* Receive until the last packet */ }

这是一个空循环，只要还有数据过来，那我就一直调用receivePacket方法。这个receivePacket方法返回的是接收的数据字节数。里面会从输入流中读取Packet的各种信息，然后入队，等待packet线程后台去处理，如下图：

那接下来就是看PacketResponeder怎么给上游的DataNode或者Client返回响应的吧。其实是在PacketResponeder线程的run方法中最终调用了sendAckUpstreamUnprotected方法给上游发送Ack，如下图所示：

OK，到这里我们也就知道了DataNode是如何回复Ack信息的了。

【精】彻底理解HDFS写文件流程

一、写数据Pipeline总览

二、DFSOutputStream的获取

三、DataStream类

三、BlockReceiver & PacketResponder

你可能感兴趣的:(【精】彻底理解HDFS写文件流程)