hadoop源码解析之hdfs写数据全流程分析---datanode处理

  • 概述
  • 接收数据
  • BlockReceiver接收数据包信息
    • receivePacket 接收数据信息
  • PacketResponder处理响应信息

概述

因为在hdfs这样一个复杂的分布式文件系统中,每个文件都是由多个block组成的,每个block又有多个副本,这些副本分布在不同的机器上,所以对于hdfs的写操作流程,就算不考虑异常的处理,其实该流程也是hdfs中最复杂的流程了。

先上一个《hadoop权威指南》中经典的数据写入流程图

hadoop源码解析之hdfs写数据全流程分析---datanode处理_第1张图片

整体流程描述下:

  1. 客户端申请数据块,根据namenode返回的数据节点信息构造输出管道(DFSOutputStream.DataStreamer的run方法里)。
  2. 客户端向第一个datanode写入数据。
  3. 第一个datanode接收到数据之后,将数据转发到下游的datanode,然后向上游节点返回确定信息。
  4. datanode将刚才接收到的数据写入磁盘。
  5. datanode向namenode发送信息,汇报刚刚接收的数据,namenode更新块信息

接收数据

我们在说hdfs读数据的时候说过datanode上用于监听数据流操作的DataXceiverServer类,同样写数据也会用到这个类。当DataXceiverServer监听到有流式数据到达datanode的时候,会创建一个DataXceiver对象,用于具体的处理写操作。

具体DataXceiverServer相关的介绍请参考 http://blog.csdn.net/zhangjun5965/article/details/72586037

和读数据一样,readOp()首先读取了操作码,对于写操作来说,操作码是80,通过switch判断,最终进入了DataXceiver的writeBlock方法

首先来看下writeBlock方法的注释和参数,来对这个方法有一个大概的了解

  /**
   * 往数据管道中的一个datanode写入数据块。
   * Write a block to a datanode pipeline.
   * 接收这次请求的datanode是管道中的下一个datanode。
   * The receiver datanode of this call is the next datanode in the pipeline.
   * 其他的下游的datanode是参数targets中的datanode。
   * The other downstream datanodes are specified by the targets parameter.
   * 注意参数DatanodeInfo是非必须的,因为这个接收的datanode它知道它自己的信息。然后StorageType是必须的,因为一个datanode可以支持多种类型的StorageType
   * Note that the receiver {@link } is not required in the
   * parameter list since the receiver datanode knows its info.  However, the
   * {@link StorageType} for storing the replica in the receiver datanode is a 
   * parameter since the receiver datanode may support multiple storage types.

   * @param blk the block being written.  待写入的数据块
   * @param storageType for storing the replica in the receiver datanode. 存储类型
   * @param blockToken security token for accessing the block.  块的token
   * @param clientName client's name. 客户端名字
   * @param targets other downstream datanodes in the pipeline. 管道中的下游的datnaode  
   * @param targetStorageTypes target {@link StorageType}s corresponding  目前存储类型
   *                           to the target datanodes.
   * @param source source datanode. 来源datanode
   * @param stage pipeline stage. 状态
   * @param pipelineSize the size of the pipeline. 管道的大小
   * @param minBytesRcvd minimum number of bytes received. 最少可以接收的字节数
   * @param maxBytesRcvd maximum number of bytes received. 最大可以接收的字节数
   * @param latestGenerationStamp the latest generation stamp of the block. 最后时间戳
   * @param pinning whether to pin the block, so Balancer won't move it.  是否去锁住这个block,以便让执行负喜欢b载均衡命令的时候不能移动它
   * @param targetPinnings whether to pin the block on target datanode 是否锁住目标节点的dataonde
   */
  public void writeBlock(final ExtendedBlock blk,
      final StorageType storageType, 
      final Token blockToken,
      final String clientName,
      final DatanodeInfo[] targets,
      final StorageType[] targetStorageTypes, 
      final DatanodeInfo source,
      final BlockConstructionStage stage,
      final int pipelineSize,
      final long minBytesRcvd,
      final long maxBytesRcvd,
      final long latestGenerationStamp,
      final DataChecksum requestedChecksum,
      final CachingStrategy cachingStrategy,
      final boolean allowLazyPersist,
      final boolean pinning,
      final boolean[] targetPinnings) throws IOException;


这个方法里,定义了用于写数据的最基本的信息,包含源数据块,目前datanode等。

最后两个参数,是否锁住数据块特别强调下,这个的是hadoop 2.7.0 新加的特性,为了防止在执行负载均衡命令的时候数据块被移动,因为有些文件我们是不需要被移动的,
比如用于提供hbase的目录‘/hbase’。

具体的请参考下 : https://issues.apache.org/jira/browse/HDFS-6133

writeBlock方法的分析:如下

    //是否是datannode发起的请求
    final boolean isDatanode = clientname.length() == 0;
    //是否是客户端发起的请求
    final boolean isClient = !isDatanode;
    //是否是数据块的复制操作
    final boolean isTransfer = stage == BlockConstructionStage.TRANSFER_RBW
        || stage == BlockConstructionStage.TRANSFER_FINALIZED;



    // reply to upstream datanode or client 
    //和上游的数据节点或者客户端交互的输出流,用于发送响应数据
    final DataOutputStream replyOut = getBufferedOutputStream();    

首先定义几个变量,用于上下游节点的输入和输出



    DataOutputStream mirrorOut = null;  // stream to next target  下游节点的输出流
    DataInputStream mirrorIn = null;    // reply from next target 下游节点的输入流
    Socket mirrorSock = null;           // socket to next target  下一个节点的Socket
    String mirrorNode = null;           // the name:port of next target 下一个节点的host:端口
    String firstBadLink = "";           // first datanode that failed in connection setup 管道中第一恶坏的节点

然后构造了一个用来接收数据块信息


        // open a block receiver
        blockReceiver = new BlockReceiver(block, storageType, in,
            peer.getRemoteAddressString(),
            peer.getLocalAddressString(),
            stage, latestGenerationStamp, minBytesRcvd, maxBytesRcvd,
            clientname, srcDataNode, datanode, requestedChecksum,
            cachingStrategy, allowLazyPersist, pinning);

如果下游节点长度大于0,则首先连接到下游的第一个节点,把接收到的数据块信息转发到下一个datanode。


       //如果下一个节点不为空
      if (targets.length > 0) {

        InetSocketAddress mirrorTarget = null;
        // Connect to backup machine
        //从target节点中获取下一个节点。
        mirrorNode = targets[0].getXferAddr(connectToDnViaHostname);
        if (LOG.isDebugEnabled()) {
          LOG.debug("Connecting to datanode " + mirrorNode);
        }
        mirrorTarget = NetUtils.createSocketAddr(mirrorNode);
        mirrorSock = datanode.newSocket();
        try {
          ....................
          //连接到下一个datanode
          NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
          mirrorSock.setSoTimeout(timeoutValue);
          mirrorSock.setSendBufferSize(HdfsConstants.DEFAULT_DATA_SOCKET_SIZE);

          ......................
          //构造下游节点的输出流mirrorOut,用于往下游写数据
          //构造下游节点的输入流mirrorIn,用于下游接收相应信息
          mirrorOut = new DataOutputStream(new BufferedOutputStream(unbufMirrorOut,
              HdfsConstants.SMALL_BUFFER_SIZE));
          mirrorIn = new DataInputStream(unbufMirrorIn);


         //和客户端往第一个datanode发数据请求一样,构造了一个Sender对象,封装了写数据需要的参数,通过writeBlock方法,加上标识为80的操作码,将相关数据发送到下一个datanode
          // Do not propagate allowLazyPersist to downstream DataNodes.
          if (targetPinnings != null && targetPinnings.length > 0) {
            new Sender(mirrorOut).writeBlock(originalBlock, targetStorageTypes[0],
              blockToken, clientname, targets, targetStorageTypes, srcDataNode,
              stage, pipelineSize, minBytesRcvd, maxBytesRcvd,
              latestGenerationStamp, requestedChecksum, cachingStrategy,
              false, targetPinnings[0], targetPinnings);
          } else {
            new Sender(mirrorOut).writeBlock(originalBlock, targetStorageTypes[0],
              blockToken, clientname, targets, targetStorageTypes, srcDataNode,
              stage, pipelineSize, minBytesRcvd, maxBytesRcvd,
              latestGenerationStamp, requestedChecksum, cachingStrategy,
              false, false, targetPinnings);
          }

          //flush数据流
          mirrorOut.flush();

          DataNodeFaultInjector.get().writeBlockAfterFlush();

          // read connect ack (only for clients, not for replication req)
          //从mirrorIn解析出来下游相应的状态    
          if (isClient) {
            BlockOpResponseProto connectAck =
              BlockOpResponseProto.parseFrom(PBHelper.vintPrefixed(mirrorIn));
            mirrorInStatus = connectAck.getStatus();
            firstBadLink = connectAck.getFirstBadLink();
            if (LOG.isDebugEnabled() || mirrorInStatus != SUCCESS) {
              LOG.info("Datanode " + targets.length +
                       " got response for connect ack " +
                       " from downstream datanode with firstbadlink as " +
                       firstBadLink);
            }
          }

        } catch (IOException e) {
         // 遇到异常之后,关闭各种流
          .................
          IOUtils.closeStream(mirrorOut);
          mirrorOut = null;
          IOUtils.closeStream(mirrorIn);
          mirrorIn = null;
          IOUtils.closeSocket(mirrorSock);
          mirrorSock = null;
         .........................
        }



      }

接下来会做下判断,如果是客户端发起的请求,并且不是复制数据块的操作,就向上游节点发送确认数据


      // send connect-ack to source for clients and not transfer-RBW/Finalized
      if (isClient && !isTransfer) {
        if (LOG.isDebugEnabled() || mirrorInStatus != SUCCESS) {
          LOG.info("Datanode " + targets.length +
                   " forwarding connect ack to upstream firstbadlink is " +
                   firstBadLink);
        }
        BlockOpResponseProto.newBuilder()
          .setStatus(mirrorInStatus)
          .setFirstBadLink(firstBadLink)
          .build()
          .writeDelimitedTo(replyOut);
        replyOut.flush();
      }

接下来通过BlockReceiver的receiveBlock方法不断的接收数据包,并且把数据信息写入本地的磁盘


      if (blockReceiver != null) {
        String mirrorAddr = (mirrorSock == null) ? null : mirrorNode;
        blockReceiver.receiveBlock(mirrorOut, mirrorIn, replyOut,
            mirrorAddr, null, targets, false);

        // send close-ack for transfer-RBW/Finalized 
        //数据块复制操作,直接把状态置成SUCCESS,返回上游节点相关信息
        if (isTransfer) {
          if (LOG.isTraceEnabled()) {
            LOG.trace("TRANSFER: send close-ack");
          }
          writeResponse(SUCCESS, null, replyOut);
        }
      }

更新时间戳信息

      // update its generation stamp
      if (isClient && 
          stage == BlockConstructionStage.PIPELINE_CLOSE_RECOVERY) {
        block.setGenerationStamp(latestGenerationStamp);
        block.setNumBytes(minBytesRcvd);
      }

BlockReceiver接收数据包信息

datanode处理接收数据块的操作主要都是交给了BlockReceiver.receiveBlock方法,

首先构造了一个守护进程PacketResponder,用于处理管道中下游的datanode返回的响应数据,并且转发到上游

      if (isClient && !isTransfer) {
        responder = new Daemon(datanode.threadGroup, 
            new PacketResponder(replyOut, mirrIn, downstreams));
        responder.start(); // start thread to processes responses
      }

然后通过一个while循环不断的接收数据包信息


while (receivePacket() >= 0) { /* Receive until the last packet */ }

receivePacket 接收数据信息


  /** 
   * 接收并且处理一个数据包,它里面可能包含很多的块(512字节的chunks)
   * Receives and processes a packet. It can contain many .
   * 返回接收到的数据包包含了多少个字节
   * returns the number of data bytes that the packet has.
   */
  private int receivePacket() throws IOException {
          // read the next packet 从输入流中读取下一个数据包
            packetReceiver.receiveNextPacket(in);

            //获取数据包头信息,并校验   
            PacketHeader header = packetReceiver.getHeader();  

            ....................

            //接收数据信息和数据的校验和信息,并写入缓冲区
            ByteBuffer dataBuf = packetReceiver.getDataSlice();
            ByteBuffer checksumBuf = packetReceiver.getChecksumSlice();


              .......

           // Write data to disk.
           //前面经过了一大堆的校验之后,将数据写入磁盘
          long begin = Time.monotonicNow();
          out.write(dataBuf.array(), startByteToDisk, numBytesToDisk);



          /// flush entire packet, sync if requested
          //flush操作
          flushOrSync(syncBlock);


          //统计工作
          datanode.metrics.incrBytesWritten(len);
          datanode.metrics.incrTotalWriteTime(duration);

          //如果开启了限流操作
         if (throttler != null) { // throttle I/O
            throttler.throttle(len);
         }

  }

PacketResponder处理响应信息

PacketResponder 用于从下游接收数据块的确认消息,然后转发到上游。整体的流程如下:

  1. 从下游的datanode中读取响应数据
  2. 判断是否是OOB消息,如果是的话,直接返回上游。
  3. 调用waitForAckHead方法从从ackQueue中获取一个数据包。
  4. 对从下游获取的数据包和从队列中的数据包的seqno进行比较,如果不一致的话,直接抛出异常。
  5. 如果有下游节点的话,计算一下总响应时长
  6. 如果线程被打断,则 running = false;结束while循环
  7. 如果是最后一个数据包,调用 finalizeBlock(startTime)方法完成数据块并且关系数据块文件
  8. 把当前的确认信息和下游的确认信息合并,然后发到上游节点
  9. 从ackQueue里删除响应的数据包信息
    ;


    /**
     * Thread to process incoming acks.
     * @see java.lang.Runnable#run()
     */
    @Override
    public void run() {
      boolean lastPacketInBlock = false;
      final long startTime = ClientTraceLog.isInfoEnabled() ? System.nanoTime() : 0;
      while (isRunning() && !lastPacketInBlock) {
        long totalAckTimeNanos = 0;
        boolean isInterrupted = false;
        try {
          Packet pkt = null;
          long expected = -2;
          PipelineAck ack = new PipelineAck();
          long seqno = PipelineAck.UNKOWN_SEQNO;
          long ackRecvNanoTime = 0;
          try {
             //不是管道中的最后一个数据块并且没错误
            if (type != PacketResponderType.LAST_IN_PIPELINE && !mirrorError) {
              // read an ack from downstream datanode
              ack.readFields(downstreamIn);//从下游读取一个响应
              ackRecvNanoTime = System.nanoTime();
              if (LOG.isDebugEnabled()) {
                LOG.debug(myString + " got " + ack);
              }
              //如果是OOB消息,发送确认消息到上游
              // Process an OOB ACK.
              Status oobStatus = ack.getOOBStatus();
              if (oobStatus != null) {
                LOG.info("Relaying an out of band ack of type " + oobStatus);
                sendAckUpstream(ack, PipelineAck.UNKOWN_SEQNO, 0L, 0L,
                    PipelineAck.combineHeader(datanode.getECN(),
                      Status.SUCCESS));
                continue;
              }
              seqno = ack.getSeqno();
            }
            if (seqno != PipelineAck.UNKOWN_SEQNO
                || type == PacketResponderType.LAST_IN_PIPELINE) {
                //从ackQueue里面读取数据包
              pkt = waitForAckHead(seqno);
              if (!isRunning()) {
                break;
              }
              expected = pkt.seqno;
              //如果当前节点存在下游节点,并且序列号不符合,则抛出异常
              if (type == PacketResponderType.HAS_DOWNSTREAM_IN_PIPELINE
                  && seqno != expected) {
                throw new IOException(myString + "seqno: expected=" + expected
                    + ", received=" + seqno);
              }
              if (type == PacketResponderType.HAS_DOWNSTREAM_IN_PIPELINE) {
                // The total ack time includes the ack times of downstream
                // nodes.
                // The value is 0 if this responder doesn't have a downstream
                // DN in the pipeline.
                totalAckTimeNanos = ackRecvNanoTime - pkt.ackEnqueueNanoTime;
                // Report the elapsed time from ack send to ack receive minus
                // the downstream ack time.
                long ackTimeNanos = totalAckTimeNanos
                    - ack.getDownstreamAckTimeNanos();
                if (ackTimeNanos < 0) {
                  if (LOG.isDebugEnabled()) {
                    LOG.debug("Calculated invalid ack time: " + ackTimeNanos
                        + "ns.");
                  }
                } else {
                  datanode.metrics.addPacketAckRoundTripTimeNanos(ackTimeNanos);
                }
              }
              lastPacketInBlock = pkt.lastPacketInBlock;
            }
          } catch (InterruptedException ine) {
            isInterrupted = true;
          } catch (IOException ioe) {
            if (Thread.interrupted()) {
              isInterrupted = true;
            } else if (ioe instanceof EOFException && !packetSentInTime()) {
              // The downstream error was caused by upstream including this
              // node not sending packet in time. Let the upstream determine
              // who is at fault.  If the immediate upstream node thinks it
              // has sent a packet in time, this node will be reported as bad.
              // Otherwise, the upstream node will propagate the error up by
              // closing the connection.
              LOG.warn("The downstream error might be due to congestion in " +
                  "upstream including this node. Propagating the error: ",
                  ioe);
              throw ioe;
            } else {
              // continue to run even if can not read from mirror
              // notify client of the error
              // and wait for the client to shut down the pipeline
              mirrorError = true;
              LOG.info(myString, ioe);
            }
          }

            //如果线程被打断,则 running = false;这样就结束了while循环。
          if (Thread.interrupted() || isInterrupted) {
            /*
             * The receiver thread cancelled this thread. We could also check
             * any other status updates from the receiver thread (e.g. if it is
             * ok to write to replyOut). It is prudent to not send any more
             * status back to the client because this datanode has a problem.
             * The upstream datanode will detect that this datanode is bad, and
             * rightly so.
             *
             * The receiver thread can also interrupt this thread for sending
             * an out-of-band response upstream.
             */
            LOG.info(myString + ": Thread is interrupted.");
            running = false;
            continue;
          }

          if (lastPacketInBlock) {
            // Finalize the block and close the block file
            finalizeBlock(startTime);
          }

          Status myStatus = pkt != null ? pkt.ackStatus : Status.SUCCESS;
          //把当前的确认信息和下游的确认信息合并,然后发到上游节点
          sendAckUpstream(ack, expected, totalAckTimeNanos,
            (pkt != null ? pkt.offsetInBlock : 0),
            PipelineAck.combineHeader(datanode.getECN(), myStatus));
         //从ackQueue里移除
          if (pkt != null) {
            // remove the packet from the ack queue
            removeAckHead();
          }
        } catch (IOException e) {
          LOG.warn("IOException in BlockReceiver.run(): ", e);
          if (running) {
            datanode.checkDiskErrorAsync();
            LOG.info(myString, e);
            running = false;
            if (!Thread.interrupted()) { // failure not caused by interruption
              receiverThread.interrupt();
            }
          }
        } catch (Throwable e) {
          if (running) {
            LOG.info(myString, e);
            running = false;
            receiverThread.interrupt();
          }
        }
      }
      LOG.info(myString + " terminating");
    }

你可能感兴趣的:(hadoop)