hadoop源码解析之hdfs写数据全流程分析---客户端处理

阅读更多
  • DFSOutputStream介绍
    • DFSOutputStream概况介绍
    • DFSOutputStream重要的变量
    • 数据处理线程类DataStreamer
    • 响应处理类ResponseProcessor
  • 处理流程
    • 客户端发数据到dataQueue
    • DataStreamer处理dataQueue中的数据
      • 处理错误
      • 创建输出数据流,发送数据
        • 向namenode申请数据块
        • 连接到第一个datanode
      • 建立管道
      • 初始化数据流
      • 发送数据包
    • 关闭数据流
    • ResponseProcessor处理回复消息

 

DFSOutputStream介绍

DFSOutputStream概况介绍

这一节我们介绍hdfs写数据过程中,客户端的处理部分。客户端的处理主要是用到了DFSOutputStream对象,从名字我们可以看出,这个是对dfs文件系统输出流的一个封装,接下来我们先来详细了解一下用到的几个重要的类和其中的变量。

DFSOutputStream的主要功能在类的注释中其实已经说的很清楚了,大家先看下,英文不好,翻译的可能不太好。

/****************************************************************
 * DFSOutputStream从字节流创建文件
 * DFSOutputStream creates files from a stream of bytes.
 *
 * 客户端写的数据DFSOutputStream临时缓存了起来。数据被分解了一个个的数据包(DFSPacket),
 * 每个DFSPacket一般是64K大小,一个DFSPacket又包含了若干个块(chunks),每个chunk一般是512k并且
 * 有一个对应的校验和。
 * The client application writes data that is cached internally by
 * this stream. Data is broken up into packets, each packet is
 * typically 64K in size. A packet comprises of chunks. Each chunk
 * is typically 512 bytes and has an associated checksum with it.
 *
 * 当一个客户端程序写的的数据填充慢了当前的数据包的时候(DFSPacket类型的变量currentPacket),
 * 就会被有顺序的放入dataQueue队列中。DataStreamer线程从dataQueue中获取数据包(packets),
 * 发送该数据包给数据管道(pipeline)中的第一个datanode, 然后把该数据包从dataQueue中移除,添加到ackQueue。
 * ResponseProcessor会从各个datanode中接收ack确认消息。
 * 当对于一个DFSPacket的成功的ack确认消息被所有的datanode接收到了,ResponseProcessor将其从ackQueue列表中移除  
 * When a client application fills up the currentPacket, it is
 * enqueued into dataQueue.  The DataStreamer thread picks up
 * packets from the dataQueue, sends it to the first datanode in
 * the pipeline and moves it from the dataQueue to the ackQueue.
 * The ResponseProcessor receives acks from the datanodes. When an
 * successful ack for a packet is received from all datanodes, the
 * ResponseProcessor removes the corresponding packet from the
 * ackQueue.
 *
 *
 * 在有错误发生的时候,所有的未完成的数据包从ackQueue队列移除,一个新的不包含损坏的datanode的管道将会被建立,
 * DataStreamer线程将重新开始从dataQueue获取数据包发送。
 * In case of error, all outstanding packets and moved from
 * ackQueue. A new pipeline is setup by eliminating the bad
 * datanode from the original pipeline. The DataStreamer now
 * starts sending packets from the dataQueue.
****************************************************************/
@InterfaceAudience.Private
public class DFSOutputStream extends FSOutputSummer
    implements Syncable, CanSetDropBehind { }

DFSOutputStream重要的变量

最重要的两个队列,dataQueue和ackQueue,这两个队列都是典型的生产者、消费者模式,对于dataQueue来说,生产者是客户端,消费者是DataStreamer,对于ackQueue来说,生产者是DataStreamer,消费者是ResponseProcessor

/**
   * dataQueue和ackQueue是两个非常重要的变量,他们是存储了DFSPacket对象的链表。
   * dataQueue列表用于存储待发送的数据包,客户端写入的数据,先临时存到这个队列里。
   * ackQueue是回复队列,从datanode收到回复消息之后,存到这里队列里。
   * 
   */
  // both dataQueue and ackQueue are protected by dataQueue lock
  private final LinkedList dataQueue = new LinkedList();
  private final LinkedList ackQueue = new LinkedList();
  private DFSPacket currentPacket = null;//当前正在处理的数据包
  private DataStreamer streamer;
  private long currentSeqno = 0;
  private long lastQueuedSeqno = -1;
  private long lastAckedSeqno = -1;
  private long bytesCurBlock = 0; // bytes written in current block 当前的数据块有多少个字节
  private int packetSize = 0; // write packet size, not including the header.
  private int chunksPerPacket = 0;

数据处理线程类DataStreamer

DataStreamer是用于处理数据的核心类,我们看下注释中的解释

/**
   *  DataStreamer负责往管道中的datanodes发送数据包, 从namenode中获取块的位置信息和blockid,然后开始
   *  将数据包发送到datanode的管道。
   *  每个包都有一个序列号。
   *  当所有的数据包都发送完毕并且都接收到回复消息之后,DataStreamer关闭当前的block
   * The DataStreamer class is responsible for sending data packets to the
   * datanodes in the pipeline. It retrieves a new blockid and block locations
   * from the namenode, and starts streaming packets to the pipeline of
   * Datanodes. Every packet has a sequence number associated with
   * it. When all the packets for a block are sent out and acks for each
   * if them are received, the DataStreamer closes the current block.
   */
  class DataStreamer extends Daemon {
      
    private volatile boolean streamerClosed = false;
    private volatile ExtendedBlock block; // its length is number of bytes acked
    private Token accessToken;
    private DataOutputStream blockStream;//发送数据的输出流
    private DataInputStream blockReplyStream;//输入流,即接收ack消息的流
    private ResponseProcessor response = null;
    private volatile DatanodeInfo[] nodes = null; // list of targets for current block 将要发送的datanode的集合
    private volatile StorageType[] storageTypes = null;
    private volatile String[] storageIDs = null;
      
    ......................  
      
  }

响应处理类ResponseProcessor

ResponseProcessor是DataStreamer的子类,用于处理接收到的ack数据

//处理从datanode返回的相应信息,当相应到达的时候,将DFSPacket从ackQueue移除
    // Processes responses from the datanodes.  A packet is removed
    // from the ackQueue when its response arrives.
    //
    private class ResponseProcessor extends Daemon {}

处理流程

客户端发数据到dataQueue

创建文件之后返回一个FSDataOutputStream对象,调用write方法写数据,最终调用了org.apache.hadoop.fs.FSOutputSummer.write(byte[], int, int);

write调用write1()方法循环写入len长度的数据,当写满一个数据块的时候,调用抽象方法writeChunk来写入数据,具体的实现则是org.apache.hadoop.hdfs.DFSOutputStream类中的同名方法,

具体的写入是在writeChunkImpl方法中,具体的代码如下:

private synchronized void writeChunkImpl(byte[] b, int offset, int len,
          byte[] checksum, int ckoff, int cklen) throws IOException {
    dfsClient.checkOpen();
    checkClosed();

    if (len > bytesPerChecksum) {
      throw new IOException("writeChunk() buffer size is " + len +
                            " is larger than supported  bytesPerChecksum " +
                            bytesPerChecksum);
    }
    if (cklen != 0 && cklen != getChecksumSize()) {
      throw new IOException("writeChunk() checksum size is supposed to be " +
                            getChecksumSize() + " but found to be " + cklen);
    }

    if (currentPacket == null) {
      currentPacket = createPacket(packetSize, chunksPerPacket, 
          bytesCurBlock, currentSeqno++, false);
      if (DFSClient.LOG.isDebugEnabled()) {
        DFSClient.LOG.debug("DFSClient writeChunk allocating new packet seqno=" + 
            currentPacket.getSeqno() +
            ", src=" + src +
            ", packetSize=" + packetSize +
            ", chunksPerPacket=" + chunksPerPacket +
            ", bytesCurBlock=" + bytesCurBlock);
      }
    }

    currentPacket.writeChecksum(checksum, ckoff, cklen);
    currentPacket.writeData(b, offset, len);
    currentPacket.incNumChunks();
    bytesCurBlock += len;

    // If packet is full, enqueue it for transmission
    //当一个DFSPacket写满了,则调用waitAndQueueCurrentPacket将其加入
    if (currentPacket.getNumChunks() == currentPacket.getMaxChunks() ||
        bytesCurBlock == blockSize) {
      if (DFSClient.LOG.isDebugEnabled()) {
        DFSClient.LOG.debug("DFSClient writeChunk packet full seqno=" +
            currentPacket.getSeqno() +
            ", src=" + src +
            ", bytesCurBlock=" + bytesCurBlock +
            ", blockSize=" + blockSize +
            ", appendChunk=" + appendChunk);
      }
      waitAndQueueCurrentPacket();

      // If the reopened file did not end at chunk boundary and the above
      // write filled up its partial chunk. Tell the summer to generate full 
      // crc chunks from now on.
      if (appendChunk && bytesCurBlock%bytesPerChecksum == 0) {
        appendChunk = false;
        resetChecksumBufSize();
      }

      if (!appendChunk) {
        int psize = Math.min((int)(blockSize-bytesCurBlock), dfsClient.getConf().writePacketSize);
        computePacketChunkSize(psize, bytesPerChecksum);
      }
      //
      // if encountering a block boundary, send an empty packet to 
      // indicate the end of block and reset bytesCurBlock.
      //
      if (bytesCurBlock == blockSize) {
        currentPacket = createPacket(0, 0, bytesCurBlock, currentSeqno++, true);
        currentPacket.setSyncBlock(shouldSyncBlock);
        waitAndQueueCurrentPacket();
        bytesCurBlock = 0;
        lastFlushOffset = 0;
      }
    }
  }

当packet满了的时候,调用waitAndQueueCurrentPacket方法,将数据包放入dataQueue队列中,waitAndQueueCurrentPacket方法开始的时候会进行packet的大小的判断,当dataQueue和ackQueue的值大于writeMaxPackets(默认80)时候,就等地,直到有足够的空间.

private void waitAndQueueCurrentPacket() throws IOException {
    synchronized (dataQueue) {
      try {
      // If queue is full, then wait till we have enough space
        boolean firstWait = true;
        try {
         //当大小不够的时候就wait
          while (!isClosed() && dataQueue.size() + ackQueue.size() >
              dfsClient.getConf().writeMaxPackets) {
                    ..................
            try {
              dataQueue.wait();
            } catch (InterruptedException e) {
                ..............
            }
          }
        } finally {
         ...............
        }
        checkClosed();
        //入队列
        queueCurrentPacket();
      } catch (ClosedChannelException e) {
      }
    }
  }

最后调用了queueCurrentPacket方法,将packet真正的放入了队列中

private void queueCurrentPacket() {
    synchronized (dataQueue) {
      if (currentPacket == null) return;
      currentPacket.addTraceParent(Trace.currentSpan());
      dataQueue.addLast(currentPacket);//将数据包放到了队列的尾部
      lastQueuedSeqno = currentPacket.getSeqno();
      if (DFSClient.LOG.isDebugEnabled()) {
        DFSClient.LOG.debug("Queued packet " + currentPacket.getSeqno());
      }
      currentPacket = null;//当前packet置空,用于下一个数据包的写入
      dataQueue.notifyAll();//唤醒所有在dataQueue上的线程去处理
    }
  }

最终通过方法queueCurrentPacket将DFSPacket写入dataQueue,即dataQueue.addLast(currentPacket);

并通过dataQueue.notifyAll();唤醒dataQueue上面等待的所有线程来处理数据

private void queueCurrentPacket() {
    synchronized (dataQueue) {
      if (currentPacket == null) return;
      currentPacket.addTraceParent(Trace.currentSpan());
      dataQueue.addLast(currentPacket);
      lastQueuedSeqno = currentPacket.getSeqno();
      if (DFSClient.LOG.isDebugEnabled()) {
        DFSClient.LOG.debug("Queued packet " + currentPacket.getSeqno());
      }
      currentPacket = null;
      dataQueue.notifyAll();
    }
  }

DataStreamer处理dataQueue中的数据

DataStreamer处理发送数据的核心逻辑在run方法中。

处理错误

在开始的时候,首先判断是否有错误

具体的处理方法是private的processDatanodeError方法,如果发现了错误,就讲ack队列里的packet全部放回dataQueue中,然后创建一个新的流重新发送数据。

创建输出数据流,发送数据

通过nextBlockOutputStream()方法建立到datanode的输出流。

向namenode申请数据块

locateFollowingBlock方法申请数据块,具体的代码是
dfsClient.namenode.addBlock(src, dfsClient.clientName, block, excludedNodes, fileId, favoredNodes);

dfsClient拿到namenode的代理,然后通过addBlock方法来申请新的数据块,addBlock方法申请数据块的时候还会提交上一个块,也就是参数中的block,即上一个数据块。
excludedNodes参数表示了申请数据块的时候需要排除的datanode列表,
favoredNodes参数表示了优先选择的datanode列表。

连接到第一个datanode

成功申请了数据块之后,会返回一个LocatedBlock对象,里面包含了datanode的相关信息。

然后通过createBlockOutputStream方法连接到第一个datanode,具体就是new了一个DataOutputStream对象来连接到datanode。 然后构造了一个Sender对象,来向DataNode发送操作码是80的写block的输出流, 发送到datanode的数据,datanode通过DataXceiver接收处理

new Sender(out).writeBlock(blockCopy, nodeStorageTypes[0], accessToken,
      dfsClient.clientName, nodes, nodeStorageTypes, null, bcs, 
      nodes.length, block.getNumBytes(), bytesSent, newGS,
      checksum4WriteBlock, cachingStrategy.get(), isLazyPersistFile,
    (targetPinnings == null ? false :targetPinnings[0]), targetPinnings);

申请block,然后建立到datanode的连接,是在一个do while循环中做的,如果失败了会尝试重新连接,默认三次。

建立管道

nextBlockOutputStream方法成功的返回了datanode的信息之后,setPipeline方法建立到datanode的管道信息,这个方法比较简单,就是用申请到的datanode给相应的变量赋值。

private void setPipeline(LocatedBlock lb) {
      setPipeline(lb.getLocations(), lb.getStorageTypes(), lb.getStorageIDs());
    }
    private void setPipeline(DatanodeInfo[] nodes, StorageType[] storageTypes,
        String[] storageIDs) {
      this.nodes = nodes;
      this.storageTypes = storageTypes;
      this.storageIDs = storageIDs;
    }

初始化数据流

initDataStreaming方法主要就是根据datanode列表建立ResponseProcessor对象,并且调动start方法启动,并将状态设置为DATA_STREAMING

/**
     * Initialize for data streaming
     */
    private void initDataStreaming() {
      this.setName("DataStreamer for file " + src +
          " block " + block);
      response = new ResponseProcessor(nodes);
      response.start();
      stage = BlockConstructionStage.DATA_STREAMING;
    }

发送数据包

一切准备就绪之后,从dataQueue头部拿出一个packet,放入ackQueue的尾部,并且唤醒在dataQueue上等待的所有线程,通过 one.writeTo(blockStream);发送数据包。

// send the packet
          Span span = null;
          synchronized (dataQueue) {
            // move packet from dataQueue to ackQueue
            if (!one.isHeartbeatPacket()) {
              span = scope.detach();
              one.setTraceSpan(span);
              dataQueue.removeFirst();
              ackQueue.addLast(one);
              dataQueue.notifyAll();
            }
          }

          if (DFSClient.LOG.isDebugEnabled()) {
            DFSClient.LOG.debug("DataStreamer block " + block +
                " sending packet " + one);
          }

          // write out data to remote datanode
          TraceScope writeScope = Trace.startSpan("writeTo", span);
          try {
            one.writeTo(blockStream);
            blockStream.flush();   
          } catch (IOException e) {
            // HDFS-3398 treat primary DN is down since client is unable to 
            // write to primary DN. If a failed or restarting node has already
            // been recorded by the responder, the following call will have no 
            // effect. Pipeline recovery can handle only one node error at a
            // time. If the primary node fails again during the recovery, it
            // will be taken out then.
            tryMarkPrimaryDatanodeFailed();
            throw e;
          } finally {
            writeScope.close();
          }

关闭数据流

当dataQueue中的所有数据块都发送完毕,并且确保都收到ack消息之后,客户端的写入操作就结束了,调用endBlock方法来关闭相应的流,

// Is this block full?
          if (one.isLastPacketInBlock()) {
            // wait for the close packet has been acked
            synchronized (dataQueue) {
              while (!streamerClosed && !hasError && 
                  ackQueue.size() != 0 && dfsClient.clientRunning) {
                dataQueue.wait(1000);// wait for acks to arrive from datanodes
              }
            }
            if (streamerClosed || hasError || !dfsClient.clientRunning) {
              continue;
            }

            endBlock();
          }

关闭响应,关闭数据流,将管道置空,状态变成PIPELINE_SETUP_CREATE

private void endBlock() {
      if(DFSClient.LOG.isDebugEnabled()) {
        DFSClient.LOG.debug("Closing old block " + block);
      }
      this.setName("DataStreamer for file " + src);
      closeResponder();
      closeStream();
      setPipeline(null, null, null);
      stage = BlockConstructionStage.PIPELINE_SETUP_CREATE;
    }

ResponseProcessor处理回复消息

这块逻辑相对比较简单

@Override
      public void run() {

        setName("ResponseProcessor for block " + block);
        PipelineAck ack = new PipelineAck();

        TraceScope scope = NullScope.INSTANCE;
        while (!responderClosed && dfsClient.clientRunning && !isLastPacketInBlock) {
          // process responses from datanodes.
          try {
            //从ack队列里读取packet
            // read an ack from the pipeline
            long begin = Time.monotonicNow();
            ack.readFields(blockReplyStream);
             ..............

                
            //一切都处理成功之后,将其从ack队列中删除
            synchronized (dataQueue) {
              scope = Trace.continueSpan(one.getTraceSpan());
              one.setTraceSpan(null);
              lastAckedSeqno = seqno;
              pipelineRecoveryCount = 0;
              ackQueue.removeFirst();
              dataQueue.notifyAll();

              one.releaseBuffer(byteArrayManager);
            }
          } catch (Exception e) {
          //如果遇到了异常,并没有立即处理,而是放到了一个AtomicReference类型的对象中,
            if (!responderClosed) {
              if (e instanceof IOException) {
                setLastException((IOException)e);
              }
                ............
            }
          } finally {
            scope.close();
          }
        }
      }

你可能感兴趣的:(hadoop,hdfs,源码,读数据)