HDFS源码解析---Balancer

概述

在输入启动命令的那台机器上会启动一个进程,为了避免给namenode带来过大的负担,整个balance过程由balance server而不是namenode来控制。

Balancer的最终结果是namenode上记录的一个block的一个副本从一个datanode转移到另一个datanode上。

PS:副本放置策略

  • 第 2 个副本存放于不同于第 1 个副本所在的机架

  • 第 3 个副本存放于第2个副本所在的机架,但 是属于不同的节点

HDFS源码解析---Balancer_第1张图片

 *绿色区域表示一个副本

balance策略

把各个数据节点分成过载节点、负载节点、存储使用率高于平均水平的节点和低于平均水平的节点四类,再判断是否有节点处于过载和负载状态(也即过载节点列表和负载节点列表中是否有机器),如果是则继续,否则退出。如果判断可继续,则遍历过载节点列表和负载节点列表以生成balance策略。生成balance策略的过程包括以下步骤:

a、选择数据移动的源节点和目的节点,选择依据如下:

   对于负载节点,依据以下条件随机选取选取作为其source,条件优先级自上而下递减

    • 同一机架上的过载节点

    • 同一机架上的高于平均使用率的节点

    • 其他机架上的过载节点

    • 其他机架上的高于平均使用率的节

    对于过载节点,依据以下条件随机选取选取作为其target,条件优先级自上而下递减

    • 同一机架上的负载节点

    • 同一机架上的低于平均使用率的节点

    • 其他机架上的负载节点

    • 其他机架上的低于平均使用率的节点

b、计算每一个source到每个destination要移动的数据量(注意以byte为单位而不是block)。

如果source节点是过载节点,则看容积允许偏差值是否大于1GB,大于则取1GB,否则取允许偏差值。如果source只是高于平均使用率而没有达到过载的条件,则看该节点实际容积率与集群平均容积率之差是否大于2GB,大于取2GB,否则取前者。destination节点也如此计算。

dispatchBlockMoves

private long dispatchBlockMoves() throws InterruptedException {
    final long bytesLastMoved = getBytesMoved();
    final Future[] futures = new Future[sources.size()];
    final Iterator i = sources.iterator();
    for (int j = 0; j < futures.length; j++) {
      final Source s = i.next();
      futures[j] = dispatchExecutor.submit(new Runnable() {
        @Override
        public void run() {
          // 开始执行
          s.dispatchBlocks();
        }
      });
    }

    // wait for all dispatcher threads to finish
    for (Future future : futures) {
      try {
        future.get();
      } catch (ExecutionException e) {
        LOG.warn("Dispatcher thread failed", e.getCause());
      }
    }

    // wait for all block moving to be done
    waitForMoveCompletion(targets);

    return getBytesMoved() - bytesLastMoved;
  }
private void dispatchBlocks() {
      this.blocksToReceive = 2 * getScheduledSize();
      int noPendingMoveIteration = 0;
      LOG.info(getScheduledSize() > 0 && !isIterationOver()
              && (!srcBlocks.isEmpty() || blocksToReceive > 0));
      while (getScheduledSize() > 0 && !isIterationOver()
          && (!srcBlocks.isEmpty() || blocksToReceive > 0)) {
        if (LOG.isTraceEnabled()) {
          LOG.trace(this + " blocksToReceive=" + blocksToReceive
              + ", scheduledSize=" + getScheduledSize()
              + ", srcBlocks#=" + srcBlocks.size());
        }
       final PendingMove p = chooseNextMove();
        if (p != null) {
          // Reset no pending move counter
          noPendingMoveIteration=0;
          executePendingMove(p);
          continue;
        }
        // Since we cannot schedule any block to move,
        // remove any moved blocks from the source block list and
        removeMovedBlocks(); // filter already moved blocks
        // check if we should fetch more blocks from the namenode
        if (shouldFetchMoreBlocks()) {
          // fetch new blocks
          try {
            final long received = getBlockList();
            if (received == 0) {
              return;
            }
            blocksToReceive -= received;
            continue;
          } catch (IOException e) {
            LOG.warn("Exception while getting block list", e);
            return;
          }
        } else {
          // source node cannot find a pending block to move, iteration +1
          noPendingMoveIteration++;
          // in case no blocks can be moved for source node's task,
          // jump out of while-loop after 5 iterations.
          if (noPendingMoveIteration >= MAX_NO_PENDING_MOVE_ITERATIONS) {
            LOG.info("Failed to find a pending move "  + noPendingMoveIteration
                + " times.  Skipping " + this);
            resetScheduledSize();
          }
        }

        // Now we can not schedule any block to move and there are
        // no new blocks added to the source block list, so we wait.
        try {
          synchronized (Dispatcher.this) {
            Dispatcher.this.wait(1000); // wait for targets/sources to be idle
          }
        } catch (InterruptedException ignored) {
        }
      }

      if (isIterationOver()) {
        LOG.info("The maximum iteration time (" + MAX_ITERATION_TIME/1000
            + " seconds) has been reached. Stopping " + this);
      }
    }
  • 选择一个合适的block和一个proxy新建PendingMove对象

  • 把这个block添加到movedBlocks队列(标记为已经移动过)

  • 从srcBlocks队列中移除(这个队列为空需要从namenode上随机拉取一定数量的block,最多一次2G,筛选之后保存到src_block列表中。触发拉取的条件之一:src_block列表中尚未被迁移的block数量少于5(固定值,不可配)),每次拉取新的block的时候会排除掉已经在movedBlocks队列里的。

  • 执行一次移动(将src_block列表中的block提交到线程池(线程池大小:dfs.balancer.moverThreads,默认1000)进行迁移。)

getBlockList

  • 从namenode上随机拉取一定数量的block(每次最多2G,累计20G),筛选之后保存到src_block列表中。触发拉取的条件之一:src_block列表中尚未被迁移的block数量少于5(固定值,不可配)。

  • MAX_BLOCKS_SIZE_TO_FETCH这个参数默认是2G,blocksToReceive一开始默认是2*scheduledSize = 20G,也就是说一次dispatchBlockMove最多20G(外面的while循环控制)(!srcBlocks.isEmpty() || blocksToReceive > 0)

  • 拉取完之后筛选一遍

private long getBlockList() throws IOException {
      final long size = Math.min(MAX_BLOCKS_SIZE_TO_FETCH, blocksToReceive);
      final BlocksWithLocations newBlocks = nnc.getBlocks(getDatanodeInfo(), size);
      if (LOG.isTraceEnabled()) {
        LOG.trace("getBlocks(" + getDatanodeInfo() + ", "
            + StringUtils.TraditionalBinaryPrefix.long2String(size, "B", 2)
            + ") returns " + newBlocks.getBlocks().length + " blocks.");
      }

      long bytesReceived = 0;
      for (BlockWithLocations blk : newBlocks.getBlocks()) {
        bytesReceived += blk.getBlock().getNumBytes();
        synchronized (globalBlocks) {
          final DBlock block = globalBlocks.get(blk.getBlock());
          synchronized (block) {
            block.clearLocations();

            // update locations
            final String[] datanodeUuids = blk.getDatanodeUuids();
            final StorageType[] storageTypes = blk.getStorageTypes();
            for (int i = 0; i < datanodeUuids.length; i++) {
              final StorageGroup g = storageGroupMap.get(
                  datanodeUuids[i], storageTypes[i]);
              if (g != null) { // not unknown
                block.addLocation(g);
              }
            }
          }
          if (!srcBlocks.contains(block) && isGoodBlockCandidate(block)) {
            if (LOG.isTraceEnabled()) {
              LOG.trace("Add " + block + " to " + this);
            }
            srcBlocks.add(block);
          }
        }
      }
      return bytesReceived;
    }

chooseNextMove

从逻辑上来说,chooseNextMove会选择一个block去新建一个PendingMove对象,然后标记为已经移动。

  • 判断target是否空闲

  • 从srcBlocks队列中选取合适的block并且找一个代理(proxy)

private PendingMove chooseNextMove() {
      for (Iterator i = tasks.iterator(); i.hasNext();) {
        final Task task = i.next();
        final DDatanode target = task.target.getDDatanode();
        final PendingMove pendingBlock = new PendingMove(this, task.target);
        if (target.addPendingBlock(pendingBlock)) {
          // target is not busy, so do a tentative block allocation
          if (pendingBlock.chooseBlockAndProxy()) {
            long blockSize = pendingBlock.block.getNumBytes();
            incScheduledSize(-blockSize);
            task.size -= blockSize;
            if (task.size <= 0) {
              i.remove();
            }
            return pendingBlock;
          } else {
            // cancel the tentative move
            target.removePendingBlock(pendingBlock);
          }
        }
      }
      return null;
    }

选择block和proxy

        如果选择了这个block,并且这个block已经添加到movedBlocks队列中,那么就从srcBlocks队列中移

private boolean chooseBlockAndProxy() {
      // source and target must have the same storage type
      final StorageType t = source.getStorageType();
      // iterate all source's blocks until find a good one
      // 遍历srcBlocks队列
      for (Iterator i = source.getBlockIterator(); i.hasNext();) {
        if (markMovedIfGoodBlock(i.next(), t)) {
          i.remove();
          return true;
        }
      }
      return false;
    }

选择合适的block

        选取待移动block的时候不能破坏block的分布原则,也即不能造成block丢失,不能使一个block的副本数变少,也不能使一个block放置的机架数变少。选取时依据的原则如下:

  • 如果source和target在不同的机架上,则target所在的机架上不应该有待移动block的副本

  • target上不能有待移动block的副本

  • block不能处于正在被移动的状态/已经移动

  • 不能使一个block放置的机架数变少

  • target上不能有除了source本身以外的其他副本

/**
   * Decide if the block is a good candidate to be moved from source to target.
   * A block is a good candidate if
   * 1. the block is not in the process of being moved/has not been moved;
   * 2. the block does not have a replica on the target;
   * 3. doing the move does not reduce the number of racks that the block has
   */
  private boolean isGoodBlockCandidate(StorageGroup source, StorageGroup target,
      StorageType targetStorageType, DBlock block) {
    if (source.equals(target)) {
      return false;
    }
    if (target.storageType != targetStorageType) {
      return false;
    }
    // check if the block is moved or not
    if (movedBlocks.contains(block.getBlock())) {
      return false;
    }
    final DatanodeInfo targetDatanode = target.getDatanodeInfo();
    if (source.getDatanodeInfo().equals(targetDatanode)) {
      // the block is moved inside same DN
      return true;
    }

    // check if block has replica in target node
    for (StorageGroup blockLocation : block.getLocations()) {
      if (blockLocation.getDatanodeInfo().equals(targetDatanode)) {
        return false;
      }
    }

    if (cluster.isNodeGroupAware()
        && isOnSameNodeGroupWithReplicas(source, target, block)) {
      LOG.info("cluster.isNodeGroupAware()\n" +
              "        && isOnSameNodeGroupWithReplicas(source, target, block)");
      return false;
    }
    if (reduceNumOfRacks(source, target, block)) {
      return false;
    }
    return true;
  }

block添加到movedBlocks队列

        添加到movedBlocks,表示已经移动过这个block了。

private boolean markMovedIfGoodBlock(DBlock block, StorageType targetStorageType) {
      synchronized (block) {
        synchronized (movedBlocks) {
          if (isGoodBlockCandidate(source, target, targetStorageType, block)) {
            // PendingMove赋值block
            this.block = block;
            // 选择source的代理
            if (chooseProxySource()) {
              // 添加到已经移动的block队列
              movedBlocks.put(block);
              if (LOG.isDebugEnabled()) {
                LOG.debug("Decided to move " + this);
              }
              return true;
            }
          }
        }
      }
      return false;
    }

选择proxy

private boolean chooseProxySource() {
      final DatanodeInfo targetDN = target.getDatanodeInfo();
      // if source and target are same nodes then no need of proxy
      if (source.getDatanodeInfo().equals(targetDN) && addTo(source)) {
        return true;
      }
      // if node group is supported, first try add nodes in the same node group
      if (cluster.isNodeGroupAware()) {
        for (StorageGroup loc : block.getLocations()) {
          if (cluster.isOnSameNodeGroup(loc.getDatanodeInfo(), targetDN)
              && addTo(loc)) {
            return true;
          }
        }
      }
      // check if there is replica which is on the same rack with the target
      for (StorageGroup loc : block.getLocations()) {
        if (cluster.isOnSameRack(loc.getDatanodeInfo(), targetDN) && addTo(loc)) {
          return true;
        }
      }
      // find out a non-busy replica
      for (StorageGroup loc : block.getLocations()) {
        if (addTo(loc)) {
          return true;
        }
      }
      return false;
    }

执行移动

public void executePendingMove(final PendingMove p) {
    // move the block
    moveExecutor.execute(new Runnable() {
      @Override
      public void run() {
        p.dispatch();
      }
    });
  }

step1:

        balancer socket连接target,发起replaceBlock 请求,请求target从proxy上复制一个block副本到本地来替换掉source上的副本。

step2:

        target向proxy 发起copyBlock请求,从proxy上将block副本复制到本地,复制完成后 target 通过notifyNamenodeReceivedBlock 方法生成一个ReceivedDeletedBlockInfo对象并缓存在队列,下一次发起心跳的时候会据此对象通知namenode 将target上新加的block副本存入blockmap,并将source上对应的block 副本删除

/** Dispatch the move to the proxy source & wait for the response. */
    private void dispatch() {
      LOG.info("Start moving " + this);

      Socket sock = new Socket();
      DataOutputStream out = null;
      DataInputStream in = null;
      try {
        sock.connect(
            NetUtils.createSocketAddr(target.getDatanodeInfo().
                getXferAddr(Dispatcher.this.connectToDnViaHostname)),
                HdfsServerConstants.READ_TIMEOUT);

        // Set read timeout so that it doesn't hang forever against
        // unresponsive nodes. Datanode normally sends IN_PROGRESS response
        // twice within the client read timeout period (every 30 seconds by
        // default). Here, we make it give up after 5 minutes of no response.
        sock.setSoTimeout(HdfsServerConstants.READ_TIMEOUT * 5);
        sock.setKeepAlive(true);

        OutputStream unbufOut = sock.getOutputStream();
        InputStream unbufIn = sock.getInputStream();
        ExtendedBlock eb = new ExtendedBlock(nnc.getBlockpoolID(),
            block.getBlock());
        final KeyManager km = nnc.getKeyManager(); 
        Token accessToken = km.getAccessToken(eb);
        IOStreamPair saslStreams = saslClient.socketSend(sock, unbufOut,
            unbufIn, km, accessToken, target.getDatanodeInfo());
        unbufOut = saslStreams.out;
        unbufIn = saslStreams.in;
        out = new DataOutputStream(new BufferedOutputStream(unbufOut,
            HdfsConstants.IO_FILE_BUFFER_SIZE));
        in = new DataInputStream(new BufferedInputStream(unbufIn,
            HdfsConstants.IO_FILE_BUFFER_SIZE));

        sendRequest(out, eb, accessToken);
        receiveResponse(in);
        nnc.getBytesMoved().addAndGet(block.getNumBytes());
        LOG.info("Successfully moved " + this);
      } catch (IOException e) {
        LOG.warn("Failed to move " + this, e);
        target.getDDatanode().setHasFailure();
        // Proxy or target may have some issues, delay before using these nodes
        // further in order to avoid a potential storm of "threads quota
        // exceeded" warnings when the dispatcher gets out of sync with work
        // going on in datanodes.
        // 迁移失败,可能是因为proxy、target当前过于繁忙(同时处理blockReplace的操作太多),所以延迟10s其参与balance
        proxySource.activateDelay(delayAfterErrors);
        target.getDDatanode().activateDelay(delayAfterErrors);
      } finally {
        IOUtils.closeStream(out);
        IOUtils.closeStream(in);
        IOUtils.closeSocket(sock);
				// 不管迁移成功还是失败,都将当前block从队列中删除
        proxySource.removePendingBlock(this);
        target.getDDatanode().removePendingBlock(this);

        synchronized (this) {
          reset();
        }
        synchronized (Dispatcher.this) {
          Dispatcher.this.notifyAll();
        }
      }
    }

这里会调用relaceBlock,通知namenode该block已经从datanode1转移到datanode2

/** Send a block replace request to the output stream */
    private void sendRequest(DataOutputStream out, ExtendedBlock eb,
        Token accessToken) throws IOException {
      new Sender(out).replaceBlock(eb, target.storageType, accessToken,
          source.getDatanodeInfo().getDatanodeUuid(), proxySource.datanode);
    }

你可能感兴趣的:(Hadoop,HDFS,balancer,hdfs,java)