Balancer的最终结果是namenode上记录的一个block的一个副本从一个datanode转移到另一个datanode上。
第 2 个副本存放于不同于第 1 个副本所在的机架
第 3 个副本存放于第2个副本所在的机架,但 是属于不同的节点
*绿色区域表示一个副本
把各个数据节点分成过载节点、负载节点、存储使用率高于平均水平的节点和低于平均水平的节点四类,再判断是否有节点处于过载和负载状态(也即过载节点列表和负载节点列表中是否有机器),如果是则继续,否则退出。如果判断可继续,则遍历过载节点列表和负载节点列表以生成balance策略。生成balance策略的过程包括以下步骤:
对于负载节点,依据以下条件随机选取选取作为其source,条件优先级自上而下递减
同一机架上的过载节点
同一机架上的高于平均使用率的节点
其他机架上的过载节点
其他机架上的高于平均使用率的节
对于过载节点,依据以下条件随机选取选取作为其target,条件优先级自上而下递减
同一机架上的负载节点
同一机架上的低于平均使用率的节点
其他机架上的负载节点
其他机架上的低于平均使用率的节点
如果source节点是过载节点,则看容积允许偏差值是否大于1GB,大于则取1GB,否则取允许偏差值。如果source只是高于平均使用率而没有达到过载的条件,则看该节点实际容积率与集群平均容积率之差是否大于2GB,大于取2GB,否则取前者。destination节点也如此计算。
private long dispatchBlockMoves() throws InterruptedException {
final long bytesLastMoved = getBytesMoved();
final Future>[] futures = new Future>[sources.size()];
final Iterator
private void dispatchBlocks() {
this.blocksToReceive = 2 * getScheduledSize();
int noPendingMoveIteration = 0;
LOG.info(getScheduledSize() > 0 && !isIterationOver()
&& (!srcBlocks.isEmpty() || blocksToReceive > 0));
while (getScheduledSize() > 0 && !isIterationOver()
&& (!srcBlocks.isEmpty() || blocksToReceive > 0)) {
if (LOG.isTraceEnabled()) {
LOG.trace(this + " blocksToReceive=" + blocksToReceive
+ ", scheduledSize=" + getScheduledSize()
+ ", srcBlocks#=" + srcBlocks.size());
}
final PendingMove p = chooseNextMove();
if (p != null) {
// Reset no pending move counter
noPendingMoveIteration=0;
executePendingMove(p);
continue;
}
// Since we cannot schedule any block to move,
// remove any moved blocks from the source block list and
removeMovedBlocks(); // filter already moved blocks
// check if we should fetch more blocks from the namenode
if (shouldFetchMoreBlocks()) {
// fetch new blocks
try {
final long received = getBlockList();
if (received == 0) {
return;
}
blocksToReceive -= received;
continue;
} catch (IOException e) {
LOG.warn("Exception while getting block list", e);
return;
}
} else {
// source node cannot find a pending block to move, iteration +1
noPendingMoveIteration++;
// in case no blocks can be moved for source node's task,
// jump out of while-loop after 5 iterations.
if (noPendingMoveIteration >= MAX_NO_PENDING_MOVE_ITERATIONS) {
LOG.info("Failed to find a pending move " + noPendingMoveIteration
+ " times. Skipping " + this);
resetScheduledSize();
}
}
// Now we can not schedule any block to move and there are
// no new blocks added to the source block list, so we wait.
try {
synchronized (Dispatcher.this) {
Dispatcher.this.wait(1000); // wait for targets/sources to be idle
}
} catch (InterruptedException ignored) {
}
}
if (isIterationOver()) {
LOG.info("The maximum iteration time (" + MAX_ITERATION_TIME/1000
+ " seconds) has been reached. Stopping " + this);
}
}
选择一个合适的block和一个proxy新建PendingMove对象
把这个block添加到movedBlocks队列(标记为已经移动过)
从srcBlocks队列中移除(这个队列为空需要从namenode上随机拉取一定数量的block,最多一次2G,筛选之后保存到src_block列表中。触发拉取的条件之一:src_block列表中尚未被迁移的block数量少于5(固定值,不可配)),每次拉取新的block的时候会排除掉已经在movedBlocks队列里的。
执行一次移动(将src_block列表中的block提交到线程池(线程池大小:dfs.balancer.moverThreads,默认1000)进行迁移。)
从namenode上随机拉取一定数量的block(每次最多2G,累计20G),筛选之后保存到src_block列表中。触发拉取的条件之一:src_block列表中尚未被迁移的block数量少于5(固定值,不可配)。
MAX_BLOCKS_SIZE_TO_FETCH这个参数默认是2G,blocksToReceive一开始默认是2*scheduledSize = 20G,也就是说一次dispatchBlockMove最多20G(外面的while循环控制)(!srcBlocks.isEmpty() || blocksToReceive > 0)
拉取完之后筛选一遍
private long getBlockList() throws IOException {
final long size = Math.min(MAX_BLOCKS_SIZE_TO_FETCH, blocksToReceive);
final BlocksWithLocations newBlocks = nnc.getBlocks(getDatanodeInfo(), size);
if (LOG.isTraceEnabled()) {
LOG.trace("getBlocks(" + getDatanodeInfo() + ", "
+ StringUtils.TraditionalBinaryPrefix.long2String(size, "B", 2)
+ ") returns " + newBlocks.getBlocks().length + " blocks.");
}
long bytesReceived = 0;
for (BlockWithLocations blk : newBlocks.getBlocks()) {
bytesReceived += blk.getBlock().getNumBytes();
synchronized (globalBlocks) {
final DBlock block = globalBlocks.get(blk.getBlock());
synchronized (block) {
block.clearLocations();
// update locations
final String[] datanodeUuids = blk.getDatanodeUuids();
final StorageType[] storageTypes = blk.getStorageTypes();
for (int i = 0; i < datanodeUuids.length; i++) {
final StorageGroup g = storageGroupMap.get(
datanodeUuids[i], storageTypes[i]);
if (g != null) { // not unknown
block.addLocation(g);
}
}
}
if (!srcBlocks.contains(block) && isGoodBlockCandidate(block)) {
if (LOG.isTraceEnabled()) {
LOG.trace("Add " + block + " to " + this);
}
srcBlocks.add(block);
}
}
}
return bytesReceived;
}
从逻辑上来说,chooseNextMove会选择一个block去新建一个PendingMove对象,然后标记为已经移动。
判断target是否空闲
从srcBlocks队列中选取合适的block并且找一个代理(proxy)
private PendingMove chooseNextMove() {
for (Iterator i = tasks.iterator(); i.hasNext();) {
final Task task = i.next();
final DDatanode target = task.target.getDDatanode();
final PendingMove pendingBlock = new PendingMove(this, task.target);
if (target.addPendingBlock(pendingBlock)) {
// target is not busy, so do a tentative block allocation
if (pendingBlock.chooseBlockAndProxy()) {
long blockSize = pendingBlock.block.getNumBytes();
incScheduledSize(-blockSize);
task.size -= blockSize;
if (task.size <= 0) {
i.remove();
}
return pendingBlock;
} else {
// cancel the tentative move
target.removePendingBlock(pendingBlock);
}
}
}
return null;
}
选择block和proxy
如果选择了这个block,并且这个block已经添加到movedBlocks队列中,那么就从srcBlocks队列中移
private boolean chooseBlockAndProxy() {
// source and target must have the same storage type
final StorageType t = source.getStorageType();
// iterate all source's blocks until find a good one
// 遍历srcBlocks队列
for (Iterator i = source.getBlockIterator(); i.hasNext();) {
if (markMovedIfGoodBlock(i.next(), t)) {
i.remove();
return true;
}
}
return false;
}
选择合适的block
选取待移动block的时候不能破坏block的分布原则,也即不能造成block丢失,不能使一个block的副本数变少,也不能使一个block放置的机架数变少。选取时依据的原则如下:
如果source和target在不同的机架上,则target所在的机架上不应该有待移动block的副本
target上不能有待移动block的副本
block不能处于正在被移动的状态/已经移动
不能使一个block放置的机架数变少
target上不能有除了source本身以外的其他副本
/**
* Decide if the block is a good candidate to be moved from source to target.
* A block is a good candidate if
* 1. the block is not in the process of being moved/has not been moved;
* 2. the block does not have a replica on the target;
* 3. doing the move does not reduce the number of racks that the block has
*/
private boolean isGoodBlockCandidate(StorageGroup source, StorageGroup target,
StorageType targetStorageType, DBlock block) {
if (source.equals(target)) {
return false;
}
if (target.storageType != targetStorageType) {
return false;
}
// check if the block is moved or not
if (movedBlocks.contains(block.getBlock())) {
return false;
}
final DatanodeInfo targetDatanode = target.getDatanodeInfo();
if (source.getDatanodeInfo().equals(targetDatanode)) {
// the block is moved inside same DN
return true;
}
// check if block has replica in target node
for (StorageGroup blockLocation : block.getLocations()) {
if (blockLocation.getDatanodeInfo().equals(targetDatanode)) {
return false;
}
}
if (cluster.isNodeGroupAware()
&& isOnSameNodeGroupWithReplicas(source, target, block)) {
LOG.info("cluster.isNodeGroupAware()\n" +
" && isOnSameNodeGroupWithReplicas(source, target, block)");
return false;
}
if (reduceNumOfRacks(source, target, block)) {
return false;
}
return true;
}
block添加到movedBlocks队列
添加到movedBlocks,表示已经移动过这个block了。
private boolean markMovedIfGoodBlock(DBlock block, StorageType targetStorageType) {
synchronized (block) {
synchronized (movedBlocks) {
if (isGoodBlockCandidate(source, target, targetStorageType, block)) {
// PendingMove赋值block
this.block = block;
// 选择source的代理
if (chooseProxySource()) {
// 添加到已经移动的block队列
movedBlocks.put(block);
if (LOG.isDebugEnabled()) {
LOG.debug("Decided to move " + this);
}
return true;
}
}
}
}
return false;
}
选择proxy
private boolean chooseProxySource() {
final DatanodeInfo targetDN = target.getDatanodeInfo();
// if source and target are same nodes then no need of proxy
if (source.getDatanodeInfo().equals(targetDN) && addTo(source)) {
return true;
}
// if node group is supported, first try add nodes in the same node group
if (cluster.isNodeGroupAware()) {
for (StorageGroup loc : block.getLocations()) {
if (cluster.isOnSameNodeGroup(loc.getDatanodeInfo(), targetDN)
&& addTo(loc)) {
return true;
}
}
}
// check if there is replica which is on the same rack with the target
for (StorageGroup loc : block.getLocations()) {
if (cluster.isOnSameRack(loc.getDatanodeInfo(), targetDN) && addTo(loc)) {
return true;
}
}
// find out a non-busy replica
for (StorageGroup loc : block.getLocations()) {
if (addTo(loc)) {
return true;
}
}
return false;
}
public void executePendingMove(final PendingMove p) {
// move the block
moveExecutor.execute(new Runnable() {
@Override
public void run() {
p.dispatch();
}
});
}
step1:
balancer socket连接target,发起replaceBlock 请求,请求target从proxy上复制一个block副本到本地来替换掉source上的副本。
step2:
target向proxy 发起copyBlock请求,从proxy上将block副本复制到本地,复制完成后 target 通过notifyNamenodeReceivedBlock 方法生成一个ReceivedDeletedBlockInfo对象并缓存在队列,下一次发起心跳的时候会据此对象通知namenode 将target上新加的block副本存入blockmap,并将source上对应的block 副本删除
/** Dispatch the move to the proxy source & wait for the response. */
private void dispatch() {
LOG.info("Start moving " + this);
Socket sock = new Socket();
DataOutputStream out = null;
DataInputStream in = null;
try {
sock.connect(
NetUtils.createSocketAddr(target.getDatanodeInfo().
getXferAddr(Dispatcher.this.connectToDnViaHostname)),
HdfsServerConstants.READ_TIMEOUT);
// Set read timeout so that it doesn't hang forever against
// unresponsive nodes. Datanode normally sends IN_PROGRESS response
// twice within the client read timeout period (every 30 seconds by
// default). Here, we make it give up after 5 minutes of no response.
sock.setSoTimeout(HdfsServerConstants.READ_TIMEOUT * 5);
sock.setKeepAlive(true);
OutputStream unbufOut = sock.getOutputStream();
InputStream unbufIn = sock.getInputStream();
ExtendedBlock eb = new ExtendedBlock(nnc.getBlockpoolID(),
block.getBlock());
final KeyManager km = nnc.getKeyManager();
Token accessToken = km.getAccessToken(eb);
IOStreamPair saslStreams = saslClient.socketSend(sock, unbufOut,
unbufIn, km, accessToken, target.getDatanodeInfo());
unbufOut = saslStreams.out;
unbufIn = saslStreams.in;
out = new DataOutputStream(new BufferedOutputStream(unbufOut,
HdfsConstants.IO_FILE_BUFFER_SIZE));
in = new DataInputStream(new BufferedInputStream(unbufIn,
HdfsConstants.IO_FILE_BUFFER_SIZE));
sendRequest(out, eb, accessToken);
receiveResponse(in);
nnc.getBytesMoved().addAndGet(block.getNumBytes());
LOG.info("Successfully moved " + this);
} catch (IOException e) {
LOG.warn("Failed to move " + this, e);
target.getDDatanode().setHasFailure();
// Proxy or target may have some issues, delay before using these nodes
// further in order to avoid a potential storm of "threads quota
// exceeded" warnings when the dispatcher gets out of sync with work
// going on in datanodes.
// 迁移失败,可能是因为proxy、target当前过于繁忙(同时处理blockReplace的操作太多),所以延迟10s其参与balance
proxySource.activateDelay(delayAfterErrors);
target.getDDatanode().activateDelay(delayAfterErrors);
} finally {
IOUtils.closeStream(out);
IOUtils.closeStream(in);
IOUtils.closeSocket(sock);
// 不管迁移成功还是失败,都将当前block从队列中删除
proxySource.removePendingBlock(this);
target.getDDatanode().removePendingBlock(this);
synchronized (this) {
reset();
}
synchronized (Dispatcher.this) {
Dispatcher.this.notifyAll();
}
}
}
这里会调用relaceBlock,通知namenode该block已经从datanode1转移到datanode2
/** Send a block replace request to the output stream */
private void sendRequest(DataOutputStream out, ExtendedBlock eb,
Token accessToken) throws IOException {
new Sender(out).replaceBlock(eb, target.storageType, accessToken,
source.getDatanodeInfo().getDatanodeUuid(), proxySource.datanode);
}