Hedged reads是HDFS的一个功能,在Hadoop 2.4.0之后引入。一般来说,每个读请求都会由生成的一个线程处理。在Hedged reads 启用后,客户端可以等待一个预配置的时间,如果read没有返回,则客户端会生成第二个读请求,访问同一份数据的另一个block replica之后,其中任意一个read 先返回的话,则另一个read请求则被丢弃。
Hedged reads使用的场景是:解决少概率的slow read(可能由瞬时错误导致,例如磁盘错误或是网络抖动等)。
客户端hedged读两个参数:
private volatile long hedgedReadThresholdMillis; //定义原子性的hedged读阈值时间
private static ThreadPoolExecutor HEDGED_READ_THREAD_POOL; //定义静态线程池
看一下参数相关说明
this.hedgedReadThresholdMillis = conf.getLong(
DFSConfigKeys.DFS_DFSCLIENT_HEDGED_READ_THRESHOLD_MILLIS,
DFSConfigKeys.DEFAULT_DFSCLIENT_HEDGED_READ_THRESHOLD_MILLIS); //默认值是500
int numThreads = conf.getInt(
DFSConfigKeys.DFS_DFSCLIENT_HEDGED_READ_THREADPOOL_SIZE,
DFSConfigKeys.DEFAULT_DFSCLIENT_HEDGED_READ_THREADPOOL_SIZE); //默认值是0
if (numThreads > 0) {
this.initThreadsNumForHedgedReads(numThreads); //如果线程池大小为0,则表示不开启hedged读,大于0时则进行初始化
}
/**
* Create hedged reads thread pool, HEDGED_READ_THREAD_POOL, if
* it does not already exist.
* @param num Number of threads for hedged reads thread pool.
* If zero, skip hedged reads thread pool creation.
*/
private synchronized void initThreadsNumForHedgedReads(int num) {
if (num <= 0 || HEDGED_READ_THREAD_POOL != null) return;
HEDGED_READ_THREAD_POOL = new ThreadPoolExecutor(1, num, 60,
TimeUnit.SECONDS, new SynchronousQueue<Runnable>(),
new Daemon.DaemonFactory() {
private final AtomicInteger threadIndex =
new AtomicInteger(0);
@Override
public Thread newThread(Runnable r) {
Thread t = super.newThread(r);
t.setName("hedgedRead-" +
threadIndex.getAndIncrement());
return t;
}
},
new ThreadPoolExecutor.CallerRunsPolicy() {
@Override
public void rejectedExecution(Runnable runnable,
ThreadPoolExecutor e) {
LOG.info("Execution rejected, Executing in current thread");
HEDGED_READ_METRIC.incHedgedReadOpsInCurThread();
// will run in the current thread
super.rejectedExecution(runnable, e);
}
});
HEDGED_READ_THREAD_POOL.allowCoreThreadTimeOut(true);
if (LOG.isDebugEnabled()) {
LOG.debug("Using hedged reads; pool threads=" + num);
}
}
分析一下初始化代码,未实例化过线程池则初始化线程池。
关于ThreadPoolExecutor的参数本文不再论述,可参考其他资料。这里对HEDGED_READ_THREAD_POOL初始化参数做下说明:
DFSClient为hedged提供了一些方法,如:
long getHedgedReadTimeout() {
return this.hedgedReadThresholdMillis;
}
@VisibleForTesting
void setHedgedReadTimeout(long timeoutMillis) {
this.hedgedReadThresholdMillis = timeoutMillis;
}
ThreadPoolExecutor getHedgedReadsThreadPool() {
return HEDGED_READ_THREAD_POOL;
}
boolean isHedgedReadsEnabled() {
return (HEDGED_READ_THREAD_POOL != null) &&
HEDGED_READ_THREAD_POOL.getMaximumPoolSize() > 0;
}
在输入流DFSInputStream的read方法中,会通过dfsClient.isHedgedReadsEnabled()判断是否开启了Hedged Read特性,在其开启的情况下,调用hedgedFetchBlockByteRange()方法进行数据读取操作,如下:
if (dfsClient.isHedgedReadsEnabled()) {
hedgedFetchBlockByteRange(blk, targetStart, targetStart + bytesToRead
- 1, buffer, offset, corruptedBlockMap);
} else {
fetchBlockByteRange(blk, targetStart, targetStart + bytesToRead - 1,
buffer, offset, corruptedBlockMap);
}
开启hedged时走hedgedFetchBlockByteRange逻辑。hedgedFetchBlockByteRange方法的成员变量如下:
ArrayList<Future<ByteBuffer>> futures = new ArrayList<Future<ByteBuffer>>(); //1
CompletionService<ByteBuffer> hedgedService = new ExecutorCompletionService<ByteBuffer>(dfsClient.getHedgedReadsThreadPool()); //2
ArrayList<DatanodeInfo> ignored = new ArrayList<DatanodeInfo>(); //3
ByteBuffer bb = null; //4
int len = (int) (end - start + 1); //4
int hedgedReadId = 0; //4
上述成员变量协助hedged读,hedged读主要流程再一个while(true)循环中,该循环又分为两部分:
1.第一次读取时,ignored和futures为空,从NN获取blocks块的位置(仅仅是第一次的时候需要从NN获取,走RPC调用),chosenNode = chooseDataNode(block, ignored);
通过chosenNode来构造一个获得ByteBuffer的Callable线程,并提交到hedgedService,其返回结果是Future对象,记为firstRequest,并将该对象添加到futures列表中,futures.add(firstRequest)。
然后用非阻塞的poll获取结果future,判断future是否成功,成功即返回;否则在ignored中添加下次需要忽略的本节点,incHedgedReadOps计数并继续,此时开启启动hedged读。
2.第一次结束后在while(true)中进行后续读。第二次(或者大于第二次)读时,进行真正的hedged读(为什么是第二次开始呢?因为第一次相当于一次正常读,如果读到就直接返回了,等待500ms后,直接进行下一次循环),设置boolean refetch = false;即选择dn时不从NameNodeRPC请求,如果选到了,从该dn读并将结果放入futures列表中,如果选不到节点需要将refetch设为true,用于再次获取该块的dn。
以下贴出hedgedRead的关键代码:
/**
* Like {@link #fetchBlockByteRange}except we start up a second, parallel,
* 'hedged' read if the first read is taking longer than configured amount of
* time. We then wait on which ever read returns first.
*/
private void hedgedFetchBlockByteRange(LocatedBlock block, long start,
long end, ByteBuffer buf, CorruptedBlocks corruptedBlocks)
throws IOException {
final DFSClient.Conf conf = dfsClient.getConf();
ArrayList<Future<ByteBuffer>> futures = new ArrayList<>();
CompletionService<ByteBuffer> hedgedService =
new ExecutorCompletionService<>(dfsClient.getHedgedReadsThreadPool());
ArrayList<DatanodeInfo> ignored = new ArrayList<>();
ByteBuffer bb;
int len = (int) (end - start + 1);
int hedgedReadId = 0;
while (true) {
// see HDFS-6591, this metric is used to verify/catch unnecessary loops
hedgedReadOpsLoopNumForTesting++;
DNAddrPair chosenNode = null;
// there is no request already executing.
if (futures.isEmpty()) {
// chooseDataNode is a commitment. If no node, we go to
// the NN to reget block locations. Only go here on first read.
chosenNode = chooseDataNode(block, ignored);
// Latest block, if refreshed internally
block = chosenNode.block;
bb = ByteBuffer.allocate(len);
Callable<ByteBuffer> getFromDataNodeCallable = getFromOneDataNode(
chosenNode, block, start, end, bb,
corruptedBlocks, hedgedReadId++);
Future<ByteBuffer> firstRequest = hedgedService
.submit(getFromDataNodeCallable);
futures.add(firstRequest);
Future<ByteBuffer> future = null;
try {
future = hedgedService.poll(
dfsClient.getHedgedReadTimeout(), TimeUnit.MILLISECONDS);
if (future != null) {
ByteBuffer result = future.get();
result.flip();
buf.put(result);
return;
}
DFSClient.LOG.debug("Waited {}ms to read from {}; spawning hedged "
+ "read", dfsClient.getHedgedReadTimeout(), chosenNode.info);
dfsClient.getHedgedReadMetrics().incHedgedReadOps();
// continue; no need to refresh block locations
} catch (ExecutionException e) {
futures.remove(future);
} catch (InterruptedException e) {
throw new InterruptedIOException(
"Interrupted while waiting for reading task");
}
// Ignore this node on next go around.
// If poll timeout and the request still ongoing, don't consider it
// again. If read data failed, don't consider it either.
ignored.add(chosenNode.info);
} else {
// We are starting up a 'hedged' read. We have a read already
// ongoing. Call getBestNodeDNAddrPair instead of chooseDataNode.
// If no nodes to do hedged reads against, pass.
boolean refetch = false;
try {
chosenNode = chooseDataNode(block, ignored, false);
if (chosenNode != null) {
// Latest block, if refreshed internally
block = chosenNode.block;
bb = ByteBuffer.allocate(len);
Callable<ByteBuffer> getFromDataNodeCallable =
getFromOneDataNode(chosenNode, block, start, end, bb,
corruptedBlocks, hedgedReadId++);
Future<ByteBuffer> oneMoreRequest =
hedgedService.submit(getFromDataNodeCallable);
futures.add(oneMoreRequest);
} else {
refetch = true;
}
} catch (IOException ioe) {
DFSClient.LOG.debug("Failed getting node for hedged read: {}",
ioe.getMessage());
}
// if not succeeded. Submit callables for each datanode in a loop, wait
// for a fixed interval and get the result from the fastest one.
try {
ByteBuffer result = getFirstToComplete(hedgedService, futures);
// cancel the rest.
cancelAll(futures);
dfsClient.getHedgedReadMetrics().incHedgedReadWins();
result.flip();
buf.put(result);
return;
} catch (InterruptedException ie) {
// Ignore and retry
}
if (refetch) {
refetchLocations(block, ignored);
}
// We got here if exception. Ignore this node on next go around IFF
// we found a chosenNode to hedge read against.
if (chosenNode != null && chosenNode.info != null) {
ignored.add(chosenNode.info);
}
}
}
}
getFirstToComplete的方法如下:
private ByteBuffer getFirstToComplete(
CompletionService<ByteBuffer> hedgedService,
ArrayList<Future<ByteBuffer>> futures) throws InterruptedException {
if (futures.isEmpty()) {
throw new InterruptedException("let's retry");
}
Future<ByteBuffer> future = null;
try {
future = hedgedService.take();
ByteBuffer bb = future.get();
futures.remove(future);
return bb;
} catch (ExecutionException | CancellationException e) {
// already logged in the Callable
futures.remove(future);
}
throw new InterruptedException("let's retry");
}
1.hedged读的实现符合典型的异步线程Future设计模式,其任一个线程得到结果则取消其他线程。利用限时等待的非阻塞的poll方法来启动hedged读,利用阻塞的task方法来得到优先结果,利用cancelAll来取消其他线程。这种设计思想值得学习。