当一个应用向YARN集群提交作业后,此作业的多个任务由于负载不均衡、资源分布不均等原因都会导致各个任务运行完成的时间不一致,甚至会出现一个任务明显慢于同一作业的其它任务的情况。如果对这种情况不加优化,最慢的任务最终会拖慢整个作业的整体执行进度。好在mapreduce框架提供了任务推断执行机制,当有必要时就启动一个备份任务。最终会采用备份任务和原任务中率先执行完的结果作为最终结果。
由于具体分析推断执行机制,篇幅很长,所以我会分成几篇内容陆续介绍。
本文在我自己搭建的集群(集群搭建可以参阅《Linux下Hadoop2.6.0集群环境的搭建》一文)上,执行wordcount例子,来验证mapreduce框架的推断机制。我们输入以下命令:
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount -D mapreduce.input.fileinputformat.split.maxsize=19 /wordcount/input/ /wordcount/output/result1
我们看到map任务被划分为10:
这一次测试并没有发生推断执行的情况,我们可以再执行一次上面的命令,最后看到的信息如下:
其中看到执行的map数量多了1个,成了11个,而且还出现了Killed map tasks=1的信息,这表示这次执行最终发生了推断执行。其中一个map任务增加了一个备份任务,当备份的map任务和原有的map任务中有一个率先完成了,那么会将另一个慢的map任务杀死。reduce任务也是类似,只不过Hadoop2.6.0的子项目hadoop-mapreduce-examples中自带的wordcount例子中,使用Job.setNumReduceTasks(int)这个API将reduce任务的数量控制为1个。在这里我们看到推断执行有时候会发生,而有时候却不会,这是为什么呢?
在《Hadoop2.6.0运行mapreduce之Uber模式验证》一文中我还简短提到,如果启用了Uber运行模式,推断执行会被取消。关于这些内部的实现原理需要我们从架构设计和源码角度进行剖析,因为我们还需要知道所以然。
在mapreduce中设计了Speculator接口作为推断执行的统一规范,DefaultSpeculator作为一种服务在实现了Speculator的同时继承了AbstractService,DefaultSpeculator是mapreduce的默认实现。DefaultSpeculator负责处理SpeculatorEvent事件,目前主要包括四种事件,分别是:
TaskRuntimeEstimator接口为推断执行提供了计算模型的规范,默认的实现类是LegacyTaskRuntimeEstimator,此外还有ExponentiallySmoothedTaskRuntimeEstimator。这里暂时不对其内容进行深入介绍,在后面会陆续展开。
Speculator的初始化和启动伴随着MRAppMaster的初始化与启动。
接下来我们以Speculator接口的默认实现DefaultSpeculator为例,逐步分析其初始化、启动、推断执行等内容的工作原理。
Speculator是MRAppMaster的子组件、子服务,所以也需要初始化。有经验的Hadoop工程师,想必知道当mapreduce作业提交给ResourceManager后,由RM负责向NodeManger通信启动一个Container用于执行MRAppMaster。启动MRAppMaster实际也是通过调用其main方法,其中会调用MRAppMaster实例的serviceInit方法,其中与Speculator有关的代码实现见代码清单1。
if (conf.getBoolean(MRJobConfig.MAP_SPECULATIVE, false)
|| conf.getBoolean(MRJobConfig.REDUCE_SPECULATIVE, false)) {
//optional service to speculate on task attempts' progress
speculator = createSpeculator(conf, context);
addIfService(speculator);
}
speculatorEventDispatcher = new SpeculatorEventDispatcher(conf);
dispatcher.register(Speculator.EventType.class,
speculatorEventDispatcher);
代码清单1所示代码的执行步骤如下:
createSpeculator方法(见代码清单2)创建的推断服务的实现类默认是DefaultSpeculator,用户也可以通过参数yarn.app.mapreduce.am.job.speculator.class(即MRJobConfig.MR_AM_JOB_SPECULATOR)指定推断服务的实现类。
代码清单2 创建推断器
protected Speculator createSpeculator(Configuration conf,
final AppContext context) {
return callWithJobClassLoader(conf, new Action() {
public Speculator call(Configuration conf) {
Class extends Speculator> speculatorClass;
try {
speculatorClass
// "yarn.mapreduce.job.speculator.class"
= conf.getClass(MRJobConfig.MR_AM_JOB_SPECULATOR,
DefaultSpeculator.class,
Speculator.class);
Constructor extends Speculator> speculatorConstructor
= speculatorClass.getConstructor
(Configuration.class, AppContext.class);
Speculator result = speculatorConstructor.newInstance(conf, context);
return result;
} catch (InstantiationException ex) {
LOG.error("Can't make a speculator -- check "
+ MRJobConfig.MR_AM_JOB_SPECULATOR, ex);
throw new YarnRuntimeException(ex);
} catch (IllegalAccessException ex) {
LOG.error("Can't make a speculator -- check "
+ MRJobConfig.MR_AM_JOB_SPECULATOR, ex);
throw new YarnRuntimeException(ex);
} catch (InvocationTargetException ex) {
LOG.error("Can't make a speculator -- check "
+ MRJobConfig.MR_AM_JOB_SPECULATOR, ex);
throw new YarnRuntimeException(ex);
} catch (NoSuchMethodException ex) {
LOG.error("Can't make a speculator -- check "
+ MRJobConfig.MR_AM_JOB_SPECULATOR, ex);
throw new YarnRuntimeException(ex);
}
}
});
}
根据代码清单2,我们知道createSpeculator方法通过反射调用了DefaultSpeculator的构造器来实例化任务推断器。DefaultSpeculator的构造器如下:
public DefaultSpeculator(Configuration conf, AppContext context) {
this(conf, context, context.getClock());
}
public DefaultSpeculator(Configuration conf, AppContext context, Clock clock) {
this(conf, context, getEstimator(conf, context), clock);
}
static private TaskRuntimeEstimator getEstimator
(Configuration conf, AppContext context) {
TaskRuntimeEstimator estimator;
try {
// "yarn.mapreduce.job.task.runtime.estimator.class"
Class extends TaskRuntimeEstimator> estimatorClass
= conf.getClass(MRJobConfig.MR_AM_TASK_ESTIMATOR,
LegacyTaskRuntimeEstimator.class,
TaskRuntimeEstimator.class);
Constructor extends TaskRuntimeEstimator> estimatorConstructor
= estimatorClass.getConstructor();
estimator = estimatorConstructor.newInstance();
estimator.contextualize(conf, context);
} catch (InstantiationException ex) {
LOG.error("Can't make a speculation runtime estimator", ex);
throw new YarnRuntimeException(ex);
} catch (IllegalAccessException ex) {
LOG.error("Can't make a speculation runtime estimator", ex);
throw new YarnRuntimeException(ex);
} catch (InvocationTargetException ex) {
LOG.error("Can't make a speculation runtime estimator", ex);
throw new YarnRuntimeException(ex);
} catch (NoSuchMethodException ex) {
LOG.error("Can't make a speculation runtime estimator", ex);
throw new YarnRuntimeException(ex);
}
return estimator;
}
根据代码清单3可以看出推断估算器可以通过参数yarn.app.mapreduce.am.job.task.estimator.class(即MRJobConfig.MR_AM_TASK_ESTIMATOR)进行指定,如果没有指定,则默认使用LegacyTaskRuntimeEstimator。实例化LegacyTaskRuntimeEstimator后,还调用其父类StartEndTimesBase的contextualize方法(见代码清单4)进行上下文的初始化,实际就是将当前作业添加到map任务统计列表、reduce任务统计列表,并设置作业与其慢任务阈值(mapreduce.job.speculative.slowtaskthreshold)之间的映射关系。代码清单4 StartEndTimesBase的初始化
@Override
public void contextualize(Configuration conf, AppContext context) {
this.context = context;
Map allJobs = context.getAllJobs();
for (Map.Entry entry : allJobs.entrySet()) {
final Job job = entry.getValue();
mapperStatistics.put(job, new DataStatistics());
reducerStatistics.put(job, new DataStatistics());
slowTaskRelativeTresholds.put
(job, conf.getFloat(MRJobConfig.SPECULATIVE_SLOWTASK_THRESHOLD,1.0f));
}
}
public DefaultSpeculator
(Configuration conf, AppContext context,
TaskRuntimeEstimator estimator, Clock clock) {
super(DefaultSpeculator.class.getName());
this.conf = conf;
this.context = context;
this.estimator = estimator;
this.clock = clock;
this.eventHandler = context.getEventHandler();
}
至此,我们介绍完了DefaultSpeculator的初始化过程。
代码清单5 启动MRAppMaster时涉及Speculator的部分
if (job.isUber()) {
speculatorEventDispatcher.disableSpeculation();
LOG.info("MRAppMaster uberizing job " + job.getID()
+ " in local container (\"uber-AM\") on node "
+ nmHost + ":" + nmPort + ".");
} else {
// send init to speculator only for non-uber jobs.
// This won't yet start as dispatcher isn't started yet.
dispatcher.getEventHandler().handle(
new SpeculatorEvent(job.getID(), clock.getTime()));
LOG.info("MRAppMaster launching normal, non-uberized, multi-container "
+ "job " + job.getID() + ".");
}
分析代码清单5,其中与任务推断有关的逻辑如下:
public SpeculatorEvent(JobId jobID, long timestamp) {
super(Speculator.EventType.JOB_CREATE, timestamp);
this.jobID = jobID;
}
由此可见,在启动MRAppMaster的阶段,创建的SpeculatorEvent事件的类型是Speculator.EventType.JOB_CREATE。SpeculatorEventDispatcher的handle方法会被调度器执行,用以处理SpeculatorEvent事件,其代码实现见代码清单6。
private class SpeculatorEventDispatcher implements
EventHandler {
private final Configuration conf;
private volatile boolean disabled;
public SpeculatorEventDispatcher(Configuration config) {
this.conf = config;
}
@Override
public void handle(final SpeculatorEvent event) {
if (disabled) {
return;
}
TaskId tId = event.getTaskID();
TaskType tType = null;
/* event's TaskId will be null if the event type is JOB_CREATE or
* ATTEMPT_STATUS_UPDATE
*/
if (tId != null) {
tType = tId.getTaskType();
}
boolean shouldMapSpec =
conf.getBoolean(MRJobConfig.MAP_SPECULATIVE, false);
boolean shouldReduceSpec =
conf.getBoolean(MRJobConfig.REDUCE_SPECULATIVE, false);
/* The point of the following is to allow the MAP and REDUCE speculative
* config values to be independent:
* IF spec-exec is turned on for maps AND the task is a map task
* OR IF spec-exec is turned on for reduces AND the task is a reduce task
* THEN call the speculator to handle the event.
*/
if ( (shouldMapSpec && (tType == null || tType == TaskType.MAP))
|| (shouldReduceSpec && (tType == null || tType == TaskType.REDUCE))) {
// Speculator IS enabled, direct the event to there.
callWithJobClassLoader(conf, new Action() {
public Void call(Configuration conf) {
speculator.handle(event);
return null;
}
});
}
}
public void disableSpeculation() {
disabled = true;
}
}
SpeculatorEventDispatcher的实现告诉我们当启用map或者reduce任务推断时,将异步调用Speculator的handle方法处理SpeculatorEvent事件。以默认的DefaultSpeculator的handle方法为例,来看看其实现,代码如下:
@Override
public void handle(SpeculatorEvent event) {
processSpeculatorEvent(event);
}
代码清单7 DefaultSpeculator处理SpeculatorEvent事件的实现
private synchronized void processSpeculatorEvent(SpeculatorEvent event) {
switch (event.getType()) {
case ATTEMPT_STATUS_UPDATE:
statusUpdate(event.getReportedStatus(), event.getTimestamp());
break;
case TASK_CONTAINER_NEED_UPDATE:
{
AtomicInteger need = containerNeed(event.getTaskID());
need.addAndGet(event.containersNeededChange());
break;
}
case ATTEMPT_START:
{
LOG.info("ATTEMPT_START " + event.getTaskID());
estimator.enrollAttempt
(event.getReportedStatus(), event.getTimestamp());
break;
}
case JOB_CREATE:
{
LOG.info("JOB_CREATE " + event.getJobID());
estimator.contextualize(getConfig(), context);
break;
}
}
}
当DefaultSpeculator收到类型为JOB_CREATE的SpeculatorEvent事件时会匹配执行以下代码:
case JOB_CREATE:
{
LOG.info("JOB_CREATE " + event.getJobID());
estimator.contextualize(getConfig(), context);
break;
}
这里实际也调用了StartEndTimesBase的contextualize方法(见代码清单4),不再赘述。
由于DefaultSpeculator也是MRAppMaster的子组件之一,所以在启动MRAppMaster(调用MRAppMaster的serviceStart)的过程中,也会调用DefaultSpeculator的serviceStart方法(见代码清单8)启动DefaultSpeculator。
代码清单8 启动DefaultSpeculator
protected void serviceStart() throws Exception {
Runnable speculationBackgroundCore
= new Runnable() {
@Override
public void run() {
while (!stopped && !Thread.currentThread().isInterrupted()) {
long backgroundRunStartTime = clock.getTime();
try {
int speculations = computeSpeculations();
long mininumRecomp
= speculations > 0 ? SOONEST_RETRY_AFTER_SPECULATE
: SOONEST_RETRY_AFTER_NO_SPECULATE;
long wait = Math.max(mininumRecomp,
clock.getTime() - backgroundRunStartTime);
if (speculations > 0) {
LOG.info("We launched " + speculations
+ " speculations. Sleeping " + wait + " milliseconds.");
}
Object pollResult
= scanControl.poll(wait, TimeUnit.MILLISECONDS);
} catch (InterruptedException e) {
if (!stopped) {
LOG.error("Background thread returning, interrupted", e);
}
return;
}
}
}
};
speculationBackgroundThread = new Thread
(speculationBackgroundCore, "DefaultSpeculator background processing");
speculationBackgroundThread.start();
super.serviceStart();
}
启动DefaultSpeculator的主要目的是启动一个线程不断推断执行进行估算,步骤如下:
private int computeSpeculations() {
// We'll try to issue one map and one reduce speculation per job per run
return maybeScheduleAMapSpeculation() + maybeScheduleAReduceSpeculation();
}
computeSpeculations方法返回的结果等于maybeScheduleAMapSpeculation方法(用于推断需要重新分配Container的map任务数量)和maybeScheduleAReduceSpeculation方法(用于推断需要重新分配Container的reduce任务数量)返回值的和。maybeScheduleAMapSpeculation的实现如下:
private int maybeScheduleAMapSpeculation() {
return maybeScheduleASpeculation(TaskType.MAP);
}
private int maybeScheduleAReduceSpeculation() {
return maybeScheduleASpeculation(TaskType.REDUCE);
}
private int maybeScheduleASpeculation(TaskType type) {
int successes = 0;
long now = clock.getTime();
ConcurrentMap containerNeeds
= type == TaskType.MAP ? mapContainerNeeds : reduceContainerNeeds;
for (ConcurrentMap.Entry jobEntry : containerNeeds.entrySet()) {
// This race conditon is okay. If we skip a speculation attempt we
// should have tried because the event that lowers the number of
// containers needed to zero hasn't come through, it will next time.
// Also, if we miss the fact that the number of containers needed was
// zero but increased due to a failure it's not too bad to launch one
// container prematurely.
if (jobEntry.getValue().get() > 0) {
continue;
}
int numberSpeculationsAlready = 0;
int numberRunningTasks = 0;
// loop through the tasks of the kind
Job job = context.getJob(jobEntry.getKey());
Map tasks = job.getTasks(type);
int numberAllowedSpeculativeTasks
= (int) Math.max(MINIMUM_ALLOWED_SPECULATIVE_TASKS,
PROPORTION_TOTAL_TASKS_SPECULATABLE * tasks.size());
TaskId bestTaskID = null;
long bestSpeculationValue = -1L;
// this loop is potentially pricey.
// TODO track the tasks that are potentially worth looking at
for (Map.Entry taskEntry : tasks.entrySet()) {
long mySpeculationValue = speculationValue(taskEntry.getKey(), now);
if (mySpeculationValue == ALREADY_SPECULATING) {
++numberSpeculationsAlready;
}
if (mySpeculationValue != NOT_RUNNING) {
++numberRunningTasks;
}
if (mySpeculationValue > bestSpeculationValue) {
bestTaskID = taskEntry.getKey();
bestSpeculationValue = mySpeculationValue;
}
}
numberAllowedSpeculativeTasks
= (int) Math.max(numberAllowedSpeculativeTasks,
PROPORTION_RUNNING_TASKS_SPECULATABLE * numberRunningTasks);
// If we found a speculation target, fire it off
if (bestTaskID != null
&& numberAllowedSpeculativeTasks > numberSpeculationsAlready) {
addSpeculativeAttempt(bestTaskID);
++successes;
}
}
return successes;
}
遍历containerNeeds的执行步骤如下:
//Add attempt to a given Task.
protected void addSpeculativeAttempt(TaskId taskID) {
LOG.info
("DefaultSpeculator.addSpeculativeAttempt -- we are speculating " + taskID);
eventHandler.handle(new TaskEvent(taskID, TaskEventType.T_ADD_SPEC_ATTEMPT));
mayHaveSpeculated.add(taskID);
}
根据代码清单10,我们看到推断执行尝试是通过发送类型为TaskEventType.T_ADD_SPEC_ATTEMPT的TaskEvent事件完成的。
在分析代码清单9时,我故意跳过了speculationValue方法的分析。speculationValue方法(见代码清单11)主要用于估算每个任务的推断值。
代码清单11 估算任务的推断值
private long speculationValue(TaskId taskID, long now) {
Job job = context.getJob(taskID.getJobId());
Task task = job.getTask(taskID);
Map attempts = task.getAttempts();
long acceptableRuntime = Long.MIN_VALUE;
long result = Long.MIN_VALUE;
if (!mayHaveSpeculated.contains(taskID)) {
acceptableRuntime = estimator.thresholdRuntime(taskID);
if (acceptableRuntime == Long.MAX_VALUE) {
return ON_SCHEDULE;
}
}
TaskAttemptId runningTaskAttemptID = null;
int numberRunningAttempts = 0;
for (TaskAttempt taskAttempt : attempts.values()) {
if (taskAttempt.getState() == TaskAttemptState.RUNNING
|| taskAttempt.getState() == TaskAttemptState.STARTING) {
if (++numberRunningAttempts > 1) {
return ALREADY_SPECULATING;
}
runningTaskAttemptID = taskAttempt.getID();
long estimatedRunTime = estimator.estimatedRuntime(runningTaskAttemptID);
long taskAttemptStartTime
= estimator.attemptEnrolledTime(runningTaskAttemptID);
if (taskAttemptStartTime > now) {
// This background process ran before we could process the task
// attempt status change that chronicles the attempt start
return TOO_NEW;
}
long estimatedEndTime = estimatedRunTime + taskAttemptStartTime;
long estimatedReplacementEndTime
= now + estimator.estimatedNewAttemptRuntime(taskID);
float progress = taskAttempt.getProgress();
TaskAttemptHistoryStatistics data =
runningTaskAttemptStatistics.get(runningTaskAttemptID);
if (data == null) {
runningTaskAttemptStatistics.put(runningTaskAttemptID,
new TaskAttemptHistoryStatistics(estimatedRunTime, progress, now));
} else {
if (estimatedRunTime == data.getEstimatedRunTime()
&& progress == data.getProgress()) {
// Previous stats are same as same stats
if (data.notHeartbeatedInAWhile(now)) {
// Stats have stagnated for a while, simulate heart-beat.
TaskAttemptStatus taskAttemptStatus = new TaskAttemptStatus();
taskAttemptStatus.id = runningTaskAttemptID;
taskAttemptStatus.progress = progress;
taskAttemptStatus.taskState = taskAttempt.getState();
// Now simulate the heart-beat
handleAttempt(taskAttemptStatus);
}
} else {
// Stats have changed - update our data structure
data.setEstimatedRunTime(estimatedRunTime);
data.setProgress(progress);
data.resetHeartBeatTime(now);
}
}
if (estimatedEndTime < now) {
return PROGRESS_IS_GOOD;
}
if (estimatedReplacementEndTime >= estimatedEndTime) {
return TOO_LATE_TO_SPECULATE;
}
result = estimatedEndTime - estimatedReplacementEndTime;
}
}
// If we are here, there's at most one task attempt.
if (numberRunningAttempts == 0) {
return NOT_RUNNING;
}
if (acceptableRuntime == Long.MIN_VALUE) {
acceptableRuntime = estimator.thresholdRuntime(taskID);
if (acceptableRuntime == Long.MAX_VALUE) {
return ON_SCHEDULE;
}
}
return result;
}
speculationValue方法的执行步骤如下:后记:个人总结整理的《深入理解Spark:核心思想与源码分析》一书现在已经正式出版上市,目前京东、当当、天猫等网站均有销售,欢迎感兴趣的同学购买。
京东:(现有满150送50活动)http://item.jd.com/11846120.html
当当:http://product.dangdang.com/23838168.html