当一个应用向YARN集群提交作业后,此作业的多个任务由于负载不均衡、资源分布不均等原因都会导致各个任务运行完成的时间不一致,甚至会出现一个任务明显慢于同一作业的其它任务的情况。如果对这种情况不加优化,最慢的任务最终会拖慢整个作业的整体执行进度。好在mapreduce框架提供了任务推断执行机制,当有必要时就启动一个备份任务。最终会采用备份任务和原任务中率先执行完的结果作为最终结果。
由于具体分析推断执行机制,篇幅很长,所以我会分成几篇内容陆续介绍。
本文在我自己搭建的集群(集群搭建可以参阅《Linux下Hadoop2.6.0集群环境的搭建》一文)上,执行wordcount例子,来验证mapreduce框架的推断机制。我们输入以下命令:
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount -D mapreduce.input.fileinputformat.split.maxsize=19 /wordcount/input/ /wordcount/output/result1
我们看到map任务被划分为10:
这一次测试并没有发生推断执行的情况,我们可以再执行一次上面厄命令,最后看到的信息如下:
其中看到执行的map数量多了1个,成了11个,而且还出现了Killed map tasks=1的信息,这表示这次执行最终发生了推断执行。其中一个map任务增加了一个备份任务,当备份的map任务和原有的map任务中有一个率先完成了,那么会将另一个慢的map任务杀死。reduce任务也是类似,只不过Hadoop2.6.0的子项目hadoop-mapreduce-examples中自带的wordcount例子中,使用Job.setNumReduceTasks(int)这个API将reduce任务的数量控制为1个。在这里我们看到推断执行有时候会发生,而有时候却不会,这是为什么呢?
在《Hadoop2.6.0运行mapreduce之Uber模式验证》一文中我还简短提到,如果启用了Uber运行模式,推断执行会被取消。关于这些内部的实现原理需要我们从架构设计和源码角度进行剖析,因为我们还需要知道所以然。
在mapreduce中设计了Speculator接口作为推断执行的统一规范,DefaultSpeculator作为一种服务在实现了Speculator的同时继承了AbstractService,DefaultSpeculator是mapreduce的默认实现。DefaultSpeculator负责处理SpeculatorEvent事件,目前主要包括四种事件,分别是:
TaskRuntimeEstimator接口为推断执行提供了计算模型的规范,默认的实现类是LegacyTaskRuntimeEstimator,此外还有ExponentiallySmoothedTaskRuntimeEstimator。这里暂时不对其内容进行深入介绍,在后面会陆续展开。
Speculator的初始化和启动伴随着MRAppMaster的初始化与启动。
接下来我们以Speculator接口的默认实现DefaultSpeculator为例,逐步分析其初始化、启动、推断执行等内容的工作原理。
Speculator是MRAppMaster的子组件、自服务,所以也需要初始化。有经验的Hadoop工程师,想必知道当mapreduce作业提交给ResourceManager后,由RM负责向NodeManger通信启动一个Container用于执行MRAppMaster。启动MRAppMaster实际也是通过调用其main方法,其中会调用MRAppMaster实例的serviceInit方法,其中与Speculator有关的代码实现见代码清单1。
if (conf.getBoolean(MRJobConfig.MAP_SPECULATIVE, false) || conf.getBoolean(MRJobConfig.REDUCE_SPECULATIVE, false)) { //optional service to speculate on task attempts' progress speculator = createSpeculator(conf, context); addIfService(speculator); } speculatorEventDispatcher = new SpeculatorEventDispatcher(conf); dispatcher.register(Speculator.EventType.class, speculatorEventDispatcher);
代码清单1所示代码的执行步骤如下:
createSpeculator方法(见代码清单2)创建的推断服务的实现类默认是DefaultSpeculator,用户也可以通过参数yarn.app.mapreduce.am.job.speculator.class(即MRJobConfig.MR_AM_JOB_SPECULATOR)指定推断服务的实现类。
代码清单2 创建推断器
protected Speculator createSpeculator(Configuration conf, final AppContext context) { return callWithJobClassLoader(conf, new Action<Speculator>() { public Speculator call(Configuration conf) { Class<? extends Speculator> speculatorClass; try { speculatorClass // "yarn.mapreduce.job.speculator.class" = conf.getClass(MRJobConfig.MR_AM_JOB_SPECULATOR, DefaultSpeculator.class, Speculator.class); Constructor<? extends Speculator> speculatorConstructor = speculatorClass.getConstructor (Configuration.class, AppContext.class); Speculator result = speculatorConstructor.newInstance(conf, context); return result; } catch (InstantiationException ex) { LOG.error("Can't make a speculator -- check " + MRJobConfig.MR_AM_JOB_SPECULATOR, ex); throw new YarnRuntimeException(ex); } catch (IllegalAccessException ex) { LOG.error("Can't make a speculator -- check " + MRJobConfig.MR_AM_JOB_SPECULATOR, ex); throw new YarnRuntimeException(ex); } catch (InvocationTargetException ex) { LOG.error("Can't make a speculator -- check " + MRJobConfig.MR_AM_JOB_SPECULATOR, ex); throw new YarnRuntimeException(ex); } catch (NoSuchMethodException ex) { LOG.error("Can't make a speculator -- check " + MRJobConfig.MR_AM_JOB_SPECULATOR, ex); throw new YarnRuntimeException(ex); } } }); }
根据代码清单2,我们知道createSpeculator方法通过反射调用了DefaultSpeculator的构造器来实例化任务推断器。DefaultSpeculator的构造器如下:
public DefaultSpeculator(Configuration conf, AppContext context) { this(conf, context, context.getClock()); } public DefaultSpeculator(Configuration conf, AppContext context, Clock clock) { this(conf, context, getEstimator(conf, context), clock); }
static private TaskRuntimeEstimator getEstimator (Configuration conf, AppContext context) { TaskRuntimeEstimator estimator; try { // "yarn.mapreduce.job.task.runtime.estimator.class" Class<? extends TaskRuntimeEstimator> estimatorClass = conf.getClass(MRJobConfig.MR_AM_TASK_ESTIMATOR, LegacyTaskRuntimeEstimator.class, TaskRuntimeEstimator.class); Constructor<? extends TaskRuntimeEstimator> estimatorConstructor = estimatorClass.getConstructor(); estimator = estimatorConstructor.newInstance(); estimator.contextualize(conf, context); } catch (InstantiationException ex) { LOG.error("Can't make a speculation runtime estimator", ex); throw new YarnRuntimeException(ex); } catch (IllegalAccessException ex) { LOG.error("Can't make a speculation runtime estimator", ex); throw new YarnRuntimeException(ex); } catch (InvocationTargetException ex) { LOG.error("Can't make a speculation runtime estimator", ex); throw new YarnRuntimeException(ex); } catch (NoSuchMethodException ex) { LOG.error("Can't make a speculation runtime estimator", ex); throw new YarnRuntimeException(ex); } return estimator; }根据代码清单3可以看出推断估算器可以通过参数yarn.app.mapreduce.am.job.task.estimator.class(即MRJobConfig.MR_AM_TASK_ESTIMATOR)进行指定,如果没有指定,则默认使用LegacyTaskRuntimeEstimator。实例化LegacyTaskRuntimeEstimator后,还调用其父类StartEndTimesBase的contextualize方法(见代码清单4)进行上下文的初始化,实际就是将当前作业添加到map任务统计列表、reduce任务统计列表,并设置任务与其慢任务阈值(mapreduce.job.speculative.slowtaskthreshold)之间的映射关系。
代码清单4 StartEndTimesBase的初始化
@Override public void contextualize(Configuration conf, AppContext context) { this.context = context; Map<JobId, Job> allJobs = context.getAllJobs(); for (Map.Entry<JobId, Job> entry : allJobs.entrySet()) { final Job job = entry.getValue(); mapperStatistics.put(job, new DataStatistics()); reducerStatistics.put(job, new DataStatistics()); slowTaskRelativeTresholds.put (job, conf.getFloat(MRJobConfig.SPECULATIVE_SLOWTASK_THRESHOLD,1.0f)); } }
public DefaultSpeculator (Configuration conf, AppContext context, TaskRuntimeEstimator estimator, Clock clock) { super(DefaultSpeculator.class.getName()); this.conf = conf; this.context = context; this.estimator = estimator; this.clock = clock; this.eventHandler = context.getEventHandler(); }
至此,我们介绍完了DefaultSpeculator的初始化过程。
代码清单5 启动MRAppMaster时涉及Speculator的部分
if (job.isUber()) { speculatorEventDispatcher.disableSpeculation(); LOG.info("MRAppMaster uberizing job " + job.getID() + " in local container (\"uber-AM\") on node " + nmHost + ":" + nmPort + "."); } else { // send init to speculator only for non-uber jobs. // This won't yet start as dispatcher isn't started yet. dispatcher.getEventHandler().handle( new SpeculatorEvent(job.getID(), clock.getTime())); LOG.info("MRAppMaster launching normal, non-uberized, multi-container " + "job " + job.getID() + "."); }
分析代码清单5,其中与任务推断有关的逻辑如下:
public SpeculatorEvent(JobId jobID, long timestamp) { super(Speculator.EventType.JOB_CREATE, timestamp); this.jobID = jobID; }由此可见,在启动MRAppMaster的阶段,创建的SpeculatorEvent事件的类型是Speculator.EventType.JOB_CREATE。SpeculatorEventDispatcher的handle方法会被调度器执行,用以处理SpeculatorEvent事件,其代码实现见代码清单6。
private class SpeculatorEventDispatcher implements EventHandler<SpeculatorEvent> { private final Configuration conf; private volatile boolean disabled; public SpeculatorEventDispatcher(Configuration config) { this.conf = config; } @Override public void handle(final SpeculatorEvent event) { if (disabled) { return; } TaskId tId = event.getTaskID(); TaskType tType = null; /* event's TaskId will be null if the event type is JOB_CREATE or * ATTEMPT_STATUS_UPDATE */ if (tId != null) { tType = tId.getTaskType(); } boolean shouldMapSpec = conf.getBoolean(MRJobConfig.MAP_SPECULATIVE, false); boolean shouldReduceSpec = conf.getBoolean(MRJobConfig.REDUCE_SPECULATIVE, false); /* The point of the following is to allow the MAP and REDUCE speculative * config values to be independent: * IF spec-exec is turned on for maps AND the task is a map task * OR IF spec-exec is turned on for reduces AND the task is a reduce task * THEN call the speculator to handle the event. */ if ( (shouldMapSpec && (tType == null || tType == TaskType.MAP)) || (shouldReduceSpec && (tType == null || tType == TaskType.REDUCE))) { // Speculator IS enabled, direct the event to there. callWithJobClassLoader(conf, new Action<Void>() { public Void call(Configuration conf) { speculator.handle(event); return null; } }); } } public void disableSpeculation() { disabled = true; } }
SpeculatorEventDispatcher的实现告诉我们当启用map或者reduce任务推断时,将异步调用Speculator的handle方法处理SpeculatorEvent事件。以默认的DefaultSpeculator的handle方法为例,来看看其实现,代码如下:
@Override public void handle(SpeculatorEvent event) { processSpeculatorEvent(event); }
代码清单7 DefaultSpeculator处理SpeculatorEvent事件的实现
private synchronized void processSpeculatorEvent(SpeculatorEvent event) { switch (event.getType()) { case ATTEMPT_STATUS_UPDATE: statusUpdate(event.getReportedStatus(), event.getTimestamp()); break; case TASK_CONTAINER_NEED_UPDATE: { AtomicInteger need = containerNeed(event.getTaskID()); need.addAndGet(event.containersNeededChange()); break; } case ATTEMPT_START: { LOG.info("ATTEMPT_START " + event.getTaskID()); estimator.enrollAttempt (event.getReportedStatus(), event.getTimestamp()); break; } case JOB_CREATE: { LOG.info("JOB_CREATE " + event.getJobID()); estimator.contextualize(getConfig(), context); break; } } }当DefaultSpeculator收到类型为JOB_CREATE的SpeculatorEvent事件时会匹配执行以下代码:
case JOB_CREATE: { LOG.info("JOB_CREATE " + event.getJobID()); estimator.contextualize(getConfig(), context); break; }这里实际也调用了StartEndTimesBase的contextualize方法(见代码清单4),不再赘述。
由于DefaultSpeculator也是MRAppMaster的子组件之一,所以在启动MRAppMaster(调用MRAppMaster的serviceStart)的过程中,也会调用DefaultSpeculator的serviceStart方法(见代码清单8)启动DefaultSpeculator。
代码清单8 启动DefaultSpeculator
protected void serviceStart() throws Exception { Runnable speculationBackgroundCore = new Runnable() { @Override public void run() { while (!stopped && !Thread.currentThread().isInterrupted()) { long backgroundRunStartTime = clock.getTime(); try { int speculations = computeSpeculations(); long mininumRecomp = speculations > 0 ? SOONEST_RETRY_AFTER_SPECULATE : SOONEST_RETRY_AFTER_NO_SPECULATE; long wait = Math.max(mininumRecomp, clock.getTime() - backgroundRunStartTime); if (speculations > 0) { LOG.info("We launched " + speculations + " speculations. Sleeping " + wait + " milliseconds."); } Object pollResult = scanControl.poll(wait, TimeUnit.MILLISECONDS); } catch (InterruptedException e) { if (!stopped) { LOG.error("Background thread returning, interrupted", e); } return; } } } }; speculationBackgroundThread = new Thread (speculationBackgroundCore, "DefaultSpeculator background processing"); speculationBackgroundThread.start(); super.serviceStart(); }
启动DefaultSpeculator的主要目的是启动一个线程不断推断执行进行估算,步骤如下:
private int computeSpeculations() { // We'll try to issue one map and one reduce speculation per job per run return maybeScheduleAMapSpeculation() + maybeScheduleAReduceSpeculation(); }computeSpeculations方法返回的结果等于maybeScheduleAMapSpeculation方法(用于推断需要重新分配Container的map任务数量)和maybeScheduleAReduceSpeculation方法(用于推断需要重新分配Container的reduce任务数量)返回值的和。maybeScheduleAMapSpeculation的实现如下:
private int maybeScheduleAMapSpeculation() { return maybeScheduleASpeculation(TaskType.MAP); } private int maybeScheduleAReduceSpeculation() { return maybeScheduleASpeculation(TaskType.REDUCE); }
private int maybeScheduleASpeculation(TaskType type) { int successes = 0; long now = clock.getTime(); ConcurrentMap<JobId, AtomicInteger> containerNeeds = type == TaskType.MAP ? mapContainerNeeds : reduceContainerNeeds; for (ConcurrentMap.Entry<JobId, AtomicInteger> jobEntry : containerNeeds.entrySet()) { // This race conditon is okay. If we skip a speculation attempt we // should have tried because the event that lowers the number of // containers needed to zero hasn't come through, it will next time. // Also, if we miss the fact that the number of containers needed was // zero but increased due to a failure it's not too bad to launch one // container prematurely. if (jobEntry.getValue().get() > 0) { continue; } int numberSpeculationsAlready = 0; int numberRunningTasks = 0; // loop through the tasks of the kind Job job = context.getJob(jobEntry.getKey()); Map<TaskId, Task> tasks = job.getTasks(type); int numberAllowedSpeculativeTasks = (int) Math.max(MINIMUM_ALLOWED_SPECULATIVE_TASKS, PROPORTION_TOTAL_TASKS_SPECULATABLE * tasks.size()); TaskId bestTaskID = null; long bestSpeculationValue = -1L; // this loop is potentially pricey. // TODO track the tasks that are potentially worth looking at for (Map.Entry<TaskId, Task> taskEntry : tasks.entrySet()) { long mySpeculationValue = speculationValue(taskEntry.getKey(), now); if (mySpeculationValue == ALREADY_SPECULATING) { ++numberSpeculationsAlready; } if (mySpeculationValue != NOT_RUNNING) { ++numberRunningTasks; } if (mySpeculationValue > bestSpeculationValue) { bestTaskID = taskEntry.getKey(); bestSpeculationValue = mySpeculationValue; } } numberAllowedSpeculativeTasks = (int) Math.max(numberAllowedSpeculativeTasks, PROPORTION_RUNNING_TASKS_SPECULATABLE * numberRunningTasks); // If we found a speculation target, fire it off if (bestTaskID != null && numberAllowedSpeculativeTasks > numberSpeculationsAlready) { addSpeculativeAttempt(bestTaskID); ++successes; } } return successes; }
遍历containerNeeds的执行步骤如下:
//Add attempt to a given Task. protected void addSpeculativeAttempt(TaskId taskID) { LOG.info ("DefaultSpeculator.addSpeculativeAttempt -- we are speculating " + taskID); eventHandler.handle(new TaskEvent(taskID, TaskEventType.T_ADD_SPEC_ATTEMPT)); mayHaveSpeculated.add(taskID); }
根据代码清单10,我们看到推断执行尝试是通过发送类型为TaskEventType.T_ADD_SPEC_ATTEMPT的TaskEvent事件完成的。
在分析代码清单9时,我故意跳过了speculationValue方法的分析。speculationValue方法(见代码清单11)主要用于估算每个任务的推断值。
代码清单11 估算任务的推断值
private long speculationValue(TaskId taskID, long now) { Job job = context.getJob(taskID.getJobId()); Task task = job.getTask(taskID); Map<TaskAttemptId, TaskAttempt> attempts = task.getAttempts(); long acceptableRuntime = Long.MIN_VALUE; long result = Long.MIN_VALUE; if (!mayHaveSpeculated.contains(taskID)) { acceptableRuntime = estimator.thresholdRuntime(taskID); if (acceptableRuntime == Long.MAX_VALUE) { return ON_SCHEDULE; } } TaskAttemptId runningTaskAttemptID = null; int numberRunningAttempts = 0; for (TaskAttempt taskAttempt : attempts.values()) { if (taskAttempt.getState() == TaskAttemptState.RUNNING || taskAttempt.getState() == TaskAttemptState.STARTING) { if (++numberRunningAttempts > 1) { return ALREADY_SPECULATING; } runningTaskAttemptID = taskAttempt.getID(); long estimatedRunTime = estimator.estimatedRuntime(runningTaskAttemptID); long taskAttemptStartTime = estimator.attemptEnrolledTime(runningTaskAttemptID); if (taskAttemptStartTime > now) { // This background process ran before we could process the task // attempt status change that chronicles the attempt start return TOO_NEW; } long estimatedEndTime = estimatedRunTime + taskAttemptStartTime; long estimatedReplacementEndTime = now + estimator.estimatedNewAttemptRuntime(taskID); float progress = taskAttempt.getProgress(); TaskAttemptHistoryStatistics data = runningTaskAttemptStatistics.get(runningTaskAttemptID); if (data == null) { runningTaskAttemptStatistics.put(runningTaskAttemptID, new TaskAttemptHistoryStatistics(estimatedRunTime, progress, now)); } else { if (estimatedRunTime == data.getEstimatedRunTime() && progress == data.getProgress()) { // Previous stats are same as same stats if (data.notHeartbeatedInAWhile(now)) { // Stats have stagnated for a while, simulate heart-beat. TaskAttemptStatus taskAttemptStatus = new TaskAttemptStatus(); taskAttemptStatus.id = runningTaskAttemptID; taskAttemptStatus.progress = progress; taskAttemptStatus.taskState = taskAttempt.getState(); // Now simulate the heart-beat handleAttempt(taskAttemptStatus); } } else { // Stats have changed - update our data structure data.setEstimatedRunTime(estimatedRunTime); data.setProgress(progress); data.resetHeartBeatTime(now); } } if (estimatedEndTime < now) { return PROGRESS_IS_GOOD; } if (estimatedReplacementEndTime >= estimatedEndTime) { return TOO_LATE_TO_SPECULATE; } result = estimatedEndTime - estimatedReplacementEndTime; } } // If we are here, there's at most one task attempt. if (numberRunningAttempts == 0) { return NOT_RUNNING; } if (acceptableRuntime == Long.MIN_VALUE) { acceptableRuntime = estimator.thresholdRuntime(taskID); if (acceptableRuntime == Long.MAX_VALUE) { return ON_SCHEDULE; } } return result; }speculationValue方法的执行步骤如下: