Fair Scheduler中的Delay Schedule分析

  延迟调度的主要目的是提高数据本地性(data locality),减少数据在网络中的传输。对于那些输入数据不在本地的MapTask,调度器将会延迟调度他们,而把slot分配给那些具备本地性的MapTask。

  延迟调度的大体思想如下:

  若该job找到一个node-local的MapTask,则返回该task;若找不到,则延迟调度。即在nodeLocalityDelay时长内,重新找到一个node-local的MapTask并返回;

  否则等待时长超过nodeLocalityDelay之后,寻找一个rack-local的MapTask并返回;若找不到,则延迟调度。即在rackLocalityDelay时长内,重新找到一个rack-local的MapTask并返回;

  否则等待超过nodeLocalityDelay + rackLocalityDelay之后,重新寻找一个off-switch的MapTask并返回。

 

  FairScheduler.java中关于延迟调度的主要变量:

1 long nodeLocalityDelay://node-local已经等待的时间

2 long rackLocalityDelay: //rack-local已经等待的时间

3 boolean skippedAtLastHeartbeat://该job是否被延迟调度(是否被跳过)

4 timeWaitedForLocalMap://自从上次MapTask被分配以来等待的时间

5 LocalityLevel lastMapLocalityLevel://上次分配的MapTask对应的本地级别

6 nodeLocalityDelay = rackLocalityDelay =

7   Math.min(15000 ,  (long) (1.5 * jobTracker.getNextHeartbeatInterval()));

  

  在fair scheduler中,每个job维护了两个变量用来完成延迟调度:最后一个被调度的MapTask的本地性级别(lastMapLocalityLevel)与自从这个job被跳过以来所等待的时间(timeWaitedForLocalMap)。工作流程如下(具体工作在FairScheduler.java的getAllowedLocalityLevel ()方法中完成):

 1 /**

 2    * Get the maximum locality level at which a given job is allowed to

 3    * launch tasks, based on how long it has been waiting for local tasks.

 4    * This is used to implement the "delay scheduling" feature of the Fair

 5    * Scheduler for optimizing data locality.

 6    * If the job has no locality information (e.g. it does not use HDFS), this 

 7    * method returns LocalityLevel.ANY, allowing tasks at any level.

 8    * Otherwise, the job can only launch tasks at its current locality level

 9    * or lower, unless it has waited at least nodeLocalityDelay or

10    * rackLocalityDelay milliseconds depends on the current level. If it

11    * has waited (nodeLocalityDelay + rackLocalityDelay) milliseconds,

12    * it can go to any level.

13    */

14   protected LocalityLevel getAllowedLocalityLevel(JobInProgress job,

15       long currentTime) {

16     JobInfo info = infos.get(job);

17     if (info == null) { // Job not in infos (shouldn't happen)

18       LOG.error("getAllowedLocalityLevel called on job " + job

19           + ", which does not have a JobInfo in infos");

20       return LocalityLevel.ANY;

21     }

22     if (job.nonLocalMaps.size() > 0) { // Job doesn't have locality information

23       return LocalityLevel.ANY;

24     }

25     // Don't wait for locality if the job's pool is starving for maps

26     Pool pool = poolMgr.getPool(job);

27     PoolSchedulable sched = pool.getMapSchedulable();

28     long minShareTimeout = poolMgr.getMinSharePreemptionTimeout(pool.getName());

29     long fairShareTimeout = poolMgr.getFairSharePreemptionTimeout();

30     if (currentTime - sched.getLastTimeAtMinShare() > minShareTimeout ||

31         currentTime - sched.getLastTimeAtHalfFairShare() > fairShareTimeout) {

32       eventLog.log("INFO", "No delay scheduling for "

33           + job.getJobID() + " because it is being starved");

34       return LocalityLevel.ANY;

35     }

36     // In the common case, compute locality level based on time waited

37     switch(info.lastMapLocalityLevel) {

38     case NODE: // Last task launched was node-local

39       if (info.timeWaitedForLocalMap >=

40           nodeLocalityDelay + rackLocalityDelay)

41         return LocalityLevel.ANY;

42       else if (info.timeWaitedForLocalMap >= nodeLocalityDelay)

43         return LocalityLevel.RACK;

44       else

45         return LocalityLevel.NODE;

46     case RACK: // Last task launched was rack-local

47       if (info.timeWaitedForLocalMap >= rackLocalityDelay)

48         return LocalityLevel.ANY;

49       else

50         return LocalityLevel.RACK;

51     default: // Last task was non-local; can launch anywhere

52       return LocalityLevel.ANY;

53     }

54   }
getAllowedLocalityLevel()

1. 若lastMapLocalityLevel为Node:

  1)若timeWaitedForLocalMap >= nodeLocalityDelay + rackLocalityDelay,则可以调度off-switch及以下级别的MapTask;

  2)若timeWaitedForLocalMap >= nodeLocalityDelay,则可以调度rack-local及以下级别的MapTask;

  3)否则调度node-local级别的MapTask。

2. 若lastMapLocalityLevel为Rack:

  1)若timeWaitedForLocalMap >= rackLocalityDelay,则调度off-switch及以下级别的MapTask;

  2)否则调度rack-local及以下级别的MapTask;

3. 否则调度off-switch及以下级别的MapTask;

 

  延迟调度的具体工作流程如下(具体工作在FairScheduler.java的assignTasks()方法中完成):

  1 @Override

  2   public synchronized List<Task> assignTasks(TaskTracker tracker)

  3       throws IOException {

  4     if (!initialized) // Don't try to assign tasks if we haven't yet started up

  5       return null;

  6     String trackerName = tracker.getTrackerName();

  7     eventLog.log("HEARTBEAT", trackerName);

  8     long currentTime = clock.getTime();

  9     

 10     // Compute total runnable maps and reduces, and currently running ones

 11     int runnableMaps = 0;

 12     int runningMaps = 0;

 13     int runnableReduces = 0;

 14     int runningReduces = 0;

 15     for (Pool pool: poolMgr.getPools()) {

 16       runnableMaps += pool.getMapSchedulable().getDemand();

 17       runningMaps += pool.getMapSchedulable().getRunningTasks();

 18       runnableReduces += pool.getReduceSchedulable().getDemand();

 19       runningReduces += pool.getReduceSchedulable().getRunningTasks();

 20     }

 21 

 22     ClusterStatus clusterStatus = taskTrackerManager.getClusterStatus();

 23     // Compute total map/reduce slots

 24     // In the future we can precompute this if the Scheduler becomes a 

 25     // listener of tracker join/leave events.

 26     int totalMapSlots = getTotalSlots(TaskType.MAP, clusterStatus);

 27     int totalReduceSlots = getTotalSlots(TaskType.REDUCE, clusterStatus);

 28     

 29     eventLog.log("RUNNABLE_TASKS", 

 30         runnableMaps, runningMaps, runnableReduces, runningReduces);

 31 

 32     // Update time waited for local maps for jobs skipped on last heartbeat

 33     //备注一

 34     updateLocalityWaitTimes(currentTime);

 35 

 36     // Check for JT safe-mode

 37     if (taskTrackerManager.isInSafeMode()) {

 38       LOG.info("JobTracker is in safe-mode, not scheduling any tasks.");

 39       return null;

 40     } 

 41 

 42     TaskTrackerStatus tts = tracker.getStatus();

 43 

 44     int mapsAssigned = 0; // loop counter for map in the below while loop

 45     int reducesAssigned = 0; // loop counter for reduce in the below while

 46     int mapCapacity = maxTasksToAssign(TaskType.MAP, tts);

 47     int reduceCapacity = maxTasksToAssign(TaskType.REDUCE, tts);

 48     boolean mapRejected = false; // flag used for ending the loop

 49     boolean reduceRejected = false; // flag used for ending the loop

 50 

 51     // Keep track of which jobs were visited for map tasks and which had tasks

 52     // launched, so that we can later mark skipped jobs for delay scheduling

 53     Set<JobInProgress> visitedForMap = new HashSet<JobInProgress>();

 54     Set<JobInProgress> visitedForReduce = new HashSet<JobInProgress>();

 55     Set<JobInProgress> launchedMap = new HashSet<JobInProgress>();

 56 

 57     ArrayList<Task> tasks = new ArrayList<Task>();

 58     // Scan jobs to assign tasks until neither maps nor reduces can be assigned

 59     //备注二

 60     while (true) {

 61       // Computing the ending conditions for the loop

 62       // Reject a task type if one of the following condition happens

 63       // 1. number of assigned task reaches per heatbeat limit

 64       // 2. number of running tasks reaches runnable tasks

 65       // 3. task is rejected by the LoadManager.canAssign

 66       if (!mapRejected) {

 67         if (mapsAssigned == mapCapacity ||

 68             runningMaps == runnableMaps ||

 69             !loadMgr.canAssignMap(tts, runnableMaps,

 70                 totalMapSlots, mapsAssigned)) {

 71           eventLog.log("INFO", "Can't assign another MAP to " + trackerName);

 72           mapRejected = true;

 73         }

 74       }

 75       if (!reduceRejected) {

 76         if (reducesAssigned == reduceCapacity ||

 77             runningReduces == runnableReduces ||

 78             !loadMgr.canAssignReduce(tts, runnableReduces,

 79                 totalReduceSlots, reducesAssigned)) {

 80           eventLog.log("INFO", "Can't assign another REDUCE to " + trackerName);

 81           reduceRejected = true;

 82         }

 83       }

 84       // Exit while (true) loop if

 85       // 1. neither maps nor reduces can be assigned

 86       // 2. assignMultiple is off and we already assigned one task

 87       if (mapRejected && reduceRejected ||

 88           !assignMultiple && tasks.size() > 0) {

 89         break; // This is the only exit of the while (true) loop

 90       }

 91 

 92       // Determine which task type to assign this time

 93       // First try choosing a task type which is not rejected

 94       TaskType taskType;

 95       if (mapRejected) {

 96         taskType = TaskType.REDUCE;

 97       } else if (reduceRejected) {

 98         taskType = TaskType.MAP;

 99       } else {

100         // If both types are available, choose the task type with fewer running

101         // tasks on the task tracker to prevent that task type from starving

102         if (tts.countMapTasks() + mapsAssigned <=

103             tts.countReduceTasks() + reducesAssigned) {

104           taskType = TaskType.MAP;

105         } else {

106           taskType = TaskType.REDUCE;

107         }

108       }

109 

110       // Get the map or reduce schedulables and sort them by fair sharing

111       List<PoolSchedulable> scheds = getPoolSchedulables(taskType);

112       //对job进行排序

113       Collections.sort(scheds, new SchedulingAlgorithms.FairShareComparator());

114       boolean foundTask = false;

115       //备注三

116       for (Schedulable sched: scheds) { // This loop will assign only one task

117         eventLog.log("INFO", "Checking for " + taskType +

118             " task in " + sched.getName());

119         //备注四

120         Task task = taskType == TaskType.MAP ? 

121                     sched.assignTask(tts, currentTime, visitedForMap) : 

122                     sched.assignTask(tts, currentTime, visitedForReduce);

123         if (task != null) {

124           foundTask = true;

125           JobInProgress job = taskTrackerManager.getJob(task.getJobID());

126           eventLog.log("ASSIGN", trackerName, taskType,

127               job.getJobID(), task.getTaskID());

128           // Update running task counts, and the job's locality level

129           if (taskType == TaskType.MAP) {

130             launchedMap.add(job);

131             mapsAssigned++;

132             runningMaps++;

133             //备注五

134             updateLastMapLocalityLevel(job, task, tts);

135           } else {

136             reducesAssigned++;

137             runningReduces++;

138           }

139           // Add task to the list of assignments

140           tasks.add(task);

141           break; // This break makes this loop assign only one task

142         } // end if(task != null)

143       } // end for(Schedulable sched: scheds)

144 

145       // Reject the task type if we cannot find a task

146       if (!foundTask) {

147         if (taskType == TaskType.MAP) {

148           mapRejected = true;

149         } else {

150           reduceRejected = true;

151         }

152       }

153     } // end while (true)

154 

155     // Mark any jobs that were visited for map tasks but did not launch a task

156     // as skipped on this heartbeat

157     for (JobInProgress job: visitedForMap) {

158       if (!launchedMap.contains(job)) {

159         infos.get(job).skippedAtLastHeartbeat = true;

160       }

161     }

162     

163     // If no tasks were found, return null

164     return tasks.isEmpty() ? null : tasks;

165   }
assignTasks()

  备注一:updateLocalityWaitTimes()。首先更新自上次心跳以来,timeWaitedForLocalMap的时间,并将所有job 的skippedAtLastHeartbeat设为false;代码如下:

 1 /**

 2    * Update locality wait times for jobs that were skipped at last heartbeat.

 3    */

 4   private void updateLocalityWaitTimes(long currentTime) {

 5     long timeSinceLastHeartbeat = 

 6       (lastHeartbeatTime == 0 ? 0 : currentTime - lastHeartbeatTime);

 7     lastHeartbeatTime = currentTime;

 8     for (JobInfo info: infos.values()) {

 9       if (info.skippedAtLastHeartbeat) {

10         info.timeWaitedForLocalMap += timeSinceLastHeartbeat;

11         info.skippedAtLastHeartbeat = false;

12       }

13     }

14   }
updateLocalityWaitTimes()

  备注二:在while(true)循环中不断分配MapTask和ReduceTask,直到没有可被分配的为止;在循环中对所有job进行排序;接着在一个for()循环中进行真正的MapTask分配(Schedulable有两个子类,分别代表PoolSchedulable与JobSchedulable。这里的Schedulable可当做job看待)。

  备注三、四:在for()循环里,JobSchedulable中的assignTask()方法会被调用,来选择适当的MapTask或者ReduceTask。在选择MapTask时,先会调用FairScheduler.getAllowedLocalityLevel()方法来确定应该调度哪个级别的MapTask(具体的方法分析见上),然后根据该方法的返回值来选择对应级别的MapTask。assignTask()方法代码如下:

 1 @Override

 2   public Task assignTask(TaskTrackerStatus tts, long currentTime,

 3       Collection<JobInProgress> visited) throws IOException {

 4     if (isRunnable()) {

 5       visited.add(job);

 6       TaskTrackerManager ttm = scheduler.taskTrackerManager;

 7       ClusterStatus clusterStatus = ttm.getClusterStatus();

 8       int numTaskTrackers = clusterStatus.getTaskTrackers();

 9 

10       // check with the load manager whether it is safe to 

11       // launch this task on this taskTracker.

12       LoadManager loadMgr = scheduler.getLoadManager();

13       if (!loadMgr.canLaunchTask(tts, job, taskType)) {

14         return null;

15       }

16       if (taskType == TaskType.MAP) {

17           //确定应该调度的级别

18         LocalityLevel localityLevel = scheduler.getAllowedLocalityLevel(

19             job, currentTime);

20         scheduler.getEventLog().log(

21             "ALLOWED_LOC_LEVEL", job.getJobID(), localityLevel);

22         switch (localityLevel) {

23           case NODE:

24             return job.obtainNewNodeLocalMapTask(tts, numTaskTrackers,

25                 ttm.getNumberOfUniqueHosts());

26           case RACK:

27             return job.obtainNewNodeOrRackLocalMapTask(tts, numTaskTrackers,

28                 ttm.getNumberOfUniqueHosts());

29           default:

30             return job.obtainNewMapTask(tts, numTaskTrackers,

31                 ttm.getNumberOfUniqueHosts());

32         }

33       } else {

34         return job.obtainNewReduceTask(tts, numTaskTrackers,

35             ttm.getNumberOfUniqueHosts());

36       }

37     } else {

38       return null;

39     }

40   }
assignTask()

  可以看到,在该方法中又会根据相应的级别调用JobInProgress类中的方法来获取该级别的MapTask。

  备注五:最后updateLastMapLocalityLevel()方法会更新该job的一些信息:lastMapLocalityLevel设为该job对应的级别;timeWaitedForLocalMap置为0。

 1   /**

 2    * Update a job's locality level and locality wait variables given that that 

 3    * it has just launched a map task on a given task tracker.

 4    */

 5   private void updateLastMapLocalityLevel(JobInProgress job,

 6       Task mapTaskLaunched, TaskTrackerStatus tracker) {

 7     JobInfo info = infos.get(job);

 8     boolean isNodeGroupAware = conf.getBoolean(

 9         "net.topology.nodegroup.aware", false);

10     LocalityLevel localityLevel = LocalityLevel.fromTask(

11         job, mapTaskLaunched, tracker, isNodeGroupAware);

12     info.lastMapLocalityLevel = localityLevel;

13     info.timeWaitedForLocalMap = 0;

14     eventLog.log("ASSIGNED_LOC_LEVEL", job.getJobID(), localityLevel);

15   }
updateLastMapLocalityLevel()

 

  本文基于hadoop1.2.1。如有错误,还请指正

  参考文章: 《Hadoop技术内幕 深入理解MapReduce架构设计与实现原理》 董西成

    https://issues.apache.org/jira/secure/attachment/12457515/fair_scheduler_design_doc.pdf

  转载请注明出处:http://www.cnblogs.com/gwgyk/p/4568270.html

你可能感兴趣的:(scheduler)