hadoop运行原理之Job运行(五) 任务调度

  接着上篇来说。hadoop首先调度辅助型task(job-cleanup task、task-cleanup task和job-setup task),这是由JobTracker来完成的;但对于计算型task,则是由作业调度器TaskScheduler来分配的,其默认实现为JobQueueTaskScheduler。具体过程在assignTasks()方法中完成,下面来一段一段的分析该方法。

 

 public synchronized List<Task> assignTasks(TaskTracker taskTracker) throws IOException { // Check for JT safe-mode

    if (taskTrackerManager.isInSafeMode()) { LOG.info("JobTracker is in safe-mode, not scheduling any tasks."); return null; } TaskTrackerStatus taskTrackerStatus = taskTracker.getStatus(); ClusterStatus clusterStatus = taskTrackerManager.getClusterStatus(); final int numTaskTrackers = clusterStatus.getTaskTrackers(); final int clusterMapCapacity = clusterStatus.getMaxMapTasks(); final int clusterReduceCapacity = clusterStatus.getMaxReduceTasks(); Collection<JobInProgress> jobQueue = jobQueueJobInProgressListener.getJobQueue();

  首先检查是否处于安全模式;接着分别获取该TaskTracker的状态信息、集群状态信息、集群中的TaskTracker数目、集群能运行的最大Map Task个数和Reduce Task个数;再选择一个作业队列,对该队列中的作业进行调度。

 

 1 //

 2     // Get map + reduce counts for the current tracker.  3     //  4     final int trackerMapCapacity = taskTrackerStatus.getMaxMapSlots();  5     final int trackerReduceCapacity = taskTrackerStatus.getMaxReduceSlots();  6     final int trackerRunningMaps = taskTrackerStatus.countMapTasks();  7     final int trackerRunningReduces = taskTrackerStatus.countReduceTasks();  8 

 9     // Assigned tasks

10     List<Task> assignedTasks = new ArrayList<Task>();

  这4行分别是获取Map和Reduce的slot,然后是获取当前TaskTracker上正在运行的Map和Reduce task数目;最后一行的集合用来存放分配给该TaskTracker的task。

 

 1 //

 2     // Compute (running + pending) map and reduce task numbers across pool  3     //  4     int remainingReduceLoad = 0;  5     int remainingMapLoad = 0;  6     synchronized (jobQueue) {  7       for (JobInProgress job : jobQueue) {  8         if (job.getStatus().getRunState() == JobStatus.RUNNING) {  9           remainingMapLoad += (job.desiredMaps() - job.finishedMaps()); 10           if (job.scheduleReduces()) { 11             remainingReduceLoad += 

12               (job.desiredReduces() - job.finishedReduces()); 13  } 14  } 15  } 16     }

  该段代码用来计算作业队列中还有多少Map和Reduce task需要运行。job.desiredMaps()方法用来计算该Job总共有多少个Map task。job.finishedMaps()方法用来计算该Job有多少个已完成的Map task。同理,job.desiredReduces()方法与job.finishedReduces()方法用来计算Reduce。

 

 

// Compute the 'load factor' for maps and reduces

    double mapLoadFactor = 0.0; if (clusterMapCapacity > 0) { mapLoadFactor = (double)remainingMapLoad / clusterMapCapacity; } double reduceLoadFactor = 0.0; if (clusterReduceCapacity > 0) { reduceLoadFactor = (double)remainingReduceLoad / clusterReduceCapacity; }

   用来计算Map和Reduce task的装载百分比,即根据剩余需要运行的Map task和集群能运行的最大Map Task个数的比例,来为TaskTracker计算一个装载因子,使得该TaskTracker上的Map task个数不超过这个比例。Reduce也一样。

 

 1  //

 2     // In the below steps, we allocate first map tasks (if appropriate),  3     // and then reduce tasks if appropriate. We go through all jobs  4     // in order of job arrival; jobs only get serviced if their  5     // predecessors are serviced, too.  6     //

 7 

 8     //

 9     // We assign tasks to the current taskTracker if the given machine 10     // has a workload that's less than the maximum load of that kind of 11     // task. 12     // However, if the cluster is close to getting loaded i.e. we don't 13     // have enough _padding_ for speculative executions etc., we only 14     // schedule the "highest priority" task i.e. the task from the job 15     // with the highest priority. 16     // 17     

18     final int trackerCurrentMapCapacity = 

19       Math.min((int)Math.ceil(mapLoadFactor * trackerMapCapacity), 20  trackerMapCapacity); 21     int availableMapSlots = trackerCurrentMapCapacity - trackerRunningMaps; 22     boolean exceededMapPadding = false; 23     if (availableMapSlots > 0) { 24       exceededMapPadding = 

25         exceededPadding(true, clusterStatus, trackerMapCapacity); 26     }

  第一行根据上一步计算出来的Map task装载因子,计算当前结点能够运行的Map task个数;第二行计算剩余的能够运行Map task的slot个数availableMapSlots。如果availableMapSlots大于0表示还有余地运行Map task。Hadoop不会把所有的slot 都分配完,而是会留一些slot给失败的和推测执行的任务,exceededPadding()方法就是来完成这个任务的。

 

 1  int numLocalMaps = 0;  2     int numNonLocalMaps = 0;  3  scheduleMaps:  4     for (int i=0; i < availableMapSlots; ++i) {  5       synchronized (jobQueue) {  6         for (JobInProgress job : jobQueue) {  7           if (job.getStatus().getRunState() != JobStatus.RUNNING) {  8             continue;  9  } 10 

11           Task t = null; 12           

13           // Try to schedule a Map task with locality between node-local 14           // and rack-local

15           t = 

16  job.obtainNewNodeOrRackLocalMapTask(taskTrackerStatus, 17  numTaskTrackers, taskTrackerManager.getNumberOfUniqueHosts()); 18           if (t != null) { 19  assignedTasks.add(t); 20             ++numLocalMaps; 21             

22             // Don't assign map tasks to the hilt! 23             // Leave some free slots in the cluster for future task-failures, 24             // speculative tasks etc. beyond the highest priority job

25             if (exceededMapPadding) { 26               break scheduleMaps; 27  } 28            

29             // Try all jobs again for the next Map task 

30             break; 31  } 32           

33           // Try to schedule a node-local or rack-local Map task

34           t = 

35  job.obtainNewNonLocalMapTask(taskTrackerStatus, numTaskTrackers, 36  taskTrackerManager.getNumberOfUniqueHosts()); 37           

38           if (t != null) { 39  assignedTasks.add(t); 40             ++numNonLocalMaps; 41             

42             // We assign at most 1 off-switch or speculative task 43             // This is to prevent TaskTrackers from stealing local-tasks 44             // from other TaskTrackers.

45             break scheduleMaps; 46  } 47  } 48  } 49  } 50     int assignedMaps = assignedTasks.size();

  以上这部分就是分配Map task的过程。obtainNewNodeOrRackLocalMapTask()方法和obtainNewNonLocalMapTask()方法分别用来分配node-local/rack-local task和非本地的task(我觉得hadoop中这个方法的注释写的有问题,第33行,原代码第195行)。他们最终都调用了findNewMapTask()方法来分配task,但区别在于调用时的级别:obtainNewNodeOrRackLocalMapTask ()方法是“maxLevel”,表示可以运行node-local/rack-local级别的task,obtainNewNonLocalMapTask()方法是“NON_LOCAL_CACHE_LEVEL”,表示只能运行off-switch/speculative级别的task。而“anyCacheLevel”级别最高,表示node-local, rack-local, off-switch and speculative task都可以分配。

 

  1 1 /**

  2   2    * Find new map task

  3   3    * @param tts The task tracker that is asking for a task

  4   4    * @param clusterSize The number of task trackers in the cluster

  5   5    * @param numUniqueHosts The number of hosts that run task trackers

  6   6    * @param avgProgress The average progress of this kind of task in this job

  7   7    * @param maxCacheLevel The maximum topology level until which to schedule

  8   8    *                      maps. 

  9   9    *                      A value of {@link #anyCacheLevel} implies any 

 10  10    *                      available task (node-local, rack-local, off-switch and 

 11  11    *                      speculative tasks).

 12  12    *                      A value of {@link #NON_LOCAL_CACHE_LEVEL} implies only

 13  13    *                      off-switch/speculative tasks should be scheduled.

 14  14    * @return the index in tasks of the selected task (or -1 for no task)

 15  15    */

 16  16  private synchronized int findNewMapTask(final TaskTrackerStatus tts, 

 17  17                                           final int clusterSize,

 18  18                                           final int numUniqueHosts,

 19  19                                           final int maxCacheLevel,

 20  20                                           final double avgProgress) {

 21  21     if (numMapTasks == 0) {

 22  22       if(LOG.isDebugEnabled()) {

 23  23         LOG.debug("No maps to schedule for " + profile.getJobID());

 24  24       }

 25  25       return -1;

 26  26     }

 27  27 

 28  28     String taskTracker = tts.getTrackerName();

 29  29     TaskInProgress tip = null;

 30  30     

 31  31     //

 32  32     // Update the last-known clusterSize

 33  33     //

 34  34     this.clusterSize = clusterSize;

 35  35 

 36  36     if (!shouldRunOnTaskTracker(taskTracker)) {

 37  37       return -1;

 38  38     }

 39  39 

 40  40     // Check to ensure this TaskTracker has enough resources to 

 41  41     // run tasks from this job

 42  42     long outSize = resourceEstimator.getEstimatedMapOutputSize();

 43  43     long availSpace = tts.getResourceStatus().getAvailableSpace();

 44  44     if(availSpace < outSize) {

 45  45       LOG.warn("No room for map task. Node " + tts.getHost() + 

 46  46                " has " + availSpace + 

 47  47                " bytes free; but we expect map to take " + outSize);

 48  48 

 49  49       return -1; //see if a different TIP might work better. 

 50  50     }

 51  51     

 52  52     

 53  53     // When scheduling a map task:

 54  54     //  0) Schedule a failed task without considering locality

 55  55     //  1) Schedule non-running tasks

 56  56     //  2) Schedule speculative tasks

 57  57     //  3) Schedule tasks with no location information

 58  58 

 59  59     // First a look up is done on the non-running cache and on a miss, a look 

 60  60     // up is done on the running cache. The order for lookup within the cache:

 61  61     //   1. from local node to root [bottom up]

 62  62     //   2. breadth wise for all the parent nodes at max level

 63  63     // We fall to linear scan of the list ((3) above) if we have misses in the 

 64  64     // above caches

 65  65 

 66  66     // 0) Schedule the task with the most failures, unless failure was on this

 67  67     //    machine

 68  68     tip = findTaskFromList(failedMaps, tts, numUniqueHosts, false);

 69  69     if (tip != null) {

 70  70       // Add to the running list

 71  71       scheduleMap(tip);

 72  72       LOG.info("Choosing a failed task " + tip.getTIPId());

 73  73       return tip.getIdWithinJob();

 74  74     }

 75  75 

 76  76     Node node = jobtracker.getNode(tts.getHost());

 77  77     

 78  78     //

 79  79     // 1) Non-running TIP :

 80  80     // 

 81  81 

 82  82     // 1. check from local node to the root [bottom up cache lookup]

 83  83     //    i.e if the cache is available and the host has been resolved

 84  84     //    (node!=null)

 85  85     if (node != null) {

 86  86       Node key = node;

 87  87       int level = 0;

 88  88       // maxCacheLevel might be greater than this.maxLevel if findNewMapTask is

 89  89       // called to schedule any task (local, rack-local, off-switch or

 90  90       // speculative) tasks or it might be NON_LOCAL_CACHE_LEVEL (i.e. -1) if

 91  91       // findNewMapTask is (i.e. -1) if findNewMapTask is to only schedule

 92  92       // off-switch/speculative tasks

 93  93       int maxLevelToSchedule = Math.min(maxCacheLevel, maxLevel);

 94  94       for (level = 0;level < maxLevelToSchedule; ++level) {

 95  95         List <TaskInProgress> cacheForLevel = nonRunningMapCache.get(key);

 96  96         if (cacheForLevel != null) {

 97  97           tip = findTaskFromList(cacheForLevel, tts, 

 98  98               numUniqueHosts,level == 0);

 99  99           if (tip != null) {

100 100             // Add to running cache

101 101             scheduleMap(tip);

102 102 

103 103             // remove the cache if its empty

104 104             if (cacheForLevel.size() == 0) {

105 105               nonRunningMapCache.remove(key);

106 106             }

107 107 

108 108             return tip.getIdWithinJob();

109 109           }

110 110         }

111 111         key = key.getParent();

112 112       }

113 113       

114 114       // Check if we need to only schedule a local task (node-local/rack-local)

115 115       if (level == maxCacheLevel) {

116 116         return -1;

117 117       }

118 118     }

119 119 

120 120     //2. Search breadth-wise across parents at max level for non-running 

121 121     //   TIP if

122 122     //     - cache exists and there is a cache miss 

123 123     //     - node information for the tracker is missing (tracker's topology

124 124     //       info not obtained yet)

125 125 

126 126     // collection of node at max level in the cache structure

127 127     Collection<Node> nodesAtMaxLevel = jobtracker.getNodesAtMaxLevel();

128 128 

129 129     // get the node parent at max level

130 130     Node nodeParentAtMaxLevel = 

131 131       (node == null) ? null : JobTracker.getParentNode(node, maxLevel - 1);

132 132     

133 133     for (Node parent : nodesAtMaxLevel) {

134 134 

135 135       // skip the parent that has already been scanned

136 136       if (parent == nodeParentAtMaxLevel) {

137 137         continue;

138 138       }

139 139 

140 140       List<TaskInProgress> cache = nonRunningMapCache.get(parent);

141 141       if (cache != null) {

142 142         tip = findTaskFromList(cache, tts, numUniqueHosts, false);

143 143         if (tip != null) {

144 144           // Add to the running cache

145 145           scheduleMap(tip);

146 146 

147 147           // remove the cache if empty

148 148           if (cache.size() == 0) {

149 149             nonRunningMapCache.remove(parent);

150 150           }

151 151           LOG.info("Choosing a non-local task " + tip.getTIPId());

152 152           return tip.getIdWithinJob();

153 153         }

154 154       }

155 155     }

156 156 

157 157     // 3. Search non-local tips for a new task

158 158     tip = findTaskFromList(nonLocalMaps, tts, numUniqueHosts, false);

159 159     if (tip != null) {

160 160       // Add to the running list

161 161       scheduleMap(tip);

162 162 

163 163       LOG.info("Choosing a non-local task " + tip.getTIPId());

164 164       return tip.getIdWithinJob();

165 165     }

166 166 

167 167     //

168 168     // 2) Running TIP :

169 169     // 

170 170  

171 171     if (hasSpeculativeMaps) {

172 172       long currentTime = jobtracker.getClock().getTime();

173 173 

174 174       // 1. Check bottom up for speculative tasks from the running cache

175 175       if (node != null) {

176 176         Node key = node;

177 177         for (int level = 0; level < maxLevel; ++level) {

178 178           Set<TaskInProgress> cacheForLevel = runningMapCache.get(key);

179 179           if (cacheForLevel != null) {

180 180             tip = findSpeculativeTask(cacheForLevel, tts, 

181 181                                       avgProgress, currentTime, level == 0);

182 182             if (tip != null) {

183 183               if (cacheForLevel.size() == 0) {

184 184                 runningMapCache.remove(key);

185 185               }

186 186               return tip.getIdWithinJob();

187 187             }

188 188           }

189 189           key = key.getParent();

190 190         }

191 191       }

192 192 

193 193       // 2. Check breadth-wise for speculative tasks

194 194       

195 195       for (Node parent : nodesAtMaxLevel) {

196 196         // ignore the parent which is already scanned

197 197         if (parent == nodeParentAtMaxLevel) {

198 198           continue;

199 199         }

200 200 

201 201         Set<TaskInProgress> cache = runningMapCache.get(parent);

202 202         if (cache != null) {

203 203           tip = findSpeculativeTask(cache, tts, avgProgress, 

204 204                                     currentTime, false);

205 205           if (tip != null) {

206 206             // remove empty cache entries

207 207             if (cache.size() == 0) {

208 208               runningMapCache.remove(parent);

209 209             }

210 210             LOG.info("Choosing a non-local task " + tip.getTIPId() 

211 211                      + " for speculation");

212 212             return tip.getIdWithinJob();

213 213           }

214 214         }

215 215       }

216 216 

217 217       // 3. Check non-local tips for speculation

218 218       tip = findSpeculativeTask(nonLocalRunningMaps, tts, avgProgress, 

219 219                                 currentTime, false);

220 220       if (tip != null) {

221 221         LOG.info("Choosing a non-local task " + tip.getTIPId() 

222 222                  + " for speculation");

223 223         return tip.getIdWithinJob();

224 224       }

225 225     }

226 226     

227 227     return -1;

228 228   }
findNewMapTask

  这里穿插说一下findNewMapTask()方法,真正的任务分配都是它来做的,task分配的优先级为:

1)、从failedMaps中调度failed Task

2)、从nonRunningMapCache中选择具有本地性的任务,优先级为node-local、rack-local、off-switch。至于本地性如何体现在后边说。

3)、从nonLocalMaps中选择任务

4)、从runningMapCache中选择任务,为其启动备份执行

5)、从nonLocalRunningMaps中选择任务,为其启动备份执行

最后,如果findNewMapTask()方法返回值为-1,则表示没有找到合适的Map task。否则返回值表示该Map task在JobInProgress的maps[]数组中的下标。

 

 

 1   //

 2     // Same thing, but for reduce tasks  3     // However we _never_ assign more than 1 reduce task per heartbeat  4     //  5     final int trackerCurrentReduceCapacity = 

 6       Math.min((int)Math.ceil(reduceLoadFactor * trackerReduceCapacity),  7  trackerReduceCapacity);  8     final int availableReduceSlots = 

 9       Math.min((trackerCurrentReduceCapacity - trackerRunningReduces), 1); 10     boolean exceededReducePadding = false; 11     if (availableReduceSlots > 0) { 12       exceededReducePadding = exceededPadding(false, clusterStatus, 13                                               trackerReduceCapacity);

   同理,这部分用来计算是否给Reduce task留有足够的slot去执行失败的和推测执行的Reduce task。

 

 1 synchronized (jobQueue) {  2         for (JobInProgress job : jobQueue) {  3           if (job.getStatus().getRunState() != JobStatus.RUNNING ||

 4               job.numReduceTasks == 0) {  5             continue;  6  }  7 

 8           Task t = 

 9  job.obtainNewReduceTask(taskTrackerStatus, numTaskTrackers, 10  taskTrackerManager.getNumberOfUniqueHosts() 11  ); 12           if (t != null) { 13  assignedTasks.add(t); 14             break; 15  } 16           

17           // Don't assign reduce tasks to the hilt! 18           // Leave some free slots in the cluster for future task-failures, 19           // speculative tasks etc. beyond the highest priority job

20           if (exceededReducePadding) { 21             break; 22  } 23  } 24  } 25     }

  这部分用来分配Reduce task。可以看到,与分配Map task时用的双层for循环不同,分配Reduce task的时候是单层for循环,因为每次只分配一个Reduce task。Reduce task分配优先级为:

1)、从nonRunningReduces中选择

2)、从runningReduces选择一个task为其启动推测任务

最后,如果findNewReduceTask ()方法返回值为-1,则表示没有找到合适的Reduce task。否则返回值表示该Reduce task在JobInProgress的reduces[]数组中的下标。

 

 1 if (LOG.isDebugEnabled()) {  2       LOG.debug("Task assignments for " + taskTrackerStatus.getTrackerName() + " --> " +

 3                 "[" + mapLoadFactor + ", " + trackerMapCapacity + ", " + 

 4                 trackerCurrentMapCapacity + ", " + trackerRunningMaps + "] -> [" + 

 5                 (trackerCurrentMapCapacity - trackerRunningMaps) + ", " +

 6                 assignedMaps + " (" + numLocalMaps + ", " + numNonLocalMaps + 

 7                 ")] [" + reduceLoadFactor + ", " + trackerReduceCapacity + ", " + 

 8                 trackerCurrentReduceCapacity + "," + trackerRunningReduces + 

 9                 "] -> [" + (trackerCurrentReduceCapacity - trackerRunningReduces) + 

10                 ", " + (assignedTasks.size()-assignedMaps) + "]"); 11  } 12 

13     return assignedTasks;

  最后返回分配给该TaskTracker的task集合。

 

  说一下JobInProgress中与分配任务相关的重要数据结构:

1 Map<Node, List<TaskInProgress>> nonRunningMapCache:Node与未运行的TIP集合映射关系,通过作业的InputFormat可直接获取 2 Map<Node, Set<TaskInProgress>> runningMapCache:Node与运行的TIP集合映射关系,一个任务获得调度机会,其TIP便会添加进来 3 final List<TaskInProgress> nonLocalMaps:non-local(没有输入数据,InputSplit为空)且未运行的TIP集合 4 final SortedSet<TaskInProgress> failedMaps:按照Task Attempt失败次数排序的TIP集合 5 Set<TaskInProgress> nonLocalRunningMaps:non-local且正在运行的TIP集合 6 Set<TaskInProgress> nonRunningReduces:等待运行的Reduce集合 7 Set<TaskInProgress> runningReduces:正在运行的Reduce集合

 

  关于Map task本地性的实现:

  JobInProgress中的数据结构nonRunningMapCache体现了本地性,其中记录的是node与该node上待运行的Map task(TaskInProgress)集合。这个数据结构在JobInProgress中的createCache()中创建:

 1 private Map<Node, List<TaskInProgress>> createCache(  2                                  TaskSplitMetaInfo[] splits, int maxLevel)  3                                  throws UnknownHostException {  4     Map<Node, List<TaskInProgress>> cache = 

 5       new IdentityHashMap<Node, List<TaskInProgress>>(maxLevel);  6     

 7     Set<String> uniqueHosts = new TreeSet<String>();  8     for (int i = 0; i < splits.length; i++) {  9       String[] splitLocations = splits[i].getLocations(); 10       if (splitLocations == null || splitLocations.length == 0) { 11  nonLocalMaps.add(maps[i]); 12         continue; 13  } 14 

15       for(String host: splitLocations) { 16         Node node = jobtracker.resolveAndAddToTopology(host); 17  uniqueHosts.add(host); 18         LOG.info("tip:" + maps[i].getTIPId() + " has split on node:" + node); 19         for (int j = 0; j < maxLevel; j++) { 20           List<TaskInProgress> hostMaps = cache.get(node); 21           if (hostMaps == null) { 22             hostMaps = new ArrayList<TaskInProgress>(); 23  cache.put(node, hostMaps); 24  hostMaps.add(maps[i]); 25  } 26           //check whether the hostMaps already contains an entry for a TIP 27           //This will be true for nodes that are racks and multiple nodes in 28           //the rack contain the input for a tip. Note that if it already 29           //exists in the hostMaps, it must be the last element there since 30           //we process one TIP at a time sequentially in the split-size order

31           if (hostMaps.get(hostMaps.size() - 1) != maps[i]) { 32  hostMaps.add(maps[i]); 33  } 34           node = node.getParent(); 35  } 36  } 37  } 38     

39     // Calibrate the localityWaitFactor - Do not override user intent!

40     if (localityWaitFactor == DEFAULT_LOCALITY_WAIT_FACTOR) { 41       int jobNodes = uniqueHosts.size(); 42       int clusterNodes = jobtracker.getNumberOfUniqueHosts(); 43       

44       if (clusterNodes > 0) { 45         localityWaitFactor = 

46           Math.min((float)jobNodes/clusterNodes, localityWaitFactor); 47  } 48       LOG.info(jobId + " LOCALITY_WAIT_FACTOR=" + localityWaitFactor); 49  } 50     

51     return cache; 52   }

  在这个方法中,根据split所在的node,将与该分片对应的Map Task(TaskInProgress)和Node添加到该数据结构中。当选择未运行的Map Task时,只要从该数据结构中查找与该结点对应的任务即可实现本地性。

 

 

   本文基于hadoop1.2.1

   如有错误,还请指正

   参考文章:《Hadoop技术内幕 深入理解MapReduce架构设计与实现原理》 董西成

         http://www.cnblogs.com/lxf20061900/p/3775963.html

   转载请注明出处:http://www.cnblogs.com/gwgyk/p/4085627.html

 

你可能感兴趣的:(hadoop)