今天在执行job时发现如下错误:

   
   
   
   
  1. 11/08/12 15:19:33 INFO mapred.JobClient:  map 99% reduce 32%  
  2. 11/08/12 15:20:59 INFO mapred.JobClient:  map 99% reduce 33%  
  3. 11/08/12 15:21:10 INFO mapred.JobClient:  map 100% reduce 33%  
  4. 11/08/12 15:21:34 INFO mapred.JobClient: Task Id : attempt_201108021504_30459_m_000368_0, Status : FAILED  
  5. Too many fetch-failures  
  6. 11/08/12 15:22:35 WARN mapred.JobClient: Error reading task outputRead timed out  
  7. 11/08/12 15:22:36 INFO mapred.JobClient:  map 100% reduce 34%  
  8. 11/08/12 15:24:44 INFO mapred.JobClient: Task Id : attempt_201108021504_30459_m_000392_0, Status : FAILED  
  9. Too many fetch-failures  
  10. 11/08/12 15:25:44 WARN mapred.JobClient: Error reading task outputRead timed out  
  11. 11/08/12 15:25:45 INFO mapred.JobClient:  map 100% reduce 67%  
  12. 11/08/12 15:25:56 INFO mapred.JobClient: Job complete: job_201108021504_30459  
  13. 11/08/12 15:25:56 INFO mapred.JobClient: Counters: 26  
  14. 11/08/12 15:25:56 INFO mapred.JobClient:   Job Counters   
  15. 11/08/12 15:25:56 INFO mapred.JobClient:     Launched reduce tasks=303 
  16. 11/08/12 15:25:56 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=8591016 
经过查找, 原因解释如下: http://blog.csdn.net/liangliyin/article/details/6455713
        Reduce task启动后第一个阶段是shuffle,即向mapfetch数据。每次fetch都可能因为connect超时,read超时,checksum错误等原因而失败。Reduce task为每个map设置了一个计数器,用以记录fetchmap输出时失败的次数。当失败次数达到一定阈值时,会通知JobTracker fetchmap输出操作失败次数太多了,并打印如下log
Failed to fetch map-output from attempt_201105261254_102769_m_001802_0 even after MAX_FETCH_RETRIES_PER_MAP retries... reporting to the JobTracker
其中阈值计算方式为:
max(MIN_FETCH_RETRIES_PER_MAP,
getClosestPowerOf2((this.maxBackoff * 1000 / BACKOFF_INIT) + 1));
默认情况下MIN_FETCH_RETRIES_PER_MAP=2 maxBackoff=300 BACKOFF_INIT=4000因此默认阈值为6,可通过修改mapred.reduce.copy.backoff参数来调整。当达到阈值后,Reduce task通过umbilical协议告诉TaskTrackerTaskTracker在下一次heartbeat时,通知JobTracker。当JobTracker发现超过50%Reduce汇报fetch某个map的输出多次失败后,JobTrackerfailed掉该map并重新调度,打印如下log
"Too many fetch-failures for output of task: attempt_201105261254_102769_m_001802_0 ... killing it"