Hive Tez任务失败

最近集群上的Tez任务经常跑失败,报错信息见下:

出错日志

Map 1: 555(+41)/596 Reducer 2: 0(+0,-2)/1   
15/09/23 14:50:35 INFO SessionState: Map 1: 555(+41)/596    Reducer 2: 0(+0,-2)/1   
Map 1: 555(+41)/596 Reducer 2: 0(+1,-2)/1   
15/09/23 14:50:37 INFO SessionState: Map 1: 555(+41)/596    Reducer 2: 0(+1,-2)/1   
Map 1: 555(+41)/596 Reducer 2: 0(+1,-3)/1   
15/09/23 14:50:38 INFO SessionState: Map 1: 555(+41)/596    Reducer 2: 0(+1,-3)/1   
Map 1: 555(+41)/596 Reducer 2: 0(+1,-3)/1   
15/09/23 14:50:41 INFO SessionState: Map 1: 555(+41)/596    Reducer 2: 0(+1,-3)/1   
Map 1: 555(+0)/596  Reducer 2: 0(+0,-4)/1   
15/09/23 14:50:44 INFO SessionState: Map 1: 555(+0)/596 Reducer 2: 0(+0,-4)/1   
Status: Failed
15/09/23 14:50:45 ERROR SessionState: Status: Failed
Vertex failed, vertexName=Reducer 2, vertexId=vertex_1442391298043_123239_1_01, diagnostics=[Task failed, taskId=task_1442391298043_123239_1_01_000000, diagnostics=[TaskAttempt 0 failed, info=[Container container_1442391298043_123239_01_008650 finished with diagnostics set to [Container preempted internally]], TaskAttempt 1 failed, info=[Container container_1442391298043_123239_01_008771 finished with diagnostics set to [Container preempted internally]], TaskAttempt 2 failed, info=[Container container_1442391298043_123239_01_009010 finished with diagnostics set to [Container preempted internally]], TaskAttempt 3 failed, info=[Container container_1442391298043_123239_01_009723 finished with diagnostics set to [Container preempted internally]]], Vertex failed as one or more tasks failed. failedTasks:1, Vertex vertex_1442391298043_123239_1_01 [Reducer 2] killed/failed due to:null]
15/09/23 14:50:45 ERROR SessionState: Vertex failed, vertexName=Reducer 2, vertexId=vertex_1442391298043_123239_1_01, diagnostics=[Task failed, taskId=task_1442391298043_123239_1_01_000000, diagnostics=[TaskAttempt 0 failed, info=[Container container_1442391298043_123239_01_008650 finished with diagnostics set to [Container preempted internally]], TaskAttempt 1 failed, info=[Container container_1442391298043_123239_01_008771 finished with diagnostics set to [Container preempted internally]], TaskAttempt 2 failed, info=[Container container_1442391298043_123239_01_009010 finished with diagnostics set to [Container preempted internally]], TaskAttempt 3 failed, info=[Container container_1442391298043_123239_01_009723 finished with diagnostics set to [Container preempted internally]]], Vertex failed as one or more tasks failed. failedTasks:1, Vertex vertex_1442391298043_123239_1_01 [Reducer 2] killed/failed due to:null]
Vertex killed, vertexName=Map 1, vertexId=vertex_1442391298043_123239_1_00, diagnostics=[Vertex received Kill while in RUNNING state., Vertex killed as other vertex failed. failedTasks:0, Vertex vertex_1442391298043_123239_1_00 [Map 1] killed/failed due to:null]
15/09/23 14:50:45 ERROR SessionState: Vertex killed, vertexName=Map 1, vertexId=vertex_1442391298043_123239_1_00, diagnostics=[Vertex received Kill while in RUNNING state., Vertex killed as other vertex failed. failedTasks:0, Vertex vertex_1442391298043_123239_1_00 [Map 1] killed/failed due to:null]
DAG failed due to vertex failure. failedVertices:1 killedVertices:1

分析:

task_1442391298043_123239_1_01_000000,失败了4次,失败的原因是container被高优先级的任务抢占了。而task最大的失败次数默认是4.当集群上的任务比较多时,比较容易出现这个问题。

解决方案:

修改默认值,

tez.am.task.max.failed.attempts=10
tez.am.max.app.attemps=5;

你可能感兴趣的:(Hive)