上万个Map运行时链接ApplicationMaster超时FAILED

#MapReduce业务常见故障 #大数据 #生产环境真实案例 #MapReduce #批计算 #离线业务 #整理 #经验总结

说明:此篇总结MapReduce业务常见故障案例处理方案 结合自身经历 总结不易 +关注 +收藏 欢迎留言

更多专题(详见):MapReduce计算引擎详解 --项目优化(指导书)

上万个Map运行时链接ApplicationMaster超时FAILED

症状

Mapreduce任务会并发起几万个map,会有上万个左右失败,最终map失败导致任务失败。

上万个Map运行时链接ApplicationMaster超时FAILED_第1张图片

原因

我们从map日志看到这个map总是由于链接超时出现问题:

2025-02-22 02:31:56,095 INFO [communication thread] org.apache.hadoop.ipc.Client: Retrying connect to server: ihadoop-71/10.XX.XX.XX:25102. Already tried 0 time(s); maxRetries=15
2025-02-22 02:32:06,110 INFO [communication thread] org.apache.hadoop.ipc.Client: Retrying connect to server: ihadoop-71/10.XX.XX.XX:25102. Already tried 1 time(s); maxRetries=15
2025-02-22 02:32:16,122 INFO [communication thread] org.apache.hadoop.ipc.Client: Retrying connect to server: ihadoop-71/10.XX.XX.XX:25102. Already tried 2 time(s); maxRetries=15
2025-02-22 02:32:26,134 INFO [communication thread] org.apache.hadoop.ipc.Client: Retrying connect to server: ihadoop-71/10.XX.XX.XX:25102. Already tried 3 time(s); maxRetries=15
2025-02-22 02:32:36,148 INFO [communication thread] org.apache.hadoop.ipc.Client: Retrying connect to server: ihadoop-71/10.XX.XX.XX:25102. Already tried 4 time(s); maxRetries=15
2025-02-22 02:32:46,167 INFO [communication thread] org.apache.hadoop.ipc.Client: Retrying connect to server: ihadoop-71/10.XX.XX.XX:25102. Already tried 5 time(s); maxRetries=15
2025-02-22 02:32:56,182 INFO [communication thread] org.apache.hadoop.ipc.Client: Retrying connect to server: ihadoop-71/10.XX.XX.XX:25102. Already tried 6 time(s); maxRetries=15
2025-02-22 02:33:06,198 INFO [communication thread] org.apache.hadoop.ipc.Client: Retrying connect to server: ihadoop-71/10.XX.XX.XX:25102. Already tried 7 time(s); maxRetries=15
2025-02-22 02:33:16,208 INFO [communication thread] org.apache.hadoop.ipc.Client: Retrying connect to server: ihadoop-71/10.XX.XX.XX:25102. Already tried 8 time(s); maxRetries=15
2025-02-22 02:33:26,214 INFO [communication thread] org.apache.hadoop.ipc.Client: Retrying connect to server: ihadoop-71/10.XX.XX.XX:25102. Already tried 9 time(s); maxRetries=15
2025-02-22 02:33:36,226 INFO [communication thread] org.apache.hadoop.ipc.Client: Retrying connect to server: ihadoop-71/10.XX.XX.XX:25102. Already tried 10 time(s); maxRetries=15
2025-02-22 02:33:46,235 INFO [communication thread] org.apache.hadoop.ipc.Client: Retrying connect to server: ihadoop-71/10.XX.XX.XX:25102. Already tried 11 time(s); 

将任务的超时参数设置成20min 即mapreduce.task.timeout。失败的任务降到6000。

原因由于该任务启动了上万个map数量是非常庞大的,每个map的心跳链接ApplicationMaster会对ApplicationMaster的性能产生冲击,当前ApplicationMaster的内存已经很大了,但是vcore比较小只有1,通过修改vcore数,由1改为32后任务运行成功。

解决方法

ApplicationMaster的vcore值增大,且参数yarn.app.mapreduce.am.resource.cpu-vcores,是客户端参数。


最后

谢谢大家 @500佰

你可能感兴趣的:(大数据,云计算,big,data,mapreduce)