Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used

执行spark时遇到这种问题,最开始--executor-memory 设为10G,到后来20G,30G,还是报同样的错误。

1.一种解决方法

网上大部分都说要增加spark.yarn.executor.memoryOverhead,先是2048,然后4096,后来干脆增加到15G(并将executor-memory调小到20G),不再报错。

但一直很郁闷,到底是为什么呢?

首先可以肯定的一点是增加spark.yarn.executor.memoryOverhead是有效的。

spark.yarn.XXX.memoryOverhead属性决定向 YARN 请求的每个 executor 或dirver或am 的额外堆内存大小,默认值为 max(384, 0.07 * spark.executor.memory)。

2.另一种解决方法

查另一篇博客,也是同样的问题:We figured out that the physical memory usage was quite low on the VMs but the virtual memory usage was extremely high. We set yarn.nodemanager.vmem-check-enabled in yarn-site.xml to false and our containers were no longer killed, and the application appeared to work as expected.

也就是说executor运行的时候物理内存实际利用很低,但虚拟内存却很高,然后在yarn-site.xml上将yarn.nodemanager.vmem-check-enabled设置为false,问题就解决了。

Since on Centos/RHEL 6 there are aggressive allocation of virtual memory due to OS behavior, you should disable virtual memory checker or increase yarn.nodemanager.vmem-pmem-ratio to a relatively larger value.

这其实是系统问题,操作系统在分配虚拟内存的时候太过粗暴,因此你需要关闭虚拟内存检查或者增加yarn.nodemanager.vmem-pmem-ratio

3.原因

在另一篇博客中

Virtual/physical memory checker

NodeManager can monitor the memory usage(virtual and physical) of the container. If its virtual memory exceeds “yarn.nodemanager.vmem-pmem-ratio” times the "mapreduce.reduce.memory.mb" or "mapreduce.map.memory.mb", then the container will be killed if “yarn.nodemanager.vmem-check-enabled” is true;

If its physical memory exceeds "mapreduce.reduce.memory.mb" or "mapreduce.map.memory.mb", the container will be killed if “yarn.nodemanager.pmem-check-enabled” is true.

The parameters below can be set in yarn-site.xml on each NM nodes to override the default behavior.

This is a sample error for a container killed by virtual memory checker:

Current usage: 347.3 MB of 1 GB physical memory used; 
2.2 GB of 2.1 GB virtual memory used. Killing container.

And this is a sample error for physical memory checker:

Current usage: 2.1gb of 2.0gb physical memory used; 
1.1gb of 3.15gb virtual memory used. Killing container.

As in Hadoop 2.5.1 of MapR 4.1.0, virtual memory checker is disabled while physical memory checker is enabled by default.

Since on Centos/RHEL 6 there are aggressive allocation of virtual memory due to OS behavior, you should disable virtual memory checker or increase yarn.nodemanager.vmem-pmem-ratio to a relatively larger value.

f the above errors occur, it is also possible that the MapReduce job has memory leaking or the memory for each container is just not enough. Try to check the application logic and also tune the container memory request—"mapreduce.reduce.memory.mb" or "mapreduce.map.memory.mb".

4.总结

1.该问题的原因是因为OS层面虚拟内存分配导致,物理内存没有占用多少,但检查虚拟内存的时候却发现OOM,因此可以通过关闭虚拟内存检查来解决该问题,yarn.nodemanager.vmem-check-enabled=false

2.增加spark.yarn.executor.memoryOverhead也可以解决该问题(权宜之计)

3.我的程序报错是在groupByKey阶段,使用repartition(2000)(默认shuffle是200分区)后不再报错,由此可见,设置更多的分区也可以解决该错误

4.groupByKey的性能是远逊于reduceByKey的,因此可以考虑用reduceByKey代替groupByKey

 

 

你可能感兴趣的:(spark)