Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used

执行spark时遇到这种问题,最开始--executor-memory 设为10G,到后来20G,30G,还是报同样的错误。





spark.yarn.XXX.memoryOverhead属性决定向 YARN 请求的每个 executor 或dirver或am 的额外堆内存大小,默认值为 max(384, 0.07 * spark.executor.memory)。


查另一篇博客,也是同样的问题:We figured out that the physical memory usage was quite low on the VMs but the virtual memory usage was extremely high. We set yarn.nodemanager.vmem-check-enabled in yarn-site.xml to false and our containers were no longer killed, and the application appeared to work as expected.


Since on Centos/RHEL 6 there are aggressive allocation of virtual memory due to OS behavior, you should disable virtual memory checker or increase yarn.nodemanager.vmem-pmem-ratio to a relatively larger value.




Virtual/physical memory checker

NodeManager can monitor the memory usage(virtual and physical) of the container. If its virtual memory exceeds “yarn.nodemanager.vmem-pmem-ratio” times the "mapreduce.reduce.memory.mb" or "", then the container will be killed if “yarn.nodemanager.vmem-check-enabled” is true;

If its physical memory exceeds "mapreduce.reduce.memory.mb" or "", the container will be killed if “yarn.nodemanager.pmem-check-enabled” is true.

The parameters below can be set in yarn-site.xml on each NM nodes to override the default behavior.

This is a sample error for a container killed by virtual memory checker:

Current usage: 347.3 MB of 1 GB physical memory used; 
2.2 GB of 2.1 GB virtual memory used. Killing container.

And this is a sample error for physical memory checker:

Current usage: 2.1gb of 2.0gb physical memory used; 
1.1gb of 3.15gb virtual memory used. Killing container.

As in Hadoop 2.5.1 of MapR 4.1.0, virtual memory checker is disabled while physical memory checker is enabled by default.

Since on Centos/RHEL 6 there are aggressive allocation of virtual memory due to OS behavior, you should disable virtual memory checker or increase yarn.nodemanager.vmem-pmem-ratio to a relatively larger value.

f the above errors occur, it is also possible that the MapReduce job has memory leaking or the memory for each container is just not enough. Try to check the application logic and also tune the container memory request—"mapreduce.reduce.memory.mb" or "".







