yarn日常维护之nm健康状态为false

最近发现yarn集群的ui上显示的nodes个数为2个,正常情况下是2个,然后就很无语了,因为以前一直都没有问题

然后差问题呗,从ui上显示丢失了206机器的nm,重新启动206上的nm 然后我查看206机器nm的日志和207上的rm的日志 

从日志上来看 没有任何问题,nm显示注册到了207机器,207机器显示收到了206机器的注册,这就无语了,我累个法克

然后磨叽了好几个小时,在查看206 nm的ui上注意到了一个东西,上面显示NodeHealthyStatus为false而且还显示出data log bad

那就谷歌了一下

https://stackoverflow.com/questions/29131449/why-does-hadoop-report-unhealthy-node-local-dirs-and-log-dirs-are-bad



30 down vote accepted

The most common cause of local-dirs are bad is due to available disk space on the node exceeding yarn's max-disk-utilization-per-disk-percentage default value of 90.0%.

Either clean up the disk that the unhealthy node is running on, or increase the threshold in yarn-site.xml


        yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage
        98.5

Avoid disabling disk check, because your jobs may failed when the disk eventually run out of space, or if there are permission issues. Refer to the yarn-site.xml Disk Checker section for more details.

大致的意思是 hdfs的数据目录使用率达到了90%,然后yarn就修改nm的状态为不健康,so 我们只需要进行修改阀值就ok,或者动手删除数据喽

我的选择是先修改阀值

30 down vote accepted

The most common cause of local-dirs are bad is due to available disk space on the node exceeding yarn's max-disk-utilization-per-disk-percentage default value of 90.0%.

Either clean up the disk that the unhealthy node is running on, or increase the threshold in yarn-site.xml


        yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage
        98.5

Avoid disabling disk check, because your jobs may failed when the disk eventually run out of space, or if there are permission issues. Refer to the yarn-site.xml Disk Checker section for more details.

你可能感兴趣的:(yarn日常维护之nm健康状态为false)