hadoop 运行mapreduce的时候会卡死在 mapreduce.Job:Running job: job_1477030467429_0002 位置不动
思路一:分析:mapreduce卡死不动,原可能是 resourcemanager 或者 nodemanager 配置出错
检查yarn-site.xml(yarn.resourcemanager.hostname:配置了resourcemanager 的位置) 或者 slaves (配置了nodemanager 位置)配置文件
可能出现的错误:
yarn.resourcemanager.hostname :配置出错;
slaves 没有加入namenode节点;
hosts 配置出错
思路二:运行内存不足的问题 :
现象:Memory Total 为 0;
日志:
2016-10-29 10:28:30,433 WARN org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection: Directory /opt/cdh/hadoop-2.5.0-cdh5.3.6/data/tmp/nm-local-dir error, used space above threshold of 90.0%, removing from list of valid directories 2016-10-29 10:28:30,433 WARN org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection: Directory /opt/cdh/hadoop-2.5.0-cdh5.3.6/logs/userlogs error, used space above threshold of 90.0%, removing from list of valid directories 2016-10-29 10:28:30,433 INFO org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: Disk(s) failed: 1/1 local-dirs are bad: /opt/cdh/hadoop-2.5.0-cdh5.3.6/data/tmp/nm-local-dir; 1/1 log-dirs are bad: /opt/cdh/hadoop-2.5.0-cdh5.3.6/logs/userlogs 2016-10-29 10:28:30,435 ERROR org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: Most of the disks failed. 1/1 local-dirs are bad: /opt/cdh/hadoop-2.5.0-cdh5.3.6/data/tmp/nm-local-dir; 1/1 log-dirs are bad: /opt/cdh/hadoop-2.5.0-cdh5.3.6/logs/userlogs 2016-10-29 10:28:31,204 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1477706304072_0002_01_000001 2016-10-29 10:28:31,558 WARN org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: Unexpected: procfs stat file is not in the expected format for process with pid 3202 2016-10-29 10:28:31,581 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 11809 for container-id container_1477706304072_0002_01_000001: 33.0 MB of 2 GB physical memory used; 1.6 GB of 4.2 GB virtual memory used 2016-10-29 10:28:31,910 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1477706304072_0002_01_000001 transitioned from RUNNING to KILLING 2016-10-29 10:28:31,910 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1477706304072_0002_01_000001 2016-10-29 10:28:31,954 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1477706304072_0002_01_000001 is : 143 2016-10-29 10:28:31,989 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1477706304072_0002_01_000001 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL 2016-10-29 10:28:31,989 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /opt/cdh/hadoop-2.5.0-cdh5.3.6/data/tmp/nm-local-dir/usercache/kequan/appcache/application_1477706304072_0002/container_1477706304072_0002_01_000001 2016-10-29 10:28:31,991 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=kequan OPERATION=Container Finished - Killed TARGET=ContainerImpl RESULT=SUCCESS APPID=application_1477706304072_0002 CONTAINERID=container_1477706304072_0002_01_000001 2016-10-29 10:28:31,991 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1477706304072_0002_01_000001 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE 2016-10-29 10:28:31,992 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Removing container_1477706304072_0002_01_000001 from application application_1477706304072_0002 2016-10-29 10:28:31,992 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Considering container container_1477706304072_0002_01_000001 for log-aggregation 2016-10-29 10:28:31,992 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_STOP for appId application_1477706304072_0002 2016-10-29 10:28:32,915 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed completed containers from NM context: [container_1477706304072_0002_01_000001] 2016-10-29 10:28:34,582 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Stopping resource-monitoring for container_1477706304072_0002_01_000001 |
分析: used space above thresholdof 90.0% 磁盘空间超过90% ,MR运行很占用磁盘空间,磁盘空间不够用的时候,nodemanager被强行杀死;
方法一:设置磁盘最高利用率为 95 ;在yarn-site.xml目录里加入下面的配置(治标不治本,MR运行时候,磁盘使用空间还有可能超过 95%)
方法二: 删除磁盘里面不用的空间
在命令行执行 df -h 查看磁盘用了多少空间
[root@hadoop-senior01 modules]# df -h Filesystem Size Used Avail Use% Mounted on /dev/sda2 18G 14G 3.4G 80% / tmpfs 1.9G 372K 1.9G 1% /dev/shm /dev/sda1 291M 37M 240M 14% /boot |
/dev/sda2 磁盘 就是指 系统所有内存;删除系统里面不用的文件或者软件即可
ERRORorg.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: Most of thedisks failed. 1/1 local-dirs are bad:/opt/cdh/hadoop-2.5.0-cdh5.3.6/data/tmp/nm-local-dir; 1/1 log-dirs are bad:/opt/cdh/hadoop-2.5.0-cdh5.3.6/logs/userlogs
参考: http://www.cnblogs.com/tnsay/p/5917459.html
方法三: 如果是虚拟机那么久扩容;真机加磁盘
虚拟机扩容: http://www.2cto.com/os/201405/301879.html