记一次hadoopp提交任务失败
info日志
2019-07-18 11:40:50 386 [QuartzScheduler_Worker-1:203538] - [INFO] org.apache.hadoop.ipc.Client - Retrying connect to server: sparka/10.240.47.152:8032. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-07-18 11:40:51 388 [QuartzScheduler_Worker-1:204540] - [INFO] org.apache.hadoop.ipc.Client - Retrying connect to server: sparka/10.240.47.152:8032. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-07-18 11:40:52 389 [QuartzScheduler_Worker-1:205541] - [INFO] org.apache.hadoop.ipc.Client - Retrying connect to server: sparka/10.240.47.152:8032. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
error日志
2019-07-18 10:51:12 160 [QuartzScheduler_Worker-1:669550199] - [ERROR] com.bonree.action.mr.dispatcher.MrDispatcherJob - mr dispatcher. task name [TaskConfig{id=923, name='TRAJECTORY@HOUR@1562281200000', type=1, createTime=2019-07-05, uploadTime=2019-07-05, info='{"baseInput":"/data/br/base/action/inputpath/source/2019/07/05/07","baseOutput":"/data/br/base/action/inputpath/result/2019/07/05/07/TRAJECTORY","configMap":{},"gran":"HOUR","monitorTime":1562281200000,"mrTypeName":"TRAJECTORY"}', status=2}] run occur error!taskConfig:java.net.ConnectException: Call From sparka/10.240.47.152 to sparka:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
2019-07-18 11:11:22 520 [QuartzScheduler_Worker-1:670760559] - [ERROR] com.bonree.action.mr.dispatcher.MrDispatcherJob - mr dispatcher. task name [TaskConfig{id=925, name='TRAJECTORY@HOUR@1562284800000', type=1, createTime=2019-07-05, uploadTime=2019-07-05, info='{"baseInput":"/data/br/base/action/inputpath/source/2019/07/05/08","baseOutput":"/data/br/base/action/inputpath/result/2019/07/05/08/TRAJECTORY","configMap":{},"gran":"HOUR","monitorTime":1562284800000,"mrTypeName":"TRAJECTORY"}', status=2}] run occur error!taskConfig:java.net.ConnectException: Call From sparka/10.240.47.152 to sparka:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
2019-07-18 11:31:32 854 [QuartzScheduler_Worker-1:671970893] - [ERROR] com.bonree.action.mr.dispatcher.MrDispatcherJob - mr dispatcher. task name [TaskConfig{id=924, name='LAUNCH@HOUR@1562284800000', type=1, createTime=2019-07-05, uploadTime=2019-07-05, info='{"baseInput":"/data/br/base/action/inputpath/source/2019/07/05/08","baseOutput":"/data/br/base/action/inputpath/result/2019/07/05/08/LAUNCH","configMap":{},"gran":"HOUR","monitorTime":1562284800000,"mrTypeName":"LAUNCH"}', status=2}] run occur error!taskConfig:java.net.ConnectException: Call From sparka/10.240.47.152 to sparka:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
类似问题可能是8020或者其他端口ConnectionRefused,基本确定是对应端口服务有问题。这里8032是yarn服务,所以去检查,若不知道8032是什么服务,可以去hadoop配置文件下找下配的是哪个服务
我这先全量扫下配置有没有配置8032,然后去修改过配置的配置文件定位在哪个配置那文件
[root@sparka hadoop]# cat * | grep 8032
sparka:8032
[root@sparka hadoop]# cat hdfs-site.xml | grep 8032
[root@sparka hadoop]# cat core-site.xml | grep 8032
[root@sparka hadoop]# cat yarn-site.xml | grep 8032
sparka:8032
yarn.resourcemanager.address
sparka:8032
jps检查yarn的resourceManager服务
[root@sparka hadoop]# jps
24384 jar
17312 jar
15937 Main
12746 NameNode
7370 SDK_WEB_AUTOREPORT.jar
31436 ZEUS_MANAGERSERVER.jar
26637 jar
24078 Jps
21903 jar
29427 Bootstrap
31027 jar
12947 JournalNode
18580 jar
23541 RunJar
24314 jar
13117 DFSZKFailoverController
22333 jar
15358 Main
4287 machineagent.jar
[root@sparka hadoop]#
确实发现没有,可以去检查下何时因为啥原因停掉了或者启动失败,去hadoop的logs路径下检查下这个服务的日志
查看resourceManager日志
[root@sparka logs]# tail -F yarn-root-resourcemanager-sparka.log
2019-06-22 00:08:54,794 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: appId=application_1555906617082_6791,name=sdk_data_action_day-determine_partitions_groupby-Optional.of([2019-06-21T00:00:00.000Z/2019-06-22T00:00:00.000Z]),user=root,queue=root.root,state=FAILED,trackingUrl=http://sparka:8088/cluster/app/application_1555906617082_6791,appMasterHost=N/A,startTime=1561132960264,finishTime=1561133311607,finalStatus=FAILED
2019-06-22 00:08:58,664 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Sending NMToken for nodeId : sparkb:32859 for container : container_1555906617082_6794_01_000001
2019-06-22 00:09:00,142 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1555906617082_6794_01_000001 Container Transitioned from ALLOCATED to ACQUIRED
2019-06-22 00:09:00,142 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Clear node set for appattempt_1555906617082_6794_000001
2019-06-22 00:09:00,859 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Storing attempt: AppId: application_1555906617082_6794 AttemptId: appattempt_1555906617082_6794_000001 MasterContainer: Container: [ContainerId: container_1555906617082_6794_01_000001, NodeId: sparkb:32859, NodeHttpAddress: sparkb:8042, Resource: , Priority: 0, Token: Token { kind: ContainerToken, service: 10.240.47.153:32859 }, ]
2019-06-22 00:09:03,013 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1555906617082_6794_000001 State change from SCHEDULED to ALLOCATED_SAVING
2019-06-22 00:09:06,446 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1555906617082_6794_000001 State change from ALLOCATED_SAVING to ALLOCATED
2019-06-22 00:09:07,251 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching masterappattempt_1555906617082_6794_000001
2019-06-22 00:09:13,245 INFO org.apache.hadoop.util.ExitUtil: Halt with status -1 Message: HaltException
2019-06-22 00:09:13,960 INFO org.apache.hadoop.util.ExitUtil: Halt with status -1 Message: HaltException
最后两行看到这个异常,进去源码看下
/**
* Forcibly terminates the currently running Java virtual machine.
*
* @param status
* exit code
* @param msg
* message used to create the {@code HaltException}
* @throws HaltException
* if Runtime.getRuntime().halt() is disabled for test purposes
*/
public static void halt(int status, String msg) throws HaltException {
LOG.info("Halt with status " + status + " Message: " + msg);
if (systemHaltDisabled) {
HaltException ee = new HaltException(status, msg);
LOG.fatal("Halt called", ee);
if (null == firstHaltException) {
firstHaltException = ee;
}
throw ee;
}
Runtime.getRuntime().halt(status);
}
根据方法注释,被强制退出了jvm进程。一般出现强制停止服务的大都是服务器资源不能支持程序正常运行,这里我猜测是内存不足导致,但由于之前停掉的,也没看到啥日志证据。。。
重启yarn
有排查更好的思路或方式请指出,非常感谢