hadoop提交job失败:java.net.ConnectException:Connection refused

记一次hadoopp提交任务失败

文章目录

  • 问题
  • 排查与解决
  • 解决

问题

info日志

2019-07-18 11:40:50 386 [QuartzScheduler_Worker-1:203538] - [INFO] org.apache.hadoop.ipc.Client - Retrying connect to server: sparka/10.240.47.152:8032. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-07-18 11:40:51 388 [QuartzScheduler_Worker-1:204540] - [INFO] org.apache.hadoop.ipc.Client - Retrying connect to server: sparka/10.240.47.152:8032. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-07-18 11:40:52 389 [QuartzScheduler_Worker-1:205541] - [INFO] org.apache.hadoop.ipc.Client - Retrying connect to server: sparka/10.240.47.152:8032. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

error日志

2019-07-18 10:51:12 160 [QuartzScheduler_Worker-1:669550199] - [ERROR] com.bonree.action.mr.dispatcher.MrDispatcherJob - mr dispatcher. task name [TaskConfig{id=923, name='TRAJECTORY@HOUR@1562281200000', type=1, createTime=2019-07-05, uploadTime=2019-07-05, info='{"baseInput":"/data/br/base/action/inputpath/source/2019/07/05/07","baseOutput":"/data/br/base/action/inputpath/result/2019/07/05/07/TRAJECTORY","configMap":{},"gran":"HOUR","monitorTime":1562281200000,"mrTypeName":"TRAJECTORY"}', status=2}]  run occur error!taskConfig:java.net.ConnectException: Call From sparka/10.240.47.152 to sparka:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
2019-07-18 11:11:22 520 [QuartzScheduler_Worker-1:670760559] - [ERROR] com.bonree.action.mr.dispatcher.MrDispatcherJob - mr dispatcher. task name [TaskConfig{id=925, name='TRAJECTORY@HOUR@1562284800000', type=1, createTime=2019-07-05, uploadTime=2019-07-05, info='{"baseInput":"/data/br/base/action/inputpath/source/2019/07/05/08","baseOutput":"/data/br/base/action/inputpath/result/2019/07/05/08/TRAJECTORY","configMap":{},"gran":"HOUR","monitorTime":1562284800000,"mrTypeName":"TRAJECTORY"}', status=2}]  run occur error!taskConfig:java.net.ConnectException: Call From sparka/10.240.47.152 to sparka:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
2019-07-18 11:31:32 854 [QuartzScheduler_Worker-1:671970893] - [ERROR] com.bonree.action.mr.dispatcher.MrDispatcherJob - mr dispatcher. task name [TaskConfig{id=924, name='LAUNCH@HOUR@1562284800000', type=1, createTime=2019-07-05, uploadTime=2019-07-05, info='{"baseInput":"/data/br/base/action/inputpath/source/2019/07/05/08","baseOutput":"/data/br/base/action/inputpath/result/2019/07/05/08/LAUNCH","configMap":{},"gran":"HOUR","monitorTime":1562284800000,"mrTypeName":"LAUNCH"}', status=2}]  run occur error!taskConfig:java.net.ConnectException: Call From sparka/10.240.47.152 to sparka:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

类似问题可能是8020或者其他端口ConnectionRefused,基本确定是对应端口服务有问题。这里8032是yarn服务,所以去检查,若不知道8032是什么服务,可以去hadoop配置文件下找下配的是哪个服务

排查与解决

我这先全量扫下配置有没有配置8032,然后去修改过配置的配置文件定位在哪个配置那文件

[root@sparka hadoop]# cat * | grep 8032
                sparka:8032
[root@sparka hadoop]# cat hdfs-site.xml | grep 8032 
[root@sparka hadoop]# cat core-site.xml | grep 8032    
[root@sparka hadoop]# cat yarn-site.xml | grep 8032                
                sparka:8032

                yarn.resourcemanager.address
                sparka:8032
        

jps检查yarn的resourceManager服务

[root@sparka hadoop]# jps
24384 jar
17312 jar
15937 Main
12746 NameNode
7370 SDK_WEB_AUTOREPORT.jar
31436 ZEUS_MANAGERSERVER.jar
26637 jar
24078 Jps
21903 jar
29427 Bootstrap
31027 jar
12947 JournalNode
18580 jar
23541 RunJar
24314 jar
13117 DFSZKFailoverController
22333 jar
15358 Main
4287 machineagent.jar
[root@sparka hadoop]#

确实发现没有,可以去检查下何时因为啥原因停掉了或者启动失败,去hadoop的logs路径下检查下这个服务的日志

查看resourceManager日志

[root@sparka logs]# tail -F yarn-root-resourcemanager-sparka.log
2019-06-22 00:08:54,794 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: appId=application_1555906617082_6791,name=sdk_data_action_day-determine_partitions_groupby-Optional.of([2019-06-21T00:00:00.000Z/2019-06-22T00:00:00.000Z]),user=root,queue=root.root,state=FAILED,trackingUrl=http://sparka:8088/cluster/app/application_1555906617082_6791,appMasterHost=N/A,startTime=1561132960264,finishTime=1561133311607,finalStatus=FAILED
2019-06-22 00:08:58,664 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Sending NMToken for nodeId : sparkb:32859 for container : container_1555906617082_6794_01_000001
2019-06-22 00:09:00,142 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1555906617082_6794_01_000001 Container Transitioned from ALLOCATED to ACQUIRED
2019-06-22 00:09:00,142 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Clear node set for appattempt_1555906617082_6794_000001
2019-06-22 00:09:00,859 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Storing attempt: AppId: application_1555906617082_6794 AttemptId: appattempt_1555906617082_6794_000001 MasterContainer: Container: [ContainerId: container_1555906617082_6794_01_000001, NodeId: sparkb:32859, NodeHttpAddress: sparkb:8042, Resource: , Priority: 0, Token: Token { kind: ContainerToken, service: 10.240.47.153:32859 }, ]
2019-06-22 00:09:03,013 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1555906617082_6794_000001 State change from SCHEDULED to ALLOCATED_SAVING
2019-06-22 00:09:06,446 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1555906617082_6794_000001 State change from ALLOCATED_SAVING to ALLOCATED
2019-06-22 00:09:07,251 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching masterappattempt_1555906617082_6794_000001
2019-06-22 00:09:13,245 INFO org.apache.hadoop.util.ExitUtil: Halt with status -1 Message: HaltException
2019-06-22 00:09:13,960 INFO org.apache.hadoop.util.ExitUtil: Halt with status -1 Message: HaltException

最后两行看到这个异常,进去源码看下

  /**
   * Forcibly terminates the currently running Java virtual machine.
   *
   * @param status
   *          exit code
   * @param msg
   *          message used to create the {@code HaltException}
   * @throws HaltException
   *           if Runtime.getRuntime().halt() is disabled for test purposes
   */
  public static void halt(int status, String msg) throws HaltException {
    LOG.info("Halt with status " + status + " Message: " + msg);
    if (systemHaltDisabled) {
      HaltException ee = new HaltException(status, msg);
      LOG.fatal("Halt called", ee);
      if (null == firstHaltException) {
        firstHaltException = ee;
      }
      throw ee;
    }
    Runtime.getRuntime().halt(status);
  }

根据方法注释,被强制退出了jvm进程。一般出现强制停止服务的大都是服务器资源不能支持程序正常运行,这里我猜测是内存不足导致,但由于之前停掉的,也没看到啥日志证据。。。

解决

重启yarn

有排查更好的思路或方式请指出,非常感谢

你可能感兴趣的:(大数据)