问题 Spark Yarn集群模式 exitCode = 13

背景

今天一同事在提交spark任务时遇到一个很奇葩的问题,使用的是集群模式提交的spark任务,久久不申请资源,最后超时失败。下面是spark的ApplicationMaster运行的日志:

Log Length: 19060

20/03/25 14:43:03 INFO util.SignalUtils: Registered signal handler for TERM
20/03/25 14:43:03 INFO util.SignalUtils: Registered signal handler for HUP
20/03/25 14:43:03 INFO util.SignalUtils: Registered signal handler for INT
20/03/25 14:43:03 INFO spark.SecurityManager: Changing view acls to: yarn,azkaban
20/03/25 14:43:03 INFO spark.SecurityManager: Changing modify acls to: yarn,azkaban
20/03/25 14:43:03 INFO spark.SecurityManager: Changing view acls groups to: 
20/03/25 14:43:03 INFO spark.SecurityManager: Changing modify acls groups to: 
20/03/25 14:43:03 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(yarn, azkaban); groups with view permissions: Set(); users  with modify permissions: Set(yarn, azkaban); groups with modify permissions: Set()
20/03/25 14:43:04 INFO yarn.ApplicationMaster: ApplicationAttemptId: appattempt_1578905465672_0362_000002
20/03/25 14:43:04 INFO yarn.ApplicationMaster: Starting the user application in a separate Thread
20/03/25 14:43:04 INFO yarn.ApplicationMaster: Waiting for spark context initialization...
20/03/25 14:43:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/03/25 14:43:05 ERROR utils.ConfigUtil$: config error stream.process.env.alert
20/03/25 14:43:05 INFO spark.SparkContext: Running Spark version 2.4.0-cdh6.3.1
20/03/25 14:43:05 INFO spark.SparkContext: Submitted application: bigdata  data show stream  
20/03/25 14:43:05 INFO spark.SecurityManager: Changing view acls to: yarn,azkaban
20/03/25 14:43:05 INFO spark.SecurityManager: Changing modify acls to: yarn,azkaban
20/03/25 14:43:05 INFO spark.SecurityManager: Changing view acls groups to: 
20/03/25 14:43:05 INFO spark.SecurityManager: Changing modify acls groups to: 
20/03/25 14:43:05 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(yarn, azkaban); groups with view permissions: Set(); users  with modify permissions: Set(yarn, azkaban); groups with modify permissions: Set()
20/03/25 14:43:05 INFO util.Utils: Successfully started service 'sparkDriver' on port 8623.
20/03/25 14:43:05 INFO spark.SparkEnv: Registering MapOutputTracker
20/03/25 14:43:05 INFO spark.SparkEnv: Registering BlockManagerMaster
20/03/25 14:43:05 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/03/25 14:43:05 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/03/25 14:43:05 INFO storage.DiskBlockManager: Created local directory at /data4/yarn/nm/usercache/azkaban/appcache/application_1578905465672_0362/blockmgr-f79a9baf-e5c1-4328-a602-94b117bcae96
20/03/25 14:43:05 INFO storage.DiskBlockManager: Created local directory at /data3/yarn/nm/usercache/azkaban/appcache/application_1578905465672_0362/blockmgr-bbda0203-956d-44d5-948f-9e818d5f44af
20/03/25 14:43:05 INFO storage.DiskBlockManager: Created local directory at /data2/yarn/nm/usercache/azkaban/appcache/application_1578905465672_0362/blockmgr-6502c505-e075-4017-b160-6a37a29e1385
20/03/25 14:43:05 INFO storage.DiskBlockManager: Created local directory at /data/yarn/nm/usercache/azkaban/appcache/application_1578905465672_0362/blockmgr-fb4f36e5-3b0a-4022-b85d-e15756ffbd09
20/03/25 14:43:05 INFO storage.DiskBlockManager: Created local directory at /data5/yarn/nm/usercache/azkaban/appcache/application_1578905465672_0362/blockmgr-359ec979-ffed-4999-b2cb-561a4d6f375c
20/03/25 14:43:05 INFO storage.DiskBlockManager: Created local directory at /data1/yarn/nm/usercache/azkaban/appcache/application_1578905465672_0362/blockmgr-fa4ab1a4-da84-4766-b6d0-4da620aef0e1
20/03/25 14:43:05 INFO memory.MemoryStore: MemoryStore started with capacity 912.3 MB
20/03/25 14:43:05 INFO spark.SparkEnv: Registering OutputCommitCoordinator
20/03/25 14:43:05 INFO util.log: Logging initialized @3192ms
20/03/25 14:43:06 INFO ui.JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /jobs, /jobs/json, /jobs/job, /jobs/job/json, /stages, /stages/json, /stages/stage, /stages/stage/json, /stages/pool, /stages/pool/json, /storage, /storage/json, /storage/rdd, /storage/rdd/json, /environment, /environment/json, /executors, /executors/json, /executors/threadDump, /executors/threadDump/json, /static, /, /api, /jobs/job/kill, /stages/stage/kill.
20/03/25 14:43:06 INFO server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: 2018-09-05T05:11:46+08:00, git hash: 3ce520221d0240229c862b122d2b06c12a625732
20/03/25 14:43:06 INFO server.Server: Started @3335ms
20/03/25 14:43:06 INFO server.AbstractConnector: Started ServerConnector@6f973872{HTTP/1.1,[http/1.1]}{0.0.0.0:23729}
20/03/25 14:43:06 INFO util.Utils: Successfully started service 'SparkUI' on port 23729.
20/03/25 14:43:06 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@e669757{/jobs,null,AVAILABLE,@Spark}
20/03/25 14:43:06 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@43cc8bba{/jobs/json,null,AVAILABLE,@Spark}
20/03/25 14:43:06 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@480e46cc{/jobs/job,null,AVAILABLE,@Spark}
20/03/25 14:43:06 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6bfaf5c2{/jobs/job/json,null,AVAILABLE,@Spark}
20/03/25 14:43:06 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@51f8e657{/stages,null,AVAILABLE,@Spark}
20/03/25 14:43:06 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5eb9402f{/stages/json,null,AVAILABLE,@Spark}
20/03/25 14:43:06 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@736dd434{/stages/stage,null,AVAILABLE,@Spark}
20/03/25 14:43:06 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4df56dab{/stages/stage/json,null,AVAILABLE,@Spark}
20/03/25 14:43:06 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5024cc55{/stages/pool,null,AVAILABLE,@Spark}
20/03/25 14:43:06 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@341e93b4{/stages/pool/json,null,AVAILABLE,@Spark}
20/03/25 14:43:06 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2265a66e{/storage,null,AVAILABLE,@Spark}
20/03/25 14:43:06 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@449b1fbc{/storage/json,null,AVAILABLE,@Spark}
20/03/25 14:43:06 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@322b251e{/storage/rdd,null,AVAILABLE,@Spark}
20/03/25 14:43:06 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@726893dc{/storage/rdd/json,null,AVAILABLE,@Spark}
20/03/25 14:43:06 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7db10d29{/environment,null,AVAILABLE,@Spark}
20/03/25 14:43:06 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@61cb9bdc{/environment/json,null,AVAILABLE,@Spark}
20/03/25 14:43:06 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@aa3412e{/executors,null,AVAILABLE,@Spark}
20/03/25 14:43:06 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3ce3c5d5{/executors/json,null,AVAILABLE,@Spark}
20/03/25 14:43:06 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@49ce4f25{/executors/threadDump,null,AVAILABLE,@Spark}
20/03/25 14:43:06 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@74e628bb{/executors/threadDump/json,null,AVAILABLE,@Spark}
20/03/25 14:43:06 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@645cbd38{/static,null,AVAILABLE,@Spark}
20/03/25 14:43:06 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7619ef84{/,null,AVAILABLE,@Spark}
20/03/25 14:43:06 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@452ffb7d{/api,null,AVAILABLE,@Spark}
20/03/25 14:43:06 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@45d4d82f{/jobs/job/kill,null,AVAILABLE,@Spark}
20/03/25 14:43:06 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1b05e60e{/stages/stage/kill,null,AVAILABLE,@Spark}
20/03/25 14:43:06 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://hdh03.c.p.xyidc:23729
20/03/25 14:43:06 INFO executor.Executor: Starting executor ID driver on host localhost
20/03/25 14:43:06 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 20946.
20/03/25 14:43:06 INFO netty.NettyBlockTransferService: Server created on hdh03.c.p.xyidc:20946
20/03/25 14:43:06 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/03/25 14:43:06 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, hdh03.c.p.xyidc, 20946, None)
20/03/25 14:43:06 INFO storage.BlockManagerMasterEndpoint: Registering block manager hdh03.c.p.xyidc:20946 with 912.3 MB RAM, BlockManagerId(driver, hdh03.c.p.xyidc, 20946, None)
20/03/25 14:43:06 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, hdh03.c.p.xyidc, 20946, None)
20/03/25 14:43:06 INFO storage.BlockManager: external shuffle service port = 7337
20/03/25 14:43:06 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, hdh03.c.p.xyidc, 20946, None)
20/03/25 14:43:06 INFO ui.JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /metrics/json.
20/03/25 14:43:06 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1019fb61{/metrics/json,null,AVAILABLE,@Spark}
20/03/25 14:43:06 INFO scheduler.EventLoggingListener: Logging events to hdfs://nameservice1/user/spark/applicationHistory/local-1585118586217
20/03/25 14:43:06 WARN lineage.LineageWriter: Lineage directory /var/log/spark/lineage doesn't exist or is not writable. Lineage for this application will be disabled.
20/03/25 14:43:06 INFO util.Utils: Extension com.cloudera.spark.lineage.NavigatorAppListener not being initialized.
20/03/25 14:43:06 WARN spark.SparkContext: Using an existing SparkContext; some configuration may not take effect.
20/03/25 14:43:07 WARN kafka010.KafkaUtils: overriding enable.auto.commit to false for executor
20/03/25 14:43:07 WARN kafka010.KafkaUtils: overriding auto.offset.reset to none for executor
20/03/25 14:43:07 WARN kafka010.KafkaUtils: overriding executor group.id to spark-executor-ezviz_bigdataV1
20/03/25 14:43:07 WARN kafka010.KafkaUtils: overriding receive.buffer.bytes to 65536 see KAFKA-3135
20/03/25 14:43:07 INFO kafka010.DirectKafkaInputDStream: Slide time = 600000 ms
20/03/25 14:43:07 INFO kafka010.DirectKafkaInputDStream: Storage level = Serialized 1x Replicated
20/03/25 14:43:07 INFO kafka010.DirectKafkaInputDStream: Checkpoint interval = null
20/03/25 14:43:07 INFO kafka010.DirectKafkaInputDStream: Remember interval = 600000 ms
20/03/25 14:43:07 INFO kafka010.DirectKafkaInputDStream: Initialized and validated org.apache.spark.streaming.kafka010.DirectKafkaInputDStream@6528eeaa
20/03/25 14:43:07 INFO dstream.ForEachDStream: Slide time = 600000 ms
20/03/25 14:43:07 INFO dstream.ForEachDStream: Storage level = Serialized 1x Replicated
20/03/25 14:43:07 INFO dstream.ForEachDStream: Checkpoint interval = null
20/03/25 14:43:07 INFO dstream.ForEachDStream: Remember interval = 600000 ms
20/03/25 14:43:07 INFO dstream.ForEachDStream: Initialized and validated org.apache.spark.streaming.dstream.ForEachDStream@50afc6f6
20/03/25 14:43:07 INFO consumer.ConsumerConfig: ConsumerConfig values: 
	auto.commit.interval.ms = 5000
	auto.offset.reset = latest
	bootstrap.servers = [rzkafka1.p.xyidc:9092, rzkafka2.p.xyidc:9092, rzkafka3.p.xyidc:9092]
	check.crcs = true
	client.dns.lookup = default
	client.id = 
	connections.max.idle.ms = 540000
	default.api.timeout.ms = 60000
	enable.auto.commit = true
	exclude.internal.topics = true
	fetch.max.bytes = 52428800
	fetch.max.wait.ms = 500
	fetch.min.bytes = 1
	group.id = ezviz_bigdataV1
	heartbeat.interval.ms = 3000
	interceptor.classes = []
	internal.leave.group.on.close = true
	isolation.level = read_uncommitted
	key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
	max.partition.fetch.bytes = 1048576
	max.poll.interval.ms = 300000
	max.poll.records = 500
	metadata.max.age.ms = 300000
	metric.reporters = []
	metrics.num.samples = 2
	metrics.recording.level = INFO
	metrics.sample.window.ms = 30000
	partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
	receive.buffer.bytes = 65536
	reconnect.backoff.max.ms = 1000
	reconnect.backoff.ms = 50
	request.timeout.ms = 30000
	retry.backoff.ms = 100
	sasl.client.callback.handler.class = null
	sasl.jaas.config = null
	sasl.kerberos.kinit.cmd = /usr/bin/kinit
	sasl.kerberos.min.time.before.relogin = 60000
	sasl.kerberos.service.name = null
	sasl.kerberos.ticket.renew.jitter = 0.05
	sasl.kerberos.ticket.renew.window.factor = 0.8
	sasl.login.callback.handler.class = null
	sasl.login.class = null
	sasl.login.refresh.buffer.seconds = 300
	sasl.login.refresh.min.period.seconds = 60
	sasl.login.refresh.window.factor = 0.8
	sasl.login.refresh.window.jitter = 0.05
	sasl.mechanism = GSSAPI
	security.protocol = PLAINTEXT
	send.buffer.bytes = 131072
	session.timeout.ms = 10000
	ssl.cipher.suites = null
	ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
	ssl.endpoint.identification.algorithm = null
	ssl.key.password = null
	ssl.keymanager.algorithm = SunX509
	ssl.keystore.location = null
	ssl.keystore.password = null
	ssl.keystore.type = JKS
	ssl.protocol = TLS
	ssl.provider = null
	ssl.secure.random.implementation = null
	ssl.trustmanager.algorithm = PKIX
	ssl.truststore.location = null
	ssl.truststore.password = null
	ssl.truststore.type = JKS
	value.deserializer = class org.apache.kafka.common.serialization.StringDeserializer

20/03/25 14:43:07 INFO utils.AppInfoParser: Kafka version: 2.2.1-cdh6.3.1
20/03/25 14:43:07 INFO utils.AppInfoParser: Kafka commitId: null
20/03/25 14:43:07 INFO consumer.KafkaConsumer: [Consumer clientId=consumer-1, groupId=ezviz_bigdataV1] Subscribed to topic(s): bigdata_flume
20/03/25 14:43:07 INFO clients.Metadata: Cluster ID: DLztVA8bQJy39tXzct1BAA
20/03/25 14:43:07 INFO internals.AbstractCoordinator: [Consumer clientId=consumer-1, groupId=ezviz_bigdataV1] Discovered group coordinator 10.97.202.27:9092 (id: 2147483644 rack: null)
20/03/25 14:43:07 INFO internals.ConsumerCoordinator: [Consumer clientId=consumer-1, groupId=ezviz_bigdataV1] Revoking previously assigned partitions []
20/03/25 14:43:07 INFO internals.AbstractCoordinator: [Consumer clientId=consumer-1, groupId=ezviz_bigdataV1] (Re-)joining group
20/03/25 14:44:44 ERROR yarn.ApplicationMaster: Uncaught exception: 
java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]
	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
	at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
	at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:447)
	at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:275)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:805)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:804)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
	at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:804)
	at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
20/03/25 14:44:44 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: Uncaught exception: java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]
	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
	at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
	at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:447)
	at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:275)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:805)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:804)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
	at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:804)
	at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
)
20/03/25 14:44:44 INFO spark.SparkContext: Invoking stop() from shutdown hook
20/03/25 14:44:44 INFO server.AbstractConnector: Stopped Spark@6f973872{HTTP/1.1,[http/1.1]}{0.0.0.0:0}
20/03/25 14:44:44 INFO ui.SparkUI: Stopped Spark web UI at http://hdh03.c.p.xyidc:23729
20/03/25 14:44:44 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/03/25 14:44:44 INFO memory.MemoryStore: MemoryStore cleared
20/03/25 14:44:44 INFO storage.BlockManager: BlockManager stopped
20/03/25 14:44:44 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
20/03/25 14:44:44 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/03/25 14:44:44 INFO spark.SparkContext: Successfully stopped SparkContext
20/03/25 14:44:44 INFO yarn.ApplicationMaster: Deleting staging directory hdfs://nameservice1/user/azkaban/.sparkStaging/application_1578905465672_0362
20/03/25 14:44:44 INFO util.ShutdownHookManager: Shutdown hook called
20/03/25 14:44:44 INFO util.ShutdownHookManager: Deleting directory /data1/yarn/nm/usercache/azkaban/appcache/application_1578905465672_0362/spark-d051ed3a-248c-4482-b96a-40bc3d3dfad0
20/03/25 14:44:44 INFO util.ShutdownHookManager: Deleting directory /data5/yarn/nm/usercache/azkaban/appcache/application_1578905465672_0362/spark-6487181b-a3e8-45e4-926f-13c6d3a05b15
20/03/25 14:44:44 INFO util.ShutdownHookManager: Deleting directory /data3/yarn/nm/usercache/azkaban/appcache/application_1578905465672_0362/spark-ffb89e28-fc83-4a48-ae8a-45d4a7b7d505
20/03/25 14:44:44 INFO util.ShutdownHookManager: Deleting directory /data/yarn/nm/usercache/azkaban/appcache/application_1578905465672_0362/spark-6d0ffcac-04ce-42d4-915c-24709497d5e1
20/03/25 14:44:44 INFO util.ShutdownHookManager: Deleting directory /data4/yarn/nm/usercache/azkaban/appcache/application_1578905465672_0362/spark-3b588d43-001b-42b6-beda-ad001f38e5f4
20/03/25 14:44:44 INFO util.ShutdownHookManager: Deleting directory /data2/yarn/nm/usercache/azkaban/appcache/application_1578905465672_0362/spark-873e64a1-3c77-48ae-80d8-e41f9ef29e51

通过日志可以看到AppMaster正常启动起来了,按道理来讲,现在是要向yarn申请资源了,可是没有看到打印运行executor的日志,正常情况下会打印像下面这样的日志:

20/03/24 16:28:12 INFO yarn.ApplicationMaster: 
===============================================================================
YARN executor launch context:
  env:
    CLASSPATH -> {{PWD}}{{PWD}}/__spark_conf__{{PWD}}/__spark_libs__/*{{HADOOP_COMMON_HOME}}/../../../CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark/jars/*{{HADOOP_COMMON_HOME}}/../../../CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark/hive/*$HADOOP_CLIENT_CONF_DIR$HADOOP_COMMON_HOME/*$HADOOP_COMMON_HOME/lib/*$HADOOP_HDFS_HOME/*$HADOOP_HDFS_HOME/lib/*$HADOOP_YARN_HOME/*$HADOOP_YARN_HOME/lib/*$HADOOP_CLIENT_CONF_DIR$PWD/mr-framework/*$MR2_CLASSPATH{{HADOOP_COMMON_HOME}}/../../../CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/client/accessors-smart-1.2.jar:{{HADOOP_COMMON_HOME}}/../../../CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/client/accessors-smart.jar:{{HADOOP_COMMON_HOME}}/../../../CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/client/asm-5.0.4.jar:{{HADOOP_COMMON_HOME}}/../../../CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/client/asm.jar:{{HADOOP_COMMON_HOME}}/../../../CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/client/avro.jar:{{HADOOP_COMMON_HOME}}/../../../CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/client/aws-java-sdk-bundle-1.11.271.jar:{{HADOOP_COMMON_HOME}}/../../../CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/client/aws-java-sdk-bundle.jar:{{HADOOP_COMMON_HOME}}/../../../CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/client/azure-data-lake-store-sdk-2.2.9.jar:{{HADOOP_COMMON_HOME}}/../../../CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/client/azure-data-lake-store-sdk.jar:
    SPARK_YARN_STAGING_DIR -> hdfs://nameservice1/user/azkaban/.sparkStaging/application_1578905465672_0359
    SPARK_USER -> azkaban
    OPENBLAS_NUM_THREADS -> 1

  command:
    LD_LIBRARY_PATH=\"{{HADOOP_COMMON_HOME}}/../../../CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/lib/native:$LD_LIBRARY_PATH\" \ 
      {{JAVA_HOME}}/bin/java \ 
      -server \ 
      -Xmx8192m \ 
      -Djava.io.tmpdir={{PWD}}/tmp \ 
      '-Dspark.driver.port=23947' \ 
      '-Dspark.authenticate=false' \ 
      '-Dspark.network.crypto.enabled=false' \ 
      '-Dspark.shuffle.service.port=7337' \ 
      '-Dspark.ui.port=0' \ 
      -Dspark.yarn.app.container.log.dir= \ 
      -XX:OnOutOfMemoryError='kill %p' \ 
      org.apache.spark.executor.CoarseGrainedExecutorBackend \ 
      --driver-url \ 
      spark://[email protected]:23947 \ 
      --executor-id \ 
       \ 
      --hostname \ 
       \ 
      --cores \ 
      1 \ 
      --app-id \ 
      application_1578905465672_0359 \ 
      --user-class-path \ 
      file:$PWD/__app__.jar \ 
      1>/stdout \ 
      2>/stderr

  resources:
    __app__.jar -> resource { scheme: "hdfs" host: "nameservice1" port: -1 file: "/user/azkaban/.sparkStaging/application_1578905465672_0359/EzBigdataFramework-1.0-SNAPSHOT-shaded.jar" } size: 47475163 timestamp: 1585038485433 type: FILE visibility: PRIVATE
    __spark_conf__ -> resource { scheme: "hdfs" host: "nameservice1" port: -1 file: "/user/azkaban/.sparkStaging/application_1578905465672_0359/__spark_conf__.zip" } size: 172363 timestamp: 1585038485636 type: ARCHIVE visibility: PRIVATE
===============================================================================

分析思路

1.有可能是yarn 的ResourceManager 出现问题了?导致AM 不能访问RM。
我就看了下yarn的服务,一切正常

于是我又去看下RM的日志,CDH 日志路径:/var/log/hadoop-yarn/hadoop-cmf-yarn-RESOURCEMANAGER-hdh02.c.p.xyidc.log.out

通过对yarn 中 application ,appattempt ,container 状态机转换过程分析,可以看到前面一切都很正常,
但是后面就不正常了,没有任何AM申请资源的日志,直接就是下面的日志:

情况跟上面AM的日志情况直接对的上,到底什么原因导致AM没有申请资源 - -,日志太少了,算了,还是根据AM堆栈日志找到AM代码的地方瞧一瞧,如下所示:

private def runDriver(securityMgr: SecurityManager): Unit = {
    addAmIpFilter()
    userClassThread = startUserApplication()

    // This a bit hacky, but we need to wait until the spark.driver.port property has
    // been set by the Thread executing the user class.
    logInfo("Waiting for spark context initialization...")
    val totalWaitTime = sparkConf.get(AM_MAX_WAIT_TIME)
    try {
      val sc = ThreadUtils.awaitResult(sparkContextPromise.future,
        Duration(totalWaitTime, TimeUnit.MILLISECONDS))
      if (sc != null) {
        rpcEnv = sc.env.rpcEnv
        val driverRef = runAMEndpoint(
          sc.getConf.get("spark.driver.host"),
          sc.getConf.get("spark.driver.port"),
          isClusterMode = true)
        registerAM(sc.getConf, rpcEnv, driverRef, sc.ui.map(_.webUrl), securityMgr)
      } else {
        // Sanity check; should never happen in normal operation, since sc should only be null
        // if the user app did not create a SparkContext.
        if (!finished) {
          throw new IllegalStateException("SparkContext is null but app is still running!")
        }
      }
      userClassThread.join()
    } catch {
      case e: SparkException if e.getCause().isInstanceOf[TimeoutException] =>
        logError(
          s"SparkContext did not initialize after waiting for $totalWaitTime ms. " +
           "Please check earlier log output for errors. Failing the application.")
        finish(FinalApplicationStatus.FAILED,
          ApplicationMaster.EXIT_SC_NOT_INITED,
          "Timed out waiting for SparkContext.")
    }
  }

根据日志,看到报错的地方,是上面那段,提交方式集群模式,所以driver作为一个线程在AM中运行,通过日志,可以确定没有正常的初始化SparkContext,所以打印了这个错误码EXIT_SC_NOT_INITED,下面是代码里定义的错误码:

object ApplicationMaster extends Logging {

  // exit codes for different causes, no reason behind the values
  private val EXIT_SUCCESS = 0
  private val EXIT_UNCAUGHT_EXCEPTION = 10
  private val EXIT_MAX_EXECUTOR_FAILURES = 11
  private val EXIT_REPORTER_FAILURE = 12
  private val EXIT_SC_NOT_INITED = 13
  private val EXIT_SECURITY = 14
  private val EXIT_EXCEPTION_USER_CLASS = 15
  private val EXIT_EARLY = 16

private val EXIT_SC_NOT_INITED = 13,字面意思是sc没有初始化。所以就没有后面driver启动完成, AM再申请资源一说。于是就检查spark提交的命令,以及spark代码,发现同事的代码写上了 setMaster(local[*]) ,而提交的方式是集群模式,好了,到此结束。一giao 窝 里 giao,呼他。

你可能感兴趣的:(大数据—YARN,大数据—Spark)