spark-shell启动失败

背景

公司小组分配了三台虚拟机,在虚拟机上面意欲装hadoop集群及spark on yarn

版本

Hadoop 2.7.2
spark 2.3.2

问题

配置好hadoop集群与spark配置后,启动spark-shell --master yarn报错如下

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/04/07 14:14:35 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
19/04/07 14:14:47 ERROR TransportClient: Failed to send RPC 4632628449949268526 to /172.17.1.150:41470: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
19/04/07 14:14:47 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to get executor loss reason for executor id 1 at RPC address 172.17.1.150:41480, but got no response. Marking as slave lost.
java.io.IOException: Failed to send RPC 4632628449949268526 to /172.17.1.150:41470: java.nio.channels.ClosedChannelException
    at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)
    at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
    at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
    at io.netty.util.concurrent.DefaultPromise.access$000(DefaultPromise.java:34)
    at io.netty.util.concurrent.DefaultPromise$1.run(DefaultPromise.java:431)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
19/04/07 14:14:47 ERROR YarnScheduler: Lost executor 1 on wangwei01: Slave lost
Spark context Web UI available at http://wangwei02:4040
Spark context available as 'sc' (master = yarn, app id = application_1554616735892_0001).
Spark session available as 'spark'.
19/04/07 14:14:52 ERROR YarnClientSchedulerBackend: Yarn application has already exited with state FINISHED!
19/04/07 14:14:52 ERROR TransportClient: Failed to send RPC 7884125299619016997 to /172.17.1.150:41490: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
19/04/07 14:14:52 ERROR YarnSchedulerBackend$YarnSchedulerEndpoint: Sending RequestExecutors(0,0,Map(),Set()) to AM was unsuccessful
java.io.IOException: Failed to send RPC 7884125299619016997 to /172.17.1.150:41490: java.nio.channels.ClosedChannelException
    at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)
    at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
    at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
    at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
    at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122)
    at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:987)
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:869)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1316)
    at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
    at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
    at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
    at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
    at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
    at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
19/04/07 14:14:53 ERROR Utils: Uncaught exception in thread Yarn application state monitor
org.apache.spark.SparkException: Exception thrown in awaitResult: 
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
    at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:567)
    at org.apache.spark.scheduler.cluster.YarnSchedulerBackend.stop(YarnSchedulerBackend.scala:95)
    at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:155)
    at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:508)
    at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1804)
    at org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1931)
    at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1361)
    at org.apache.spark.SparkContext.stop(SparkContext.scala:1930)
    at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$MonitorThread.run(YarnClientSchedulerBackend.scala:112)
Caused by: java.io.IOException: Failed to send RPC 7884125299619016997 to /172.17.1.150:41490: java.nio.channels.ClosedChannelException
    at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)
    at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
    at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
    at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
    at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122)
    at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:987)
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:869)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1316)
    at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
    at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
    at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
    at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
    at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
    at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.2
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

ApplicationMaster日志如下

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/tmp/hadoop/hadoop-root/nm-local-dir/usercache/root/filecache/10/__spark_libs__2498524289906141930.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/04/07 14:14:41 INFO SignalUtils: Registered signal handler for TERM
19/04/07 14:14:41 INFO SignalUtils: Registered signal handler for HUP
19/04/07 14:14:41 INFO SignalUtils: Registered signal handler for INT
19/04/07 14:14:41 INFO SecurityManager: Changing view acls to: root
19/04/07 14:14:41 INFO SecurityManager: Changing modify acls to: root
19/04/07 14:14:41 INFO SecurityManager: Changing view acls groups to: 
19/04/07 14:14:41 INFO SecurityManager: Changing modify acls groups to: 
19/04/07 14:14:41 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
19/04/07 14:14:42 INFO ApplicationMaster: Preparing Local resources
19/04/07 14:14:42 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1554616735892_0001_000001
19/04/07 14:14:43 INFO ApplicationMaster: Waiting for Spark driver to be reachable.
19/04/07 14:14:43 INFO ApplicationMaster: Driver now available: wangwei02:34614
19/04/07 14:14:43 INFO TransportClientFactory: Successfully created connection to wangwei02/172.17.1.151:34614 after 47 ms (0 ms spent in bootstraps)
19/04/07 14:14:43 INFO ApplicationMaster: 
===============================================================================
YARN executor launch context:
  env:
    CLASSPATH -> {{PWD}}{{PWD}}/__spark_conf__{{PWD}}/__spark_libs__/*$HADOOP_CONF_DIR$HADOOP_COMMON_HOME/share/hadoop/common/*$HADOOP_COMMON_HOME/share/hadoop/common/lib/*$HADOOP_HDFS_HOME/share/hadoop/hdfs/*$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*$HADOOP_YARN_HOME/share/hadoop/yarn/*$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*{{PWD}}/__spark_conf__/__hadoop_conf__
    SPARK_YARN_STAGING_DIR -> hdfs://wangwei01:9000/user/root/.sparkStaging/application_1554616735892_0001
    SPARK_USER -> root

  command:
    {{JAVA_HOME}}/bin/java \ 
      -server \ 
      -Xmx1024m \ 
      -Djava.io.tmpdir={{PWD}}/tmp \ 
      '-Dspark.driver.port=34614' \ 
      -Dspark.yarn.app.container.log.dir= \ 
      -XX:OnOutOfMemoryError='kill %p' \ 
      org.apache.spark.executor.CoarseGrainedExecutorBackend \ 
      --driver-url \ 
      spark://CoarseGrainedScheduler@wangwei02:34614 \ 
      --executor-id \ 
       \ 
      --hostname \ 
       \ 
      --cores \ 
      1 \ 
      --app-id \ 
      application_1554616735892_0001 \ 
      --user-class-path \ 
      file:$PWD/__app__.jar \ 
      1>/stdout \ 
      2>/stderr

  resources:
    __spark_libs__ -> resource { scheme: "hdfs" host: "wangwei01" port: 9000 file: "/user/root/.sparkStaging/application_1554616735892_0001/__spark_libs__2498524289906141930.zip" } size: 235228182 timestamp: 1554617677670 type: ARCHIVE visibility: PRIVATE
    __spark_conf__ -> resource { scheme: "hdfs" host: "wangwei01" port: 9000 file: "/user/root/.sparkStaging/application_1554616735892_0001/__spark_conf__.zip" } size: 183160 timestamp: 1554617678303 type: ARCHIVE visibility: PRIVATE

===============================================================================
19/04/07 14:14:43 INFO RMProxy: Connecting to ResourceManager at wangwei01/172.17.1.150:8030
19/04/07 14:14:43 INFO YarnRMClient: Registering the ApplicationMaster
19/04/07 14:14:43 INFO YarnAllocator: Will request 2 executor container(s), each with 1 core(s) and 1408 MB memory (including 384 MB of overhead)
19/04/07 14:14:43 INFO YarnAllocator: Submitted 2 unlocalized container requests.
19/04/07 14:14:43 INFO ApplicationMaster: Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals
19/04/07 14:14:44 INFO AMRMClientImpl: Received new token for : wangwei01:41680
19/04/07 14:14:44 INFO AMRMClientImpl: Received new token for : wangwei02:35865
19/04/07 14:14:44 INFO YarnAllocator: Launching container container_1554616735892_0001_01_000002 on host wangwei01 for executor with ID 1
19/04/07 14:14:44 INFO YarnAllocator: Launching container container_1554616735892_0001_01_000003 on host wangwei02 for executor with ID 2
19/04/07 14:14:44 INFO YarnAllocator: Received 2 containers from YARN, launching executors on 2 of them.
19/04/07 14:14:44 INFO ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
19/04/07 14:14:44 INFO ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
19/04/07 14:14:44 INFO ContainerManagementProtocolProxy: Opening proxy : wangwei01:41680
19/04/07 14:14:44 INFO ContainerManagementProtocolProxy: Opening proxy : wangwei02:35865
19/04/07 14:14:46 ERROR ApplicationMaster: RECEIVED SIGNAL TERM
19/04/07 14:14:46 INFO ApplicationMaster: Final app status: UNDEFINED, exitCode: 16, (reason: Shutdown hook called before final status was reported.)
19/04/07 14:14:46 INFO ShutdownHookManager: Shutdown hook called

上图可以看出ApplicationMaster进程丢失,导致连接失败
在AM启动节点上面查看NM的日志信息如下(部分日志)

2019-04-07 14:14:44,317 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: launchContainer: [bash, /tmp/hadoop/hadoop-root/nm-local-dir/usercache/root/appcache/application_1554616735892_0001/container_1554616735892_0001_01_000002/default_container_executor.sh]
2019-04-07 14:14:44,321 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1554616735892_0001_01_000002 transitioned from LOCALIZED to RUNNING
2019-04-07 14:14:46,166 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1554616735892_0001_01_000002
2019-04-07 14:14:46,176 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 29146 for container-id container_1554616735892_0001_01_000002: 234.3 MB of 2 GB physical memory used; 2.8 GB of 4.2 GB virtual memory used
2019-04-07 14:14:46,188 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 29105 for container-id container_1554616735892_0001_01_000001: 330.8 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2019-04-07 14:14:46,188 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Process tree for container: container_1554616735892_0001_01_000001 has processes older than 1 iteration running over the configured limit. Limit=2254857728, current usage = 2498314240
2019-04-07 14:14:46,188 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=29105,containerID=container_1554616735892_0001_01_000001] is running beyond virtual memory limits. Current usage: 330.8 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1554616735892_0001_01_000001 :
    |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
    |- 29110 29105 29105 29105 (java) 542 24 2382413824 84382 /opt/jdk1.8.0_201/bin/java -server -Xmx512m -Djava.io.tmpdir=/tmp/hadoop/hadoop-root/nm-local-dir/usercache/root/appcache/application_1554616735892_0001/container_1554616735892_0001_01_000001/tmp -Dspark.yarn.app.container.log.dir=/opt/hadoop-2.7.2/logs/userlogs/application_1554616735892_0001/container_1554616735892_0001_01_000001 org.apache.spark.deploy.yarn.ExecutorLauncher --arg wangwei02:34614 --properties-file /tmp/hadoop/hadoop-root/nm-local-dir/usercache/root/appcache/application_1554616735892_0001/container_1554616735892_0001_01_000001/__spark_conf__/__spark_conf__.properties 
    |- 29105 29104 29105 29105 (bash) 0 0 115900416 307 /bin/bash -c /opt/jdk1.8.0_201/bin/java -server -Xmx512m -Djava.io.tmpdir=/tmp/hadoop/hadoop-root/nm-local-dir/usercache/root/appcache/application_1554616735892_0001/container_1554616735892_0001_01_000001/tmp -Dspark.yarn.app.container.log.dir=/opt/hadoop-2.7.2/logs/userlogs/application_1554616735892_0001/container_1554616735892_0001_01_000001 org.apache.spark.deploy.yarn.ExecutorLauncher --arg 'wangwei02:34614' --properties-file /tmp/hadoop/hadoop-root/nm-local-dir/usercache/root/appcache/application_1554616735892_0001/container_1554616735892_0001_01_000001/__spark_conf__/__spark_conf__.properties 1> /opt/hadoop-2.7.2/logs/userlogs/application_1554616735892_0001/container_1554616735892_0001_01_000001/stdout 2> /opt/hadoop-2.7.2/logs/userlogs/application_1554616735892_0001/container_1554616735892_0001_01_000001/stderr 

2019-04-07 14:14:46,189 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Removed ProcessTree with root 29105
2019-04-07 14:14:46,189 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1554616735892_0001_01_000001 transitioned from RUNNING to KILLING

有此日志可以看出由于AM的vmem内存使用(2.3G)超出了默认的(2.1G),因此被kill掉了

解决方案

  1. 设置虚拟内存与物理内存比率
    在yarn-site.xml中修改yarn.nodemanager.vmem-pmem-ratio(default:2.1)

    yarn.nodemanager.vmem-pmem-ratio
    10

  1. yarn.nodemanager.vmem-check-enabled,false掉

    yarn.nodemanager.vmem-check-enabled
    false

你可能感兴趣的:(spark-shell启动失败)