记录一次的spark-submit报错: scheduler.ShuffleMapStage: ShuffleMapStage 0 is now unavailable on executor

必须要记录一次的spark-submit报错

spark任务若出现由于内存不足导致任务失败的情况:
一:大多数情况想的是可能 因为shuffle过程太耗内存,导致executor执行不成功,所以增大executor-memory的大小和core的数量
二、也要记住,虽然你申请了很大的内存,但是可能集群资源并没有那么多:
即你在提交spark任务时的contanier的内存总大小(每个excutor个数乘上每个excutor的内存),超过了在 ambari YARN中配置的container的总大小。

一、命令行日志出现:

  1. WARN : scheduler.ShuffleMapStage: ShuffleMapStage 0 is now unavailable on executor 8
  2. WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e143_1591884070030_22633_01_000338 on host: hd123. Exit status: 134. Diagnostics: Exception from container-launch.
  3. ERROR cluster.YarnScheduler: Lost executor 2 on hd030.corp.yodao.com: Container marked as failed:
  4. WARN scheduler.TaskSetManager: Lost task 3.1 in stage 1.0 (TID 156, hd030): FetchFailed(null, shuffleId=0, mapId=-1, reduceId=3, message=org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0

如图:
记录一次的spark-submit报错: scheduler.ShuffleMapStage: ShuffleMapStage 0 is now unavailable on executor_第1张图片
记录一次的spark-submit报错: scheduler.ShuffleMapStage: ShuffleMapStage 0 is now unavailable on executor_第2张图片

二、yarn日志出现:(yarn日志查看方法)

  1. Exception in thread “main” java.io.IOException: Failed to connect to /IP:port
20/06/17 19:21:02 ERROR shuffle.OneForOneBlockFetcher: Failed while starting block fetches
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:192)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
	at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
	at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
	at java.lang.Thread.run(Thread.java:745)

三、application 可视化UI界面

记录一次的spark-submit报错: scheduler.ShuffleMapStage: ShuffleMapStage 0 is now unavailable on executor_第3张图片
其中可以查看每个executor上的日志,要在任务执行的时候多刷新,然后正在执行的时候,去查看executor日志:
记录一次的spark-submit报错: scheduler.ShuffleMapStage: ShuffleMapStage 0 is now unavailable on executor_第4张图片

解决方案:

(注意:contanier的内存总大小(每个excutor个数乘上每个excutor的内存),不要超过了在 ambari YARN中配置的container的总大小)

spark-submit参数:–num-executors 26 --driver-memory 15g --executor-memory 8g --executor-cores 2

  1. 先调小–num-executors,不行在调小–executor-memory
  2. 可以从最小开始,如果任务执行成功,在慢慢调大到合适的配置

还有其他的方案,就是在集群资源足够的情况下,调大一些参数:

  1. –conf spark.sql.shuffle.partitions=2048 ,默认为200
  2. 调整spark配置:yarn-site.xml中
<property>
    <name>yarn.nodemanager.vmem-pmem-ratio</name>
    <value>10</value>
</property>
<property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
</property>
  1. 等等,后面有在了解的在增加,可以互相交流

你可能感兴趣的:(Spark)