Spark开发中遇到的常见问题以及解决方案(一)

问题一:ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-8] shutting down ActorSystem [sparkDriver]

18/05/18 15:46:59 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-8] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: Java heap space
   at org.spark_project.protobuf.ByteString.toByteArray(ByteString.java:515)
   at akka.remote.serialization.MessageContainerSerializer.fromBinary(MessageContainerSerializer.scala:64)
   at akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104)
   at scala.util.Try$.apply(Try.scala:161)
   at akka.serialization.Serialization.deserialize(Serialization.scala:98)
   at akka.remote.MessageSerializer$.deserialize(MessageSerializer.scala:23)
   at akka.remote.DefaultMessageDispatcher.payload$lzycompute$1(Endpoint.scala:58)
   at akka.remote.DefaultMessageDispatcher.payload$1(Endpoint.scala:58)
   at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:76)
   at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:937)
   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
   at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
   at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

分析:driver 内存不足。

解决方案:提交spark-sumbit脚本时,--driver-memory 3g ,来相应的设置Driver内存。


问题二:Map output statuses were bytes which exceeds spark.akka.frameSize

17/10/12 00:15:38 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 1 to sparkExecutor@titannew134:46695

17/10/12 00:15:38 ERROR spark.MapOutputTrackerMasterActor: Map output statuses were 14371441 bytes which exceeds spark.akka.frameSize (10485760 bytes).
org.apache.spark.SparkException: Map output statuses were 14371441 bytes which exceeds spark.akka.frameSize (10485760 bytes).
at org.apache.spark.MapOutputTrackerMasterActor$$anonfun$receiveWithLogging$1.applyOrElse(MapOutputTracker.scala:59)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:53)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.spark.MapOutputTrackerMasterActor.aroundReceive(MapOutputTracker.scala:42)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
17/10/12 00:15:38 INFO scheduler.TaskSetManager: Starting task 1.3 in stage 1.0 (TID 5653, titannew134, PROCESS_LOCAL, 1045 bytes)
17/10/12 00:15:38 WARN scheduler.TaskSetManager: Lost task 8.2 in stage 1.0 (TID 5649, titannew134): org.apache.spark.SparkException:Error communicating with MapOutputTracker
at org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:116)
at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:163)
at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:245)
at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)

分析:spark.akka.frameSize 是worker和driver通信的每块数据大小,控制Spark中通信消息的最大容量 (如 task 的输出结果),默认为10M。从错误中可以看出,我们的程序需要的实际大小是14M多,因此我们把这个参数调整到20M。(也可以从 worker的日志中进行排查。通常 worker 上的任务失败后,master 的运行日志上出现”Lost TID: “的提示,可通过查看失败的 worker 的日志文件($SPARK_HOME/worker/下面的log文件) 中记录的任务的 Serialized size of result 是否超过10M来确定。)

解决方案提交spark-sumbit脚本时,--conf spark.akka.frameSize=20 (默认单位大小是M) ,来相应的设置Driver内存。


问题三:shuffle FetchFailedException

(1)org.apache.spark.shuffle.FetchFailedException:Failed to connect to datasvr6/101.120.110.114:60731

18/05/18 18:00:00 WARN scheduler.TaskSetManager: Lost task 22.0 in stage 8.0 (TID 172543, datasvr4): FetchFailed(BlockManagerId(23, datasvr6.bigdata.cqtpi.org, 60731), shuffleId=3, mapId=291, reduceId=22, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to  datasvr6/101.120.110.114:60731
        at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
        at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84)
        at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
        at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:216)
        at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:61)
        at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:90)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:70)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to datasvr6.bigdata.cqtpi.org/10.10.10.14:60731
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
        at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
        at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
        at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
        at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        ... 3 more
Caused by: java.net.ConnectException: Connection refused: datasvr6.bigdata.cqtpi.org/10.10.10.14:60731
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
        ... 1 more

(2)org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0

org.apache.spark.shuffle.MetadataFetchFailedException: 
Missing an output location for shuffle 0

分析:

shuffle分为shuffle write和shuffle read两部分。

shuffle write的分区数由上一阶段的RDD分区数控制,shuffle read的分区数则是由Spark提供的一些参数控制。

shuffle write可以简单理解为类似于saveAsLocalDiskFile的操作,将计算的中间结果按某种规则临时放到各个executor所在的本地磁盘上。

如果shuffle read的量很大,那么将会导致一个task需要处理的数据非常大,从而导致JVM crash以及取shuffle数据失败,最后executor也丢失了,看到Failed to connect to host的错误(executor lost)或者造成长时间的gc。


解决方案:

(a) 减少shuffle数据和操作
思考是否可以使用map side join或是broadcast join来规避shuffle的产生。
将不必要的数据在shuffle前进行过滤,比如原始数据有20个字段,只要选取需要的字段进行处理即可,将会减少一定的shuffle数据。

(b) 控制分区数
对于SparkSQL和DataFrame的join,group by等操作
通过spark.sql.shuffle.partitions控制分区数,默认为200,根据shuffle的量以及计算的复杂度提高这个值。
对于Rdd的join,groupBy,reduceByKey等操作
通过spark.default.parallelism控制shuffle read与reduce处理的分区数,默认为运行任务的core的总数(mesos细粒度模式为8个,local模式为本地的core总数),官方建议为设置成运行任务的core的2-3倍。

(c)提高executor的内存
通过spark.executor.memory适当提高executor的memory值。

(d)增加并行task的数目
通过增加并行task的数目,从而减小每个task的数据量。

(e)查看是否存在数据倾斜的问题
是否存在某个key数据特别大导致倾斜?如果存在可以单独处理或者考虑改变数据分区规则。


问题四: Job aborted due to stage failure: Task 3 in stage 0.0 failed 4 times, most recent failure: Lost task 3.3

[Stage 0:> (0 + 4) / 42]2018-03-15 11:28:16,512 [org.apache.spark.scheduler.TaskSchedulerImpl]-[ERROR] Lost executor 0 on 192.168.110.38: remote Rpc client disassociated

[Stage 0:> (0 + 4) / 42]2018-03-15 11:28:23,188 [org.apache.spark.scheduler.TaskSchedulerImpl]-[ERROR] Lost executor 1 on 192.168.110.38: remote Rpc client disassociated

[Stage 0:> (0 + 4) / 42]2018-03-15 11:28:29,203 [org.apache.spark.scheduler.TaskSchedulerImpl]-[ERROR] Lost executor 2 on 192.168.10.38: remote Rpc client disassociated

[Stage 0:> (0 + 4) / 42]2018-03-15 11:28:36,319 [org.apache.spark.scheduler.TaskSchedulerImpl]-[ERROR] Lost executor 3 on 192.168.10.38: remote Rpc client disassociated

[Stage 0:> (0 + 4) / 42]2018-03-15 11:28:23,188 [org.apache.spark.scheduler.TaskSchedulerImpl]-[ERROR] Lost executor 4 on 192.168.110.38: remote Rpc client disassociated

[Stage 0:> (0 + 4) / 42]2018-03-15 11:28:29,203 [org.apache.spark.scheduler.TaskSchedulerImpl]-[ERROR] Lost executor 5 on 192.168.10.38: remote Rpc client disassociated

[Stage 0:> (0 + 4) / 42]2018-03-15 11:28:36,319 [org.apache.spark.scheduler.TaskSchedulerImpl]-[ERROR] Lost executor 6 on 192.168.10.38: remote Rpc client disassociated

2016-01-15 11:28:36,321 [org.apache.spark.scheduler.TaskSetManager]-[ERROR] Task 3 in stage 0.0 failed 4 times; aborting job
Exception in thread "main" org.apache.spark.SparkException : Job aborted due to stage failure: Task 3 in stage 0.0 failed 4 times

分析:从出错代码可以看出,所有executor都没有执行完任务并且失联了。这是由于数据量过大,分配的内存过小,导致长时间执行超时从而断开。

解决方案:给每个executor分配更多的内存。


问题五:TaskSetManager: Lost task 1.0 in stage 6.0 (TID 100, 192.168.10.37): java.lang.OutOfMemoryError: Java heap space

16/01/15 14:29:51 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on 192.168.120.38:57139 (size: 42.0 KB, free: 24.2 MB)
16/01/15 14:29:53 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on 192.168.10.39:53816 (size: 42.0 KB, free: 24.2 MB)
18/02/15 18:27:58 INFO TaskSetManager: Starting task 3.0 in stage 6.0 (TID 102, 192.168.120.38, ANY, 2152 bytes)
18/02/15 18:27:58 WARN TaskSetManager: Lost task 1.0 in stage 6.0 (TID 100, 192.168.120.38): java.lang.OutOfMemoryError: Java heap space
        at java.io.BufferedOutputStream.(BufferedOutputStream.java:76)
        at java.io.BufferedOutputStream.(BufferedOutputStream.java:59)
        at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.(UnsafeRowSerializer.scala:55)
        at org.apache.spark.sql.execution.UnsafeRowSerializerInstance.serializeStream(UnsafeRowSerializer.scala:52)
        at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:92)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:110)
        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)

18/02/15 18:27:58 ERROR TaskSchedulerImpl: Lost executor 6 on 192.168.120.38: remote Rpc client disassociated
18/02/15 18:27:58 INFO TaskSetManager: Re-queueing tasks for 6 from TaskSet 6.0
18/02/15 18:27:58 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://[email protected]:42250] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
18/02/15 18:27:58 WARN TaskSetManager: Lost task 3.0 in stage 6.0 (TID 102, 192.168.120.38): ExecutorLostFailure (executor 6 lost)
18/02/15 18:27:58 INFO DAGScheduler: Executor lost: 6 (epoch 8)
18/02/15 18:27:58 INFO BlockManagerMasterEndpoint: Trying to remove executor 6 from BlockManagerMaster.
18/02/15 18:27:58 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(6, 192.168.120.38, 57139)
18/02/15 18:27:58 INFO BlockManagerMaster: Removed 6 successfully in removeExecutor
18/02/15 18:27:58 INFO AppClient$ClientEndpoint: Executor updated: app-20160115142128-0001/6 is now EXITED (Command exited with code 52)
18/02/15 18:27:58 INFO SparkDeploySchedulerBackend: Executor app-20160115142128-0001/6 removed: Command exited with code 52
WARN TaskSetManager: Lost task 4.1 in stage 6.0 (TID 137, 192.168.10.39): java.lang.OutOfMemoryError: GC overhead limit exceeded

分析:由于读取的数据量太大,导致在Worker执行任务数据时所需要的内存不够,从而导致内存溢出。在Spark任务中,如Executor Lost 相关的问题(shuffle fetch 失败,Task失败重试等)。一般是发生了内存不足或者数据倾斜的问题。

解决方案:设法增加Worker上面的内存即可。

(a)相同资源下,增加partition数可以缓解数据倾斜以及内存不足的压力。原因如下:通过增加partition数,每个task要处理的数据少了。
(b)增加shuffle 任务的并行数。例如reduce,group之类的函数,设置第二个参数即并行度(partition数)或者设置默认的并行数。

(c)适当减少Executor核数并且增加Executor的数量。因为同一时刻,Executor的核数太多会导致其内存压力会变大,GC也会更频繁。


你可能感兴趣的:(Spark)