【bigger than spark.driver.maxResultSize】

提交spark程序后报如下错误

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 90 tasks (1025.7 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1490)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1478)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1477)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1477)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:826)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:826)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:826)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1715)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1670)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1659)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:651)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1943)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1956)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1969)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1983)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:935)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:934)
	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:293)
	at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:78)
	at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:75)
	at org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:94)
	at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:74)
	at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:74)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

查阅资料后,找到如下解决办法

在代码中修改
    val sparkConf = new SparkConf()
    sparkConf.set("spark.driver.maxResultSize", "4g")
    val spark =SparkSession.builder().config(sparkConf).getOrCreate()

在sparkconf 中设定maxResultSize为较大的值即可

在脚本中修改

理论上在脚本中也可以配置的如下是我的脚本,最后一行是我添加的配置,但是运行的时候无法生效,知道怎么回事的麻烦评论区留言


$SPARK_HOME/bin/spark-submit \
    --cluster -hadoop \
    --conf spark.yarn.job.owners=wdp\
    --class getres \
    --master yarn \
    --deploy-mode cluster \
    --queue "$QUEUE" \
    --driver-memory 14g \
    --executor-memory 4g \
    --num-executors 100 \
    --executor-cores 1 \
    --conf spark.shuffle.io.preferDirectBufs=false \
    --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
    --conf spark.dynamicAllocation.enabled=true \
    --conf spark.shuffle.service.enabled=true \
    --conf spark.dynamicAllocation.maxExecutors=600 \
    --conf spark.dynamicAllocation.minExecutors=1 \
    --conf spark.dynamicAllocation.executorIdleTimeout=600s \
    --conf spark.yarn.executor.memoryOverhead=2048 \
    --conf spark.executor.extraJavaOptions="-XX:MaxPermSize=512m" \
    --conf spark.network.timeout=300 \
    "$JAR_FILE" \
    --userinfo   $User_INPUT_PATH3 \
    --input_threshold   $THRESHOLD \
    --appusedaily   $User_INPUT_PATH \
    --appexpansion   $User_INPUT_PATH2 \
    --output   $User_OUTPUT_PATH \
    --hashS 100 \
    --hashKey 100 \
    --CateThread 0.7 \
    --conf spark.driver.maxResultSize=3g 

你可能感兴趣的:(Spark)