spark.SparkException: Python worker failed to connect back.执行 spark 操作时 Python 工作线程无法连接回

SparkException:执行 spark 操作时 Python 工作线程无法连接回

spark.SparkException: Python worker failed to connect back.

问问题

当我尝试在 pyspark 执行此命令行时

from pyspark import SparkConf, SparkContext

# 创建SparkConf和SparkContext
conf = SparkConf().setMaster("local").setAppName("lichao-wordcount")
sc = SparkContext(conf=conf)

# 输入的数据
data=["hello","world","hello","word","count","count","hello"]

# 将Collection的data转化为spark中的rdd并进行操作
rdd=sc.parallelize(data)
resultRdd = rdd.map(lambda word: (word,1)).reduceByKey(lambda a,b:a+b)

# rdd转为collecton并打印
resultColl = resultRdd.collect()
for line in resultColl:
    print (line)

# 结束
sc.stop()

我收到以下错误消息:

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (Yunfa-PC executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:182)
        at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:107)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:119)
        at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:145)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:115)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketTimeoutException: Accept timed out
        at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
        at java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:131)
        at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:535)
        at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:189)
        at java.net.ServerSocket.implAccept(ServerSocket.java:545)
        at java.net.ServerSocket.accept(ServerSocket.java:513)
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:174)
        ... 19 more

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
        at scala.Option.foreach(Option.scala:407)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261)
        at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
        at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:180)
        at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Python worker failed to connect back.
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:182)
        at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:107)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:119)
        at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:145)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:115)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        ... 1 more
Caused by: java.net.SocketTimeoutException: Accept timed out
        at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
        at java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:131)
        at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:535)
        at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:189)
        at java.net.ServerSocket.implAccept(ServerSocket.java:545)
        at java.net.ServerSocket.accept(ServerSocket.java:513)
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:174)
        ... 19 more

问题

这种情况大多表示相关python环境没有配置好

解决尝试

我尝试了以下步骤:

  • 检查环境变量。
  • 在 Windows 10 步骤中检查 Apache Spark 安装。
  • 使用不同版本的 Apache Spark(试过 2.4.3 / 2.4.2 / 2.3.4)。
  • 禁用我安装的防火墙窗口和防病毒软件。
  • 尝试手动初始化 SparkContext sc = spark.sparkContext(在 Stackoverflow中的这个问题上找到了这个可能的解决方案,对我不起作用)。
  • 试图改变的价值PYSPARK_DRIVER_PYTHON来自jupyteripython,在此表示的链接,没有成功。

上述步骤都不适合我,我找不到解决方案。

实际上我正在使用以下版本:

Python 3.7.3、Java JDK 11.0.6、Windows 10、Apache Spark 2.3.4

解决方案

我只是配置了以下变量环境,现在它正常工作:

  • HADOOP_HOME = C:\Hadoop
  • JAVA_HOME = C:\Java\jdk-11.0.6
  • PYSPARK_DRIVER_PYTHON = jupyter
  • PYSPARK_DRIVER_PYTHON_OPTS = notebook
  • PYSPARK_PYTHON = python

实际上我正在使用以下版本:

Python 3.7.3、Java JDK 11.0.6、Windows 10、Apache Spark 2.4.3 和使用带有 pyspark 的 Jupyter Notebook。
如果使用的是jupyter lab,则需要更改- PYSPARK_DRIVER_PYTHON_OPTS = lab

配置教程

jupyter 的设置

  • 添加PYSPARK_DRIVER_PYTHON=jupyter到系统变量
  • 添加PYSPARK_DRIVER_PYTHON_OPTS=lab到系统变量

spark.SparkException: Python worker failed to connect back.执行 spark 操作时 Python 工作线程无法连接回_第1张图片

注意,我是用 jupyter lab 做编辑器,而不是 jupyter notebook,若是以 notebook 作编辑器,将 PYSPARK_DRIVER_PYTHON_OPTS=notebook 添加到系统变量即可。

成功的标志是运行以下代码没有出毛病:

from pyspark import SparkContext
sc = SparkContext("local", "Hello World App")

查看版本和相关信息

spark.SparkException: Python worker failed to connect back.执行 spark 操作时 Python 工作线程无法连接回_第2张图片

如图中有个 sparkUI 的链接,点进去可查看Spark的运行情况等。

你可能感兴趣的:(python,spark)