pyspark报错如下:
Caused by: java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:477)
at org.apache.spark.api.python.PythonRDD$.write$1(PythonRDD.scala:297)
at org.apache.spark.api.python.PythonRDD$.$anonfun$writeIteratorToStream$1(PythonRDD.scala:307)
at org.apache.spark.api.python.PythonRDD$.$anonfun$writeIteratorToStream$1$adapted(PythonRDD.scala:307)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:307)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:680)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:434)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:269)
Process finished with exit code 1
解决方法:
修改文件:spark-2.4.3-bin-hadoop2.7\python\pyspark\worker.py,在process方法处添加如下:
def process():
iterator = deserializer.load_stream(infile)
serializer.dump_stream(func(split_index, iterator), outfile)
#添加的代码片段
for obj in iterator:
pass
然后 python/lib 文件夹中重建 pyspark.zip 以包含更改。
官网解释如下:
The issue may be that the worker process is completing before the executor has written all the data to it. The thread writing the data down the socket throws an exception and if this happens before the executor marks the task as complete it will cause trouble. The idea is to try to get the worker to pull all the data from the executor even if its not needed to lazily compute the output. This is very inefficient of course so it is a workaround rather than a proper solution.
更改后,您需要在 python/lib 文件夹中重建 pyspark.zip 以包含更改。
问题可能是工作进程在执行程序将所有数据写入它之前完成。将数据写入套接字的线程会引发异常,如果在执行程序将任务标记为完成之前发生这种情况,则会导致麻烦。这个想法是试图让工作人员从执行程序中提取所有数据,即使它不需要懒惰地计算输出。这当然是非常低效的,因此它是一种解决方法而不是适当的解决方案。