安装配置pyspark,计算时报错如下:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/09/02 23:52:02 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Traceback (most recent call last):
File "C:\Users\hx\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 184, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "C:\Users\hx\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 110, in _get_module_details
__import__(pkg_name)
File "" , line 991, in _find_and_load
File "" , line 975, in _find_and_load_unlocked
File "" , line 655, in _load_unlocked
File "" , line 618, in _load_backward_compatible
File "" , line 259, in load_module
File "D:\spark-2.4.3-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\__init__.py", line 51, in <module>
File "" , line 991, in _find_and_load
File "" , line 975, in _find_and_load_unlocked
File "" , line 655, in _load_unlocked
File "" , line 618, in _load_backward_compatible
File "" , line 259, in load_module
File "D:\spark-2.4.3-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\context.py", line 31, in <module>
File "" , line 991, in _find_and_load
File "" , line 975, in _find_and_load_unlocked
File "" , line 655, in _load_unlocked
File "" , line 618, in _load_backward_compatible
File "" , line 259, in load_module
File "D:\spark-2.4.3-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\accumulators.py", line 97, in <module>
File "" , line 991, in _find_and_load
File "" , line 975, in _find_and_load_unlocked
File "" , line 655, in _load_unlocked
File "" , line 618, in _load_backward_compatible
File "" , line 259, in load_module
File "D:\spark-2.4.3-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py", line 71, in <module>
File "" , line 991, in _find_and_load
File "" , line 975, in _find_and_load_unlocked
File "" , line 655, in _load_unlocked
File "" , line 618, in _load_backward_compatible
File "" , line 259, in load_module
File "D:\spark-2.4.3-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\cloudpickle.py", line 145, in <module>
File "D:\spark-2.4.3-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\cloudpickle.py", line 126, in _make_cell_set_template_code
TypeError: an integer is required (got type bytes)
[Stage 0:> (0 + 1) / 1]23/09/02 23:52:14 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Python worker failed to connect back.
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:103)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketTimeoutException: Accept timed out
at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
at java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:131)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:535)
at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:189)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)
... 18 more
电脑中有多个python版本:3.8, 3.7和Anaconda的,但只在3.7配置了pyspark环境,就算将编译器切换至3.7,程序执行还是会调用其他python的runpy.py文件导致报错,改变环境变量顺序无果,索性将其他python版本删除,解决问题。
方案一:添加环境变量:PYSPARK_PYTHON=“你电脑python解释器的路径”。
方案二:在运行的代码最前面添加以下两行:
import os
os.environ['PYSPARK_PYTHON'] = r"你电脑python解释器的路径"