运行PySpark项目报错SparkException: Python worker failed to connect back.的解决方法

目录

1.背景

2.报错原因

3.解决方法

4.测试代码


1.背景

        在未配置Spark环境的Win10系统上使用PyCharm平台运行PySpark项目,但是已通过

pip install pyspark 安装了pyspark库,代码段无报错,但是运行时出现这种报错:

运行PySpark项目报错SparkException: Python worker failed to connect back.的解决方法_第1张图片

2.报错原因

        Spark找不到Python环境的位置,需要指定Python环境.

3.解决方法

        (1)如图所示,进入编辑运行配置:

运行PySpark项目报错SparkException: Python worker failed to connect back.的解决方法_第2张图片

        (2)如图所示,点击编辑环境变量:

运行PySpark项目报错SparkException: Python worker failed to connect back.的解决方法_第3张图片

        (3)如图所示,添加PYSPARK_PYTHON的环境变量:

运行PySpark项目报错SparkException: Python worker failed to connect back.的解决方法_第4张图片

        (4)点击OK,点击Apply.再次运行项目:

         报错已被解决.

4.测试代码

        该测试代码是一个简单的词频统计,一并发出来吧:

import pyspark
from pyspark import SparkConf

# 单词统计
def word_statistics(words):
    conf = pyspark.SparkConf().setMaster("local[*]").setAppName("Word_Statistics")
    sc = pyspark.SparkContext(conf=conf)

    words = words
    rdd = sc.parallelize(words)
    counts = rdd.map(lambda w: (w, 1)).reduceByKey(lambda a, b: a+b)
    print(counts.collect())

if __name__ == "__main__":
    words = ["test1", "test2", "test1", "test2", "test3", "test2", "test1", "test5", "test4", "test2", "test6", "test7"]
    word_statistics(words)

你可能感兴趣的:(Spark,spark,大数据,分布式)