在使用pyspark的时候,需要使用RDD中的map成员方法
遇到了如下问题:
_pickle.PicklingError: Could not serialize object: IndexError: tuple index out of range
from pyspark import SparkConf, SparkContext
import os
os.environ["PYSPARK_PYTHON"] = "/Users/week/PycharmProjects/PythonProject/venv/bin/python3"
conf = SparkConf().setMaster("local[*]").setAppName("test_spark")
sc = SparkContext(conf=conf)
# 准备一个RDD
rdd = sc.parallelize([1, 2, 3, 4, 5])
# 通过map方法将数据全部乘以10
def func(data):
return data * 10
rdd2 = rdd.map(func)
print(rdd2.collect())
Python版本使用的是3.11过高导致,于是更换了3.10.9
官网下载太慢,这里在阿里云镜像仓库进行下载,3秒下载完成
附上仓库地址:
https://registry.npmmirror.com/binary.html?path=python/3.10.9/
修改代码里的python解释器环境
os.environ["PYSPARK_PYTHON"] = "/Library/Frameworks/Python.framework/Versions/3.10/bin/python3"