在notebook中使用pyspark遇到的问题

代码:

from pyspark import SparkContext
sc = SparkContext()
rdd.getNumPartitions()
rdd.glom().collect()

遇到的问题:
执行rdd.glom().collect()时出现如下错误:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 123, in main
    ("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 3.6 than that in driver 2.7, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set

解决办法:
在集群上的每个节点上添加环境变量
export PYSPARK_DRIVER_PYTHON=/usr/local/anacond/bin/python3
export PYSPARK_PYTHON=/usr/local/anacond/bin/python3
记得使用source命令生效,然后重启集群中的所有节点,重启spark

你可能感兴趣的:(pyspark)