ubuntu16.04 安装 pyspark2.4

记录下安装pyspark2.4那点事
首先把 java1.8 配置好(这个so easy),我这个是基于python3.7,如果没有安装,移步这里
然后下载包spark-2.4.0-bin-hadoop2.7.tgz

$ tar -zvxf spark-2.4.0-bin-hadoop2.7.tgz
# 配置环境变量
$ vi ~/.bashrc
export JAVA_HOME=/usr/bin/java/jdk1.8.0_102
export SPARK_HOME=/home/ubuntu/spark-2.4.0-bin-hadoop2.7
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
export PATH=$PATH:$JAVA_HOME/bin:$SPARK_HOME/bin

$ source ~/.bashrc

然后安装pyspark

$ cd /home/ubuntu/spark-2.4.0-bin-hadoop2.7/python
$ python3 setup.py install

这样就安装好了,但是会报JAVA_HOME 找不到的错误,那就去找到你python的site-package下的pyspark 下的bin目录下的有个叫 load-spark-env.sh的文件,在下面添加:

# 找文件命令
$ sudo find /opt/python3.7  -name load-spark-env.sh
$ vi load-spark-env.sh
export JAVA_HOME=/usr/bin/java/jdk1.8.0_102

但是这样在程序里可能会报错

Exception: Python in worker has different version 2.7 than that in driver 3.7, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

所以可以在程序里加这么一句

import os
os.environ["PYSPARK_PYTHON"] = "python3"

验证程序:

# -*- coding:UTF-8 -*-
import os
from pyspark import SparkContext
import pyspark
from pyspark import SparkConf
os.environ["PYSPARK_PYTHON"] = "python3"
conf = SparkConf().setAppName("miniProject").setMaster("local[*]")
sc = SparkContext.getOrCreate(conf)

# flatMap() 对RDD中的item执行同一个操作以后得到一个list,然后以平铺的方式把这些list里所有的结果组成新的list
sentencesRDD = sc.parallelize(['Hello world', 'My name is Patrick'])
wordsRDD = sentencesRDD.flatMap(lambda sentence: sentence.split(" "))
print(wordsRDD.collect())
print(wordsRDD.count())

能跑起来就没毛病了,如果想把hadoop也装了,请移步这里

你可能感兴趣的:(linux,大数据)