pycharm 开发pyspark

下载spark包

配置参数

配置spark参数
vim ${spark_dir}/conf/spark-env.sh
export SPARK_LOCAL_IP=ifconfig|grep -1a en0|grep netmask|awk {'print $2'}
HADOOP_CONF_DIR=$SPARK_HOME/conf

vim ${spark_dir}/conf/spark-defaults.conf
spark.master local

配置系统环境
vim ~/.bash_profile
SPARK_HOME=${spark_dir}
export PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.10.3-src.zip:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_PYTHON=python
export SPARK_HOME

因为pycharm会读取.bash_profile,不过执行代码的时候会把PYTHONPATH会覆盖掉.
所以让pycharm先设置PYTHONPATH.

preferences->project interpreter->show all->


pycharm 开发pyspark_第1张图片
image.png

pycharm 开发pyspark_第2张图片
image.png
pycharm 开发pyspark_第3张图片
image.png

这样就可以在本地开发spark任务了

from __future__ import print_function
import sys
from random import random
from operator import add

from pyspark.sql import SparkSession

if __name__ == "__main__":
    """
        Usage: pi [partitions]
    """
    spark = SparkSession.builder.appName("PythonPi").getOrCreate()

    partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
    n = 100000 * partitions

    def f(_):
        x = random() * 2 - 1
        y = random() * 2 - 1
        return 1 if x ** 2 + y ** 2 <= 1 else 0

    count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
    print("Pi is roughly %f" % (4.0 * count / n))

    spark.stop()

你可能感兴趣的:(pycharm 开发pyspark)