win +本地pyspark

参考:配置本地单机pyspark
https://www.cnblogs.com/jackchen-Net/p/6667205.html#_label3

在sitepackages下新建pyspark.pth 文件,并添加D:\apps\spark\spark-2.2.1-bin-hadoop2.7\python

submit一个py文件报错
ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
说明没有hadoop执行文件,因为我们并没安装hadoop:

另外参考此篇hadoop安装(记住,仅仅是hadoop)
http://blog.csdn.net/yaoqiwaimai/article/details/59114881
winutils.exe chmod 777 c:\tmp\hive(先创建文件夹后再执行)
后面不要执行spark-shell(这是基于scala的,我们目的是pyspark)
winutils.exe版本不对(现在版本hadoop2.7.3),重新下载(需要的话留下你的email)。hadoop.dll-and-winutils.exe-for-hadoop2.7.3-on-windows_X64-master.zip解压并将所有子文件复制到hadoop的bin文件夹中

再次提交脚本还是error:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
将自己jdk9改为jdk8就没有问题了(自己坑自己)

警示信息太多:spark home的conf文件中logXXX复制,去掉.template,修改log4j.rootCategory=WARN, console 这里INFO改为WARN
submit的测试脚本


image.png
# encoding: utf-8
import os
import sys
# os.environ['SPARK_HOME'] = r"D:\apps\spark\spark-2.2.1-bin-hadoop2.7"
# os.environ['HADOOP_HOME'] = r"D:\apps\spark\"

# You might need to enter your local IP
# os.environ['SPARK_LOCAL_IP']="192.168.2.138"

# Path for pyspark and py4j
sys.path.append(r"D:\apps\spark\spark-2.2.1-bin-hadoop2.7\python")
# sys.path.append(r"D:\apps\spark\spark-2.2.1-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip") #pip install py4j后即可不要这句


try:
    from pyspark import SparkContext
    from pyspark import SparkConf

    print ("Successfully imported Spark Modules")
except ImportError as e:
    print ("Can not import Spark Modules", e)
    sys.exit(1)

sc = SparkContext(master='local')
words = sc.parallelize(["scala", "java", "hadoop", "spark", "akka"])
print("统计结果", words.count())

你可能感兴趣的:(win +本地pyspark)