详细配置步骤参考:Windows和PC机上搭建Spark+Python开发环境的详细步骤
按照上述配置过程,当采用Anaconda 5.1 (Python3.6)+java1.7.0_79+spark2.0.1+Hadoop2.6.0进行配置时,出现如下错误:
AttributeError: 'module' Object has no attribute bool_
出现上述错误的可能原因:
解决办法:
采用Anaconda 4.2.0 (Python3.5)+java1.7.0_79+spark2.0.1+Hadoop2.6.0配置成功;
Anaconda 4.2.0 下载
注意:在按照教程配置的过程中安装py4j软件时,需要将Jupiter Notebook关闭。
from pyspark.sql import SparkSession
spark=SparkSession.builder\
.appName('My_App')\
.master('local')\
.getOrCreate()
df = spark.read.csv('example.csv',header=True)
df.printSchema()
输出为数据描述信息:
root
|-- SHEDID: string (nullable = true)
|-- time: string (nullable = true)
|-- RT: string (nullable = true)
|-- LEASE: string (nullable = true)
import sys
from operator import add
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext(appName="PythonWordCount")
lines = sc.textFile('words.txt')
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
output = counts.collect()
for (word, count) in output:
print("%s: %i" % (word, count))
sc.stop()