Windows上配置Python+Spark开发环境

1、配置过程

详细配置步骤参考:Windows和PC机上搭建Spark+Python开发环境的详细步骤

按照上述配置过程,当采用Anaconda 5.1 (Python3.6)+java1.7.0_79+spark2.0.1+Hadoop2.6.0进行配置时,出现如下错误:

AttributeError: 'module' Object has no attribute bool_

出现上述错误的可能原因:

  • 没有完全按照教程下载相应的软件,主要考虑到和当前系统各种软件的兼容性
  • spark2.0.1的版本还不支持Python3.6

解决办法:

  • 采用Anaconda 4.2.0 (Python3.5)+java1.7.0_79+spark2.0.1+Hadoop2.6.0配置成功;

  • Anaconda 4.2.0 下载

注意:在按照教程配置的过程中安装py4j软件时,需要将Jupiter Notebook关闭。

2、检验是否安装成功

from pyspark.sql import SparkSession
spark=SparkSession.builder\
    .appName('My_App')\
    .master('local')\
    .getOrCreate()
df = spark.read.csv('example.csv',header=True)
df.printSchema()

输出为数据描述信息:

root
 |-- SHEDID: string (nullable = true)
 |-- time: string (nullable = true)
 |-- RT: string (nullable = true)
 |-- LEASE: string (nullable = true)

3、 WordCount测试

import sys
from operator import add

from pyspark import SparkContext


if __name__ == "__main__":
    sc = SparkContext(appName="PythonWordCount")
    lines = sc.textFile('words.txt')
    counts = lines.flatMap(lambda x: x.split(' ')) \
                  .map(lambda x: (x, 1)) \
                  .reduceByKey(add)
    output = counts.collect()
    for (word, count) in output:
        print("%s: %i" % (word, count))

    sc.stop()

你可能感兴趣的:(spark,Spark,Python,WordCount)