MAC单机版本Pyspark运行

1.安装spark,假设不需要使用hdfs,此时可以直接启用spark,如果需要使用hdfs需要先启用hadoop后,再启用spark。
a. 进入spark安装路径

cd /Users/jingwang/Documents/tools/spark-2.1.1-bin-hadoop2.7

b. 进入交互式环境

bin/pyspark

如果你已经在环境变量中设置交互式环境为jupyter notebook,这就可以进入jupyter页面,运行测试:

from pyspark import SparkContext
sc = SparkContext( 'local', 'test')
logFile = "file:///Users/jingwang/Documents/tools/spark-2.1.1-bin-hadoop2.7/README.md"
logData = sc.textFile(logFile, 2).cache()
numAs = logData.filter(lambda line: 'a' in line).count()
numBs = logData.filter(lambda line: 'b' in line).count()
print('Lines with a: %s, Lines with b: %s' % (numAs, numBs))

报错:
“Cannot run multiple SparkContexts at once; existing SparkContext(app=test, master=local) created by init at :2 ”表明之前已经有一个sc在运行,先关闭,再执行上面的程序,即先运行:

sc.stop() #退出已有的sc

再执行:

from pyspark import SparkContext
sc = SparkContext( 'local', 'test')
logFile = "file:///Users/jingwang/Documents/tools/spark-2.1.1-bin-hadoop2.7/README.md"
logData = sc.textFile(logFile, 2).cache()
numAs = logData.filter(lambda line: 'a' in line).count()
numBs = logData.filter(lambda line: 'b' in line).count()
print('Lines with a: %s, Lines with b: %s' % (numAs, numBs))

输出:Lines with a: 62, Lines with b: 30
真的感觉找对学习方法很重要,强烈推荐厦门大学林子雨老师的数据库实验室http://dblab.xmu.edu.cn/blog/1709-2/

你可能感兴趣的:(MAC单机版本Pyspark运行)