RDD的构建方式一:textFile()
其中的word.txt文件为:
Hadoop is good
Spark is good
Spark is better
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext("local")
spark = SparkSession(sc)
lines = sc.textFile("/root/pythonlearn/word.txt")
filter(过滤):
linesresult = lines.filter(lambda lines:"Spark" in lines)
linesresult.foreach(print)
在代码中出现并未打印结果的情况,但在终端的交互式环境中可以输出结果
linesresult.collect()
结果如下(后续文章均使用collect()输出):
['Spark is good', 'Spark is better']
map(映射):
linesresult = lines.map(lambda lines:lines.split(" "))
linesresult.foreach(print)
linesresult.collect()
结果:
[['Hadoop', 'is', 'good'], ['Spark', 'is', 'good'], ['Spark', 'is', 'better']]
flatMap()与map()相似。先执行map映射操作,再执行flat扁平化操作:
linesresult = lines.flatMap(lambda lines:lines.split(" "))
linesresult.foreach(print)
linesresult.collect()
结果:
['Hadoop', 'is', 'good', 'Spark', 'is', 'good', 'Spark', 'is', 'better']
RDD的构建方式二:parallelize()
# %%
rdd = sc.parallelize([2, 3, 4])
print(sorted(rdd.flatMap(lambda x: range(1, x)).collect()))
print(sorted(rdd.map(lambda x: [(x, x), (x, x)]).collect()))
print(sorted(rdd.flatMap(lambda x: [(x, x), (x, x)]).collect()))
运行结果:
[1, 1, 1, 2, 2, 3]
[[(2, 2), (2, 2)], [(3, 3), (3, 3)], [(4, 4), (4, 4)]]
[(2, 2), (2, 2), (3, 3), (3, 3), (4, 4), (4, 4)]
pyspark依赖于spark,而spark由scala语言开发,scala的底层是java
在pyspark的配置中需要:
1.python解释器,最好是3.6版本及以下,并配置环境变量
2.java,jdk最好是1.8版本,并配置环境变量
3.scala环境,并配置环境变量
4.Hadoop,并配置环境变量
5.spark(spark中内置有pyspark),并配置环境变量
6.python的pyspark库
在scala中map,filter,flatmap详细如下:
Scala_函数式编程以及简单的map,rreduce_Gadaite的博客-CSDN博客