上面我们学习了RDD如何转换,即一个RDD转换成另外一个RDD,但是转换完成之后并没有立刻执行,仅仅是记住了数据集的逻辑操作,只有当执行了Action动作之后才会真正触发Spark作业,进行算子的计算
执行操作有:
reduce:使用函数func聚合数据集元素,返回执行结果
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
print(rdd.reduce(lambda x,y : x+y))
# 15
sc.stop()
collect:将计算结果回收到Driver端,当数据量较大时执行会造成oom
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
print(rdd.collect())
# [1, 2, 3, 4, 5]
sc.stop()
count:返回数据集元素个数,执行过程中会将数据回收到Driver端进行统计
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
print(rdd.count())
# 5
sc.stop()
first:返回数据集中的第一个元素,类似于take(1)
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
print(rdd.first())
# 1
sc.stop()
take:返回数据集中的前n个元素的数组
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
print(rdd.take(3))
# [1, 2, 3]
sc.stop()
takeSample:返回数据集中num个随机元素,seed指定随机数生成器种子
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
print(rdd.takeSample(True, 3, 1314))
# [5, 2, 3]
sc.stop()
takeOrdered:使用自然排序或自定义比较器返回数据集中的前n个元素
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [5, 1, 4, 2, 3]
rdd = sc.parallelize(data)
print(rdd.takeOrdered(3))
# [1, 2, 3]
print(rdd.takeOrdered(3, key=lambda x: -x))
# [5, 4, 3]
sc.stop()
saveAsTextFile:将数据集元素作为文本文件写入文件系统(如:本地文件系统,HDFS等)
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
rdd.saveAsTextFile("file:///home/data")
sc.stop()
countByKey:统计(K,V)对中每个K的个数
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [('a', 1), ('b', 2), ('a', 3)]
rdd = sc.parallelize(data)
print(sorted(rdd.countByKey().items()))
# [('a', 2), ('b', 1)]
sc.stop()
foreach:对RDD每个元素执行指定函数
from pyspark import SparkContext
def f(x):
print(x)
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [1, 2, 3]
rdd = sc.parallelize(data)
rdd.foreach(f)
# 1 2 3
sc.stop()
至此,所有action动作学习完毕
Spark学习目录: