Spark中的单Value算子是指对一个RDD中的每个元素进行操作,并返回一个新的RDD。下面详细介绍一些常用的单Value算子及其功能:
rdd = sc.parallelize([1, 2, 3, 4, 5])
result = rdd.map(lambda x: x * 2)
# result: [2, 4, 6, 8, 10]
rdd = sc.parallelize([1, 2, 3, 4, 5])
result = rdd.flatMap(lambda x: range(x, x+3))
# result: [1, 2, 3, 2, 3, 4, 3, 4, 5, 4, 5, 6, 5, 6, 7]
rdd = sc.parallelize([1, 2, 3, 4, 5], 2) # 两个分区
result = rdd.glom().collect()
# result: [[1, 2], [3, 4, 5]]
rdd = sc.parallelize(['apple', 'banana', 'cherry', 'date'])
result = rdd.groupBy(lambda x: x[0])
# result: [('a', ['apple']), ('b', ['banana']), ('c', ['cherry']), ('d', ['date'])]
rdd = sc.parallelize([1, 2, 3, 4, 5])
result = rdd.filter(lambda x: x % 2 == 0)
# result: [2, 4]
rdd = sc.parallelize(range(10))
result = rdd.sample(False, 0.5)
# result: [0, 2, 3, 4, 5, 7]
rdd = sc.parallelize([1, 2, 2, 3, 3, 3])
result = rdd.distinct()
# result: [1, 2, 3]
rdd = sc.parallelize([1, 2, 3, 4, 5], 4) # 4个分区
result = rdd.coalesce(2)
# result: [1, 2, 3, 4, 5](分区数变为2)
rdd = sc.parallelize([1, 2, 3, 4, 5], 2) # 2个分区
result = rdd.repartition(4)
# result: [1, 2], [3, 4], [5](分区数变为4)
rdd = sc.parallelize([3, 1, 4, 2, 5])
result = rdd.sortBy(lambda x: x)
# result: [1, 2, 3, 4, 5]
这些单Value算子能够对RDD中的每个元素进行处理,并返回一个新的RDD,可以用于各种数据转换、过滤、去重等操作。
双Value算子是指对两个RDD进行操作,并返回一个新的RDD。下面介绍一些常用的双Value算子及其功能:
rdd1 = sc.parallelize([1, 2, 3])
rdd2 = sc.parallelize([3, 4, 5])
result = rdd1.union(rdd2)
# result: [1, 2, 3, 3, 4, 5]
rdd1 = sc.parallelize([1, 2, 3])
rdd2 = sc.parallelize([3, 4, 5])
result = rdd1.intersection(rdd2)
# result: [3]
rdd1 = sc.parallelize([1, 2, 3])
rdd2 = sc.parallelize([3, 4, 5])
result = rdd1.subtract(rdd2)
# result: [1, 2]
rdd1 = sc.parallelize([1, 2])
rdd2 = sc.parallelize(['a', 'b'])
result = rdd1.cartesian(rdd2)
# result: [(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')]
rdd1 = sc.parallelize([1, 2, 3])
rdd2 = sc.parallelize(['a', 'b', 'c'])
result = rdd1.zip(rdd2)
# result: [(1, 'a'), (2, 'b'), (3, 'c')]
rdd1 = sc.parallelize([(1, 'apple'), (2, 'banana')])
rdd2 = sc.parallelize([(1, 'red'), (2, 'yellow')])
result = rdd1.join(rdd2)
# result: [(1, ('apple', 'red')), (2, ('banana', 'yellow'))]
rdd1 = sc.parallelize([(1, 'apple'), (2, 'banana')])
rdd2 = sc.parallelize([(1, 'red'), (3, 'yellow')])
result = rdd1.leftOuterJoin(rdd2)
# result: [(1, ('apple', 'red')), (2, ('banana', None))]
rdd1 = sc.parallelize([(1, 'apple'), (2, 'banana')])
rdd2 = sc.parallelize([(1, 'red'), (3, 'yellow')])
result = rdd1.rightOuterJoin(rdd2)
# result: [(1, ('apple', 'red')), (3, (None, 'yellow'))]
这些双Value算子能够对两个RDD进行操作,并返回一个新的RDD,可以用于求并集、交集、差集等操作,也可以进行连接操作,根据键值对进行配对。
Key-Value算子是指对键值对RDD进行操作的算子,这些算子主要用于处理具有键值对结构的数据,其中键位于第一列,值位于第二列。下面介绍一些常用的Key-Value算子及其功能:
rdd = sc.parallelize([(1, 2), (1, 3), (2, 4), (2, 5)])
result = rdd.reduceByKey(lambda x, y: x + y)
# result: [(1, 5), (2, 9)]
rdd = sc.parallelize([(1, 2), (1, 3), (2, 4), (2, 5)])
result = rdd.groupByKey()
# result: [(1, ), (2, )]
rdd = sc.parallelize([(3, 'apple'), (1, 'banana'), (2, 'orange')])
result = rdd.sortByKey()
# result: [(1, 'banana'), (2, 'orange'), (3, 'apple')]
rdd = sc.parallelize([(1, 'apple'), (2, 'banana')])
result = rdd.mapValues(lambda x: 'fruit ' + x)
# result: [(1, 'fruit apple'), (2, 'fruit banana')]
rdd = sc.parallelize([(1, 'hello world'), (2, 'goodbye')])
result = rdd.flatMapValues(lambda x: x.split())
# result: [(1, 'hello'), (1, 'world'), (2, 'goodbye')]
rdd = sc.parallelize([(1, 'apple'), (2, 'banana')])
result = rdd.keys()
# result: [1, 2]
rdd = sc.parallelize([(1, 'apple'), (2, 'banana')])
result = rdd.values()
# result: ['apple', 'banana']
除了上述提到的常用Key-Value算子,还有一些其他常见的Key-Value算子,它们在处理键值对RDD时也非常有用。以下是其中几个:
rdd = sc.parallelize([(1, 'apple'), (1, 'banana'), (2, 'orange'), (2, 'banana')])
result = rdd.countByKey()
# result: {1: 2, 2: 2}
rdd = sc.parallelize([(1, 'apple'), (2, 'banana')])
result = rdd.collectAsMap()
# result: {1: 'apple', 2: 'banana'}
rdd = sc.parallelize([(1, 'apple'), (2, 'banana'), (1, 'orange')])
result = rdd.lookup(1)
# result: ['apple', 'orange']
rdd = sc.parallelize([(1, 2), (1, 3), (2, 4), (2, 5)])
result = rdd.foldByKey(0, lambda x, y: x + y)
# result: [(1, 5), (2, 9)]
rdd = sc.parallelize([(1, 2), (1, 3), (2, 4), (2, 5)])
result = rdd.aggregateByKey(0, lambda x, y: x + y, lambda x, y: x + y)
# result: [(1, 5), (2, 9)]
这些Key-Value算子能够对键值对RDD进行操作,实现聚合、分组、排序、映射等功能。使用这些算子可以更方便地处理具有键值对结构的数据。