2.3.0版本
? map
(f,preservesPartitioning=False)[source]
Return a new RDD by applying a function to each element of this RDD.
中文翻译:通过对这个RDD的每个元素应用一个函数来返回一个新的RDD。
>>> rdd = sc.parallelize(["b", "a", "c"])
>>> sorted(rdd.map(lambda x: (x, 1)).collect())
[('a', 1), ('b', 1), ('c', 1)]
1. 每个元素加1
from pyspark import SparkContext, SparkConf
conf = SparkConf().setMaster("local").setAppName("Map")
sc = SparkContext(conf=conf)
rdd1 = sc.parallelize([1, 2, 3, 4])
new_rdd1 = rdd1.map(lambda x: x+1)
print('new_rdd1 = ', new_rdd1.collect())
>>> new_rdd1 = [2, 3, 4, 5]
2. 将每个元素用空格来拆分
rdd2 = sc.parallelize(['a 1', 'b 2', 'c 3'])
new_rdd2 = rdd2.map(lambda x: x.split())
print('new_rdd2 = ', new_rdd2.collect())
>>> new_rdd2 = [['a', '1'], ['b', '2'], ['c', '3']]
3. 将每个元素做成一个元素,第1位为x,第二位都为 1
rdd3 = sc.parallelize([1, 2, 3])
new_rdd3 = rdd3.map(lambda x: (x, 1))
print('new_rdd3 = ', new_rdd3.collect())
>>> new_rdd3 = [(1, 1), (2, 1), (3, 1)]
4. 传入一个方法来对x进行操作
def map1(x):
return x+1
rdd4 = sc.parallelize([1, 2, 3])
new_rdd4 = rdd4.map(lambda x: map1(x))
print('new_rdd4 = ', new_rdd4.collect())
>>> new_rdd4 = [2, 3, 4]