>>> a = sc.parallelize([(1,2),(3,4),(5,6)]) >>> a ParallelCollectionRDD[21] at parallelize at PythonRDD.scala:475 >>> help(a.map) Help on RDD in module pyspark.rdd object: class RDD(__builtin__.object) | A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. | Represents an immutable, partitioned collection of elements that can be | operated on in parallel. | | Methods defined here: | | __add__(self, other) | Return the union of this RDD and another one. | | >>> rdd = sc.parallelize([1, 1, 2, 3]) | >>> (rdd + rdd).collect() | [1, 1, 2, 3, 1, 1, 2, 3]
map(self, f, preservesPartitioning=False) method of pyspark.rdd.RDD instance Return a new RDD by applying a function to each element of this RDD. >>> rdd = sc.parallelize(["b", "a", "c"]) >>> sorted(rdd.map(lambda x: (x, 1)).collect()) [('a', 1), ('b', 1), ('c', 1)]
mapPartitions(self, f, preservesPartitioning=False) method of pyspark.rdd.RDD instance Return a new RDD by applying a function to each partition of this RDD. >>> rdd = sc.parallelize([1, 2, 3, 4], 2) >>> def f(iterator): yield sum(iterator) >>> rdd.mapPartitions(f).collect() [3, 7]
mapValues(self, f) method of pyspark.rdd.RDD instance Pass each value in the key-value pair RDD through a map function without changing the keys; this also retains the original RDD's partitioning. >>> x = sc.parallelize([("a", ["apple", "banana", "lemon"]), ("b", ["grapes"])]) >>> def f(x): return len(x) >>> x.mapValues(f).collect() [('a', 3), ('b', 1)]
Help on method flatMap in module pyspark.rdd: flatMap(self, f, preservesPartitioning=False) method of pyspark.rdd.RDD instance Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results. >>> rdd = sc.parallelize([2, 3, 4]) >>> sorted(rdd.flatMap(lambda x: range(1, x)).collect()) [1, 1, 1, 2, 2, 3] >>> sorted(rdd.flatMap(lambda x: [(x, x), (x, x)]).collect()) [(2, 2), (2, 2), (3, 3), (3, 3), (4, 4), (4, 4)]
flatMapValues(self, f) method of pyspark.rdd.RDD instance Pass each value in the key-value pair RDD through a flatMap function without changing the keys; this also retains the original RDD's partitioning. >>> x = sc.parallelize([("a", ["x", "y", "z"]), ("b", ["p", "r"])]) >>> def f(x): return x >>> x.flatMapValues(f).collect() [('a', 'x'), ('a', 'y'), ('a', 'z'), ('b', 'p'), ('b', 'r')]
reduce(self, f) method of pyspark.rdd.RDD instance Reduces the elements of this RDD using the specified commutative and associative binary operator. Currently reduces partitions locally. >>> from operator import add >>> sc.parallelize([1, 2, 3, 4, 5]).reduce(add) 15 >>> sc.parallelize((2 for _ in range(10))).map(lambda x: 1).cache().reduce(add) 10 >>> sc.parallelize([]).reduce(add) Traceback (most recent call last): ... ValueError: Can not reduce() empty RDD
>>> from operator import add >>> b.collect() [1, 2, 3, 4, 5, 6] >>> b.reduce(add) # 引入内置函数 21 >>> b.reduce(lambda a,b:a+b) # lambda自定义的匿名函数 21
Help on method reduceByKey in module pyspark.rdd: reduceByKey(self, func, numPartitions=None, partitionFunc=) method of pyspark.rdd.RDD instance Merge the values for each key using an associative and commutative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Output will be partitioned with C{numPartitions} partitions, or the default parallelism level if C{numPartitions} is not specified. Default partitioner is hash-partition. >>> from operator import add >>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)]) >>> sorted(rdd.reduceByKey(add).collect()) [('a', 2), ('b', 1)]