mapWith 在spark1.0之后就过期了,使用mapPartitionsWithIndex代替
====原来mapWith使用
mapWith是map的另外一个变种,map只需要一个输入函数,而mapWith有两个输入函数。它的定义如下:
def mapWith[A: ClassTag, U: ](constructA: Int => A, preservesPartitioning: Boolean = false)(f: (T, A) => U): RDD[U]
第一个函数constructA是把RDD的partition index(index从0开始)作为输入,输出为新类型A;
第二个函数f是把二元组(T, A)作为输入(其中T为原RDD中的元素,A为第一个函数的输出),输出类型为U。
举例:把partition index 乘以10,然后加上2作为新的RDD的元素。
val x = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3) x.mapWith(a => a * 10)((a, b) => (b + 2)).collect res4: Array[Int] = Array(2, 2, 2, 12, 12, 12, 22, 22, 22, 22)
====新的mapPartitionsWithIndex使用
mapPartitionsWithIndex(func) Similar to mapPartitions, but also provides func with an integer value representing the index of the partition, so func must be of type (Int, Iterator<T>) => Iterator<U> when running on an RDD of type T.
类似于mapPartitions, 其函数原型是:
def mapPartitionsWithIndex [ U : ClassTag ]( f : ( Int , Iterator [ T ]) => Iterator [ U ] , preservesPartitioning : Boolean = false ) : RDD [ U ],
mapPartitionsWithIndex的func接受两个参数,第一个参数是分区的索引,第二个是一个数据集分区的迭代器。而输出的是一个包含经过该函数转换的迭代器。下面测试中,将分区索引和分区数据一起输出。
Test:
val x = sc . parallelize ( List (1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9 ,10) , 3) def myfunc ( index : Int , iter : Iterator [ Int ]) : Iterator [ String ] = { iter . toList . map ( x => index + "-" + x ) . iterator } //myfunc: (index: Int, iter: Iterator[Int])Iterator[String] x . mapPartitionsWithIndex ( myfunc ) . collect() res: Array[String] = Array(0-1, 0-2, 0-3, 1-4, 1-5, 1-6, 2-7, 2-8, 2-9, 2-10)
可以运行的类:
package yanan.spark.core.transformations.example import org.apache.spark.SparkConf import org.apache.spark.SparkContext object TransformationsTest { def mapPartitionsFunc[T](iter: Iterator[T]): Iterator[(T, T)] = { var res = List[(T, T)]() var pre = iter.next while (iter.hasNext) { val cur = iter.next res.::=(pre, cur) pre = cur; } res.iterator } def mapPartitionsTest(sc: SparkContext) = { val a = sc.parallelize(1 to 9, 3) a.mapPartitions(mapPartitionsFunc).collect.foreach(println) } def mapValuesTest(sc: SparkContext) = { val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", " eagle"), 2) val b = a.map(x => (x.length, x)) b.mapValues("x" + _ + "x").collect.foreach(println) } def mapWithTest(sc: SparkContext) = { val x = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), 3) x.mapWith(a => a * 10)((a, b) => (b + 2)).collect.foreach(println) //res4: Array[Int] = Array(2, 2, 2, 12, 12, 12, 22, 22, 22, 22) //mapWith 过期了,使用mapPartitionsWithIndex代替 val parallel = sc.parallelize(1 to 9) parallel.mapPartitionsWithIndex((index: Int, it: Iterator[Int]) => it.toList.map(x => index + ", " + x).iterator).collect // parallel.collect.foreach(println) val y = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), 3) y.mapPartitionsWithIndex(myfuncPartitionsWithIndex).collect() //res: Array[String] = Array(0-1, 0-2, 0-3, 1-4, 1-5, 1-6, 2-7, 2-8, 2-9, 2-10) } def myfuncPartitionsWithIndex(index: Int, iter: Iterator[Int]): Iterator[String] = { iter.toList.map(x => index + "-" + x).iterator } def main(args: Array[String]) { val conf = new SparkConf().setAppName(s"Book example: Scala").setMaster("local[2]") val sc = new SparkContext(conf) //mapPartitionsTest(sc) mapWithTest(sc) sc.stop() } }
参考:
https://www.zybuluo.com/jewes/note/35032
http://debugo.com/spark-programming-model/