先看一下例子,输入2行数据:
val rdd = sc.parallelize(Seq("Roses are red", "Violets are blue")) // lines
rdd.collect
res0: Array[String] = Array("Roses are red", "Violets are blue")
现在,使用map将一个长度为N的RDD转换成另一个长度为N的RDD。
例如,map操作计算这两行的长度:
rdd.map(_.length).collect
res1: Array[Int] = Array(13, 16)
再看flatMap将一个RDD的长度N转化为N collection的集合,然后趋于平缓的到一个RDD的结果。
rdd.flatMap(_.split(" ")).collect
res2: Array[String] = Array("Roses", "are", "red", "Violets", "are", "blue")
(1)flatMap 和 map 两个 transforms 操作集合的区别
/** * Return a new RDD by applying a function to all elements of this RDD. */ def map[U: ClassTag](f: T => U): RDD[U] = withScope { val cleanF = sc.clean(f) new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF)) }
/** * Return a new RDD by first applying a function to all elements of this * RDD, and then flattening the results. */ def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope { val cleanF = sc.clean(f) new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF)) }
(2)转换函数:
map: One element in -> one element out.
flatMap: One element in -> 0 or more elements out (a collection).
参考:
https://stackoverflow.com/questions/22350722/what-is-the-difference-between-map-and-flatmap-and-a-good-use-case-for-each
https://stackoverflow.com/questions/1059776/scala-iterablemap-vs-iterableflatmap
https://stackoverflow.com/questions/26684562/whats-the-difference-between-map-and-flatmap-methods-in-java-8