Spark 算子-Transformations
Spark 常用Transformations算子介绍
操作 |
介绍 |
翻译 |
map(func) |
Return a new distributed dataset formed by passing each element of the source through a function func. |
传入一个函数,作用于RDD每个元素,并返回一个新的RDD |
filter(func) |
Return a new dataset formed by selecting those elements of the source on which func returns true. |
对RDD中的每个元素进行判断,反会符合条件的元素 |
flatMap(func) |
Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item). |
与map类似返回0个或多个元素 |
mapPartitions(func) |
Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator< T> => Iterator< U> when running on an RDD of type T. |
与map类似,但是作用于分区 |
groupByKey([numPartitions]) |
When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable< V>) pairs. Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance. Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional numPartitions argument to set a different number of tasks. |
根据Key进行分组,返回(Key,Iterable< value>) |
reduceByKey(func, [numPartitions]) |
When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument. |
对每个Key对应的values 进行reduce操作 |
sortByKey([ascending], [numPartitions]) |
When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument. |
对Key进行排序 |
Demo
/**
* Spark Transformations demo
*/
object TransformationsApp {
def main(args: Array[String]): Unit = {
val sparkConf= new SparkConf().setAppName("TransformationsApp").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
//创建数据集
val data1 = sc.parallelize(Array("a","b","c","d","e"),2)
//map demo 对每个元素进行操作返回一个tupe
//val mapData = data1.map((_,1)).foreach(println)
/*
(a,1)
(b,1)
(c,1)
(d,1)
(e,1)
*/
//filter demo 对元素进行过滤
// val filterData=data1.filter(x=>(x=="a")).foreach(println)//a
val data2 = sc.parallelize(Array(Array("a","b","c","d","e"),Array("q","w","r")))
// val mapData=data2.map((_,1)).foreach(println(_))
/*
([Ljava.lang.String;@5c4cd0b8,1)
([Ljava.lang.String;@b40de16,1)
*/
// val flatMapData=data2.flatMap(_.map((_,1))).foreach(println(_))
/*
(a,1)
(b,1)
(c,1)
(d,1)
(e,1)
(q,1)
(w,1)
(r,1)
*/
//对比结果可以看出,flatMap对元素进行压平在进行map操作
// val mapPartitions=data1.mapPartitions(x=>{
// x.map((_,1))
// }).foreach(println(_))
/*
(a,1)
(c,1)
(b,1)
(d,1)
(e,1)
*/
//结果与map类似,mapPartitions作用于每个分区
val data3 = sc.parallelize(Array("a","b","c","d","e","a","a","d","d"),1)
//key聚合返回
// val groupByKeyData = data3.map((_,1)).groupByKey().foreach(println(_))
/*
(e,CompactBuffer(1))
(d,CompactBuffer(1, 1))
(a,CompactBuffer(1, 1, 1))
(b,CompactBuffer(1, 1))
(c,CompactBuffer(1))
*/
//key聚合返回reduce结果
//val reduceByKeyData = data3.map((_,1)).reduceByKey(_+_).foreach(println(_))
/*
(e,1)
(a,3)
(d,2)
(c,1)
(b,2)
*/
//对key进行排序,注意时反区内排序
val sortByKeyData = data3.map((_,1)).reduceByKey(_+_).sortByKey().foreach(println(_))
//sortBy可以指定排序字段,默认时升序
data3.map((_,1)).reduceByKey(_+_).sortBy(_._2).foreach(println(_))
sc.stop()
}
}