Spark 算子-Transformations

Spark 算子-Transformations

Spark 常用Transformations算子介绍

操作 介绍 翻译
map(func) Return a new distributed dataset formed by passing each element of the source through a function func. 传入一个函数,作用于RDD每个元素,并返回一个新的RDD
filter(func) Return a new dataset formed by selecting those elements of the source on which func returns true. 对RDD中的每个元素进行判断,反会符合条件的元素
flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item). 与map类似返回0个或多个元素
mapPartitions(func) Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator< T> => Iterator< U> when running on an RDD of type T. 与map类似,但是作用于分区
groupByKey([numPartitions]) When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable< V>) pairs. Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance. Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional numPartitions argument to set a different number of tasks. 根据Key进行分组,返回(Key,Iterable< value>)
reduceByKey(func, [numPartitions]) When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument. 对每个Key对应的values 进行reduce操作
sortByKey([ascending], [numPartitions]) When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument. 对Key进行排序

Demo

/**
  * Spark Transformations demo
  */
object TransformationsApp {
  def main(args: Array[String]): Unit = {
    val sparkConf= new SparkConf().setAppName("TransformationsApp").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)
    //创建数据集
    val data1 = sc.parallelize(Array("a","b","c","d","e"),2)
    //map demo 对每个元素进行操作返回一个tupe
    //val mapData = data1.map((_,1)).foreach(println)
    /*
     (a,1)
     (b,1)
     (c,1)
     (d,1)
     (e,1)
      */
    //filter demo 对元素进行过滤
   // val filterData=data1.filter(x=>(x=="a")).foreach(println)//a

    val data2 = sc.parallelize(Array(Array("a","b","c","d","e"),Array("q","w","r")))
  //  val mapData=data2.map((_,1)).foreach(println(_))
    /*
    ([Ljava.lang.String;@5c4cd0b8,1)
    ([Ljava.lang.String;@b40de16,1)
     */
   // val flatMapData=data2.flatMap(_.map((_,1))).foreach(println(_))
    /*
    (a,1)
    (b,1)
    (c,1)
    (d,1)
    (e,1)
    (q,1)
    (w,1)
    (r,1)
    */
    //对比结果可以看出,flatMap对元素进行压平在进行map操作

//    val mapPartitions=data1.mapPartitions(x=>{
//      x.map((_,1))
//    }).foreach(println(_))
    /*
    (a,1)
    (c,1)
    (b,1)
    (d,1)
    (e,1)
     */
    //结果与map类似,mapPartitions作用于每个分区
    val data3 = sc.parallelize(Array("a","b","c","d","e","a","a","d","d"),1)
    //key聚合返回
   // val groupByKeyData = data3.map((_,1)).groupByKey().foreach(println(_))
    /*
    (e,CompactBuffer(1))
    (d,CompactBuffer(1, 1))
    (a,CompactBuffer(1, 1, 1))
    (b,CompactBuffer(1, 1))
    (c,CompactBuffer(1))
     */
    //key聚合返回reduce结果
   //val reduceByKeyData = data3.map((_,1)).reduceByKey(_+_).foreach(println(_))
   /*
    (e,1)
    (a,3)
    (d,2)
    (c,1)
    (b,2)
    */
    //对key进行排序,注意时反区内排序
    val sortByKeyData = data3.map((_,1)).reduceByKey(_+_).sortByKey().foreach(println(_))
    //sortBy可以指定排序字段,默认时升序
    data3.map((_,1)).reduceByKey(_+_).sortBy(_._2).foreach(println(_))
    sc.stop()
  }
}

你可能感兴趣的:(大数据,spark,spark)