Spark的学习(三)

Spark RDD最最常用的还是以键值对的形式存在,称作pairs RDDS, 下面简称PRDDS。有很多常见的场景都需要PRDDs操作,比如:reduceByKey,join等,这一些都是以key为基础进行的。
下面的例子是按照key的值进行累加操作:

package day02
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object Example4_1 {
          def main(args: Array[String]) {
          val sparkConf = new SparkConf().setAppName("SparkCount").setMaster("local")
          val sc = new SparkContext(sparkConf)
          val rdds = sc.parallelize(List((1,2),(3,4),(3,6)))
          println(rdds.reduceByKey((x,y)=>x+y).collectAsMap())
    }
}

下面的例子是按照key进行分组

package day02
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object Example4_1 {
          def main(args: Array[String]) {
          val sparkConf = new SparkConf().setAppName("SparkCount").setMaster("local")
          val sc = new SparkContext(sparkConf)
          val rdds = sc.parallelize(List((1,2),(3,4),(3,6)))
//          println(rdds.reduceByKey((x,y)=>x+y).collectAsMap())
          println(rdds.groupByKey().collectAsMap())
//          println(rdds.mapValues(x=>x+1).collectAsMap())
    }
}

下面的操作是对Value值进行操作:

package day02
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object Example4_1 {
          def main(args: Array[String]) {
          val sparkConf = new SparkConf().setAppName("SparkCount").setMaster("local")
          val sc = new SparkContext(sparkConf)
          val rdds = sc.parallelize(List((1,2),(3,4),(3,6)))
//          println(rdds.reduceByKey((x,y)=>x+y).collectAsMap())
//          println(rdds.groupByKey().collectAsMap())
          println(rdds.mapValues(x=>x+1).collectAsMap())
    }
}

下面是对集合的操作,比较特殊的类似数据库的链接操作,这里和原始书中的运行结果不一样,我也不知道为神魔?

package day02
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object Example4_1 {
          def main(args: Array[String]) {
          val sparkConf = new SparkConf().setAppName("SparkCount").setMaster("local")
          val sc = new SparkContext(sparkConf)
          val rdd = sc.parallelize(List((1,2),(3,6)))
//          println(rdds.reduceByKey((x,y)=>x+y).collectAsMap()) // println(rdds.groupByKey().collectAsMap()) // println(rdds.mapValues(x=>x+1).collectAsMap()) val other = sc.parallelize(List((3,9))) println(rdd.subtractByKey(other).collectAsMap()) println(rdd.join(other).collectAsMap()) println(rdd.rightOuterJoin(other).collectAsMap()) println(rdd.leftOuterJoin(other).collectAsMap()) println(rdd.cogroup(other).collectAsMap()) } }

有的时候我们需要对每一个key进行计算其平均值,这显然需要计算两个值,一个是key出现的次数,一个是根据key计算values的和。

package day02
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object Example4_1 {
          def main(args: Array[String]) {
          val sparkConf = new SparkConf().setAppName("SparkCount").setMaster("local")
          val sc = new SparkContext(sparkConf)
          val rdd = sc.parallelize(
              List(("panda",0),("pink",3),("pirate",3),("panda",1),("pink",4)))
          val result = rdd.mapValues(x => (x,1)).reduceByKey(
              (x,y)=>(x._1+y._1,x._2+y._2))
          println(result.collectAsMap())

    }
}

完成上述操作可以使用combineByKey函数完成操作。

package day02
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object Example4_1 {
          def main(args: Array[String]) {
          val sparkConf = new SparkConf().setAppName("SparkCount").setMaster("local")
          val sc = new SparkContext(sparkConf)
          val rdd = sc.parallelize(
              List(("panda",0),("pink",3),("pirate",3),("panda",1),("pink",4)))
           val result = rdd.combineByKey(
               (x =>(x,1)),
               (acc :(Int, Int),v)=>(acc._1+v,acc._2+1), 
                (acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
          println(result.collectAsMap())
    }
}

下面的例子是对pairs按照key进行排序,请注意不同的输出方式对应的结果是不相同的。

package day02
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object Example4_1 {
          def main(args: Array[String]) {
          val sparkConf = new SparkConf().setAppName("SparkCount").setMaster("local")
          val sc = new SparkContext(sparkConf)
          val rdd = sc.parallelize(
              List(("panda",0),("pink",3),("pirate",3),("panda",1),("pink",4)))
          val a = sc.parallelize(List("dog","cat","owl","gnu","ant"))
          val b = sc.parallelize(1 to a.count().toInt)
           val c = a.zip(b)
            //asc
//    rdd.sortByKey(true).collect().foreach(print)
    val ss = rdd.sortByKey(true).collectAsMap()
    println(ss)
     val ss2 = rdd.sortByKey(false).collectAsMap()
    println(ss2)
    //desc
   // rdd.sortByKey(false).collect().foreach(print)
    }
}

RDD的Actions操作,这里仅仅举一个例子就好了:

package day02
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object Example4_1 {
          def main(args: Array[String]) {
          val sparkConf = new SparkConf().setAppName("SparkCount").setMaster("local")
          val sc = new SparkContext(sparkConf)
          val rdd = sc.parallelize(
              List(("panda",0),("pink",3),("pirate",3),("panda",1),("pink",4)))
           val r1 = rdd.countByKey()
           println(r1)
           val r2 = rdd.collectAsMap()
           println(r2)
           val r3 = rdd.lookup("panda")
           println(r3)
     }
}

你可能感兴趣的:(spark)