Spark RDD最最常用的还是以键值对的形式存在,称作pairs RDDS, 下面简称PRDDS。有很多常见的场景都需要PRDDs操作,比如:reduceByKey,join等,这一些都是以key为基础进行的。
下面的例子是按照key的值进行累加操作:
package day02
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object Example4_1 {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("SparkCount").setMaster("local")
val sc = new SparkContext(sparkConf)
val rdds = sc.parallelize(List((1,2),(3,4),(3,6)))
println(rdds.reduceByKey((x,y)=>x+y).collectAsMap())
}
}
下面的例子是按照key进行分组
package day02
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object Example4_1 {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("SparkCount").setMaster("local")
val sc = new SparkContext(sparkConf)
val rdds = sc.parallelize(List((1,2),(3,4),(3,6)))
// println(rdds.reduceByKey((x,y)=>x+y).collectAsMap())
println(rdds.groupByKey().collectAsMap())
// println(rdds.mapValues(x=>x+1).collectAsMap())
}
}
下面的操作是对Value值进行操作:
package day02
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object Example4_1 {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("SparkCount").setMaster("local")
val sc = new SparkContext(sparkConf)
val rdds = sc.parallelize(List((1,2),(3,4),(3,6)))
// println(rdds.reduceByKey((x,y)=>x+y).collectAsMap())
// println(rdds.groupByKey().collectAsMap())
println(rdds.mapValues(x=>x+1).collectAsMap())
}
}
下面是对集合的操作,比较特殊的类似数据库的链接操作,这里和原始书中的运行结果不一样,我也不知道为神魔?
package day02
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object Example4_1 {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("SparkCount").setMaster("local")
val sc = new SparkContext(sparkConf)
val rdd = sc.parallelize(List((1,2),(3,6)))
// println(rdds.reduceByKey((x,y)=>x+y).collectAsMap()) // println(rdds.groupByKey().collectAsMap()) // println(rdds.mapValues(x=>x+1).collectAsMap()) val other = sc.parallelize(List((3,9))) println(rdd.subtractByKey(other).collectAsMap()) println(rdd.join(other).collectAsMap()) println(rdd.rightOuterJoin(other).collectAsMap()) println(rdd.leftOuterJoin(other).collectAsMap()) println(rdd.cogroup(other).collectAsMap()) } }
有的时候我们需要对每一个key进行计算其平均值,这显然需要计算两个值,一个是key出现的次数,一个是根据key计算values的和。
package day02
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object Example4_1 {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("SparkCount").setMaster("local")
val sc = new SparkContext(sparkConf)
val rdd = sc.parallelize(
List(("panda",0),("pink",3),("pirate",3),("panda",1),("pink",4)))
val result = rdd.mapValues(x => (x,1)).reduceByKey(
(x,y)=>(x._1+y._1,x._2+y._2))
println(result.collectAsMap())
}
}
完成上述操作可以使用combineByKey函数完成操作。
package day02
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object Example4_1 {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("SparkCount").setMaster("local")
val sc = new SparkContext(sparkConf)
val rdd = sc.parallelize(
List(("panda",0),("pink",3),("pirate",3),("panda",1),("pink",4)))
val result = rdd.combineByKey(
(x =>(x,1)),
(acc :(Int, Int),v)=>(acc._1+v,acc._2+1),
(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
println(result.collectAsMap())
}
}
下面的例子是对pairs按照key进行排序,请注意不同的输出方式对应的结果是不相同的。
package day02
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object Example4_1 {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("SparkCount").setMaster("local")
val sc = new SparkContext(sparkConf)
val rdd = sc.parallelize(
List(("panda",0),("pink",3),("pirate",3),("panda",1),("pink",4)))
val a = sc.parallelize(List("dog","cat","owl","gnu","ant"))
val b = sc.parallelize(1 to a.count().toInt)
val c = a.zip(b)
//asc
// rdd.sortByKey(true).collect().foreach(print)
val ss = rdd.sortByKey(true).collectAsMap()
println(ss)
val ss2 = rdd.sortByKey(false).collectAsMap()
println(ss2)
//desc
// rdd.sortByKey(false).collect().foreach(print)
}
}
RDD的Actions操作,这里仅仅举一个例子就好了:
package day02
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object Example4_1 {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("SparkCount").setMaster("local")
val sc = new SparkContext(sparkConf)
val rdd = sc.parallelize(
List(("panda",0),("pink",3),("pirate",3),("panda",1),("pink",4)))
val r1 = rdd.countByKey()
println(r1)
val r2 = rdd.collectAsMap()
println(r2)
val r3 = rdd.lookup("panda")
println(r3)
}
}