spark为包含键值对类型的RDD提供了一些专有的操作。这些RDD被称为pair RDD,pair RDD是很多程序的构成要素,因为他们提供了并行操作各个键或跨节点重新进行数据分组的啊哦做接口。比如普通的RDD有countByValue,而pair RDD提供了reduceByKey的操作。
根据之前了解的普通RDD的一些转化操作和pair RDD的定义,我们知道,pair RDD 可以从普通的RDD使用替换的转化操作得到。
val nums = sc.parallelize(List(1,2,3,4,5))
val pairNums = nums map (x => (x,1))
println(pairNums.collect.mkString(","))
格式:
pairRDD reduceByKey ( => )
val nums = sc.parallelize(List(1,2,3,4,5,6,7,8,9))
val all = nums sample (true, 100000)
val pairNums = all map (x => (x,1))
val sumNums = pairNums reduceByKey(_+_)
println(nums.collect.mkString(","))
println(pairNums.collect.mkString(","))
println(sumNums.collect.mkString(","))
val result = sumNums map (x => (x._1,x_2/100000.toDouble))
println(result.collect.mkString(","))
如果将数量设置到100亿呢?
不可思议,100亿啊,6.7分钟就完了。
格式:
pairRDD groupByKey
val initNums = sc parallelize ( 0 to 9)
val pairNums= initNums sample(true,20) map (x => (x, x+"x"))
println(pairNums.collect.mkString(","))
println(pairNums.groupByKey.collect.mkString(","))
格式
pairRDD keys
val initNums = sc parallelize ( 0 to 9)
val pairNums = initNums sample(true,20) map (x => (x, x+1))
println(pairNums.collect.mkString(","))
println(pairNums.keys.collect.mkString(","))
格式
pairRDD values
val initNums = sc parallelize ( 0 to 9)
val pairNums= initNums sample(true,20) map (x => (x, x+"x"))
println(pairNums.collect.mkString(","))
println(pairNums.values.collect.mkString(","))
格式:
pairRDD sortByKey
val nums = sc parallelize ( 0 to 9) map (x => (10 - x, x))
println(nums.collect.mkString(","))
println(nums.sortByKey().collect.mkString(","))
格式:
pairRDD mapValues ( => )
val nums = sc parallelize (0 to 9) map (x => (x%4,x))
nums collect() foreach print
nums mapValues ( x => x * 10 ) collect() foreach print
格式:
pairRDD flatMapValues( => )
val nums = sc parallelize ( 0 to 9 ) map ( x => ( x, x ))
nums collect() foreach print
nums flatMapValues ( x => x to 10 ) collect() foreach print
格式:
pairRDD combineByKey( => , => , => )
第一个 => :元素转返回类型
第一个 => :参数 元素
第二个 => :分区内元素聚合
第二个 => :参数 返回类型,元素
第三个 => :分区聚合
第三个 => :参数 返回类型,返回类型
val nums = sc parallelize(List(("A",66),("B",56),("C",88),("D",99),("A",33),("B",67858),("C",8987),("D",11231)))
type Mt = (Int,Int)
nums.combineByKey(a => (a,1),(x:Mt,s)=>(x._1+s,x._2 + 1),(c:Mt,d:Mt)=>(c._1+d._1,c._2+d._2)) map{ case (key,value) => (key, value._1/value._2.toDouble)} collect() foreach print
格式:
pairRDD1 subtractByKey pairRDD2
val num1 = sc parallelize(List(("A",66),("B",78),("C",77),("D",88)))
val num2 = sc parallelize(List(("A",66),("C",77)))
num1 collect() foreach print
num2 collect() foreach print
num1 subtract num2 collect() foreach print
格式:
pairRDD1 join pairRDD2
val num1 = sc parallelize(List(("A",66),("B",78),("C",77),("D",88)))
val num2 = sc parallelize(List(("A",778),("C",899)))
num1 collect() foreach print
num2 collect() foreach print
num1 join num2 collect() foreach print
格式:
pairRDD1 rightOuterJoin pairRDD2
val num1 = sc parallelize(List(("A",66),("B",78),("C",77),("D",88)))
val num2 = sc parallelize(List(("A",778),("C",899)))
num1 collect() foreach print
num2 collect() foreach print
num1 rightOuterJoin num2 collect() foreach print
格式:
pairRDD1 leftOuterJoin pairRDD2
val num1 = sc parallelize(List(("A",66),("B",78),("C",77),("D",88)))
val num2 = sc parallelize(List(("A",778),("C",899)))
num1 collect() foreach print
num2 collect() foreach print
num1 leftOuterJoin num2 collect() foreach print
格式:
pairRDD1 cogroup pairRDD2
val num1 = sc parallelize(List(("A",66),("B",78),("C",77),("D",88)))
val num2 = sc parallelize(List(("A",778),("C",899)))
num1 collect() foreach print
num2 collect() foreach print
num1 cogroup num2 collect() foreach print
操作名 | 方法名 | 格式 |
---|---|---|
根据键聚合 | reduceByKey | pairRDD reduceByKey ( => ) |
根据键分组 | groupByKey | pairRDD groupByKey |
获取键 | keys | pairRDD keys |
获取值 | values | pairRDD values |
根据键排序 | sortByKey | pairRDD mapValues ( => ) |
值操作 | flatMapValues | pairRDD mapValues ( => ) |
合并值流操作 | combineByKey | pairRDD flatMapValues( => ) |
根据键自定义聚合 | combineByKey | pairRDD combineByKey( => , => , => ) |
差集 | subtractByKey | pairRDD1 subtractByKey pairRDD2 |
内连接 | join | pairRDD1 join pairRDD2 |
右外连接 | rightOuterJoin | pairRDD1 rightOuterJoin pairRDD2 |
左外连接 | leftOuterJoin | pairRDD1 leftOuterJoin pairRDD2 |
交集 | cogroup | pairRDD1 cogroup pairRDD2 |
因为pair RDD 是继承RDD的,所以,RDD的操作,pair RDD都可以使用。
格式:
pairRDD map {case (key,value) => (key, value’)}
val keys = sc parallelize( 1 to 6)
val pairs = keys map (x => (x,x*x))
pairs map {case (key,value) => (key, value * 10 )} collect() foreach print
格式:
pairRDD filter {{case (key,value) => Boolean}
val keys = sc parallelize( 1 to 6)
val pairs = keys map (x => (x,x*x*10))
pairs collect() foreach print
pairs filter {case (key,value) => value < 100 } collect() foreach print
格式:
pairRDD keys
val keys = sc parallelize( 1 to 6)
val pairs = keys map (x => (x,x*x*10))
pairs collect() foreach print
println(pairs.keys.collect.mkString(","))
格式:
pairRDD values
val keys = sc parallelize( 1 to 6)
val pairs = keys map (x => (x,x*x*10))
pairs collect() foreach print
println(pairs.keys.collect.mkString(","))
println(pairs.values.collect.mkString(","))
格式:
pairRDD mapValues ( => )
val keys = sc parallelize( 1 to 6)
val pairs = keys map (x => (x,x*x*10))
pairs collect() foreach print
pairs mapValues ( x => x /10 ) collect() foreach print
格式:
pairRDD reduceByKey ( => )
val keys = sc parallelize (List("A","B","C","D"))
val values = sc parallelize (List(1,2,3,4))
val pairs = keys cartesian values
pairs collect() foreach print
pairs reduceByKey ((a,b) => ( if (a > b) a else b)) collect() foreach print
格式:
pairRDD foldByKey (value)( => )
val keys = sc parallelize (List("A","B","C","D"))
val values = sc parallelize (List(1,2,3,4))
val pairs = sc parallelize (keys cartesian values collect,1)
pairs collect() foreach print
pairs.foldByKey(3)((a,b) => (a+b)) collect()
格式:
pairRDD aggregateByKey(value)( => , => )
val keys = sc parallelize (List("A","B","C","D"))
val values = sc parallelize (List(1,2,3,4))
val pairs = sc parallelize (keys cartesian values collect,1)
pairs collect() foreach print
pairs.aggregateByKey("")((a,b)=>(a+""+b),(s,t)=>s+t) collect()
格式:
pairRDD combineByKey( => , => , => )
第一个 => :元素转返回类型
第一个 => :参数 元素
第二个 => :分区内元素聚合
第二个 => :参数 返回类型,元素 (参数顺序不可变)
第三个 => :分区聚合
第三个 => :参数 返回类型,返回类型
val keys = sc parallelize (List("A","B","C","D"))
val values = sc parallelize (List(1,2,3,4))
val pairs = sc parallelize (keys cartesian values collect,1)
pairs combineByKey(x=>x.toDouble,(a:Double,b:Int)=>(a + b.toDouble),(a:Double,b:Double)=>(a+b)) collect
格式:
pairRDD groupBy ( => )
val keys = sc parallelize (List("A","B","C","D"))
val values = sc parallelize (List(1,2,3,4))
val pairs = sc parallelize (keys cartesian values collect,1)
pairs groupBy{case(key,value) => key} map {case(key,value) => (key, value map (x => x._2))} collect
val keys = sc parallelize (List("A","B","C","D"))
val values = sc parallelize (List(1,2,3,4))
val pairs = sc parallelize (keys cartesian values collect,1)
pairs groupByKey() collect
格式:
pairRDD groupByKey ( => )
val keys = sc parallelize (List("A","B","C","D"))
val values = sc parallelize (List(1,2,3,4))
val pairs = sc parallelize (keys cartesian values collect,1)
pairs groupByKey() collect
格式:
pairRDD1 cogroup pairRDD2 [cogroup pairRDD3 …]
val keys = sc parallelize (List("A","B","C","D"))
val values = sc parallelize (List(1,2,3,4))
val pairs1 = sc parallelize (keys cartesian values collect)
val keys = sc parallelize(List("B","C"))
val values = sc parallelize(5 to 8)
val pairs2 = sc parallelize (keys cartesian values collect)
pairs1 collect() foreach print
pairs2 collect() foreach print
pairs1 cogroup pairs2 collect() foreach print
格式:
pairRDD1 join pairRDD2
val keys = sc parallelize (1 to 3)
val values = sc parallelize ('A' to 'C')
val pairs1 = sc parallelize ( keys cartesian values collect)
keys collect() foreach print
values collect() foreach print
pairs1 collect() foreach print
val keys = sc parallelize (2 to 4)
val values = sc parallelize ( 'M' to 'O')
val pairs2 = sc parallelize ( keys cartesian values collect)
keys collect() foreach print
values collect() foreach print
pairs2 collect() foreach print
pairs1 join pairs2 collect() foreach print
格式:
pairRDD1 leftOuterJoin pairRDD2
val keys = sc parallelize ( 1 to 3)
val values = sc parallelize( 'A' to 'C')
keys collect() foreach print
values collect() foreach print
val pairs1 = keys cartesian values
val keys = sc parallelize ( 2 to 3)
val values = sc parallelize ('A' to 'D')
keys collect() foreach print
values collect() foreach print
val pairs2 = keys cartesian values
pairs1 collect() foreach print
pairs2 collect() foreach print
pairs1 leftOuterJoin pairs2 collect() foreach print
格式:
pairRDD1 rightOuterJoin pairRDD2
val keys = sc parallelize ( 1 to 2)
val values = sc parallelize ( 'A' to 'C')
val pairs1 = keys cartesian values
keys collect() foreach print
values collect() foreach print
val keys = sc parallelize ( 2 to 4)
val values = sc parallelize ('B' to 'D')
val pairs2 = keys cartesian values
keys collect() foreach print
values collect() foreach print
pairs1 collect() foreach print
pairs2 collect() foreach print
pairs1 rightOuterJoin pairs2 collect() foreach print
格式:
pairRDD sortByKey
val keys = sc parallelize (List(3,2,1))
val values = sc parallelize ( 'M' to 'O')
val pairs = keys cartesian values
keys collect() foreach print
values collect() foreach print
pairs collect() foreach print
pairs sortByKey() collect() foreach print
格式:
pairRDD couuntByKey
val keys = sc parallelize( 1 to 8)
val values = sc parallelize ( 'A' to 'E')
val pairs = keys cartesian values
keys collect() foreach print
values collect() foreach print
pairs collect() foreach print
pairs countByKey
格式:
pairRDD collectAsMap
val keys = sc parallelize( 1 to 7)
val values = sc parallelize( 'E' to 'H')
val pairs = keys cartesian values
keys collect() foreach print
values collect() foreach print
pairs collect() foreach print
pairs collectAsMap
格式:
pairRDD lookup key
val keys = sc parallelize ( 1 to 5)
val values = sc parallelize ( 'M' to 'Z')
val pair = keys cartesian values
keys collect() foreach print
values collect() foreach print
pair collect() foreach print
pair lookup 2