这篇文章是关于spark RDD Key/Value Pair 的操作
1. 创建 k/v pair 的RDD
val sc = new SparkContext(conf);
val strArray = List("this is spark","It is fun!","spark is cool");
val strRDD = sc.parallelize(strArray);
val lenRDD = strRDD.flatMap(l=>l.split(" ")).map(l=>(l.length,l));
lenRDD.collect().foreach(println);
#####输出
(4,this)
(2,is)
(5,spark)
(2,It)
(2,is)
(4,fun!)
(5,spark)
(2,is)
(4,cool)
2. Key/Value Pair RDD Transformations
NAME |
DESC |
groupByKey([numTasks]) |
Groups all the values of the same key together. For a dataset of (K,V) pairs, the returned rDD has the type (K, Iterable). |
reduceByKey(func,[numTasks]) |
First performs the grouping of values with the same key and then applies the specified func to return the list of values down to a single value. For a dataset of (K,V) pairs, the returned rDD has the type of (K, V). |
sortByKey([ascending],[numTasks]) |
sorts the rows according to the keys. by default, the keys are sorted in ascending order. |
join(otherRDD,[numTasks]) |
Joins the rows in both rDDs by matching their keys. each row of the returned rDD contains a tuple where the first element is the key and the second element is another tuple containing the values from both rDDs. |
案例1. groupByKey([numTasks])
val sc = new SparkContext(conf);
val strArray = List("this is spark","It is fun!","spark is cool");
val strRDD = sc.parallelize(strArray);
val lenRDD = strRDD.flatMap(l=>l.split(" ")).map(l=>(l.length,l));
lenRDD.groupByKey().foreach(println);
输出
(5,CompactBuffer(spark, spark))
(4,CompactBuffer(this, fun!, cool))
(2,CompactBuffer(is, It, is, is))
案例2. reduceByKey(func, [numTasks])
val sc = new SparkContext(conf);
val strArray = List("this is spark","It is fun!","spark is cool");
val strRDD = sc.parallelize(strArray);
val countRDD = strRDD.flatMap(l=>l.split(" ")).map(l=>(l,1));
countRDD.reduceByKey((_1,_2)=>_1+_2).collect().foreach(println);
输出
(this,1)
(is,3)
(fun!,1)
(cool,1)
(spark,2)
(It,1)
案例3. sortByKey([ascending],[numTasks])
默认升序,传入参数false为降序
val sc = new SparkContext(conf);
val strArray = List("this is spark","It is fun!","spark is cool");
val strRDD = sc.parallelize(strArray);
val countRDD = strRDD.flatMap(l=>l.split(" ")).map(l=>(l,1));
countRDD.reduceByKey((_1,_2)=>_1+_2).map(t=>(t._2,t._1)).sortByKey().collect().foreach(println);
输出
(1,this)
(1,fun!)
(1,cool)
(1,It)
(2,spark)
(3,is)
案例4.join(otherRDD)
用于2个RDD之间的连接: By joining the dataset of type (K,V) and dataset (K,W), the result of the joined dataset is (K,(V,W))
val sc = new SparkContext(conf);
val parentRDD = sc.parallelize(List((1,"Jason")));
val childRDD = sc.parallelize(List((1,"Tom"),(1,"Mike")));
parentRDD.join(childRDD).map((t => {t._2._1+"-->"+t._2._2})).foreach(println);
输出
Jason-->Tom
Jason-->Mike
3. Key/Value Pair RDD Actions
name |
desc |
countByKey() |
returns a map where each entry contains the key and a count of values |
collectAsMap() |
similar behavior as the collect action; return type is a map |
lookup(key) |
performs a look by key and returns all values that have the same specified key |
案例1. countByKey( )
val sc = new SparkContext(conf);
sc.setLogLevel("ERROR");
sc.parallelize(List(("Jason",1),("Jason",3),("Tom",1),("Tom",9))).
countByKey().map(t=>{t._1+"-->"+t._2}).foreach(println);
输出信息
Tom-->2
Jason-->2
案例2. collectAsMap( )
val sc = new SparkContext(conf);
sc.setLogLevel("ERROR");
sc.parallelize(List(("Jason",1),("Jason",3),("Tom",1),("Tom",9))).collectAsMap().foreach(println);
输出信息
(Tom,9)
(Jason,3)
案例3. lookup(key)
val sc = new SparkContext(conf);
sc.setLogLevel("ERROR");
val mapRDD = sc.parallelize(List(("Jason",1),("Jason",3),("Tom",1),("Tom",9))).reduceByKey(((t1:Int,t2:Int)=>{t1+t2}));
mapRDD.lookup("Jason").foreach(println);
mapRDD.lookup("Tom").foreach(println);
输出信息
4
10