Spark RDD 之 pair (k.v) 操作

这篇文章是关于spark RDD Key/Value Pair 的操作

1. 创建 k/v pair 的RDD
val sc = new SparkContext(conf);
val strArray = List("this is spark","It is fun!","spark is cool");
val strRDD = sc.parallelize(strArray);
val lenRDD = strRDD.flatMap(l=>l.split(" ")).map(l=>(l.length,l));
lenRDD.collect().foreach(println);

#####输出

(4,this)
(2,is)
(5,spark)
(2,It)
(2,is)
(4,fun!)
(5,spark)
(2,is)
(4,cool)
2. Key/Value Pair RDD Transformations
NAME DESC
groupByKey([numTasks]) Groups all the values of the same key together. For a dataset of (K,V) pairs, the returned rDD has the type (K, Iterable).
reduceByKey(func,[numTasks]) First performs the grouping of values with the same key and then applies the specified func to return the list of values down to a single value. For a dataset of (K,V) pairs, the returned rDD has the type of (K, V).
sortByKey([ascending],[numTasks]) sorts the rows according to the keys. by default, the keys are sorted in ascending order.
join(otherRDD,[numTasks]) Joins the rows in both rDDs by matching their keys. each row of the returned rDD contains a tuple where the first element is the key and the second element is another tuple containing the values from both rDDs.
案例1. groupByKey([numTasks])
val sc = new SparkContext(conf);
val strArray = List("this is spark","It is fun!","spark is cool");
val strRDD = sc.parallelize(strArray);
val lenRDD = strRDD.flatMap(l=>l.split(" ")).map(l=>(l.length,l));
lenRDD.groupByKey().foreach(println);
输出
(5,CompactBuffer(spark, spark))
(4,CompactBuffer(this, fun!, cool))
(2,CompactBuffer(is, It, is, is))
案例2. reduceByKey(func, [numTasks])
 val sc = new SparkContext(conf);
 val strArray = List("this is spark","It is fun!","spark is cool");
 val strRDD = sc.parallelize(strArray);
 val countRDD = strRDD.flatMap(l=>l.split(" ")).map(l=>(l,1));
 countRDD.reduceByKey((_1,_2)=>_1+_2).collect().foreach(println);
输出
(this,1)
(is,3)
(fun!,1)
(cool,1)
(spark,2)
(It,1)
案例3. sortByKey([ascending],[numTasks])
默认升序,传入参数false为降序
val sc = new SparkContext(conf);
val strArray = List("this is spark","It is fun!","spark is cool");
val strRDD = sc.parallelize(strArray);
val countRDD = strRDD.flatMap(l=>l.split(" ")).map(l=>(l,1));
countRDD.reduceByKey((_1,_2)=>_1+_2).map(t=>(t._2,t._1)).sortByKey().collect().foreach(println);
输出
(1,this)
(1,fun!)
(1,cool)
(1,It)
(2,spark)
(3,is)
案例4.join(otherRDD)
用于2个RDD之间的连接: By joining the dataset of type (K,V) and dataset (K,W), the result of the joined dataset is (K,(V,W))
 val sc = new SparkContext(conf);
 val parentRDD = sc.parallelize(List((1,"Jason")));
 val childRDD = sc.parallelize(List((1,"Tom"),(1,"Mike")));
 parentRDD.join(childRDD).map((t => {t._2._1+"-->"+t._2._2})).foreach(println);
输出
Jason-->Tom
Jason-->Mike
3. Key/Value Pair RDD Actions
name desc
countByKey() returns a map where each entry contains the key and a count of values
collectAsMap() similar behavior as the collect action; return type is a map
lookup(key) performs a look by key and returns all values that have the same specified key
案例1. countByKey( )
 val sc = new SparkContext(conf);
 sc.setLogLevel("ERROR");
 sc.parallelize(List(("Jason",1),("Jason",3),("Tom",1),("Tom",9))).
 countByKey().map(t=>{t._1+"-->"+t._2}).foreach(println);
输出信息
Tom-->2
Jason-->2
案例2. collectAsMap( )
val sc = new SparkContext(conf);
sc.setLogLevel("ERROR");
sc.parallelize(List(("Jason",1),("Jason",3),("Tom",1),("Tom",9))).collectAsMap().foreach(println);
输出信息
(Tom,9)
(Jason,3)
案例3. lookup(key)
 val sc = new SparkContext(conf);
 sc.setLogLevel("ERROR");
 val mapRDD = sc.parallelize(List(("Jason",1),("Jason",3),("Tom",1),("Tom",9))).reduceByKey(((t1:Int,t2:Int)=>{t1+t2}));
 mapRDD.lookup("Jason").foreach(println);
 mapRDD.lookup("Tom").foreach(println);
输出信息
4
10

你可能感兴趣的:(Spark,Spark)