spark Note book

RDD transformations

https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#transformations

map( x=> (x._2, x._1))
flatmap( line => line.split(' '))  // this will ignore None and will not flatten tuple (1,2,3)
filter(println)
distinct
sample
union
...

RDD action

nothing will get done until an action is called (lazy evaluation)
Think: how to mininize shuffle operation.
https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#actions

collect
count
...
val result = someRdd.toSeq.sortBy(_._1,false)  
// sort according to the first column from high to low

key value RDD

reduceByKey ( _ + _ )
groupByKey
sortByKey
keys
values

Find average value ( take tuple as value)

val totals=rdd
.mapValues(x=>(x,1))
.reduceByKey((x,y)=>(x._1+y._1,  x._2+y._2))
val avg=totals.mapValues(x=> x._1 / x._2)
val ans=avg.collect()
ans.sorted.foreach(println)

broadcast

你可能感兴趣的:(spark Note book)