无论是hive还是mapreduce在实际的业务需求实现时候,都会涉及到排序;hive中的排序有sort by ,在partition时候根据mapkey 的compare to方法实现排序,spark排序主要有两个函数sortBy,sortByKey
依据源码分析排序函数用法
一、sortByKey
源码:
/** * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling * `collect` or `save` on the resulting RDD will return or output an ordered list of records * (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in * order of the keys). */ // TODO: this currently doesn't work on P other than Tuple2! def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.size) : RDD[(K, V)] = { val part = new RangePartitioner(numPartitions, self, ascending) new ShuffledRDD[K, V, V](self, part) .setKeyOrdering(if (ascending) ordering else ordering.reverse) }可以看到sortByKey需要三个输入参数
sortByKey 输入RDD是二元组,排序规则由class OrderedRDDFunctions中的
private val ordering = implicitly[Ordering[K]]决定,我们可以重写它,来自定义排序规则
举例:
scala> val data = List((1,4),(4,8),(0,4),(12,8))
scala> val rdd = sc.parallelize(data)
scala> rdd.sortByKey().collect
Array[(Int, Int)] = Array( (0,4), (1,4), (4,8),(12,8))
重写 ordering
implicit val st = new Ordering[Int]{override def compare(a:Int,b:Int)=a.toString.compare(b.toString)}
Array[(Int, Int)] = Array((0,4), (1,4), (12,8), (4,8))
二、sortBy
源码:
/** * Return this RDD sorted by the given key function. */ def sortBy[K]( f: (T) => K, ascending: Boolean = true, numPartitions: Int = this.partitions.size) (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = this.keyBy[K](f) .sortByKey(ascending, numPartitions) .values
可以看到sortBy需要三个输入参数
举例如下
rdd1 数据如下:
(4E7NX0V9kpC0,1432295586,101829181)
(4E7TY0Zq4n55,1432292180,101829181)
(4E7XY0DceR42,1432302403,101829181)
rdd2 数据如下
(4E7TY0Zq4n55,1432292180,)
(4E7XY0DceR42,1432302403,)
(4E7XY0DceR42,1432305145,)
以上数据都是由三元组构成的RDD[(String,String,String)]
需要对上述数据整体排序
代码如下:
val rdd = rdd1.union(rdd2) if (!deleteDirIfExisted(fileSystem, new Path("hdfs:/**/result_t1"))) { throw new BusinessRuntimeException("Output path in hdfs :%s delete failed!") } rdd.sortBy(line =>sortBasicRddAlgo(line),true,1) .saveAsTextFile("hdfs:/**/result_t1")
sortBy 函数第一个参数 f函数,构造排序比较的Key
def sortBasicRddAlgo(line: (String, String, String)): (String,String) = { (String.format("%s_%s", line._1, line._2),line._3) }
排序输出
(4E7NX0V9kpC0,1432295586,101829181)
(4E7TY0Zq4n55,1432292180,101829181)
(4E7TY0Zq4n55,1432292180,)
(4E7XY0DceR42,1432302403,101829181)
(4E7XY0DceR42,1432305145,)