Spark sort (排序)

无论是hive还是mapreduce在实际的业务需求实现时候,都会涉及到排序;hive中的排序有sort by ,在partition时候根据mapkey 的compare to方法实现排序,spark排序主要有两个函数sortBy,sortByKey


依据源码分析排序函数用法

一、sortByKey

源码:

/**
 * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
 * `collect` or `save` on the resulting RDD will return or output an ordered list of records
 * (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
 * order of the keys).
 */
// TODO: this currently doesn't work on P other than Tuple2!
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.size)
    : RDD[(K, V)] =
{
  val part = new RangePartitioner(numPartitions, self, ascending)
  new ShuffledRDD[K, V, V](self, part)
    .setKeyOrdering(if (ascending) ordering else ordering.reverse)
}
可以看到sortByKey需要三个输入参数

  1. 升序(true)/降序(false)标识
  2. 分片数(相当于mr中的partition)

sortByKey 输入RDD是二元组,排序规则由class OrderedRDDFunctions中的

private val ordering = implicitly[Ordering[K]]
决定,我们可以重写它,来自定义排序规则


举例:

scala> val data = List((1,4),(4,8),(0,4),(12,8))

scala> val rdd = sc.parallelize(data)

scala> rdd.sortByKey().collect

Array[(Int, Int)] = Array( (0,4), (1,4), (4,8),(12,8))

重写 ordering

implicit val st = new Ordering[Int]{override def compare(a:Int,b:Int)=a.toString.compare(b.toString)}

Array[(Int, Int)] = Array((0,4), (1,4), (12,8), (4,8))


二、sortBy

源码:

/**
 * Return this RDD sorted by the given key function.
 */
def sortBy[K](
    f: (T) => K,
    ascending: Boolean = true,
    numPartitions: Int = this.partitions.size)
    (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] =
  this.keyBy[K](f)
      .sortByKey(ascending, numPartitions)
      .values


可以看到sortBy需要三个输入参数

  1. f,排序key构造函数,排序规则由这个函数指定
  2. 升序(true)/降序(false)标识
  3. 分片数(相当于mr中的partition)


举例如下

rdd1 数据如下:

(4E7NX0V9kpC0,1432295586,101829181)
(4E7TY0Zq4n55,1432292180,101829181)
(4E7XY0DceR42,1432302403,101829181)


rdd2 数据如下
(4E7TY0Zq4n55,1432292180,)
(4E7XY0DceR42,1432302403,)
(4E7XY0DceR42,1432305145,)


以上数据都是由三元组构成的RDD[(String,String,String)]

需要对上述数据整体排序

代码如下:

val rdd = rdd1.union(rdd2)

if (!deleteDirIfExisted(fileSystem, new Path("hdfs:/**/result_t1"))) {
  throw new BusinessRuntimeException("Output path in hdfs :%s delete failed!")
}
rdd.sortBy(line =>sortBasicRddAlgo(line),true,1)
  .saveAsTextFile("hdfs:/**/result_t1")

sortBy 函数第一个参数 f函数,构造排序比较的Key

def sortBasicRddAlgo(line: (String, String, String)): (String,String) = {
 (String.format("%s_%s", line._1, line._2),line._3)
}


排序输出

(4E7NX0V9kpC0,1432295586,101829181)

(4E7TY0Zq4n55,1432292180,101829181)

(4E7TY0Zq4n55,1432292180,)

(4E7XY0DceR42,1432302403,101829181)

(4E7XY0DceR42,1432305145,)











你可能感兴趣的:(大数据/数据挖掘/机器学习,Scala,Spark)