Spark RDD 分组求TopN三种实现

Spark RDD 分组求TopN优化

    • 实现思路
    • ETL
    • 方法一 :groupBy 实现
    • 方法二:repartitionAndSortWithinPartitions 实现
    • 方法三:
    • 总结

实现思路

1.分组
2.排序
3.取出每组的TopN
4.合并

ETL

val etlRdd = sourceRdd.map(line => {
var url = “-”
var fileResource = “-”
var timeStr = “-”
var traffic = 0
val strArr = line.split("\t")
if (strArr.length == 3) {
val urlStr = strArr(0)
val splitIndex = urlStr.indexOf(".com/")
if (splitIndex >= 0) {
url = urlStr.substring(0, splitIndex + 4)
fileResource = urlStr.substring(splitIndex + 4, urlStr.length)
timeStr = strArr(1)
traffic = 0
try {
traffic = strArr(2).toInt
} catch {
case ex: Exception => {
println(ex)
}
}
(url, fileResource, timeStr, traffic)
} else {
(url, fileResource, timeStr, traffic)
}
} else {
(url, fileResource, timeStr, traffic)
}
})
//url resource time traffic
etlRdd.cache()
val topN = 10
//根据key reduce 之后的rdd (url , resource , 访问次数 ,traffic)
val reducedRdd = etlRdd.map(x => ((x._1, x._2), (1, x._4)))
.reduceByKey((a, b) => (a._1 + b._1, a._2 + b._2))
// .map(a=>(a._1._1,a._1._2,a._2._1,a._2._2))
reducedRdd.cache()

方法一 :groupBy 实现

//方法一 2.5s
//分组排序并取topN
val topNRdd = reducedRdd
.map(x => (x._1._1, x._1._2, x._2._1, x._2._2))
//按照key 分组
.groupBy(x => x._1)
// RDD[(K, Iterable[T])]
.map(x => {
//Iterable[T] 同组的行数据
val rawRows = x.2.toBuffer
//降序排序
val sortedRows = rawRows.sortBy(
._3).reverse
//删除掉 N 之后的数据
if (sortedRows.size > topN) {
sortedRows.remove(topN, (sortedRows.length - topN))
}
//返回 RDD[(K, Iterable[T])]
(x._1, sortedRows.toIterator)
})
//展开 RDD[(K, Iterable[T])]
topNRdd.flatMap(x => {
for (y <- x._2) yield (y._1, y._2, y._3, y._4)
})
//合并到一个文件
.repartition(1)
.saveAsTextFile(“output/spark-task/01_1”)

测试耗时:
在这里插入图片描述
缺点: groupBy 算子性能较低

方法二:repartitionAndSortWithinPartitions 实现

//求出分区大小
val partitions = etlRdd.map(x => (x._1, 1)).countByKey().size
//自定义分区函数。根据要分组的key分区
class MyPartitioner(partitions: Int) extends Partitioner {
//分区数
override def numPartitions: Int = partitions
//根据key获取分区
override def getPartition(key: Any): Int = {
val keyTuple = key.asInstanceOf[Tuple2[String, String]]
Math.abs(keyTuple._1.hashCode % numPartitions)
}
}
//自定义排序函数
implicit val my_self_Ordering = new Ordering[Tuple2[String, String]] {
override def compare(x: Tuple2[String, String], y: Tuple2[String, String]): Int = {
try{
y._2.toInt-x._2.toInt
}catch{
case ex: Exception => {
println(ex)
0
}
}
}
}
//((url , resource ),( 访问次数 ,traffic))=>((url , 访问次数) , ( resource,traffic))
//key 是 (url , 访问次数) url 用来分区 访问次数用来排序
reducedRdd.map(x => ((x._1._1,x._2._1),(x._1._2,x._2._2)))
//分区并且排序.
.repartitionAndSortWithinPartitions(new MyPartitioner(partitions))
//对每个分区做处理求出TopN x :Iterable[T]
.mapPartitions(x => {
val rows = x.toBuffer
//删除掉TopN之外的数据
if (rows.size > topN) {
rows.remove(topN, rows.length - topN)
}
rows.toIterator
})
//分区合并
.repartition(1)
.saveAsTextFile(“output/spark-task/01_2”)

测试耗时
在这里插入图片描述
**优点:**相比 groupBy 方法 直接在shuff 阶段分区的时候排序。大大提高了性能。
**缺点: **如果key 的值太多会造成分区过多

方法三:

//方法3
val partitions3 = etlRdd.map(x => (x._1, 1)).countByKey().size
reducedRdd
//(url , resource , 访问次数 ,traffic)
.map(x => ((x._1._1),( x._1._2, x._2._1, x._2._2)))
//直接按需要分组统计的key 分组
.repartition(partitions3)
//对每个分区操作
.mapPartitions(x => {
val buffer = x.toBuffer
//排序
val sorted = buffer.sortBy(x => x._2._2).reverse
//取 TopN
if (sorted.length > topN) {
sorted.remove(topN, sorted.length - topN)
}
(sorted.toIterator)
})
.repartition(1)
.saveAsTextFile(“output/spark-task/01_3”)

总结

三种方法的详细对比。
未完待续。。。

你可能感兴趣的:(大数据)