Spark 取每个groupby的N条数据

如果用groupby接口的话,可能OOM,

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{rand, row_number}


val windowFun = Window.partitionBy("groupby_column").orderBy(rand())
val resultDF = dataDF.withColumn("rank", row_number.over(windowFun))
      .filter("rank<=100").map((row: Row) => {
      //...
    })

你可能感兴趣的:(Spark,Scala)