SparkSQL自定义函数

UDF函数

// 注册函数, 整个Application可以使用
val addName = sparkSession.udf.register("add", x => x + "-")

UDAF函数, 强类型

//输入数据类型,中间结果类型,返回结果类型
case class Average(var sum:Int,var count:Int)
class UDAF2 extends Aggregator[People,Average,Double]{
//初始化中间结果
  override def zero: Average = Average(0,0)
//分区内聚合,把每一条数据聚合到中间结果对象
  override def reduce(b: Average, a: People): Average = {
    b.sum = b.sum + a.age
    b.count = b.count + 1
    b
  }
  //分区结果汇总
  override def merge(b1: Average, b2: Average): Average = {
    b1.sum = b1.sum + b2.sum
    b1.count = b1.count+b2.count
    b1
  }
//返回最后聚合的结果
  override def finish(reduction: Average): Double = reduction.sum/reduction.count.toDouble
//定义中间结果和返回结果的编码方式
  override def bufferEncoder: Encoder[Average] = Encoders.product
  override def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}

内置函数

count(), distinct(), sum()等

窗口函数

row_number() over(partition by ... order by ...)
rank() over(partition by ... order by ...)
dense_rank() over(partition by ... order by ...)
count() over(partition by ... order by ...)
max() over(partition by ... order by ...)
min() over(partition by ... order by ...)
sum() over(partition by ... order by ...)
avg() over(partition by ... order by ...)
first_value() over(partition by ... order by ...)
last_value() over(partition by ... order by ...)
lag() over(partition by ... order by ...)
lead() over(partition by ... order by ...)

你可能感兴趣的:(spark)