Spark DataFrame 的窗口函数使用的两种形式介绍

1、概述

上文介绍了spark dataframe常用操作算子。除此外,spark还有一类操作比较特别——窗口函数。

窗口函数常多用于sql,spark sql也集成了,同样,spark dataframe也有这种函数,spark sql的窗口函数与spark dataframe的写法不太一样。

1.1、spark sql 写法

select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc) as sum_duration
from userlogs_date
where dt=20210720

1.2、spark dataframe

import org.apache.spark.sql.expressions._
val first_2_now_window = Window.partitionBy("pcode").orderBy("event_date")
df_userlogs_date.select(
    $"pcode",
    $"event_date",
    sum($"duration").over(first_2_now_window).as("sum_duration")
).show

1.3、窗口函数形式及示意

窗口函数形式为 over(partition by A order by B),意为对A分组,对B排序,然后进行某项计算,比如求count,max等。

count(...) over(partition by ... order by ...)--求分组后的总数。
sum(...) over(partition by ... order by ...)--求分组后的和。
max(...) over(partition by ... order by ...)--求分组后的最大值。
min(...) over(partition by ... order by ...)--求分组后的最小值。
avg(...) over(partition by ... order by ...)--求分组后的平均值。
rank() over(partition by ... order by ...)--rank值可能是不连续的。
dense_rank() over(partition by ... order by ...)--rank值是连续的。
first_value(...) over(partition by ... order by ...)--求分组内的第一个值。
last_value(...) over(partition by ... order by ...)--求分组内的最后一个值。
lag() over(partition by ... order by ...)--取出前n行数据。  
lead() over(partition by ... order by ...)--取出后n行数据。
ratio_to_report() over(partition by ... order by ...)--Ratio_to_report() 括号中就是分子,over() 括号中就是分母。
percent_rank() over(partition by ... order by ...)-- 计算当前行所在前百分位

 2、spark-sql形式窗口函数示例

窗口函数可以实现如下逻辑:

a.求取聚合后个体占组的百分比
b.求解历史数据累加

2.1、求取聚合后个体占组的百分比

val data = spark.read.json(spark.createDataset(
      Seq(
         """{"name":"A","lesson":"Math","score":100}""",
         """{"name":"B","lesson":"Math","score":100}""",
         """{"name":"C","lesson":"Math","score":99}""",
         """{"name":"D","lesson":"Math","score":98}""",
         """{"name":"A","lesson":"English","score":100}""",
         """{"name":"B","lesson":"English","score":99}""",
         """{"name":"C","lesson":"English","score":99}""",
         """{"name":"D","lesson":"English","score":98}"""
      )))

data.show

scala> data.show
+-------+----+-----+
| lesson|name|score|
+-------+----+-----+
|   Math|   A|  100|
|   Math|   B|  100|
|   Math|   C|   99|
|   Math|   D|   98|
|English|   A|  100|
|English|   B|   99|
|English|   C|   99|
|English|   D|   98|
+-------+----+-----+

data.registerTempTable("score") 

//求取每个人的单科成绩占自己总成绩的百分比   
 spark.sql(
      s"""
         |select name, lesson, score, (score/sum(score) over()) as y1, (score/sum(score) over(partition by name)) as y2
         |from score
         |""".stripMargin).show


+----+-------+-----+-------------------+-------------------+
|name| lesson|score|                 y1|                 y2|
+----+-------+-----+-------------------+-------------------+
|   B|   Math|  100|0.12610340479192939| 0.5025125628140703|
|   B|English|   99|0.12484237074401008|0.49748743718592964|
|   D|   Math|   98| 0.1235813366960908|                0.5|
|   D|English|   98| 0.1235813366960908|                0.5|
|   C|   Math|   99|0.12484237074401008|                0.5|
|   C|English|   99|0.12484237074401008|                0.5|
|   A|   Math|  100|0.12610340479192939|                0.5|
|   A|English|  100|0.12610340479192939|                0.5|
+----+-------+-----+-------------------+-------------------+

2.2、求解历史数据累加

比如,有个需求,求取从2018年到2020年各年累加的物品总数。

val data1 = spark.read.json(spark.createDataset(
      Seq(
        """{"date":"2020-01-01","build":1}""",
        """{"date":"2020-01-01","build":1}""",
        """{"date":"2020-04-01","build":1}""",
        """{"date":"2020-04-01","build":1}""",
        """{"date":"2020-05-01","build":1}""",
        """{"date":"2020-09-01","build":1}""",
        """{"date":"2019-01-01","build":1}""",
        """{"date":"2019-01-01","build":1}""",
        """{"date":"2018-01-01","build":1}"""
      )))

data1.show()
+-----+----------+
|build|      date|
+-----+----------+
|    1|2020-01-01|
|    1|2020-01-01|
|    1|2020-04-01|
|    1|2020-04-01|
|    1|2020-05-01|
|    1|2020-09-01|
|    1|2019-01-01|
|    1|2019-01-01|
|    1|2018-01-01|
+-----+----------+


data1.createOrReplaceTempView("data1")
/**
     * 历史累加
     */
    //统计build字段的历史累加数据
spark.sql(
      s"""
         |select c.dd,sum(c.sum_build) over(partition by 1 order by dd asc) from
         |(select  substring(date,0,4) as dd, sum(build) as sum_build  from data1 group by dd) c
         |""".stripMargin).show

+----+------------------------------------------------------------------------------------------------------------------+
|  dd|sum(sum_build) OVER (PARTITION BY 1 ORDER BY dd ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)|
+----+------------------------------------------------------------------------------------------------------------------+
|2018|                                                                                                                 1|
|2019|                                                                                                                 3|
|2020|                                                                                                                 9|
+----+------------------------------------------------------------------------------------------------------------------+


spark.sql(
      s"""
         |select c.dd,sum(c.sum_build) over (partition by 1) from
         |(select  substring(date,0,4) as dd, sum(build) as sum_build  from data1 group by dd) c
         |""".stripMargin).show    

+----+---------------------------------------------------------------------------------------------+
|  dd|sum(sum_build) OVER (PARTITION BY 1 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)|
+----+---------------------------------------------------------------------------------------------+
|2020|                                                                                            9|
|2019|                                                                                            9|
|2018|                                                                                            9|
+----+---------------------------------------------------------------------------------------------+

 3、spark-dataframe 形式窗口函数示例

3.1、准备数据

val df1=Seq(
 ("a","10","m1"),
 ("b","20","m1"),
 (null,"30","m1"),
 ("b","30","m2"),
 ("c","40","m2"),
 (null,"50","m2")
 )toDF("val","count","id")

scala> df1.show()
+----+-----+---+
| val|count| id|
+----+-----+---+
|   a|   10| m1|
|   b|   20| m1|
|null|   30| m1|
|   b|   30| m2|
|   c|   40| m2|
|null|   50| m2|
+----+-----+---+

3.2、用row_number()和窗口函数来做一个分组内排序

import org.apache.spark.sql.expressions.Window

df1.withColumn("rank_num", row_number().over(Window.partitionBy("id").orderBy("count"))).show

scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window

scala> df1.withColumn("rank_num", row_number().over(Window.partitionBy("id").orderBy("count"))).show
+----+-----+---+---------+
| val|count| id|rank_num|
+----+-----+---+---------+
|   a|   10| m1|        1|
|   b|   20| m1|        2|
|null|   30| m1|        3|
|   b|   30| m2|        1|
|   c|   40| m2|        2|
|null|   50| m2|        3|
+----+-----+---+---------+

过滤那些val为空值的记录之后:

df1.withColumn("rank_num", row_number().over(Window.partitionBy("id").orderBy("count"))).where("val <> 'null'").show()

scala> df1.withColumn("rank_num", row_number().over(Window.partitionBy("id").orderBy("count"))).where("val <> 'null'").show()
+---+-----+---+--------+
|val|count| id|rank_num|
+---+-----+---+--------+
|  a|   10| m1|       1|
|  b|   20| m1|       2|
|  b|   30| m2|       1|
|  c|   40| m2|       2|
+---+-----+---+--------+

 3.3、用row_number()和窗口函数来做一个全局排序

如果不进行partitionBy,只进行orderBy,则是全局排序:

scala> df1.withColumn("rank_num", row_number().over(Window.orderBy("count"))).show
2021-07-26 09:41:47,061 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+----+-----+---+--------+
| val|count| id|rank_num|
+----+-----+---+--------+
|   a|   10| m1|       1|
|   b|   20| m1|       2|
|null|   30| m1|       3|
|   b|   30| m2|       4|
|   c|   40| m2|       5|
|null|   50| m2|       6|
+----+-----+---+--------+

3.4、指定切片partitionSize大小,将根据全局排序结果进行切片,求出每个切片的上下界限

scala> val partitionSize=3
partitionSize: Int = 3

scala> val df=df1.withColumn("rank_num", row_number().over(Window.orderBy("count"))).where(s"rank_num % $partitionSize == 0").show
2021-07-27 11:43:00,567 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+----+-----+---+--------+
| val|count| id|rank_num|
+----+-----+---+--------+
|null|   30| m1|       3|
|null|   50| m2|       6|
+----+-----+---+--------+

df: Unit = ()


scala> val df=df1.withColumn("rank_num", row_number().over(Window.orderBy("count"))).where(s"rank_num % $partitionSize == 0").select(col("rank_num"))
df: org.apache.spark.sql.DataFrame = [rank_num: int]

scala> df.show
2021-07-26 09:50:19,415 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+--------+
|rank_num|
+--------+
|       3|
|       6|
+--------+


scala> val partitionSize=2
partitionSize: Int = 2

scala> val df=df1.withColumn("rank_num", row_number().over(Window.orderBy("count"))).where(s"rank_num % $partitionSize == 0").show
2021-07-27 11:44:03,205 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+----+-----+---+--------+
| val|count| id|rank_num|
+----+-----+---+--------+
|   b|   20| m1|       2|
|   b|   30| m2|       4|
|null|   50| m2|       6|
+----+-----+---+--------+

df: Unit = ()


scala> val df=df1.withColumn("rank_num", row_number().over(Window.orderBy("count"))).where(s"rank_num % $partitionSize == 0").select(col("rank_num"))
df: org.apache.spark.sql.DataFrame = [rank_num: int]

scala> df.show
2021-07-26 09:51:13,534 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+--------+
|rank_num|
+--------+
|       2|
|       4|
|       6|
+--------+

 3.5、循环遍历每个切片的上界和下界

scala> val ids = df.collect().map(_.get(0).asInstanceOf[Number].longValue)
ids: Array[Long] = Array(2, 4, 6)

val partitionColumn="main_idx"

import scala.collection.mutable.ArrayBuffer

val results = ArrayBuffer[String]()

results += s"$partitionColumn < ${ids(0)}"
for (i <- 1 until ids.length) {
      val start = ids(i - 1)
      val end = ids(i)
      results += s"$partitionColumn >= ${start} and $partitionColumn < ${end}"
    }

results += s"$partitionColumn >= ${ids(ids.length - 1)}"
results.toArray


scala> ids
res31: Array[Long] = Array(2, 4, 6)

scala> import scala.collection.mutable.ArrayBuffer
import scala.collection.mutable.ArrayBuffer

scala> val results = ArrayBuffer[String]()
results: scala.collection.mutable.ArrayBuffer[String] = ArrayBuffer()

scala> results += s"$partitionColumn < ${ids(0)}"
res51: results.type = ArrayBuffer(main_idx < 2)

scala> for (i <- 1 until ids.length) {
     |       val start = ids(i - 1)
     |       val end = ids(i)
     |       results += s"$partitionColumn >= ${start} and $partitionColumn < ${end}"
     |     }

scala> results += s"$partitionColumn >= ${ids(ids.length - 1)}"
res53: results.type = ArrayBuffer(main_idx < 2, main_idx >= 2 and main_idx < 4, main_idx >= 4 and main_idx < 6, main_idx >= 6)

scala> 

scala> results.toArray
res54: Array[String] = Array(main_idx < 2, main_idx >= 2 and main_idx < 4, main_idx >= 4 and main_idx < 6, main_idx >= 6)

你可能感兴趣的:(sparksql,DataFrame,Spark,spark,scala,dataframe,窗口函数)