上文介绍了spark dataframe常用操作算子。除此外,spark还有一类操作比较特别——窗口函数。
窗口函数常多用于sql,spark sql也集成了,同样,spark dataframe也有这种函数,spark sql的窗口函数与spark dataframe的写法不太一样。
select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc) as sum_duration
from userlogs_date
where dt=20210720
import org.apache.spark.sql.expressions._
val first_2_now_window = Window.partitionBy("pcode").orderBy("event_date")
df_userlogs_date.select(
$"pcode",
$"event_date",
sum($"duration").over(first_2_now_window).as("sum_duration")
).show
窗口函数形式为 over(partition by A order by B),意为对A分组,对B排序,然后进行某项计算,比如求count,max等。
count(...) over(partition by ... order by ...)--求分组后的总数。
sum(...) over(partition by ... order by ...)--求分组后的和。
max(...) over(partition by ... order by ...)--求分组后的最大值。
min(...) over(partition by ... order by ...)--求分组后的最小值。
avg(...) over(partition by ... order by ...)--求分组后的平均值。
rank() over(partition by ... order by ...)--rank值可能是不连续的。
dense_rank() over(partition by ... order by ...)--rank值是连续的。
first_value(...) over(partition by ... order by ...)--求分组内的第一个值。
last_value(...) over(partition by ... order by ...)--求分组内的最后一个值。
lag() over(partition by ... order by ...)--取出前n行数据。
lead() over(partition by ... order by ...)--取出后n行数据。
ratio_to_report() over(partition by ... order by ...)--Ratio_to_report() 括号中就是分子,over() 括号中就是分母。
percent_rank() over(partition by ... order by ...)-- 计算当前行所在前百分位
窗口函数可以实现如下逻辑:
a.求取聚合后个体占组的百分比
b.求解历史数据累加
val data = spark.read.json(spark.createDataset(
Seq(
"""{"name":"A","lesson":"Math","score":100}""",
"""{"name":"B","lesson":"Math","score":100}""",
"""{"name":"C","lesson":"Math","score":99}""",
"""{"name":"D","lesson":"Math","score":98}""",
"""{"name":"A","lesson":"English","score":100}""",
"""{"name":"B","lesson":"English","score":99}""",
"""{"name":"C","lesson":"English","score":99}""",
"""{"name":"D","lesson":"English","score":98}"""
)))
data.show
scala> data.show
+-------+----+-----+
| lesson|name|score|
+-------+----+-----+
| Math| A| 100|
| Math| B| 100|
| Math| C| 99|
| Math| D| 98|
|English| A| 100|
|English| B| 99|
|English| C| 99|
|English| D| 98|
+-------+----+-----+
data.registerTempTable("score")
//求取每个人的单科成绩占自己总成绩的百分比
spark.sql(
s"""
|select name, lesson, score, (score/sum(score) over()) as y1, (score/sum(score) over(partition by name)) as y2
|from score
|""".stripMargin).show
+----+-------+-----+-------------------+-------------------+
|name| lesson|score| y1| y2|
+----+-------+-----+-------------------+-------------------+
| B| Math| 100|0.12610340479192939| 0.5025125628140703|
| B|English| 99|0.12484237074401008|0.49748743718592964|
| D| Math| 98| 0.1235813366960908| 0.5|
| D|English| 98| 0.1235813366960908| 0.5|
| C| Math| 99|0.12484237074401008| 0.5|
| C|English| 99|0.12484237074401008| 0.5|
| A| Math| 100|0.12610340479192939| 0.5|
| A|English| 100|0.12610340479192939| 0.5|
+----+-------+-----+-------------------+-------------------+
比如,有个需求,求取从2018年到2020年各年累加的物品总数。
val data1 = spark.read.json(spark.createDataset(
Seq(
"""{"date":"2020-01-01","build":1}""",
"""{"date":"2020-01-01","build":1}""",
"""{"date":"2020-04-01","build":1}""",
"""{"date":"2020-04-01","build":1}""",
"""{"date":"2020-05-01","build":1}""",
"""{"date":"2020-09-01","build":1}""",
"""{"date":"2019-01-01","build":1}""",
"""{"date":"2019-01-01","build":1}""",
"""{"date":"2018-01-01","build":1}"""
)))
data1.show()
+-----+----------+
|build| date|
+-----+----------+
| 1|2020-01-01|
| 1|2020-01-01|
| 1|2020-04-01|
| 1|2020-04-01|
| 1|2020-05-01|
| 1|2020-09-01|
| 1|2019-01-01|
| 1|2019-01-01|
| 1|2018-01-01|
+-----+----------+
data1.createOrReplaceTempView("data1")
/**
* 历史累加
*/
//统计build字段的历史累加数据
spark.sql(
s"""
|select c.dd,sum(c.sum_build) over(partition by 1 order by dd asc) from
|(select substring(date,0,4) as dd, sum(build) as sum_build from data1 group by dd) c
|""".stripMargin).show
+----+------------------------------------------------------------------------------------------------------------------+
| dd|sum(sum_build) OVER (PARTITION BY 1 ORDER BY dd ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)|
+----+------------------------------------------------------------------------------------------------------------------+
|2018| 1|
|2019| 3|
|2020| 9|
+----+------------------------------------------------------------------------------------------------------------------+
spark.sql(
s"""
|select c.dd,sum(c.sum_build) over (partition by 1) from
|(select substring(date,0,4) as dd, sum(build) as sum_build from data1 group by dd) c
|""".stripMargin).show
+----+---------------------------------------------------------------------------------------------+
| dd|sum(sum_build) OVER (PARTITION BY 1 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)|
+----+---------------------------------------------------------------------------------------------+
|2020| 9|
|2019| 9|
|2018| 9|
+----+---------------------------------------------------------------------------------------------+
val df1=Seq(
("a","10","m1"),
("b","20","m1"),
(null,"30","m1"),
("b","30","m2"),
("c","40","m2"),
(null,"50","m2")
)toDF("val","count","id")
scala> df1.show()
+----+-----+---+
| val|count| id|
+----+-----+---+
| a| 10| m1|
| b| 20| m1|
|null| 30| m1|
| b| 30| m2|
| c| 40| m2|
|null| 50| m2|
+----+-----+---+
import org.apache.spark.sql.expressions.Window
df1.withColumn("rank_num", row_number().over(Window.partitionBy("id").orderBy("count"))).show
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> df1.withColumn("rank_num", row_number().over(Window.partitionBy("id").orderBy("count"))).show
+----+-----+---+---------+
| val|count| id|rank_num|
+----+-----+---+---------+
| a| 10| m1| 1|
| b| 20| m1| 2|
|null| 30| m1| 3|
| b| 30| m2| 1|
| c| 40| m2| 2|
|null| 50| m2| 3|
+----+-----+---+---------+
过滤那些val为空值的记录之后:
df1.withColumn("rank_num", row_number().over(Window.partitionBy("id").orderBy("count"))).where("val <> 'null'").show()
scala> df1.withColumn("rank_num", row_number().over(Window.partitionBy("id").orderBy("count"))).where("val <> 'null'").show()
+---+-----+---+--------+
|val|count| id|rank_num|
+---+-----+---+--------+
| a| 10| m1| 1|
| b| 20| m1| 2|
| b| 30| m2| 1|
| c| 40| m2| 2|
+---+-----+---+--------+
如果不进行partitionBy,只进行orderBy,则是全局排序:
scala> df1.withColumn("rank_num", row_number().over(Window.orderBy("count"))).show
2021-07-26 09:41:47,061 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+----+-----+---+--------+
| val|count| id|rank_num|
+----+-----+---+--------+
| a| 10| m1| 1|
| b| 20| m1| 2|
|null| 30| m1| 3|
| b| 30| m2| 4|
| c| 40| m2| 5|
|null| 50| m2| 6|
+----+-----+---+--------+
scala> val partitionSize=3
partitionSize: Int = 3
scala> val df=df1.withColumn("rank_num", row_number().over(Window.orderBy("count"))).where(s"rank_num % $partitionSize == 0").show
2021-07-27 11:43:00,567 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+----+-----+---+--------+
| val|count| id|rank_num|
+----+-----+---+--------+
|null| 30| m1| 3|
|null| 50| m2| 6|
+----+-----+---+--------+
df: Unit = ()
scala> val df=df1.withColumn("rank_num", row_number().over(Window.orderBy("count"))).where(s"rank_num % $partitionSize == 0").select(col("rank_num"))
df: org.apache.spark.sql.DataFrame = [rank_num: int]
scala> df.show
2021-07-26 09:50:19,415 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+--------+
|rank_num|
+--------+
| 3|
| 6|
+--------+
scala> val partitionSize=2
partitionSize: Int = 2
scala> val df=df1.withColumn("rank_num", row_number().over(Window.orderBy("count"))).where(s"rank_num % $partitionSize == 0").show
2021-07-27 11:44:03,205 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+----+-----+---+--------+
| val|count| id|rank_num|
+----+-----+---+--------+
| b| 20| m1| 2|
| b| 30| m2| 4|
|null| 50| m2| 6|
+----+-----+---+--------+
df: Unit = ()
scala> val df=df1.withColumn("rank_num", row_number().over(Window.orderBy("count"))).where(s"rank_num % $partitionSize == 0").select(col("rank_num"))
df: org.apache.spark.sql.DataFrame = [rank_num: int]
scala> df.show
2021-07-26 09:51:13,534 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+--------+
|rank_num|
+--------+
| 2|
| 4|
| 6|
+--------+
scala> val ids = df.collect().map(_.get(0).asInstanceOf[Number].longValue)
ids: Array[Long] = Array(2, 4, 6)
val partitionColumn="main_idx"
import scala.collection.mutable.ArrayBuffer
val results = ArrayBuffer[String]()
results += s"$partitionColumn < ${ids(0)}"
for (i <- 1 until ids.length) {
val start = ids(i - 1)
val end = ids(i)
results += s"$partitionColumn >= ${start} and $partitionColumn < ${end}"
}
results += s"$partitionColumn >= ${ids(ids.length - 1)}"
results.toArray
scala> ids
res31: Array[Long] = Array(2, 4, 6)
scala> import scala.collection.mutable.ArrayBuffer
import scala.collection.mutable.ArrayBuffer
scala> val results = ArrayBuffer[String]()
results: scala.collection.mutable.ArrayBuffer[String] = ArrayBuffer()
scala> results += s"$partitionColumn < ${ids(0)}"
res51: results.type = ArrayBuffer(main_idx < 2)
scala> for (i <- 1 until ids.length) {
| val start = ids(i - 1)
| val end = ids(i)
| results += s"$partitionColumn >= ${start} and $partitionColumn < ${end}"
| }
scala> results += s"$partitionColumn >= ${ids(ids.length - 1)}"
res53: results.type = ArrayBuffer(main_idx < 2, main_idx >= 2 and main_idx < 4, main_idx >= 4 and main_idx < 6, main_idx >= 6)
scala>
scala> results.toArray
res54: Array[String] = Array(main_idx < 2, main_idx >= 2 and main_idx < 4, main_idx >= 4 and main_idx < 6, main_idx >= 6)