假设我们有以下数据:
col_names = ["name", "date", "score"]
value = [
("Ali", "2020-01-01", 10.0),
("Ali", "2020-01-02", 15.0),
("Ali", "2020-01-03", 20.0),
("Ali", "2020-01-04", 25.0),
("Ali", "2020-01-05", 30.0),
("Bob", "2020-01-01", 15.0),
("Bob", "2020-01-02", 20.0),
("Bob", "2020-01-03", 30.0)
]
df = spark.createDataFrame(value, col_names)
df.show()
我们对每个name按照获得的score排序,代码及结果如下:
from pyspark.sql import Row
from pyspark.sql.window import Window
from pyspark.sql.functions import mean, col
from pyspark.sql import functions as F
win1 = Window.partitionBy("name").orderBy("score")
df_rank = df.withColumn("rank",F.rank().over(win1))
df_rank.show()
+----+----------+-----+----+
|name| date|score|rank|
+----+----------+-----+----+
| Bob|2020-01-01| 15.0| 1|
| Bob|2020-01-02| 20.0| 2|
| Bob|2020-01-03| 30.0| 3|
| Ali|2020-01-01| 10.0| 1|
| Ali|2020-01-02| 15.0| 2|
| Ali|2020-01-03| 20.0| 3|
| Ali|2020-01-04| 25.0| 4|
| Ali|2020-01-05| 30.0| 5|
+----+----------+-----+----+
现在的需求是获取分别获取每个对象的最低分数以及其他分数之和,这种情况下我们需要先对name进行分组,然后在聚合的时候分两种条件对分数求和:最低分和其他分。代码如下:
df_sum = df_rank.groupby("name").agg(F.sum(F.when(col("rank")==1,col("score"))).alias("first"), F.sum(F.when(col("rank")!=1,col("score"))).alias("common"))
df_sum.show()
得到如下结果:
+----+-----+------+
|name|first|common|
+----+-----+------+
| Bob| 15.0| 50.0|
| Ali| 10.0| 90.0|
+----+-----+------+
怎么样,是不是很简单?这种情况在实际业务场景中也会出现。例如,手机用户会经常升级系统版本,而需要统计的是某版本作为初始版本的用户数和从上一版本升级到该版本的用户数,这里就会需要用到条件聚合。
参考链接:how-to-filter-rows-for-a-specific-aggregate-with-spark-sql
还有一种条件聚合方法与上面的类似,但方法很巧妙。假设我们的数据与前述一样,对每个对象分数进行排序。然后增加一个标志flag,当rank为1时其值为1,否则值为0,如下所示:
from pyspark.sql import Row
from pyspark.sql.window import Window
from pyspark.sql.functions import mean, col
from pyspark.sql import functions as F
col_names = ["name", "date", "score"]
value = [
("Ali", "2020-01-01", 10.0),
("Ali", "2020-01-02", 15.0),
("Ali", "2020-01-03", 20.0),
("Ali", "2020-01-04", 25.0),
("Ali", "2020-01-05", 30.0),
("Bob", "2020-01-01", 15.0),
("Bob", "2020-01-02", 20.0),
("Bob", "2020-01-03", 30.0)
]
df = spark.createDataFrame(value, col_names)
win1 = Window.partitionBy("name").orderBy("score")
df_rank = df.withColumn("rank",F.rank().over(win1))
df_rank = df_rank.withColumn("flag",F.when(F.col("rank")==1, 1).otherwise(0))
df_rank.show()
+----+----------+-----+----+----+
|name| date|score|rank|flag|
+----+----------+-----+----+----+
| Bob|2020-01-01| 15.0| 1| 1|
| Bob|2020-01-02| 20.0| 2| 0|
| Bob|2020-01-03| 30.0| 3| 0|
| Ali|2020-01-01| 10.0| 1| 1|
| Ali|2020-01-02| 15.0| 2| 0|
| Ali|2020-01-03| 20.0| 3| 0|
| Ali|2020-01-04| 25.0| 4| 0|
| Ali|2020-01-05| 30.0| 5| 0|
+----+----------+-----+----+----+
接下来对每个对象进行分组,聚合的时候分两种,一种时直接对所有分数求和作为总分,另一种是将分数与标志位相乘再求和,代码及结果如下:
df_flag = df_rank.groupBy("name").agg(
F.sum(F.col("score")*F.col("flag")).alias("first"),
F.sum(F.col("score")).alias("all"),
)
df_flag.show()
+----+-----+-----+
|name|first| all|
+----+-----+-----+
| Bob| 15.0| 65.0|
| Ali| 10.0|100.0|
+----+-----+-----+
结果与上一种条件聚合方法基本一样。
参考链接:pyspark-groupby-with-filter-optimizing-speed