Spark SQL 之 聚合函数 (Aggregation)
案例来自于世界杯足球赛相关数据
下载文件 https://www.kaggle.com/abecklas/fifa-world-cup#WorldCups.csv
数据样式
创建 WorldCUP 的case class
case class WorldCup(year:String,
country:String,
winner:String,
runnerup:String,
third:String,
fouth:String,
goalscored:Integer,
qualifiedTeam:Integer,
matchedPlayed: Integer,
attendance:Long);
案例一: 统计夺冠前三的球队
val fifa_file = "file:///d:/WorldCups.csv";
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._;
val fifa_ds = spark.read.format("csv").
option("sep",",").
option("header",true).
option("inferSchema",true).
load(fifa_file).as[WorldCup];
fifa_ds.printSchema();
fifa_ds.persist();
fifa_ds.groupBy("winner").count().orderBy(col("count").desc).show(3);
输出日志
+----------+-----+
| winner|count|
+----------+-----+
| Brazil| 5|
| Italy| 4|
|Germany FR| 3|
+----------+-----+
only showing top 3 rows
案例二: 统计夺冠前三的球队的首次和末次夺冠时间
fifa_ds.groupBy("winner").agg(
count("winner").as("count"),
min(col("year")),max(columnName = "year")
).orderBy(col("count").desc).show(3);
输出日志
+----------+-----+---------+---------+
| winner|count|min(year)|max(year)|
+----------+-----+---------+---------+
| Brazil| 5| 1958| 2002|
| Italy| 4| 1934| 2006|
|Germany FR| 3| 1954| 1990|
+----------+-----+---------+---------+
only showing top 3 rows
案例三: 统计夺冠前三的球队的所有夺冠年份用逗号分隔
fifa_ds.groupBy("winner").agg(
collect_list("year"),
count("winner").as("count")
).orderBy(col("count").desc).show(3,false);
输出日志
+----------+------------------------------+-----+
|winner |collect_list(year) |count|
+----------+------------------------------+-----+
|Brazil |[1958, 1962, 1970, 1994, 2002]|5 |
|Italy |[1934, 1938, 1982, 2006] |4 |
|Germany FR|[1954, 1974, 1990] |3 |
+----------+------------------------------+-----+
only showing top 3 rows
案例四: 统计不同决赛场次的冠军数
fifa_ds.groupBy("qualifiedTeam").pivot("winner").count().show();
输出日志
+-------------+---------+------+-------+------+-------+----------+-----+-----+-------+
|qualifiedTeam|Argentina|Brazil|England|France|Germany|Germany FR|Italy|Spain|Uruguay|
+-------------+---------+------+-------+------+-------+----------+-----+-----+-------+
| 13| null| null| null| null| null| null| null| null| 2|
| 16| 1| 3| 1| null| null| 2| 1| null| null|
| 15| null| null| null| null| null| null| 1| null| null|
| 24| 1| 1| null| null| null| 1| 1| null| null|
| 32| null| 1| null| 1| 1| null| 1| 1| null|
+-------------+---------+------+-------+------+-------+----------+-----+-----+-------+