Spark SQL 之 Aggregation

Spark SQL 之 聚合函数 (Aggregation)

案例来自于世界杯足球赛相关数据
下载文件 https://www.kaggle.com/abecklas/fifa-world-cup#WorldCups.csv

Spark SQL 之 Aggregation_第1张图片

数据样式

Spark SQL 之 Aggregation_第2张图片

创建 WorldCUP 的case class
case class WorldCup(year:String,
                      country:String,
                      winner:String,
                      runnerup:String,
                      third:String,
                      fouth:String,
                      goalscored:Integer,
                      qualifiedTeam:Integer,
                      matchedPlayed: Integer,
                      attendance:Long);
案例一: 统计夺冠前三的球队
val fifa_file = "file:///d:/WorldCups.csv";
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._;
val fifa_ds = spark.read.format("csv").
                  option("sep",",").
                  option("header",true).
                  option("inferSchema",true).
                  load(fifa_file).as[WorldCup];

fifa_ds.printSchema();
fifa_ds.persist();
fifa_ds.groupBy("winner").count().orderBy(col("count").desc).show(3);
输出日志
+----------+-----+
|    winner|count|
+----------+-----+
|    Brazil|    5|
|     Italy|    4|
|Germany FR|    3|
+----------+-----+
only showing top 3 rows
案例二: 统计夺冠前三的球队的首次和末次夺冠时间
fifa_ds.groupBy("winner").agg(
                             count("winner").as("count"),
                             min(col("year")),max(columnName = "year")
).orderBy(col("count").desc).show(3);
输出日志
+----------+-----+---------+---------+
|    winner|count|min(year)|max(year)|
+----------+-----+---------+---------+
|    Brazil|    5|     1958|     2002|
|     Italy|    4|     1934|     2006|
|Germany FR|    3|     1954|     1990|
+----------+-----+---------+---------+
only showing top 3 rows
案例三: 统计夺冠前三的球队的所有夺冠年份用逗号分隔
 fifa_ds.groupBy("winner").agg(
      collect_list("year"),
      count("winner").as("count")
    ).orderBy(col("count").desc).show(3,false);
输出日志
+----------+------------------------------+-----+
|winner    |collect_list(year)            |count|
+----------+------------------------------+-----+
|Brazil    |[1958, 1962, 1970, 1994, 2002]|5    |
|Italy     |[1934, 1938, 1982, 2006]      |4    |
|Germany FR|[1954, 1974, 1990]            |3    |
+----------+------------------------------+-----+
only showing top 3 rows
案例四: 统计不同决赛场次的冠军数
 fifa_ds.groupBy("qualifiedTeam").pivot("winner").count().show();
输出日志
+-------------+---------+------+-------+------+-------+----------+-----+-----+-------+
|qualifiedTeam|Argentina|Brazil|England|France|Germany|Germany FR|Italy|Spain|Uruguay|
+-------------+---------+------+-------+------+-------+----------+-----+-----+-------+
|           13|     null|  null|   null|  null|   null|      null| null| null|      2|
|           16|        1|     3|      1|  null|   null|         2|    1| null|   null|
|           15|     null|  null|   null|  null|   null|      null|    1| null|   null|
|           24|        1|     1|   null|  null|   null|         1|    1| null|   null|
|           32|     null|     1|   null|     1|      1|      null|    1|    1|   null|
+-------------+---------+------+-------+------+-------+----------+-----+-----+-------+

你可能感兴趣的:(Spark)