SparkSQL之内置函数--groupBy()和agg()

创建Dataframe

scala>val df= Seq(
     |     ("01","Jack","08012345566","28","SALES","1000",1),
     |     ("02","Tom", "08056586761","19","MANAGEMENT","2500",1),
     |     ("03","Mike","08009097878","25","MARKET","2000",1),
     |     ("04","Tina","07099661234","30","LOGISTICS","3000",0),
     |     ("05","Alex","08019208960","18","MARKET","3500",1),
     |     ("06","Bob","08011223344","22","CLERK","1500",1),
     |     ("07","Dvaid","08022557788","25","CLERK","2500",1),
     |     ("08","Ben","08080201682","35","MARKET","500",1),
     |     ("09","Allen","08099206680","20","MARKET","2500",1),
     |     ("10","Caesar","09011020806","32","SALES","1000",1)).toDF("id","name","cellphone","age","department","expense","gender")
df: org.apache.spark.sql.DataFrame= [id: string, name: string ...5 more fields]

show

scala> df.show
+---+------+-----------+---+----------+-------+------+
| id|  name|  cellphone|age|department|expense|gender|
+---+------+-----------+---+----------+-------+------+
| 01|  Jack|08012345566| 28|     SALES|   1000|     1|
| 02|   Tom|08056586761| 19|MANAGEMENT|   2500|     1|
| 03|  Mike|08009097878| 25|    MARKET|   2000|     1|
| 04|  Tina|07099661234| 30| LOGISTICS|   3000|     0|
| 05|  Alex|08019208960| 18|    MARKET|   3500|     1|
| 06|   Bob|08011223344| 22|     CLERK|   1500|     1|
| 07| Dvaid|08022557788| 25|     CLERK|   2500|     1|
| 08|   Ben|08080201682| 35|    MARKET|    500|     1|
| 09| Allen|08099206680| 20|    MARKET|   2500|     1|
| 10|Caesar|09011020806| 32|     SALES|   1000|     1|
+---+------+-----------+---+----------+-------+------+

根据性别,统计人数。

scala> df.groupBy("gender").count.show()
+------+-----+
|gender|count|
+------+-----+
|     1|    9|
|     0|    1|
+------+-----+

按照department分组后求出组中expense的最大值、最小值和总和,平均年龄。

scala> df.groupBy("department").agg(max("expense"), min("expense"),sum("expense"),mean("age")).show
+----------+------------+------------+------------+--------+
|department|max(expense)|min(expense)|sum(expense)|avg(age)|
+----------+------------+------------+------------+--------+
|     CLERK|        2500|        1500|      4000.0|    23.5|
|     SALES|        1000|        1000|      2000.0|    30.0|
|    MARKET|         500|        2000|      8500.0|    24.5|
| LOGISTICS|        3000|        3000|      3000.0|    30.0|
|MANAGEMENT|        2500|        2500|      2500.0|    19.0|
+----------+------------+------------+------------+--------+

或
scala> df.groupBy("department").agg(max("expense"), min("expense"),sum("expense"),avg("age")).show

过滤出cellphone中包含"080"的电话号码

scala> df.filter($"cellphone".contains("080")).show
+---+------+-----------+---+----------+-------+------+
| id|  name|  cellphone|age|department|expense|gender|
+---+------+-----------+---+----------+-------+------+
| 01|  Jack|08012345566| 28|     SALES|   1000|     1|
| 02|   Tom|08056586761| 19|MANAGEMENT|   2500|     1|
| 03|  Mike|08009097878| 25|    MARKET|   2000|     1|
| 05|  Alex|08019208960| 18|    MARKET|   3500|     1|
| 06|   Bob|08011223344| 22|     CLERK|   1500|     1|
| 07| Dvaid|08022557788| 25|     CLERK|   2500|     1|
| 08|   Ben|08080201682| 35|    MARKET|    500|     1|
| 09| Allen|08099206680| 20|    MARKET|   2500|     1|
| 10|Caesar|09011020806| 32|     SALES|   1000|     1|
+---+------+-----------+---+----------+-------+------+

过滤出cellphone中包含"080"的后,按照department分组后求出组中expense的总和,并按department排序。

scala> df.filter($"cellphone".contains("080")).groupBy($"department").agg(sum($"expense")).orderBy($"department").show(false)
+----------+------------+
|department|sum(expense)|
+----------+------------+
|CLERK     |4000.0      |
|MANAGEMENT|2500.0      |
|MARKET    |8500.0      |
|SALES     |2000.0      |
+----------+------------+

你可能感兴趣的:(spark)