创建Dataframe
scala>val df= Seq(
| ("01","Jack","08012345566","28","SALES","1000",1),
| ("02","Tom", "08056586761","19","MANAGEMENT","2500",1),
| ("03","Mike","08009097878","25","MARKET","2000",1),
| ("04","Tina","07099661234","30","LOGISTICS","3000",0),
| ("05","Alex","08019208960","18","MARKET","3500",1),
| ("06","Bob","08011223344","22","CLERK","1500",1),
| ("07","Dvaid","08022557788","25","CLERK","2500",1),
| ("08","Ben","08080201682","35","MARKET","500",1),
| ("09","Allen","08099206680","20","MARKET","2500",1),
| ("10","Caesar","09011020806","32","SALES","1000",1)).toDF("id","name","cellphone","age","department","expense","gender")
df: org.apache.spark.sql.DataFrame= [id: string, name: string ...5 more fields]
show
scala> df.show
+---+------+-----------+---+----------+-------+------+
| id| name| cellphone|age|department|expense|gender|
+---+------+-----------+---+----------+-------+------+
| 01| Jack|08012345566| 28| SALES| 1000| 1|
| 02| Tom|08056586761| 19|MANAGEMENT| 2500| 1|
| 03| Mike|08009097878| 25| MARKET| 2000| 1|
| 04| Tina|07099661234| 30| LOGISTICS| 3000| 0|
| 05| Alex|08019208960| 18| MARKET| 3500| 1|
| 06| Bob|08011223344| 22| CLERK| 1500| 1|
| 07| Dvaid|08022557788| 25| CLERK| 2500| 1|
| 08| Ben|08080201682| 35| MARKET| 500| 1|
| 09| Allen|08099206680| 20| MARKET| 2500| 1|
| 10|Caesar|09011020806| 32| SALES| 1000| 1|
+---+------+-----------+---+----------+-------+------+
根据性别,统计人数。
scala> df.groupBy("gender").count.show()
+------+-----+
|gender|count|
+------+-----+
| 1| 9|
| 0| 1|
+------+-----+
按照department分组后求出组中expense的最大值、最小值和总和,平均年龄。
scala> df.groupBy("department").agg(max("expense"), min("expense"),sum("expense"),mean("age")).show
+----------+------------+------------+------------+--------+
|department|max(expense)|min(expense)|sum(expense)|avg(age)|
+----------+------------+------------+------------+--------+
| CLERK| 2500| 1500| 4000.0| 23.5|
| SALES| 1000| 1000| 2000.0| 30.0|
| MARKET| 500| 2000| 8500.0| 24.5|
| LOGISTICS| 3000| 3000| 3000.0| 30.0|
|MANAGEMENT| 2500| 2500| 2500.0| 19.0|
+----------+------------+------------+------------+--------+
或
scala> df.groupBy("department").agg(max("expense"), min("expense"),sum("expense"),avg("age")).show
过滤出cellphone中包含"080"的电话号码
scala> df.filter($"cellphone".contains("080")).show
+---+------+-----------+---+----------+-------+------+
| id| name| cellphone|age|department|expense|gender|
+---+------+-----------+---+----------+-------+------+
| 01| Jack|08012345566| 28| SALES| 1000| 1|
| 02| Tom|08056586761| 19|MANAGEMENT| 2500| 1|
| 03| Mike|08009097878| 25| MARKET| 2000| 1|
| 05| Alex|08019208960| 18| MARKET| 3500| 1|
| 06| Bob|08011223344| 22| CLERK| 1500| 1|
| 07| Dvaid|08022557788| 25| CLERK| 2500| 1|
| 08| Ben|08080201682| 35| MARKET| 500| 1|
| 09| Allen|08099206680| 20| MARKET| 2500| 1|
| 10|Caesar|09011020806| 32| SALES| 1000| 1|
+---+------+-----------+---+----------+-------+------+
过滤出cellphone中包含"080"的后,按照department分组后求出组中expense的总和,并按department排序。
scala> df.filter($"cellphone".contains("080")).groupBy($"department").agg(sum($"expense")).orderBy($"department").show(false)
+----------+------------+
|department|sum(expense)|
+----------+------------+
|CLERK |4000.0 |
|MANAGEMENT|2500.0 |
|MARKET |8500.0 |
|SALES |2000.0 |
+----------+------------+