原数据和表结构
+----------+------------+------------+-------+--------+-----------+
|train_code|station_name|station_code|is_late|late_min|arrive_date|
+----------+------------+------------+-------+--------+-----------+
|K8363 |昆山 |KSH |1 |1435 |2019-03-19 |
|2149 |兖州 |YZK |1 |1424 |2019-03-19 |
|K1084 |唐山 |TSP |0 |0 |2019-03-19 |
|K7755 |唐山北 |FUP |1 |1415 |2019-03-19 |
|K451 |马兰 |MLR |0 |0 |2019-03-19 |
|K567 |麻城 |MCN |0 |0 |2019-03-19 |
|T396 |济宁 |JIK |1 |4 |2019-03-19 |
|K346 |锦州 |JZD |0 |0 |2019-03-19 |
|K1126 |衢州 |QEH |0 |0 |2019-03-19 |
|K1295 |中卫 |ZWJ |0 |0 |2019-03-19 |
|K125 |唐山 |TSP |0 |0 |2019-03-19 |
|K1137 |兖州 |YZK |0 |0 |2019-03-19 |
|K1074 |潢川 |KCN |0 |0 |2019-03-19 |
|Z180 |呼和浩特东 |NDC |1 |3 |2019-03-19 |
|K748 |三江县 |SOZ |1 |7 |2019-03-19 |
|K928 |天津 |TJP |1 |7 |2019-03-19 |
|K549 |四平 |SPT |0 |0 |2019-03-19 |
|K96 |鞍山 |AST |0 |0 |2019-03-19 |
|K1669 |玉山 |YNG |1 |10 |2019-03-19 |
|K70 |蕲春 |QRN |0 |0 |2019-03-19 |
+----------+------------+------------+-------+--------+-----------+
root
|-- train_code: string (nullable = true)
|-- station_name: string (nullable = true)
|-- station_code: string (nullable = true)
|-- is_late: long (nullable = true)
|-- late_min: long (nullable = true)
|-- arrive_date: string (nullable = true)
想统计某一个车次所有晚点时间
Dataset structData = tableData.groupBy("train_code", "station_code").agg(collect_set(struct("arrive_date", "late_min")).as("detail_set"));
structData.printSchema();
structData.show(false);
结果如下:
root
|-- train_code: string (nullable = true)
|-- station_code: string (nullable = true)
|-- detail_set: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- arrive_date: string (nullable = true)
| | |-- late_min: long (nullable = true)
+----------+------------+----------------------------------------------------------------+
|train_code|station_code|detail_set |
+----------+------------+----------------------------------------------------------------+
|0000 |GQD |[[2019-03-31,0], [2019-03-19,0]] |
|0000 |LYT |[[2019-03-21,0], [2019-03-25,0], [2019-03-19,0]] |
|1133 |CGV |[[2019-03-24,0]] |
|1133 |DTV |[[2019-03-19,0], [2019-03-30,0]] |
|1133 |FZC |[[2019-03-18,0]] |
|1133 |JAC |[[2019-03-27,0]] |
|1133 |YOV |[[2019-03-20,0], [2019-03-25,0]] |
|1134 |BXP |[[2019-03-24,0], [2019-03-19,3], [2019-03-18,8]] |
|1134 |DTV |[[2019-03-31,0], [2019-03-23,0], [2019-03-30,0]] |
|1134 |FZC |[[2019-03-31,0], [2019-03-27,0]] |
|1134 |WQC |[[2019-03-25,0], [2019-03-19,0]] |
|1134 |WVC |[[2019-03-23,5]] |
|1134 |XHP |[[2019-03-19,7]] |
|1147 |CJY |[[2019-03-31,12], [2019-03-25,14]] |
|1147 |LKF |[[2019-03-18,1], [2019-03-20,0]] |
|1147 |PJH |[[2019-03-31,6], [2019-03-28,0], [2019-03-25,0], [2019-03-30,0]]|
|1147 |WNY |[[2019-03-19,23]] |
|1148 |DKH |[[2019-03-30,2], [2019-03-23,0], [2019-03-21,0]] |
|1148 |KFF |[[2019-03-30,0]] |
|1148 |UIH |[[2019-03-25,7]] |
+----------+------------+----------------------------------------------------------------+
想把detail_set字段有array转换为string类型。
修改代码
Dataset lateDetail=tableData.groupBy("train_code","station_code").agg(collect_set(concat_ws(":",col("arrive_date"),col("late_min"))).as("late_list"));
Dataset finalResult= lateDetail.withColumn("date_list_str",concat_ws(",",col("late_list")));
finalResult.printSchema();
finalResult.show(false);
统计结果如下
root
|-- train_code: string (nullable = true)
|-- station_code: string (nullable = true)
|-- late_list: array (nullable = true)
| |-- element: string (containsNull = true)
|-- date_list_str: string (nullable = false)
+----------+------------+--------------------------------------------------------+---------------------------------------------------+
|train_code|station_code|late_list |date_list_str |
+----------+------------+--------------------------------------------------------+---------------------------------------------------+
|0000 |GQD |[2019-03-19:0, 2019-03-31:0] |2019-03-19:0,2019-03-31:0 |
|0000 |LYT |[2019-03-25:0, 2019-03-19:0, 2019-03-21:0] |2019-03-25:0,2019-03-19:0,2019-03-21:0 |
|1133 |CGV |[2019-03-24:0] |2019-03-24:0 |
|1133 |DTV |[2019-03-19:0, 2019-03-30:0] |2019-03-19:0,2019-03-30:0 |
|1133 |FZC |[2019-03-18:0] |2019-03-18:0 |
|1133 |JAC |[2019-03-27:0] |2019-03-27:0 |
|1133 |YOV |[2019-03-25:0, 2019-03-20:0] |2019-03-25:0,2019-03-20:0 |
|1134 |BXP |[2019-03-19:3, 2019-03-18:8, 2019-03-24:0] |2019-03-19:3,2019-03-18:8,2019-03-24:0 |
|1134 |DTV |[2019-03-30:0, 2019-03-23:0, 2019-03-31:0] |2019-03-30:0,2019-03-23:0,2019-03-31:0 |
|1134 |FZC |[2019-03-27:0, 2019-03-31:0] |2019-03-27:0,2019-03-31:0 |
|1134 |WQC |[2019-03-25:0, 2019-03-19:0] |2019-03-25:0,2019-03-19:0 |
|1134 |WVC |[2019-03-23:5] |2019-03-23:5 |
|1134 |XHP |[2019-03-19:7] |2019-03-19:7 |
|1147 |CJY |[2019-03-25:14, 2019-03-31:12] |2019-03-25:14,2019-03-31:12 |
|1147 |LKF |[2019-03-20:0, 2019-03-18:1] |2019-03-20:0,2019-03-18:1 |
|1147 |PJH |[2019-03-25:0, 2019-03-30:0, 2019-03-31:6, 2019-03-28:0]|2019-03-25:0,2019-03-30:0,2019-03-31:6,2019-03-28:0|
|1147 |WNY |[2019-03-19:23] |2019-03-19:23 |
|1148 |DKH |[2019-03-21:0, 2019-03-23:0, 2019-03-30:2] |2019-03-21:0,2019-03-23:0,2019-03-30:2 |
|1148 |KFF |[2019-03-30:0] |2019-03-30:0 |
|1148 |UIH |[2019-03-25:7] |2019-03-25:7 |
+----------+------------+--------------------------------------------------------+---------------------------------------------------+
only showing top 20 rows
搞定。