需求分析:
⾏列之间的互相转换是ETL中的常见需求,在Spark SQL中,⾏转列有内建的PIVOT函数可⽤,没什么特别之处。
⽽列转⾏要稍微⿇烦点。
Execution Environment
Spark 3.2.0
以下将通过 Spark SQL DSL 与 Spark SQL 两种方案进行 案例分享。
data-file : city.csv city,yearm,count 北京,202001,1000 北京,202004,1023 北京,202007,1980 北京,202010,1098 北京,202101,988 北京,202104,976 北京,202107,1098 北京,202110,1221 上海,202001,1222 上海,202004,800 上海,202007,908 上海,202010,1009 上海,202101,709 上海,202104,799 上海,202107,980 上海,202110,897
// 创建执行环境 val session: SparkSession = SparkSession.builder() .appName(this.getClass.getSimpleName) .master("local[2]").getOrCreate() // 加载数据文件,注册DF val piovtCsvDF = session.read.format("csv") .option("header", true) .load("/Users/zhoulei/Documents/workspaces/zholei-core/ToolsLibrary/src/main/resources/city.csv")
Spark SQL DSL
// Spark DSL Method piovtCsvDF .groupBy("city") .pivot("yearm") .agg(sum("count")) .show()
Spark SQL
// 注册临时视图 piovtCsvDF.createOrReplaceTempView("city_main") // Spark SQL Method session.sql( """ |select * from city_main |pivot( | SUM(count) for yearm in ( | '202001' as Q202001, | '202004' as Q202004, | '202007' as Q202007, | '202010' as Q202010, | '202101' as Q202101, | '202104' as Q202104, | '202107' as Q202107, | '202110' as Q202110 | ) | ) |""".stripMargin).show(false)
studentscore.csv un,chinese,math,English 张三,91,92,93 李四,80,81,32 王五,70,78,80
val unpiovtCsvDF = session.read.format("csv") .option("header", true) .load(getClass.getResource("/studentscore.csv"))
Spark SQL DSL
stack(n, expr1, …, exprk) - 会将expr1, …, exprk 分割为n⾏
unpiovtCsvDF .selectExpr("un", "stack(3,'chinese',chinese,'math',math,'english',english) as (subject,score)") .show(false)
Spark SQL
// 注册临时视图 unpiovtCsvDF.createOrReplaceTempView("student_score_main")
// Spark SQL Method (stack function) session.sql( """ |select |un,stack(3,'chinese',chinese,'math',math,'english',english) as (subject,score) |from student_score_main |""".stripMargin).show(false)
// Spark SQL Method (lateral view explode) session.sql( """ |select |un,split(temp1,':')[0] as subject,split(temp1,':')[1] as score |from ( |select un,concat('chinese:',chinese,',','math:',math,',','english:',english) temp |from student_score_main |) lateral view explode(split(temp,',')) sub as temp1 |""".stripMargin).show(false)
知识小课堂: