Spark SQL 列名带点的处理方法

Spark SQL 列名带点的处理方法

版权声明:本文为博主原创文章,未经博主允许不得转载。

手动码字不易,请大家尊重劳动成果,谢谢

作者:http://blog.csdn.net/wang_wbq

由于这个问题遇到的比较多,因此从我的另一篇博客中摘录出来:https://blog.csdn.net/wang_wbq/article/details/79672768

这里我们将介绍结构体数据的索引获取,包括路径中包含特殊字符如点”.”反引号”`”,首先我们先创建一个基本的DataFrame:

scala> case class A(a: Int, b: String, c: Boolean, d: java.sql.Date, e: java.sql.Timestamp, f: Array[Byte])
defined class A

scala> case class B(`b.g`: Byte, `b.h`: Short, `b.i`: Long, `b.j`: Float)
defined class B

scala> case class C(x: A, y: A, z: B)
defined class C

scala> val ss = (1 to 10).map(i => C(A(i, s"str_$i", i % 2 == 0, new java.sql.Date(i * 10000000), new java.sql.Timestamp(i * 10000000), s"str_$i".getBytes), A(i, s"str_$i", i % 2 == 0, new java.sql.Date(i * 10000000), new java.sql.Timestamp(i * 10000000), s"str_$i".getBytes), B(i.toByte, i.toShort, i, i)))
ss: scala.collection.immutable.IndexedSeq[C] = Vector(C(A(1,str_1,false,1970-01-01,1970-01-01 10:46:40.0,[B@6f558842),A(1,str_1,false,1970-01-01,1970-01-01 10:46:40.0,[B@6a7fd5ba),B(1,1,1,1.0)), C(A(2,str_2,true,1970-01-01,1970-01-01 13:33:20.0,[B@38726836),A(2,str_2,true,1970-01-01,1970-01-01 13:33:20.0,[B@43a5e5a6),B(2,2,2,2.0)), C(A(3,str_3,false,1970-01-01,1970-01-01 16:20:00.0,[B@f17208),A(3,str_3,false,1970-01-01,1970-01-01 16:20:00.0,[B@1437b5a9),B(3,3,3,3.0)), C(A(4,str_4,true,1970-01-01,1970-01-01 19:06:40.0,[B@3f2f4cc1),A(4,str_4,true,1970-01-01,1970-01-01 19:06:40.0,[B@534b0657),B(4,4,4,4.0)), C(A(5,str_5,false,1970-01-01,1970-01-01 21:53:20.0,[B@58215f04),A(5,str_5,false,1970-01-01,1970-01-01 21:53:20.0,[B@71923354),B(5,5,5,5.0)), C(A(6,str_6,true,1970-01-02,1970-01-02 00:40...

scala> val df = spark.createDataFrame(ss).select(col("x").as("`x`"), col("y"), col("z").as("c.z"))
df: org.apache.spark.sql.DataFrame = [`x`: struct4 more fields>, y: struct4 more fields> ... 1 more field]

scala> df.show(10, false)
+-----------------------------------------------------------+-----------------------------------------------------------+------------------+
|`x`                                                        |y                                                          |c.z               |
+-----------------------------------------------------------+-----------------------------------------------------------+------------------+
|[1, str_1, false, 1970-01-01, 1970-01-01 10:46:40, str_1]  |[1, str_1, false, 1970-01-01, 1970-01-01 10:46:40, str_1]  |[1, 1, 1, 1.0]    |
|[2, str_2, true, 1970-01-01, 1970-01-01 13:33:20, str_2]   |[2, str_2, true, 1970-01-01, 1970-01-01 13:33:20, str_2]   |[2, 2, 2, 2.0]    |
|[3, str_3, false, 1970-01-01, 1970-01-01 16:20:00, str_3]  |[3, str_3, false, 1970-01-01, 1970-01-01 16:20:00, str_3]  |[3, 3, 3, 3.0]    |
|[4, str_4, true, 1970-01-01, 1970-01-01 19:06:40, str_4]   |[4, str_4, true, 1970-01-01, 1970-01-01 19:06:40, str_4]   |[4, 4, 4, 4.0]    |
|[5, str_5, false, 1970-01-01, 1970-01-01 21:53:20, str_5]  |[5, str_5, false, 1970-01-01, 1970-01-01 21:53:20, str_5]  |[5, 5, 5, 5.0]    |
|[6, str_6, true, 1970-01-02, 1970-01-02 00:40:00, str_6]   |[6, str_6, true, 1970-01-02, 1970-01-02 00:40:00, str_6]   |[6, 6, 6, 6.0]    |
|[7, str_7, false, 1970-01-02, 1970-01-02 03:26:40, str_7]  |[7, str_7, false, 1970-01-02, 1970-01-02 03:26:40, str_7]  |[7, 7, 7, 7.0]    |
|[8, str_8, true, 1970-01-02, 1970-01-02 06:13:20, str_8]   |[8, str_8, true, 1970-01-02, 1970-01-02 06:13:20, str_8]   |[8, 8, 8, 8.0]    |
|[9, str_9, false, 1970-01-02, 1970-01-02 09:00:00, str_9]  |[9, str_9, false, 1970-01-02, 1970-01-02 09:00:00, str_9]  |[9, 9, 9, 9.0]    |
|[10, str_10, true, 1970-01-02, 1970-01-02 11:46:40, str_10]|[10, str_10, true, 1970-01-02, 1970-01-02 11:46:40, str_10]|[10, 10, 10, 10.0]|
+-----------------------------------------------------------+-----------------------------------------------------------+------------------+


scala> df.printSchema
root
 |-- `x`: struct (nullable = true)
 |    |-- a: integer (nullable = false)
 |    |-- b: string (nullable = true)
 |    |-- c: boolean (nullable = false)
 |    |-- d: date (nullable = true)
 |    |-- e: timestamp (nullable = true)
 |    |-- f: binary (nullable = true)
 |-- y: struct (nullable = true)
 |    |-- a: integer (nullable = false)
 |    |-- b: string (nullable = true)
 |    |-- c: boolean (nullable = false)
 |    |-- d: date (nullable = true)
 |    |-- e: timestamp (nullable = true)
 |    |-- f: binary (nullable = true)
 |-- c.z: struct (nullable = true)
 |    |-- b.g: byte (nullable = false)
 |    |-- b.h: short (nullable = false)
 |    |-- b.i: long (nullable = false)
 |    |-- b.j: float (nullable = false)

对于列名中不包含点”.”反引号”`”的情况下,我们可以直接使用点分隔符来获取其中的值,由于我们使用了重命名表达式name AS alias,因此我们使用的是selectExpr:

scala> df.selectExpr("y.a", "y.b", "y.c as boolean_value", "y.d as data_value", "y.e as timestmp_value", "y.f").show
+---+------+-------------+----------+-------------------+-------------------+
|  a|     b|boolean_value|data_value|     timestmp_value|                  f|
+---+------+-------------+----------+-------------------+-------------------+
|  1| str_1|        false|1970-01-01|1970-01-01 10:46:40|   [73 74 72 5F 31]|
|  2| str_2|         true|1970-01-01|1970-01-01 13:33:20|   [73 74 72 5F 32]|
|  3| str_3|        false|1970-01-01|1970-01-01 16:20:00|   [73 74 72 5F 33]|
|  4| str_4|         true|1970-01-01|1970-01-01 19:06:40|   [73 74 72 5F 34]|
|  5| str_5|        false|1970-01-01|1970-01-01 21:53:20|   [73 74 72 5F 35]|
|  6| str_6|         true|1970-01-02|1970-01-02 00:40:00|   [73 74 72 5F 36]|
|  7| str_7|        false|1970-01-02|1970-01-02 03:26:40|   [73 74 72 5F 37]|
|  8| str_8|         true|1970-01-02|1970-01-02 06:13:20|   [73 74 72 5F 38]|
|  9| str_9|        false|1970-01-02|1970-01-02 09:00:00|   [73 74 72 5F 39]|
| 10|str_10|         true|1970-01-02|1970-01-02 11:46:40|[73 74 72 5F 31 30]|
+---+------+-------------+----------+-------------------+-------------------+


scala> df.selectExpr("y.a", "y.b", "y.c as boolean_value", "y.d as data_value", "y.e as timestmp_value", "y.f").printSchema
root
 |-- a: integer (nullable = true)
 |-- b: string (nullable = true)
 |-- boolean_value: boolean (nullable = true)
 |-- data_value: date (nullable = true)
 |-- timestmp_value: timestamp (nullable = true)
 |-- f: binary (nullable = true)

如果路径名中带有点”.”的话,如果直接使用点的话,会报错。从报错里看你可能会疑惑,明明里面有为什么报取不出来:

scala> df.selectExpr("c.z")
org.apache.spark.sql.AnalysisException: cannot resolve '`c.z`' given input columns: [`x`, y, c.z]; line 1 pos 0;
'Project ['c.z]
+- AnalysisBarrier
      +- Project [x#142 AS `x`#148, y#143, z#144 AS c.z#149]
         +- LocalRelation [x#142, y#143, z#144]

在路径名中带有点”.”的情况下,我们要使用反引号”`”将一个完整名字包裹起来,让Spark SQL认为这是一个完整的整体而不是两层路径:

scala> df.select("`c.z`").show
+------------------+
|               c.z|
+------------------+
|    [1, 1, 1, 1.0]|
|    [2, 2, 2, 2.0]|
|    [3, 3, 3, 3.0]|
|    [4, 4, 4, 4.0]|
|    [5, 5, 5, 5.0]|
|    [6, 6, 6, 6.0]|
|    [7, 7, 7, 7.0]|
|    [8, 8, 8, 8.0]|
|    [9, 9, 9, 9.0]|
|[10, 10, 10, 10.0]|
+------------------+

scala> df.select("`c.z`").printSchema
root
 |-- c.z: struct (nullable = true)
 |    |-- b.g: byte (nullable = false)
 |    |-- b.h: short (nullable = false)
 |    |-- b.i: long (nullable = false)
 |    |-- b.j: float (nullable = false)

//不同层级的查询依然使用点分隔符

scala> df.select(col("`c.z`.`b.g`"), expr("`c.z`.`b.g` AS czbg"), col("`c.z`.`b.i`").as("czbi"), $"`c.z`.`b.j`").show
+---+----+----+----+
|b.g|czbg|czbi| b.j|
+---+----+----+----+
|  1|   1|   1| 1.0|
|  2|   2|   2| 2.0|
|  3|   3|   3| 3.0|
|  4|   4|   4| 4.0|
|  5|   5|   5| 5.0|
|  6|   6|   6| 6.0|
|  7|   7|   7| 7.0|
|  8|   8|   8| 8.0|
|  9|   9|   9| 9.0|
| 10|  10|  10|10.0|
+---+----+----+----+


scala> df.select(col("`c.z`.`b.g`"), expr("`c.z`.`b.g` AS czbg"), col("`c.z`.`b.i`").as("czbi"), $"`c.z`.`b.j`").printSchema
root
 |-- b.g: byte (nullable = true)
 |-- czbg: byte (nullable = true)
 |-- czbi: long (nullable = true)
 |-- b.j: float (nullable = true)

//你也可以使用中括号或者小括号获取下一个层级

scala> df.select(expr("`c.z`['b.g'] As czbg"), col("`c.z`")("b.i").as("czbi")).printSchema
root
 |-- czbg: byte (nullable = true)
 |-- czbi: long (nullable = true)

在路径名中带有反引号”`”的情况下,我们要使用双反引号来代替一个反引号:

scala> df.select(expr("```x```.a"), expr("```x```")("b")).show
+---+------+
|  a| `x`.b|
+---+------+
|  1| str_1|
|  2| str_2|
|  3| str_3|
|  4| str_4|
|  5| str_5|
|  6| str_6|
|  7| str_7|
|  8| str_8|
|  9| str_9|
| 10|str_10|
+---+------+


scala> df.select(expr("```x```.a"), expr("```x```")("b")).printSchema
root
 |-- a: integer (nullable = true)
 |-- `x`.b: string (nullable = true)

这种做法的代码依据为,在org.apache.spark.sql.catalyst.parser.PostProcessor extends SqlBaseBaseListener类中,我们可以看到Spark SQL重写了Antlr 4的Listener方法,将两个反引号替换为一个反引号,由于它是在表达式解析器中定义的,因此我们必须在expr函数中来使用它

  override def exitQuotedIdentifier(ctx: SqlBaseParser.QuotedIdentifierContext): Unit = {
    replaceTokenByIdentifier(ctx, 1) { token =>
      // Remove the double back ticks in the string.
      token.setText(token.getText.replace("``", "`"))
      token
    }
  }

在Antlr 4的词法描述文件中,可以看到quotedIdentifier是用来匹配反引号包裹的字符串的:

quotedIdentifier
    : BACKQUOTED_IDENTIFIER
    ;

BACKQUOTED_IDENTIFIER
    : '`' ( ~'`' | '``' )* '`'
    ;

在这篇博客里介绍了在Spark 2.0之后的版本里,如何去找Spark SQL的Antlr 4语法描述文件,如果有兴趣可以继续深入学习Antlr 4与Spark SQL源码:https://blog.csdn.net/wang_wbq/article/details/79673780

你可能感兴趣的:(Spark,SQL,Spark,scala,Spark,SQL使用笔记)