版权声明:本文为博主原创文章,未经博主允许不得转载。
手动码字不易,请大家尊重劳动成果,谢谢
作者:http://blog.csdn.net/wang_wbq
由于这个问题遇到的比较多,因此从我的另一篇博客中摘录出来:https://blog.csdn.net/wang_wbq/article/details/79672768
这里我们将介绍结构体数据的索引获取,包括路径中包含特殊字符如点”.”、反引号”`”,首先我们先创建一个基本的DataFrame:
scala> case class A(a: Int, b: String, c: Boolean, d: java.sql.Date, e: java.sql.Timestamp, f: Array[Byte])
defined class A
scala> case class B(`b.g`: Byte, `b.h`: Short, `b.i`: Long, `b.j`: Float)
defined class B
scala> case class C(x: A, y: A, z: B)
defined class C
scala> val ss = (1 to 10).map(i => C(A(i, s"str_$i", i % 2 == 0, new java.sql.Date(i * 10000000), new java.sql.Timestamp(i * 10000000), s"str_$i".getBytes), A(i, s"str_$i", i % 2 == 0, new java.sql.Date(i * 10000000), new java.sql.Timestamp(i * 10000000), s"str_$i".getBytes), B(i.toByte, i.toShort, i, i)))
ss: scala.collection.immutable.IndexedSeq[C] = Vector(C(A(1,str_1,false,1970-01-01,1970-01-01 10:46:40.0,[B@6f558842),A(1,str_1,false,1970-01-01,1970-01-01 10:46:40.0,[B@6a7fd5ba),B(1,1,1,1.0)), C(A(2,str_2,true,1970-01-01,1970-01-01 13:33:20.0,[B@38726836),A(2,str_2,true,1970-01-01,1970-01-01 13:33:20.0,[B@43a5e5a6),B(2,2,2,2.0)), C(A(3,str_3,false,1970-01-01,1970-01-01 16:20:00.0,[B@f17208),A(3,str_3,false,1970-01-01,1970-01-01 16:20:00.0,[B@1437b5a9),B(3,3,3,3.0)), C(A(4,str_4,true,1970-01-01,1970-01-01 19:06:40.0,[B@3f2f4cc1),A(4,str_4,true,1970-01-01,1970-01-01 19:06:40.0,[B@534b0657),B(4,4,4,4.0)), C(A(5,str_5,false,1970-01-01,1970-01-01 21:53:20.0,[B@58215f04),A(5,str_5,false,1970-01-01,1970-01-01 21:53:20.0,[B@71923354),B(5,5,5,5.0)), C(A(6,str_6,true,1970-01-02,1970-01-02 00:40...
scala> val df = spark.createDataFrame(ss).select(col("x").as("`x`"), col("y"), col("z").as("c.z"))
df: org.apache.spark.sql.DataFrame = [`x`: struct4 more fields>, y: struct4 more fields> ... 1 more field]
scala> df.show(10, false)
+-----------------------------------------------------------+-----------------------------------------------------------+------------------+
|`x` |y |c.z |
+-----------------------------------------------------------+-----------------------------------------------------------+------------------+
|[1, str_1, false, 1970-01-01, 1970-01-01 10:46:40, str_1] |[1, str_1, false, 1970-01-01, 1970-01-01 10:46:40, str_1] |[1, 1, 1, 1.0] |
|[2, str_2, true, 1970-01-01, 1970-01-01 13:33:20, str_2] |[2, str_2, true, 1970-01-01, 1970-01-01 13:33:20, str_2] |[2, 2, 2, 2.0] |
|[3, str_3, false, 1970-01-01, 1970-01-01 16:20:00, str_3] |[3, str_3, false, 1970-01-01, 1970-01-01 16:20:00, str_3] |[3, 3, 3, 3.0] |
|[4, str_4, true, 1970-01-01, 1970-01-01 19:06:40, str_4] |[4, str_4, true, 1970-01-01, 1970-01-01 19:06:40, str_4] |[4, 4, 4, 4.0] |
|[5, str_5, false, 1970-01-01, 1970-01-01 21:53:20, str_5] |[5, str_5, false, 1970-01-01, 1970-01-01 21:53:20, str_5] |[5, 5, 5, 5.0] |
|[6, str_6, true, 1970-01-02, 1970-01-02 00:40:00, str_6] |[6, str_6, true, 1970-01-02, 1970-01-02 00:40:00, str_6] |[6, 6, 6, 6.0] |
|[7, str_7, false, 1970-01-02, 1970-01-02 03:26:40, str_7] |[7, str_7, false, 1970-01-02, 1970-01-02 03:26:40, str_7] |[7, 7, 7, 7.0] |
|[8, str_8, true, 1970-01-02, 1970-01-02 06:13:20, str_8] |[8, str_8, true, 1970-01-02, 1970-01-02 06:13:20, str_8] |[8, 8, 8, 8.0] |
|[9, str_9, false, 1970-01-02, 1970-01-02 09:00:00, str_9] |[9, str_9, false, 1970-01-02, 1970-01-02 09:00:00, str_9] |[9, 9, 9, 9.0] |
|[10, str_10, true, 1970-01-02, 1970-01-02 11:46:40, str_10]|[10, str_10, true, 1970-01-02, 1970-01-02 11:46:40, str_10]|[10, 10, 10, 10.0]|
+-----------------------------------------------------------+-----------------------------------------------------------+------------------+
scala> df.printSchema
root
|-- `x`: struct (nullable = true)
| |-- a: integer (nullable = false)
| |-- b: string (nullable = true)
| |-- c: boolean (nullable = false)
| |-- d: date (nullable = true)
| |-- e: timestamp (nullable = true)
| |-- f: binary (nullable = true)
|-- y: struct (nullable = true)
| |-- a: integer (nullable = false)
| |-- b: string (nullable = true)
| |-- c: boolean (nullable = false)
| |-- d: date (nullable = true)
| |-- e: timestamp (nullable = true)
| |-- f: binary (nullable = true)
|-- c.z: struct (nullable = true)
| |-- b.g: byte (nullable = false)
| |-- b.h: short (nullable = false)
| |-- b.i: long (nullable = false)
| |-- b.j: float (nullable = false)
对于列名中不包含点”.”、反引号”`”的情况下,我们可以直接使用点分隔符来获取其中的值,由于我们使用了重命名表达式name AS alias,因此我们使用的是selectExpr:
scala> df.selectExpr("y.a", "y.b", "y.c as boolean_value", "y.d as data_value", "y.e as timestmp_value", "y.f").show
+---+------+-------------+----------+-------------------+-------------------+
| a| b|boolean_value|data_value| timestmp_value| f|
+---+------+-------------+----------+-------------------+-------------------+
| 1| str_1| false|1970-01-01|1970-01-01 10:46:40| [73 74 72 5F 31]|
| 2| str_2| true|1970-01-01|1970-01-01 13:33:20| [73 74 72 5F 32]|
| 3| str_3| false|1970-01-01|1970-01-01 16:20:00| [73 74 72 5F 33]|
| 4| str_4| true|1970-01-01|1970-01-01 19:06:40| [73 74 72 5F 34]|
| 5| str_5| false|1970-01-01|1970-01-01 21:53:20| [73 74 72 5F 35]|
| 6| str_6| true|1970-01-02|1970-01-02 00:40:00| [73 74 72 5F 36]|
| 7| str_7| false|1970-01-02|1970-01-02 03:26:40| [73 74 72 5F 37]|
| 8| str_8| true|1970-01-02|1970-01-02 06:13:20| [73 74 72 5F 38]|
| 9| str_9| false|1970-01-02|1970-01-02 09:00:00| [73 74 72 5F 39]|
| 10|str_10| true|1970-01-02|1970-01-02 11:46:40|[73 74 72 5F 31 30]|
+---+------+-------------+----------+-------------------+-------------------+
scala> df.selectExpr("y.a", "y.b", "y.c as boolean_value", "y.d as data_value", "y.e as timestmp_value", "y.f").printSchema
root
|-- a: integer (nullable = true)
|-- b: string (nullable = true)
|-- boolean_value: boolean (nullable = true)
|-- data_value: date (nullable = true)
|-- timestmp_value: timestamp (nullable = true)
|-- f: binary (nullable = true)
如果路径名中带有点”.”的话,如果直接使用点的话,会报错。从报错里看你可能会疑惑,明明里面有为什么报取不出来:
scala> df.selectExpr("c.z")
org.apache.spark.sql.AnalysisException: cannot resolve '`c.z`' given input columns: [`x`, y, c.z]; line 1 pos 0;
'Project ['c.z]
+- AnalysisBarrier
+- Project [x#142 AS `x`#148, y#143, z#144 AS c.z#149]
+- LocalRelation [x#142, y#143, z#144]
在路径名中带有点”.”的情况下,我们要使用反引号”`”将一个完整名字包裹起来,让Spark SQL认为这是一个完整的整体而不是两层路径:
scala> df.select("`c.z`").show
+------------------+
| c.z|
+------------------+
| [1, 1, 1, 1.0]|
| [2, 2, 2, 2.0]|
| [3, 3, 3, 3.0]|
| [4, 4, 4, 4.0]|
| [5, 5, 5, 5.0]|
| [6, 6, 6, 6.0]|
| [7, 7, 7, 7.0]|
| [8, 8, 8, 8.0]|
| [9, 9, 9, 9.0]|
|[10, 10, 10, 10.0]|
+------------------+
scala> df.select("`c.z`").printSchema
root
|-- c.z: struct (nullable = true)
| |-- b.g: byte (nullable = false)
| |-- b.h: short (nullable = false)
| |-- b.i: long (nullable = false)
| |-- b.j: float (nullable = false)
//不同层级的查询依然使用点分隔符
scala> df.select(col("`c.z`.`b.g`"), expr("`c.z`.`b.g` AS czbg"), col("`c.z`.`b.i`").as("czbi"), $"`c.z`.`b.j`").show
+---+----+----+----+
|b.g|czbg|czbi| b.j|
+---+----+----+----+
| 1| 1| 1| 1.0|
| 2| 2| 2| 2.0|
| 3| 3| 3| 3.0|
| 4| 4| 4| 4.0|
| 5| 5| 5| 5.0|
| 6| 6| 6| 6.0|
| 7| 7| 7| 7.0|
| 8| 8| 8| 8.0|
| 9| 9| 9| 9.0|
| 10| 10| 10|10.0|
+---+----+----+----+
scala> df.select(col("`c.z`.`b.g`"), expr("`c.z`.`b.g` AS czbg"), col("`c.z`.`b.i`").as("czbi"), $"`c.z`.`b.j`").printSchema
root
|-- b.g: byte (nullable = true)
|-- czbg: byte (nullable = true)
|-- czbi: long (nullable = true)
|-- b.j: float (nullable = true)
//你也可以使用中括号或者小括号获取下一个层级
scala> df.select(expr("`c.z`['b.g'] As czbg"), col("`c.z`")("b.i").as("czbi")).printSchema
root
|-- czbg: byte (nullable = true)
|-- czbi: long (nullable = true)
在路径名中带有反引号”`”的情况下,我们要使用双反引号来代替一个反引号:
scala> df.select(expr("```x```.a"), expr("```x```")("b")).show
+---+------+
| a| `x`.b|
+---+------+
| 1| str_1|
| 2| str_2|
| 3| str_3|
| 4| str_4|
| 5| str_5|
| 6| str_6|
| 7| str_7|
| 8| str_8|
| 9| str_9|
| 10|str_10|
+---+------+
scala> df.select(expr("```x```.a"), expr("```x```")("b")).printSchema
root
|-- a: integer (nullable = true)
|-- `x`.b: string (nullable = true)
这种做法的代码依据为,在org.apache.spark.sql.catalyst.parser.PostProcessor extends SqlBaseBaseListener
类中,我们可以看到Spark SQL重写了Antlr 4的Listener方法,将两个反引号替换为一个反引号,由于它是在表达式解析器中定义的,因此我们必须在expr
函数中来使用它:
override def exitQuotedIdentifier(ctx: SqlBaseParser.QuotedIdentifierContext): Unit = {
replaceTokenByIdentifier(ctx, 1) { token =>
// Remove the double back ticks in the string.
token.setText(token.getText.replace("``", "`"))
token
}
}
在Antlr 4的词法描述文件中,可以看到quotedIdentifier
是用来匹配反引号包裹的字符串的:
quotedIdentifier
: BACKQUOTED_IDENTIFIER
;
BACKQUOTED_IDENTIFIER
: '`' ( ~'`' | '``' )* '`'
;
在这篇博客里介绍了在Spark 2.0之后的版本里,如何去找Spark SQL的Antlr 4语法描述文件,如果有兴趣可以继续深入学习Antlr 4与Spark SQL源码:https://blog.csdn.net/wang_wbq/article/details/79673780