spark.reader并没有读取xml格式文件的方法。
想法有下面两种:
1)有没有提供的第三方jar
2)自定义外部数据源
百度google搜索之后,确实发现有一个jar包可以解决:
groupId: com.databricks
artifactId: spark-xml_2.11
version: 0.5.0
官网:https://github.com/databricks/spark-xml
使用:
1)引入依赖:
Scala 2.11:
groupId: com.databricks
artifactId: spark-xml_2.11
version: 0.6.0如果是shell/spark-submit:
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-xml_2.11:0.6.0
2)xml文件格式:
2018-05-08T00:00::00
2
1
61
2
3)解析代码:
package com.ruoze
import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}
import org.apache.spark.sql.types.{ArrayType, LongType, StringType, StructField, StructType}
import org.apache.spark.sql.functions._ //explode函数需要的包
object TestMy {
val innerSchema = StructType(
StructField("ItemData",
ArrayType(
StructType(
StructField("IdKey",LongType,true)::
StructField("Value",LongType,true)::Nil
)
),true)::Nil
)
val schema = StructType(
StructField("CDate",StringType,true)::
StructField("ListItemData", innerSchema, true):: Nil
)def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[2]").appName("TestMy")
.getOrCreate()import spark.implicits._
val xmlFile = "file:///C:/Users/小西学舞/Desktop/test.xml"val df: DataFrame = spark.read.format("com.databricks.spark.xml")
.option("rowTag", "Item")
.schema(schema)
.load(xmlFile)//df.printSchema()
/*
root
|-- CDate: string (nullable = true)
|-- ListItemData: struct (nullable = true)
| |-- ItemData: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- IdKey: long (nullable = true)
| | | |-- Value: long (nullable = true)
*///把ItemData拉平
//explode实现一行变多行
df.withColumn("ItemData", explode($"ListItemData.ItemData"))
.select("CDate", "ItemData.*") // select required column
.show()
}
}
结果如下:
+--------------------+-----+-----+
|CDate |IdKey|Value|
+--------------------+-----+-----+
|2018-05-08T00:00::00|2 |1 |
|2018-05-08T00:00::00|61 |2 |
+--------------------+-----+-----+
转载:https://blog.csdn.net/zpf336/article/details/88827081