Spark读取文本文件时,面对繁多的文件格式,是一件很让人头疼的事情,幸好databricks提供了丰富的api来进行解析,我们只需要引入相应的依赖包,使用Spark SqlContext来进行读取和解析,即可得到格式化好的数据。
下面我们讲述spark从hdfs读写解析常见的几种文本文件的方式。
<dependency>
<groupId>com.databricksgroupId>
<artifactId>spark-csv_2.11artifactId>
<version>1.4.0version>
dependency>
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter", ",") // 字段分割符
.option("header", "true") // 是否将第一行作为表头header
.option("inferSchema", "false") //是否自动推段内容的类型
.option("codec", "none") // 压缩类型
.load(csvFile) // csv文件所在hdfs路径 + 文件名
df.write.format("com.databricks.spark.csv")
.option("header", "true")
.option("codec", "none")
.save(tempHdfsPath) //落地文件hdfs路径,需提前创建路径
https://github.com/databricks/spark-csv
a,b,c,d
123,345,789,5
34,45,90,9878
scala> sqlContext.read.text("/home/test/testTxt.txt").show
+-------------+
| value|
+-------------+
| a,b,c,d|
|123,345,789,5|
|34,45,90,9878|
+-------------+
txt文件为按整行读取,如果需要获得相关字段,则需要对DataFrame的列进行拆分。详情见下一篇博文《Spark DataFrame列拆分与合并》
//获取dateframe所有的字段名
val columnArr = df.columns.map {
colName =>
df.col(colName)
}
df.select(concat_ws(",", columnArr: _*) //将各列数据使用分隔符连接
.cast(StringType))
.write.format("text")
.save(tempHdfsPath) //落地文件hdfs路径,需提前创建路径
{"a":"1747","b":"id抽取_SDK_按小时","c":1,"d":"2018112713"}
{"a":"456","b":"232","c":10,"d":"203227324"}
scala> sqlContext.read.format("json").load("/home/test/testJson.json").show
+----+------------+---+----------+
| a| b| c| d|
+----+------------+---+----------+
|1747|id抽取_SDK_按小时| 1|2018112713|
| 456| 232| 10| 203227324|
+----+------------+---+----------+
df.write.format("json")
.save(tempHdfsPath) //落地文件hdfs路径,需提前创建路径
<dependency>
<groupId>com.crealyticsgroupId>
<artifactId>spark-excel_2.11artifactId>
<version>0.12.2version>
dependency>
import org.apache.spark.sql._
val spark: SparkSession = ???
val df = spark.read
.format("com.crealytics.spark.excel")
.option("useHeader", "true") // 是否将第一行作为表头
.option("inferSchema", "false") // 是否推断schema
.option("workbookPassword", "None") // excel文件的打开密码
.load(excelFile) //excel文件路径 + 文件名
df.write.format("com.crealytics.spark.excel")
.option("useHeader", "true")
.option("timestampFormat", "MM-dd-yyyy HH:mm:ss")
.option("inferSchema", "false")
.option("workbookPassword", "None")
.save(tempHdfsPath) //落地文件hdfs路径,需提前创建路径
https://github.com/crealytics/spark-excel
Tove
Jani
Reminder
Don't forget me this weekend!
ksdhf
Jasfdi
Re
Don't forget me
<dependency>
<groupId>com.databricksgroupId>
<artifactId>spark-xml_2.11artifactId>
<version>0.6.0version>
dependency>
import org.apache.spark.sql._
val spark: SparkSession = ???
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "testXml") // xml文件rowTag,分行标识,"testXml"即为上文rowTag
.load(xmlFile) //xml文件路径+文件名
df.write.format("com.databricks.spark.xml")
.option("rowTag", "testXml")
.option("rootTag", "catalog")
.save(tempHdfsPath) //落地文件hdfs路径,需提前创建路径
https://github.com/databricks/spark-xml
<dependency>
<groupId>org.apache.sparkgroupId>
<artifactId>spark-avro_2.11artifactId>
<version>2.4.4version>
dependency>
import org.apache.spark.sql._
val spark: SparkSession = ???
spark.conf.set("spark.sql.avro.compression.codec", "deflate") //设置avro文件压缩方式
spark.conf.set("spark.sql.avro.deflate.level", "2")
val df: DataFrame = spark.read
.format("avro")
.option("avroSchema", "/.../.../test.avsc") //设置avsc格式的avro文件字段信息
.load(avroFilePath) //指定avro文件路径