SparkSQL文件按内容分区写至本地

原始数据:

[hadoop@hadoop000 data]$ cat infos.txt 
1,ruoze,30
2,jepson,18
3,spark,30
[hadoop@hadoop000 bin]$ ./spark-shell --master local[2] --jars ~/software/mysql-connector-java-5.1.27.jar
scala> case class Info (id: Int, name: String, age: Int)
scala> val info = spark.sparkContext.textFile("file:///home/hadoop/data/infos.txt")
scala> val infoDF = info.map(_.split(",")).map(x=> Info(x(0).toInt,x(1),x(2).toInt)).toDF
scala> infoDF.show()
+---+------+---+                                                                
| id|  name|age|
+---+------+---+
|  1| ruoze| 30|
|  2|jepson| 18|
|  3| spark| 30|
+---+------+---+
scala> infoDF.write.mode("overwrite").format("json").partitionBy("age").save("file:///home/hadoop/data/output")
[hadoop@hadoop000 ~]$ cd /data/output
[hadoop@hadoop000 output]$ ll
total 8
drwxrwxr-x. 2 hadoop hadoop 4096 Sep 24 07:29 age=18
drwxrwxr-x. 2 hadoop hadoop 4096 Sep 24 07:29 age=30
-rw-r--r--. 1 hadoop hadoop    0 Sep 24 07:29 _SUCCESS
[hadoop@hadoop000 output]$ cd age=30
[hadoop@hadoop000 age=30]$ ll
total 8
-rw-r--r--. 1 hadoop hadoop 24 Sep 24 07:29 part-00000-986f0ca2-b9bd-4ff7-8b63-9acf21f9e0d4.c000.json
-rw-r--r--. 1 hadoop hadoop 24 Sep 24 07:29 part-00001-986f0ca2-b9bd-4ff7-8b63-9acf21f9e0d4.c000.json
[hadoop@hadoop000 age=30]$ cat part*
{"id":1,"name":"ruoze"}
{"id":3,"name":"spark"}

用IDEA代码实现:

import org.apache.spark.sql.SparkSession

object SparkSQLApp {
  def main(args: Array[String]): Unit = {
      val spark = SparkSession
            .builder()
            .appName("SparkSQLApp")
            .master("local[2]")
            .getOrCreate()
      val info = spark.sparkContext.textFile("file:///E:/BigDataSoftware/data/infos.txt")
      import spark.implicits._
      val infoDF = info.map(_.split(",")).map(x=> Info(x(0).toInt,x(1),x(2).toInt)).toDF
      infoDF.write.mode("overwrite").option("timestampFormat", "yyyy/MM/dd     HH:mm:ss ZZ").format("json").partitionBy("age").save("file:///E:/BigDataSoftware/data/output")
    spark.stop()
  }
  case class Info (id: Int, name: String, age: Int)
}

如果没有.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")会报错:

Exception in thread "main" java.lang.IllegalArgumentException: Illegal pattern component: XXX
    at org.apache.commons.lang3.time.FastDateFormat.parsePattern(FastDateFormat.java:577)
    at org.apache.commons.lang3.time.FastDateFormat.init(FastDateFormat.java:444)
    at org.apache.commons.lang3.time.FastDateFormat.(FastDateFormat.java:437)
...........................................

原因:
The default for the timestampFormat is yyyy-MM-dd'T'HH:mm:ss.SSSXXX which is an illegal argument. It needs to be set when you are writing the dataframe out.
The fix is to change that to ZZ which will include the timezone.

你可能感兴趣的:(SparkSQL文件按内容分区写至本地)