SparkSQL各种数据源相关操作

目录

  • JSON文件
  • 文本文件
  • parquet文件
  • JSON转parquet
  • MySQL文件
  • Hive文件

JSON文件

def json(spark: SparkSession): Unit = {
    val jsonDF: DataFrame = spark.read.json("D:\\study\\workspace\\spark-sql-train\\data\\people.json")
    //jsonDF.show()
    jsonDF.filter("age > 20").select("name").write.mode(SaveMode.Overwrite).json("out")
  }
  
json("out")的源码是format("json").save("out")

文本文件

people.txt:
Michael, 29
Andy, 30
Justin, 19
def text(spark: SparkSession): Unit = {
    import spark.implicits._
    val textDF: DataFrame = spark.read.text("D:\\study\\workspace\\spark-sql-train\\data\\people.txt")
    // textDF.show()
    val result: Dataset[(String)] = textDF.map(x => {
      val splits: Array[String] = x.getString(0).split(",")
      (splits(0).trim) //, splits(1).trim
    })
    result.write.mode("overwrite").text("out") 
  }

注意事项:

  • (splits(1).trim.toInt)运行会报错:Text data source does not support int data type.
  • (splits(0).trim, splits(1).trim)也会出错:Text data source supports only a single column, and you have 2 columns.

parquet文件

def parquet(spark:SparkSession): Unit = {
    val parquetDF: DataFrame = spark.read.parquet("D:\\study\\workspace\\spark-sql-train\\data\\users.parquet")
        parquetDF.printSchema()
        parquetDF.show()
        parquetDF.select("name","favorite_numbers")
          .write.mode("overwrite")
            .option("compression","none")                             //不用snappy压缩
          .parquet("out")

    spark.read.parquet("out").show()
  }

parquet介绍和原理可见下方参考文章

JSON转parquet

def convert(spark:SparkSession): Unit = {
    val jsonDF: DataFrame = spark.read.format("json").load("D:\\study\\workspace\\spark-sql-train\\data\\people.json")
    //    jsonDF.show()
    jsonDF.filter("age>20").write.format("parquet").mode(SaveMode.Overwrite).save("out")
    spark.read.parquet("out").show()
  }

MySQL文件

def jdbc2(spark:SparkSession): Unit = {
    import spark.implicits._
    val config = ConfigFactory.load()
    val url = config.getString("db.default.url")
    val user = config.getString("db.default.user")
    val password = config.getString("db.default.password")
    val driver = config.getString("db.default.driver")
    val database = config.getString("db.default.database")
    val table = config.getString("db.default.table")
    val sinkTable = config.getString("db.default.sink.table")

    val connectionProperties = new Properties()
    connectionProperties.put("user", user)
    connectionProperties.put("password", password)

    val jdbcDF: DataFrame = spark.read.jdbc(url, s"$database.$table", connectionProperties)
    jdbcDF.filter($"id" > 1).write.jdbc(url, s"$database.$sinkTable", connectionProperties)
  }

Hive文件

需要导入hive-site.xml文件

val spark: SparkSession = SparkSession.builder().master("local").appName("HiveSourceApp")
      .enableHiveSupport() //切记:一定要开启
      .getOrCreate()
spark.table("test_db.helloworld").show()               //读取hive中的表


val jdbcDF: DataFrame = spark.read
      .jdbc(url, s"$database.$table", connectionProperties).filter($"id" > 2)     //读取mysql中的数据
    jdbcDF.write.saveAsTable(s"$database.$sinkTable")                //写入到hive,无表会自动创表
    jdbcDF.write.insertInto(s"$database.$sinkTable2")                  //写入到hive,需手动创表

参考文章
parquet学习:
https://blog.csdn.net/wulantian/article/details/82869741
https://zhuanlan.zhihu.com/p/141908285

你可能感兴趣的:(#,Spark基础与问题解决,hive,spark)