Spark学习(六):Spark SQL二

目录

4.数据的read、write和savemode

4.1 数据的读取

4.2 数据的写出

4.3  数据保存的模式

5. Spark SQL数据源

5.1 数据源之json

5.2 数据源之parquet

5.3 数据源之csv

5.4 数据源之JDBC

5.5 数据源之hive


4.数据的read、write和savemode

4.1 数据的读取

一些常见的数据源,parquet:是之前输出parquet文件的目录,读取该目录下的所有文件

student.json

{"name":"jack", "age":"22"}
{"name":"rose", "age":"21"}
{"name":"mike", "age":"19"}

 product.csv

phone,5000,100
xiaomi,3000,300

val spark = SparkSession.builder()
  .master("local[*]")
  .appName(this.getClass.getSimpleName)
  .getOrCreate()

//方式一:
val jsonSource: DataFrame = spark.read.json("E:\\student.json")
val csvSource: DataFrame = spark.read.csv("e://product.csv")
val parquetSource: DataFrame = spark.read.parquet("E:/parquetOutput/*")

//方式二:
val jsonSource1: DataFrame = spark.read.format("json").load("E:\\student.json")
val csvSource1: DataFrame = spark.read.format("csv").load("e://product.csv")
val parquetSource1: DataFrame = spark.read.format("parquet").load("E:/parquetOutput/*")
//方式三:默认是paprquet格式
val df: DataFrame = spark.sqlContext.load("E:/parquetOutput/*")

4.2 数据的写出

//方式一:
jsonSource.write.json("./jsonOutput")
jsonSource.write.parquet("./parquetOutput")
jsonSource.write.csv("./scvOut")
//方式二:
jsonSource.write.format("json").save("./jsonOutput")
jsonSource.write.format("parquet").save("./parquetOutput")
jsonSource.write.format("csv").save("./scvOut")
//方式三:默认parquet格式
jsonSource.write.save("./parquetOutput")

4.3  数据保存的模式

result1.write.mode(SaveMode.Append).json("spark_day01/jsonOutput1")

Scala/Java

Any Language

Meaning

SaveMode.ErrorIfExists(default)

"error"(default)

 如果文件存在,则报错

SaveMode.Append

"append"

追加

SaveMode.Overwrite

"overwrite"

覆写

SaveMode.Ignore

"ignore"

数据存在,则忽略

5. Spark SQL数据源

5.1 数据源之json

如上所示,之前程序中用的数据源均为json

5.2 数据源之parquet

如上所示,spark数据源的默认格式,是一种压缩格式

5.3 数据源之csv

如上所示

默认分割符为,可以直接使用excel打开

默认生成的schema信息为_c0,_c1...默认的类都是String

5.4 数据源之JDBC

要操作数据库(mysql)首先需要导入Mysql的依赖



    mysql
    mysql-connector-java
    5.1.35

在配置文件中添加application.conf

db.driver="com.mysql.jdbc.Driver"
db.url="jdbc:mysql://localhost:3306/test?characterEncoding=utf-8"
db.user="root"
db.password="1234"
import java.util.Properties
import com.typesafe.config.ConfigFactory
import org.apache.spark.sql._

object JDBCSource {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder()
      .master("local[*]")
      .appName(this.getClass.getSimpleName)
      .getOrCreate()

    import spark.implicits._
    //默认的加载次序:application.conf  application.json  application.properties
    val config = ConfigFactory.load()

    // 读取mysql数据库  --->  操作之后 ---> 写到mysql中
    //设置连接数据库的连接信息
    val url = config.getString("db.url")
    val conn = new Properties()
    conn.setProperty("user", config.getString("db.user"))
    conn.setProperty("password", config.getString("db.password"))

    // 读取数据
    val jdbc: DataFrame = spark.read.jdbc(url, "emp", conn)
    jdbc.printSchema()
    
    val result1: Dataset[Row] = jdbc.where("sal > 2500").select("empno")
    // 写数据
    result1.write.mode(SaveMode.Append).jdbc(url, "emp10", conn)
    spark.close()
  }
}

5.5 数据源之hive

准备工作,导入下面依赖



    org.apache.spark
    spark-hive_2.11
    ${spark.version}

在使用IDEA开发的时候在resources配置文件下加一个hive-site.xml文件,集群环境把hive的配置文件要发到$SPARK_HOME/conf目录下

hive-site.xml文件内容,修改连接的路径,数据库名,表名,用户及密码


    
        javax.jdo.option.ConnectionURL
        jdbc:mysql://hadoop101:3306/metastore?createDatabaseIfNotExist=true
        JDBC connect string for a JDBC metastore
    

    
        javax.jdo.option.ConnectionDriverName
        com.mysql.jdbc.Driver
        Driver class name for a JDBC metastore
    

    
        javax.jdo.option.ConnectionUserName
        root
        username to use against metastore database
    

    
        javax.jdo.option.ConnectionPassword
        123456
        password to use against metastore database
    

测试代码:

import org.apache.spark.sql.{DataFrame, Dataset, SaveMode, SparkSession}

object HiveDemo {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .master("local[*]")
      .appName(this.getClass.getSimpleName)
      .enableHiveSupport() //开启spark对hive的支持
      .getOrCreate()

    import spark.implicits._

    //连接集群中的数据库,先伪装用户身份
    System.setProperty("HADOOP_USER_NAME","root")

    /*val result1 = spark.sql("select * from db_hive.student")
    result1.show()*/

    //创建表
    //spark.sql("create table student(name string, age string, sex string) row format delimited fields terminated by ','")

    //删除表
    //spark.sql("drop table student")

    //插入数据
    //val result = spark.sql("insert into student select * from db_hive.student")

    //覆盖写数据
    //spark.sql("insert overwrite table student select * from db_hive.student")

    //覆盖load新数据
    //spark.sql("load data local inpath 'spark_day01/student.txt' overwrite into table default.student")

    //清空数据
    //spark.sql("truncate table student")

    //写入自定义数据
    val students: Dataset[String] = spark.createDataset(List("jack,18,male","rose,19,female","mike,20,male"))
    val result: DataFrame = students.map(student => {
      val fields = student.split(",")
      (fields(0), fields(1), fields(2))
    }).toDF("name", "age", "sex")
    result.show()

    result.createTempView("v_student")

    //将自定义的数据插入到表中
    //spark.sql("insert into student select * from v_student")

    //将自定字的数据写入到数据库
    result.write.mode(SaveMode.Append).insertInto("student")

    //查询
    spark.sql("select * from default.student").show()

    spark.close()
  }
}

 

你可能感兴趣的:(Spark)