目录
4.数据的read、write和savemode
4.1 数据的读取
4.2 数据的写出
4.3 数据保存的模式
5. Spark SQL数据源
5.1 数据源之json
5.2 数据源之parquet
5.3 数据源之csv
5.4 数据源之JDBC
5.5 数据源之hive
一些常见的数据源,parquet:是之前输出parquet文件的目录,读取该目录下的所有文件
student.json
{"name":"jack", "age":"22"}
{"name":"rose", "age":"21"}
{"name":"mike", "age":"19"}
product.csv
phone,5000,100
xiaomi,3000,300
val spark = SparkSession.builder() .master("local[*]") .appName(this.getClass.getSimpleName) .getOrCreate() //方式一: val jsonSource: DataFrame = spark.read.json("E:\\student.json") val csvSource: DataFrame = spark.read.csv("e://product.csv") val parquetSource: DataFrame = spark.read.parquet("E:/parquetOutput/*") //方式二: val jsonSource1: DataFrame = spark.read.format("json").load("E:\\student.json") val csvSource1: DataFrame = spark.read.format("csv").load("e://product.csv") val parquetSource1: DataFrame = spark.read.format("parquet").load("E:/parquetOutput/*") //方式三:默认是paprquet格式 val df: DataFrame = spark.sqlContext.load("E:/parquetOutput/*")
//方式一: jsonSource.write.json("./jsonOutput") jsonSource.write.parquet("./parquetOutput") jsonSource.write.csv("./scvOut") //方式二: jsonSource.write.format("json").save("./jsonOutput") jsonSource.write.format("parquet").save("./parquetOutput") jsonSource.write.format("csv").save("./scvOut") //方式三:默认parquet格式 jsonSource.write.save("./parquetOutput")
result1.write.mode(SaveMode.Append).json("spark_day01/jsonOutput1")
Scala/Java |
Any Language |
Meaning |
SaveMode.ErrorIfExists(default) |
"error"(default) |
如果文件存在,则报错 |
SaveMode.Append |
"append" |
追加 |
SaveMode.Overwrite |
"overwrite" |
覆写 |
SaveMode.Ignore |
"ignore" |
数据存在,则忽略 |
如上所示,之前程序中用的数据源均为json
如上所示,spark数据源的默认格式,是一种压缩格式
如上所示
默认分割符为,可以直接使用excel打开
默认生成的schema信息为_c0,_c1...默认的类都是String
要操作数据库(mysql)首先需要导入Mysql的依赖
mysql mysql-connector-java 5.1.35
在配置文件中添加application.conf
db.driver="com.mysql.jdbc.Driver" db.url="jdbc:mysql://localhost:3306/test?characterEncoding=utf-8" db.user="root" db.password="1234"
import java.util.Properties import com.typesafe.config.ConfigFactory import org.apache.spark.sql._ object JDBCSource { def main(args: Array[String]): Unit = { val spark: SparkSession = SparkSession.builder() .master("local[*]") .appName(this.getClass.getSimpleName) .getOrCreate() import spark.implicits._ //默认的加载次序:application.conf application.json application.properties val config = ConfigFactory.load() // 读取mysql数据库 ---> 操作之后 ---> 写到mysql中 //设置连接数据库的连接信息 val url = config.getString("db.url") val conn = new Properties() conn.setProperty("user", config.getString("db.user")) conn.setProperty("password", config.getString("db.password")) // 读取数据 val jdbc: DataFrame = spark.read.jdbc(url, "emp", conn) jdbc.printSchema() val result1: Dataset[Row] = jdbc.where("sal > 2500").select("empno") // 写数据 result1.write.mode(SaveMode.Append).jdbc(url, "emp10", conn) spark.close() } }
准备工作,导入下面依赖
org.apache.spark spark-hive_2.11 ${spark.version}
在使用IDEA开发的时候在resources配置文件下加一个hive-site.xml文件,集群环境把hive的配置文件要发到$SPARK_HOME/conf目录下
hive-site.xml文件内容,修改连接的路径,数据库名,表名,用户及密码
javax.jdo.option.ConnectionURL jdbc:mysql://hadoop101:3306/metastore?createDatabaseIfNotExist=true JDBC connect string for a JDBC metastore javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver Driver class name for a JDBC metastore javax.jdo.option.ConnectionUserName root username to use against metastore database javax.jdo.option.ConnectionPassword 123456 password to use against metastore database
测试代码:
import org.apache.spark.sql.{DataFrame, Dataset, SaveMode, SparkSession} object HiveDemo { def main(args: Array[String]): Unit = { val spark = SparkSession.builder() .master("local[*]") .appName(this.getClass.getSimpleName) .enableHiveSupport() //开启spark对hive的支持 .getOrCreate() import spark.implicits._ //连接集群中的数据库,先伪装用户身份 System.setProperty("HADOOP_USER_NAME","root") /*val result1 = spark.sql("select * from db_hive.student") result1.show()*/ //创建表 //spark.sql("create table student(name string, age string, sex string) row format delimited fields terminated by ','") //删除表 //spark.sql("drop table student") //插入数据 //val result = spark.sql("insert into student select * from db_hive.student") //覆盖写数据 //spark.sql("insert overwrite table student select * from db_hive.student") //覆盖load新数据 //spark.sql("load data local inpath 'spark_day01/student.txt' overwrite into table default.student") //清空数据 //spark.sql("truncate table student") //写入自定义数据 val students: Dataset[String] = spark.createDataset(List("jack,18,male","rose,19,female","mike,20,male")) val result: DataFrame = students.map(student => { val fields = student.split(",") (fields(0), fields(1), fields(2)) }).toDF("name", "age", "sex") result.show() result.createTempView("v_student") //将自定义的数据插入到表中 //spark.sql("insert into student select * from v_student") //将自定字的数据写入到数据库 result.write.mode(SaveMode.Append).insertInto("student") //查询 spark.sql("select * from default.student").show() spark.close() } }