SparkSQL的数据源可以是JSON类型的字符串,JDBC,Parquent,Hive,HDFS等。
SparkSQL底层架构
首先拿到sql后解析一批未被解决的逻辑计划,再经过分析得到分析后的逻辑计划,再经过一批优化规则转换成一批最佳优化的逻辑计划,再经过SparkPlanner的策略转化成一批物理计划,随后经过消费模型转换成一个个的Spark任务执行。
1、读取json格式的文件创建DataFrame
package scala
import org.apache.spark.sql.{DataFrame, SparkSession}
object DataFrameCreate {
def main(args: Array[String]): Unit = {
val session = SparkSession
.builder()
.appName(this.getClass.getSimpleName)
.master("local[2]")
.getOrCreate()
val df: DataFrame = session.read.json("path")
df.createOrReplaceTempView("tmp_data")
}
}
2、通过json格式的RDD创建DataFrame
package scala
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SparkSession}
object DataFrameCreate {
def main(args: Array[String]): Unit = {
val session = SparkSession
.builder()
.appName(this.getClass.getSimpleName)
.master("local[2]")
.getOrCreate()
val nameRDD: RDD[String] = session.sparkContext.makeRDD(Array(
"""
|{"name":"zhangsan","age":18}
""".stripMargin
,
"""
|{"name":"lisi","age":11}
""".stripMargin
))
val scoreRDD: RDD[String] = session.sparkContext.makeRDD(Array(
"""
|{"name":"zhangsan","score":100}
""".stripMargin
,
"""
|{"name":"lisi","score":200}
""".stripMargin
))
val nameDF: DataFrame = session.read.json(nameRDD)
val scoreDF: DataFrame = session.read.json(scoreRDD)
nameDF.createOrReplaceTempView("name")
scoreDF.createOrReplaceTempView("score")
val result = session.sql("select name.name,name.age,score.score from name,score where name.name = score.name")
result.show()
}
}
3、非json格式的RDD创建DataFrame(重要)
- 通过反射的方式将非json格式的RDD转换成DataFrame(不建议使用)
自定义类要可序列化、自定义类的访问级别是Public、
RDD转成DataFrame后会根据映射将字段按Assci码排序、
将DataFrame转换成RDD时获取字段两种方式:
一种是df.getInt(0)下标获取(不推荐使用);
另一种是df.getAs(“列名”)获取(推荐使用)
关于序列化问题:
1.反序列化时serializable 版本号不一致时会导致不能反序列化。
2.子类中实现了serializable接口,父类中没有实现,父类中的变量不能被序列化,序列化后父类中的变量会得到null。
注意:父类实现serializable接口,子类没有实现serializable接口时,子类可以正常序列化。
3.被关键字transient修饰的变量不能被序列化。
4.静态变量不能被序列化,属于类,不属于方法和对象,所以不能被序列化。
另外:一个文件多次writeObject时,如果有相同的对象已经写入文件,那么下次再写入时,只保存第二次写入的引用,读取时,都是第一次保存的对象。
package scala
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SparkSession}
case class Person(
id: String,
name: String,
age: Int
)
object DataFrameCreate {
def main(args: Array[String]): Unit = {
val session = SparkSession
.builder()
.appName(this.getClass.getSimpleName)
.master("local[2]")
.getOrCreate()
import session.implicits._
val lineRDD: RDD[String] = session.sparkContext.textFile("path")
val personRDD: RDD[Person] = lineRDD.map { x => {
val person = Person(x.split(",")(0), x.split(",")(1), Integer.valueOf(x.split(",")(2)))
person
}
}
val df: DataFrame = personRDD.toDF()
val rdd = df.rdd
val result: RDD[Person] = rdd.map { x => {
Person(x.getAs("id"),x.getAs("name"),x.getAs("age"))
} }
result.foreach{println}
}
}
- 动态创建Schema将非json格式的RDD转换成DataFrame(建议使用)
package scala
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{DataFrame, Row, RowFactory, SparkSession}
object DataFrameCreate {
def main(args: Array[String]): Unit = {
val session = SparkSession
.builder()
.appName(this.getClass.getSimpleName)
.master("local[2]")
.getOrCreate()
val lineRDD: RDD[String] = session.sparkContext.textFile("path")
val rowRDD: RDD[Row] = lineRDD.map { x => {
val split: Array[String] = x.split(",")
RowFactory.create(split(0), split(1), Integer.valueOf(split(2)))
}
}
val schema: StructType = StructType(List(
StructField("id", StringType, nullable = true),
StructField("name", StringType, nullable = true),
StructField("age", IntegerType, nullable = true)
))
val df: DataFrame = session.createDataFrame(rowRDD, schema)
df.show()
df.printSchema()
}
4、读取parquet文件创建DataFrame
可以将DataFrame存储成parquet文件。保存成parquet文件的方式有两种
df.write().mode(SaveMode.Overwrite).format("parquet").save("./sparksql/parquet");
df.write().mode(SaveMode.Overwrite).parquet("./sparksql/parquet");
SaveMode指定文件保存时的模式
Overwrite:覆盖 Append:追加
ErrorIfExists:如果存在就报错 Ignore:如果存在就忽略
package scala
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{DataFrame, Row, RowFactory, SaveMode, SparkSession}
object DataFrameCreate {
def main(args: Array[String]): Unit = {
val session = SparkSession
.builder()
.appName(this.getClass.getSimpleName)
.master("local[2]")
.getOrCreate()
val RDDStr: RDD[String] = session.sparkContext.textFile("path/jsondata")
val df: DataFrame = session.read.json(RDDStr)
df.write.mode(SaveMode.Overwrite).format("parquet").save("path")
val result: DataFrame = session.read.format("parquet").load("path")
result.show()
}
}
5、读取JDBC中的数据创建DataFrame(MySql为例)
package scala
import java.util.Properties
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{DataFrame, DataFrameReader, Row, RowFactory, SaveMode, SparkSession}
import scala.collection.mutable._
object DataFrameCreate {
def main(args: Array[String]): Unit = {
val session = SparkSession
.builder()
.appName(this.getClass.getSimpleName)
.master("local[2]")
.getOrCreate()
val options: HashMap[String, String] = new HashMap[String, String]();
options.put("url", "jdbc:mysql://IP:3306/spark")
options.put("driver", "com.mysql.jdbc.Driver")
options.put("user", "root")
options.put("password", "123456")
options.put("dbtable", "person")
val person: DataFrame = session.read.format("jdbc").options(options).load()
person.show()
val reader: DataFrameReader = session.read.format("jdbc")
reader.option("url", "jdbc:mysql://IP:3306/spark")
reader.option("driver", "com.mysql.jdbc.Driver")
reader.option("user", "root")
reader.option("password", "123456")
reader.option("dbtable", "score")
val score: DataFrame = reader.load()
val properties = new Properties()
properties.setProperty("user", "root")
properties.setProperty("password", "123456")
score.write.mode(SaveMode.Append).jdbc("jdbc:mysql://IP:3306/spark", "tableName", properties)
}
}
6、通过Seq生成
val df = session.createDataFrame(Seq(
("ming", 20, 15552211521L),
("hong", 19, 13287994007L),
("zhi", 21, 15552211523L)
)) toDF("name", "age", "phone")
7、通过读取Csv文件生成
val dfCsv = session.read.format("csv")
.option("header",true)
.load("/Users/data.csv")
dfCsv.show()
8、通过Json格式的DataSet生成
import session.implicits._
val jsonDataSet: RDD[String] = session.createDataset(Array(
"{\"name\":\"ming\",\"age\":20,\"phone\":15552211521}",
"{\"name\":\"hong\", \"age\":19,\"phone\":13287994007}",
"{\"name\":\"zhi\", \"age\":21,\"phone\":15552211523}"
)).rdd
val jsonDataSetDf = session.read.json(jsonDataSet)
jsonDataSetDf.show()