一种以RDD为基础的分布式数据集,DataFrame具有schema信息,提供了类似字段的信息。
是DataFrame的扩展,提供了RDD的优势,可以将DataFrame的一行数据映射成一个对象
老版本中提供了SqlContext用于自己查询,和HiveContext用于连接Hive操作。SparkSession兼顾了两种,在spark-shell使用中,提供了spark对象可以操作sql。
共性:都是分布式数据集、惰性机制、拥有共同的方法、都有分区、都会根据自身情况进行缓存
区别:
val conf = new SparkConf().setAppName("sparkSQL").setMaster("local[*]")
val spark: SparkSession = SparkSession.builder().config(conf).getOrCreate()
import spark.implicits._
val df: DataFrame = spark.read.json("D:\\ideaWorkspace\\scala0105\\spark-sql\\input\\1.json")
val rdd: RDD[(String, Int)] = spark.sparkContext.makeRDD(List(("zhaoliu", 29), ("liuqi", 30)))
df.createOrReplaceTempView("user")
val sql: DataFrame = spark.sql("select * from user")
sql.show()
df.select("*").show()
val df2rdd: RDD[Row] = df.rdd
df2rdd.foreach(println)
val r2f: DataFrame = rdd.toDF()
r2f.select("*").show()
val rdd2ds: Dataset[(String, Int)] = rdd.toDS()
rdd2ds.show()
val df2ds: Dataset[User] = df.as[User]
df2ds.show()
spark.stop()
UDF:输入一行,输出也是一行,一对一的关系
df.createOrReplaceTempView("user_temp")
spark.udf.register("addPre",(name:String) =>{"hello " + name})
val addPreDF: DataFrame = spark.sql("select addPre(name) from user_temp")
UDAF:输入多行返回一行,多对一的关系。通过继承UserDefinedAggregateFunction来实现用户自定义聚合函数。
//弱类型
val myAvg = new MyAvg
spark.udf.register("myavg",myAvg)
val ageAvg: DataFrame = spark.sql("select myavg(age) from user_temp")
class MyAvg extends UserDefinedAggregateFunction{
override def inputSchema: StructType = StructType(Array(StructField("age",LongType)))
override def bufferSchema: StructType = StructType(Array(StructField("age",LongType),StructField("count",LongType)))
override def dataType: DataType = DoubleType
override def deterministic: Boolean = true
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = 0L
buffer(1) = 0L
}
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
buffer(0) = buffer.getLong(0) + input.getLong(0)
buffer(1) = buffer.getLong(1) + 1L
}
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0)
buffer1(1) = buffer1.getLong(1) + buffer2.getLong(1)
}
override def evaluate(buffer: Row): Any = buffer.getLong(0).toDouble/buffer.getLong(1)
}
//强类型
val ds: Dataset[Student] = df.as[Student]
val myAvgNew = new MyAvgNew
val col: TypedColumn[Student, Double] = myAvgNew.toColumn
ds.select(col).show()
case class Student(name:String,age:Long)
case class StudentBuffer(var allAge:Long,var allCount:Long)
class MyAvgNew extends Aggregator[Student,StudentBuffer,Double]{
override def zero: StudentBuffer = StudentBuffer(0,0)
override def reduce(b: StudentBuffer, a: Student): StudentBuffer = {
b.allAge = b.allAge + a.age
b.allCount = b.allCount + 1L
b
}
override def merge(b1: StudentBuffer, b2: StudentBuffer): StudentBuffer = {
b1.allAge = b1.allAge + b2.allAge
b1.allCount = b1.allCount + b2.allCount
b1
}
override def finish(reduction: StudentBuffer): Double = reduction.allAge.toDouble/reduction.allCount
override def bufferEncoder: Encoder[StudentBuffer] = Encoders.product
override def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}
spark.read.load (默认parquet) 和 spark.write.save(默认parquet)修改配置项:spark.sql.sources.default
加载:
保存:
val rdd: RDD[people] = spark.sparkContext.makeRDD(List(people("wusong",29),people("dawu",29)))
rdd.toDS().write.format("jdbc")
.option("url","jdbc:mysql://hadoop101:3306/test")
.option("driver","com.mysql.jdbc.Driver")
.option("user","root")
.option("password","000000")
.option("dbtable","people").mode(SaveMode.Append).save()
spark.read.format("jdbc")
.option("url","jdbc:mysql://hadoop101:3306/test")
.option("driver","com.mysql.jdbc.Driver")
.option("user","root")
.option("password","000000")
.option("dbtable","people").load().show()
1、内嵌Hive:使用内嵌Hive时,为derby数据库,在本地存放,仓库也在本地
2、外部使用Hive:拷贝hive-site.xml到spark的conf目录下,确保hive可以正常使用
3、代码连接:添加hive-site.xml和依赖
开始hive支持:.enableHiveSupport()
org.apache.spark
spark-hive_2.11
2.1.1
org.apache.hive
hive-exec
1.2.1