Spark SQL是构建在RDD之上的ETL(Extract Transform Load)工具。SparkSQL在RDD之上抽象出来Dataset/Dataframe
这两个类提供了类似RDD的功能,也就意味用户可以使用map、faltMap、filter等高阶算子,同时也通过了基于列的命名查询,也就是说Dataset/DataFrame提供了两套操作数据的API,这些API可以给Saprk引擎要提供更多信息,系统可可以根据这些信息对计算实现一定的优化。目前Spark SQL提供了两种交互方式
SQL 脚本
Dataset API
(strong-typed类型、untyped类型操作)Dataset是一个分布式数据集,Dataset是在spark-1.6提出新的API,该API构建在RDD(strong type,使用lambda表达式)之上同时可以借助于Spark SQL对执行引擎的优点,使得使用Dateset执行一些数据的转换比直接使用RDD算子功能和性能都有所提升。因此我们可以认为Dateset就是一个加强版本的RDD。 Dataset除了可以使用JVM中数组|集合对象创建之外,也可以将任意的一个RDD转换为Dataset.
Python does not have the support for the Dataset API.
DataFrames 是Dataset的一种特殊情况。比如 Dataset中可以存储任意 对象类型的数据作为Dataset的元素。但是Dataframe的元素只有一种类型Row类型,这种基于Row查询和传统数据库中ResultSet操作极其相似。因为Row类型的数据表示Dataframe的一个元素,类似数据库中的一行,这些行中的元素可以通过下标或者column name访问。由于 Dateset是API的兼容或者支持度上不是多么的好,但是Dataframe在 API层面支持的Scala、Java、R、Python支持比较全面。
<dependency>
<groupId>org.apache.sparkgroupId>
<artifactId>spark-core_2.11artifactId>
<version>2.4.3version>
dependency>
<dependency>
<groupId>org.apache.sparkgroupId>
<artifactId>spark-sql_2.11artifactId>
<version>2.4.3version>
dependency>
def main(args: Array[String]): Unit = {
//1.创建SparkSeesion
val spark = SparkSession
.builder()
.appName("wordcount")
.master("local[6]")
.getOrCreate()
//2.导入spark定义隐式增强|转换
import spark.implicits._
//3.创建dataset
val lines = Array("this is a demo", "hello spark")
val wordRDD = spark.sparkContext
.makeRDD(lines)
.flatMap(_.split("\\s+"))
.map((_, 1))
val ds: Dataset[(String, Int)] = wordRDD.toDS()
//4.对Dataset执行sql算子操作
ds.groupBy($"_1")
//无类型操作
.sum("_2")
.as("total")
.withColumnRenamed("_1", "word")
.withColumnRenamed("sum(_2)", "total")
.show()
//5.关闭spark
spark.stop()
}
def main(args: Array[String]): Unit = {
//1.创建SparkSeesion
val spark = SparkSession
.builder()
.appName("wordcount")
.master("local[6]")
.getOrCreate()
//2.导入spark定义隐式增强|转换
import spark.implicits._
//3.创建dataset
val lines = Array("this is a demo", "hello spark")
val wordRDD = spark.sparkContext
.makeRDD(lines)
.flatMap(_.split("\\s+"))
.map((_, 1))
val ds: Dataset[(String, Int)] = wordRDD.toDS()
//4.对Dataset执行sql算子操作
ds.groupByKey(t => t._1)
.agg(typed.sum[(String, Int)](tuple => tuple._2).name("total"))
.show()
//5.关闭spark
spark.stop()
}
Dataset类似于RDD,不同的是Spark SQL有一套自己的序列化规范独立于Spark RDD(Java/Kryo序列化)之上称为Encoders。不同于SparkRDD序列化,由于Dataset支持无类型操作,用户无需获取操作的类型,操作仅仅是列名,因为Spark SQL在执行算子操作的时候可以省略反序列化的步骤,继而提升程序执行
效率。
// case-class
case class Person(id: Int, name: String, age: Int, sex: Boolean)
/**
* case-class
*/
val person: Dataset[Person] = List(
Person(1, "zhangsan", 18, true),
Person(2, "lisi", 28, true)
)
.toDS()
person.select($"id", $"name")
.show()
注:
- 加入隐式转换 import spark.implicits._ 该语句需要放在获取spark对象的语句之后
- case class Person(id: Int, name: String, age: Int, sex: Boolean) 的定义需要放在方法的作用域之外(即
Scala/Java
的成员变量位置)- 如果case类是java编写,则java bean需要实现
Serializable
,以及增加get、set
方法
结果:
+---+--------+
| id| name|
+---+--------+
| 1|zhangsan|
| 2| lisi|
+---+--------+
/**
* tuple
*/
val person: Dataset[(Int, String, Int, Boolean)] = List(
(1, "zhangsan", 18, true),
(2, "lisi", 28, true)
)
.toDS()
person.select($"_1", $"_2").show()
结果:
+---+--------+
| _1| _2|
+---+--------+
| 1|zhangsan|
| 2| lisi|
+---+--------+
数据:
{"name":"张三","age":18}
{"name":"李四","age":28}
{"name":"王五","age":38}
//数值默认是long类型
case class User(name: String, age: Long)
/**
* json格式
*/
spark.read
.json("file:///Users/mashikang/IdeaProjects/spark_sql/src/main/resources/json/")
.as[User]
.show()
结果:
+---+----+
|age|name|
+---+----+
| 18|张三|
| 28|李四|
| 38|王五|
+---+----+
/**
* RDD 元组
*/
val userRDD = spark.sparkContext.makeRDD(List((1,"张三",true,18,15000.0)))
userRDD.toDS().show()
结果:
+---+----+----+---+-------+
| _1| _2| _3| _4| _5|
+---+----+----+---+-------+
| 1|张三|true| 18|15000.0|
+---+----+----+---+-------+
// case-class
case class Person(id: Int, name: String, age: Int, sex: Boolean)
/**
* RDD case-class
*/
val userRDD = spark.sparkContext.makeRDD(List(Person(1,"张三",18,true)))
userRDD.toDS().show()
结果:
+---+----+---+----+
| id|name|age| sex|
+---+----+---+----+
| 1|张三| 18|true|
+---+----+---+----+
DataFrame是一个命名列的数据集,用户可以直接操作 column
因此几乎所有DataFrame推荐操作都是无类型操作
。用户也可以把一个DataFrame看做是 Dataset[Row]
类型的数据集。
/**
* json
*/
val dataFrame: DataFrame = spark.read
.json("file:///Users/mashikang/IdeaProjects/spark_sql/src/main/resources/json/")
dataFrame.printSchema()
dataFrame.show()
结果:
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
+---+----+
|age|name|
+---+----+
| 18|张三|
| 28|李四|
| 38|王五|
+---+----+
case class User(id:Int,name:String,sex:Boolean)
/**
* case-class
*/
var userDF=List(User(1,"张三",true))
.toDF()
userDF.show()
结果:
+---+----+----+
| id|name| sex|
+---+----+----+
| 1|张三|true|
+---+----+----+
/**
* Tuple
*/
var userDF=List((1,"张三",true))
.toDF("id","name","sex")
userDF.show()
结果:
+---+----+----+
| id|name| sex|
+---+----+----+
| 1|张三|true|
+---+----+----+
/**
* RDD Tuple
*/
var userDF = spark.sparkContext.parallelize(List((1, "张三", true)))
.toDF("id", "name", "sex") //可以指定列
userDF.show()
结果:
+---+----+----+
| id|name| sex|
+---+----+----+
| 1|张三|true|
+---+----+----+
/**
* RDD case-class
*/
var userDF = spark.sparkContext.parallelize(List(User(1, "张三", true)))
.toDF()
userDF.show()
结果:
+---+----+----+
| id|name| sex|
+---+----+----+
| 1|张三|true|
+---+----+----+
/**
* RDD[Row]转换
*/
var userRDD: RDD[Row] = spark.sparkContext.parallelize(List(User(1, "张三", true)))
.map(u => Row(u.id, u.name, u.sex))
var schema = new StructType()
.add("id", IntegerType)
.add("name", StringType)
.add("sex", BooleanType)
var userDF = spark.createDataFrame(userRDD, schema)
userDF.show()
结果:
+---+----+----+
| id|name| sex|
+---+----+----+
| 1|张三|true|
+---+----+----+
/**
* RDD case-class
*/
var userRDD: RDD[User] = spark.sparkContext
.makeRDD(List(User(1, "张三", true)))
var userDF = spark.createDataFrame(userRDD)
userDF.show()
结果:
+---+----+----+
| id|name| sex|
+---+----+----+
| 1|张三|true|
+---+----+----+
/**
* RDD Tuple
*/
var userRDD:RDD[(Int,String,Boolean)]=spark.sparkContext
.makeRDD(List((1,"张三",true)))
var userDF=spark.createDataFrame(userRDD)
userDF.show()
结果:
+---+----+----+
| _1| _2| _3|
+---+----+----+
| 1|张三|true|
+---+----+----+
var df = List((1, "张三", true))
.toDF("id", "name", "sex")
df.printSchema()
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- sex: boolean (nullable = false)
var df = List(
(1, "zs", true, 1, 15000),
(2, "ls", false, 1, 15000)
)
.toDF("id", "name", "sex", "dept", "salary")
df.select($"id", $"name", $"salary")
.show()
+---+----+------+
| id|name|salary|
+---+----+------+
| 1| zs| 15000|
| 2| ls| 15000|
+---+----+------+
var df = List(
(1, "zs", true, 1, 15000),
(2, "ls", false, 1, 18000)
)
.toDF("id", "name", "sex", "dept", "salary")
df.select($"id",$"name",$"sex",$"dept",$"salary" * 12 as "annual_salary")
.show()
+---+----+-----+----+-------------+
| id|name| sex|dept|annual_salary|
+---+----+-----+----+-------------+
| 1| zs| true| 1| 180000|
| 2| ls|false| 1| 216000|
+---+----+-----+----+-------------+
var df = List(
(1, "zs", true, 1, 15000),
(2, "ls", false, 1, 18000)
)
.toDF("id", "name", "sex", "dept", "salary")
//等价 df.select($"id",$"name",$"sex",$"dept",$"salary" * 12 as "annual_salary")
df.selectExpr("id", "name", "sex", "dept", "salary * 12 as annual_salary")
.show()
+---+----+-----+----+-------------+
| id|name| sex|dept|annual_salary|
+---+----+-----+----+-------------+
| 1| zs| true| 1| 180000|
| 2| ls|false| 1| 216000|
+---+----+-----+----+-------------+
var df = List(
(1, "zs", true, 1, 15000),
(2, "ls", false, 1, 18000)
)
.toDF("id", "name", "sex", "dept", "salary")
df.select($"id", $"name", $"sex", $"dept", $"salary")
.withColumn("annual_salary", $"salary" * 12)
.show()
+---+----+-----+----+------+-------------+
| id|name| sex|dept|salary|annual_salary|
+---+----+-----+----+------+-------------+
| 1| zs| true| 1| 15000| 180000|
| 2| ls|false| 1| 18000| 216000|
+---+----+-----+----+------+-------------+
var df = List(
(1, "zs", true, 1, 15000),
(2, "ls", false, 1, 18000)
)
.toDF("id", "name", "sex", "dept", "salary")
df.select($"id", $"name", $"sex", $"dept", $"salary")
.withColumn("annula_salary", $"salary" * 12)
.withColumnRenamed("dept", "department")
.withColumnRenamed("name", "username")
.show()
+---+--------+-----+----------+------+-------------+
| id|username| sex|department|salary|annula_salary|
+---+--------+-----+----------+------+-------------+
| 1| zs| true| 1| 15000| 180000|
| 2| ls|false| 1| 18000| 216000|
+---+--------+-----+----------+------+-------------+
var df = List(
(1, "zs", true, 1, 15000),
(2, "ls", false, 1, 18000)
)
.toDF("id", "name", "sex", "dept", "salary")
df.select($"id",$"name",$"sex",$"dept",$"salary")
.withColumn("annula_salary",$"salary" * 12)
.withColumnRenamed("dept","department")
.withColumnRenamed("name","username")
.drop("sex")
.show()
+---+--------+----------+------+-------------+
| id|username|department|salary|annula_salary|
+---+--------+----------+------+-------------+
| 1| zs| 1| 15000| 180000|
| 2| ls| 1| 18000| 216000|
+---+--------+----------+------+-------------+
var df = List(
(1, "zs", true, 1, 15000),
(2, "ls", false, 1, 18000),
(3, "ww", false, 1, 19000),
(4, "zl", false, 1, 18000)
)
.toDF("id", "name", "sex", "dept", "salary")
df.select($"id", $"name", $"sex", $"dept", $"salary")
.dropDuplicates("sex", "salary")
.show()
+---+----+-----+----+------+
| id|name| sex|dept|salary|
+---+----+-----+----+------+
| 3| ww|false| 1| 19000|
| 1| zs| true| 1| 15000|
| 2| ls|false| 1| 18000|
+---+----+-----+----+------+
var df = List(
(1, "zs", true, 1, 15000),
(2, "ls", false, 2, 18000),
(3, "ww", false, 2, 14000),
(4, "zl", false, 1, 18000),
(5, "zl", false, 1, 16000)
)
.toDF("id", "name", "sex", "dept", "salary")
df.select($"id", $"name", $"sex", $"dept", $"salary")
.orderBy($"salary" desc, $"id" asc)
//.sort($"salary" desc,$"id" asc)
.show()
+---+----+-----+----+------+
| id|name| sex|dept|salary|
+---+----+-----+----+------+
| 2| ls|false| 2| 18000|
| 4| zl|false| 1| 18000|
| 5| zl|false| 1| 16000|
| 1| zs| true| 1| 15000|
| 3| ww|false| 2| 14000|
+---+----+-----+----+------+
var df = List(
(1, "zs", true, 1, 15000),
(2, "ls", false, 2, 18000),
(3, "ww", false, 2, 14000),
(4, "zl", false, 1, 18000),
(5, "zl", false, 1, 16000)
)
.toDF("id", "name", "sex", "dept", "salary")
df.select($"id", $"name", $"sex", $"dept", $"salary")
.groupBy($"dept")
.max("salary")
.show()
+----+-----------+
|dept|max(salary)|
+----+-----------+
| 1| 18000|
| 2| 18000|
+----+-----------+
类似的算子还有 max、min、avg|mean、sum、count
var df = List(
(1, "zs", true, 1, 15000),
(2, "ls", false, 2, 18000),
(3, "ww", false, 2, 14000),
(4, "zl", false, 1, 18000),
(5, "zl", false, 1, 16000)
)
.toDF("id", "name", "sex", "dept", "salary")
import org.apache.spark.sql.functions._
df.select($"id", $"name", $"sex", $"dept", $"salary")
.groupBy($"dept")
.agg(max("salary") as "max_salary", avg("salary") as "avg_salary")
.show()
+----+----------+------------------+
|dept|max_salary| avg_salary|
+----+----------+------------------+
| 1| 18000|16333.333333333334|
| 2| 18000| 16000.0|
+----+----------+------------------+
agg 还可以传递表达式
var df = List(
(1, "zs", true, 1, 15000),
(2, "ls", false, 2, 18000),
(3, "ww", false, 2, 14000),
(4, "zl", false, 1, 18000),
(5, "zl", false, 1, 16000)
)
.toDF("id", "name", "sex", "dept", "salary")
import org.apache.spark.sql.functions._
df.select($"id", $"name", $"sex", $"dept", $"salary")
.groupBy($"dept")
.agg(Map("salary"->"max","id"->"count"))
.show()
+----+-----------+---------+
|dept|max(salary)|count(id)|
+----+-----------+---------+
| 1| 18000| 3|
| 2| 18000| 2|
+----+-----------+---------+
var df = List(
(1, "zs", true, 1, 15000),
(2, "ls", false, 2, 18000),
(3, "ww", false, 2, 14000),
(4, "zl", false, 1, 18000),
(5, "zl", false, 1, 16000)
)
.toDF("id", "name", "sex", "dept", "salary")
df.select($"id", $"name", $"sex", $"dept", $"salary")
.orderBy($"id" desc)
.limit(4)
.show()
+---+----+-----+----+------+
| id|name| sex|dept|salary|
+---+----+-----+----+------+
| 5| zl|false| 1| 16000|
| 4| zl|false| 1| 18000|
| 3| ww|false| 2| 14000|
| 2| ls|false| 2| 18000|
+---+----+-----+----+------+
var df = List(
(1, "zs", true, 1, 15000),
(2, "ls", false, 2, 18000),
(3, "ww", false, 2, 14000),
(4, "zl", false, 1, 18000),
(5, "win7", false, 1, 16000)
)
.toDF("id", "name", "sex", "dept", "salary")
df.select($"id", $"name", $"sex", $"dept", $"salary")
//where("(name like '%s%' and salary > 15000) or name = 'win7'")
.where(($"name" like "%s%" and $"salary" > 15000) or $"name" === "win7")
.show()
+---+----+-----+----+------+
| id|name| sex|dept|salary|
+---+----+-----+----+------+
| 2| ls|false| 2| 18000|
| 5|win7|false| 1| 16000|
+---+----+-----+----+------+
var scoreDF = List(
(1, "math", 85),
(1, "chinese", 80),
(1, "english", 90),
(2, "math", 90),
(2, "chinese", 80)
)
.toDF("id", "course", "score")
import org.apache.spark.sql.functions._
//select id,max(case course when 'math' then score else 0 end )as math ,max(case course when 'chinese' then score else 0 end) as chinese from t_course group by id;
scoreDF.selectExpr("id", "case course when 'math' then score else 0 end as math", "case course when 'chinese' then score else 0 end as chinese", "case course when 'english' then score else 0 end as english")
.groupBy("id")
.agg(max($"math"), max($"chinese"), max($"english"))
.show()
+---+---------+------------+------------+
| id|max(math)|max(chinese)|max(english)|
+---+---------+------------+------------+
| 1| 85| 80| 90|
| 2| 90| 80| 0|
+---+---------+------------+------------+
简易写法
var scoreRDD = List(
(1, "math", 85),
(1, "chinese", 80),
(1, "english", 90),
(2, "math", 90),
(2, "chinese", 80)
)
scoreRDD.toDF("id", "course", "score")
.groupBy("id")
// 行转列 可选值
.pivot("course", scoreRDD.map(t => t._2).distinct)
.max("score")
.show()
+---+----+-------+-------+
| id|math|chinese|english|
+---+----+-------+-------+
| 1| 85| 80| 90|
| 2| 90| 80| null|
+---+----+-------+-------+
var scoreRDD = List(
(1, "math", 85),
(1, "chinese", 80),
(1, "english", 90),
(2, "math", 90),
(2, "chinese", 80),
(3, "math", 100)
)
scoreRDD.toDF("id", "course", "score")
.groupBy("id")
// 行转列 可选值
.pivot("course", scoreRDD.map(t => t._2).distinct)
.max("score")
.na.fill(Map("english" -> -1, "chinese" -> 0))
.show()
+---+----+-------+-------+
| id|math|chinese|english|
+---+----+-------+-------+
| 1| 85| 80| 90|
| 3| 100| 0| -1|
| 2| 90| 80| -1|
+---+----+-------+-------+
case class UserCost(id: Int, category: String, totalCost: Double)
case class User(id: Int, name: String, sex: Boolean, age: Int, salary: Double)
var userCostDF = spark.sparkContext
.parallelize(List(
UserCost(1, "电脑配件", 100),
UserCost(1, "母婴用品", 100),
UserCost(1, "生活用品", 100),
UserCost(2, "居家美食", 79),
UserCost(2, "消费电子", 80),
UserCost(2, "生活用品", 100)
))
.toDF()
.withColumnRenamed("id", "uid")
val categories = userCostDF
.select("category")
.as[(String)]
.rdd
.distinct
.collect()
var userDF = spark.sparkContext
.parallelize(List(
User(1, "张晓三", true, 18, 15000),
User(2, "李晓四", true, 18, 18000),
User(3, "王晓五", false, 18, 10000)
))
.toDF()
userDF.join(userCostDF, $"id" === $"uid", "left_outer")
.drop("uid")
.groupBy("id", "name")
.pivot($"category", categories)
.sum("totalCost")
.na.fill(0.0)
.show()
+---+------+--------+--------+--------+--------+--------+
| id| name| 电脑配件| 生活用品| 母婴用品| 居家美食| 消费电子|
+---+------+--------+--------+--------+--------+--------+
| 1|张晓三| 100.0| 100.0| 100.0| 0.0| 0.0|
| 3|王晓五| 0.0| 0.0| 0.0| 0.0| 0.0|
| 2|李晓四| 0.0| 100.0| 0.0| 79.0| 80.0|
+---+------+--------+--------+--------+--------+--------+
import org.apache.spark.sql.functions._
List(
(110, 50, 80, 80),
(120, 60, 95, 75),
(120, 50, 96, 70)
)
.toDF("height", "weight", "IQ", "EQ")
.cube($"height", $"weight")
.agg(avg("IQ"), avg("EQ"))
.show()
+------+------+-----------------+-------+
|height|weight| avg(IQ)|avg(EQ)|
+------+------+-----------------+-------+
| 110| 50| 80.0| 80.0|
| 120| null| 95.5| 72.5|
| 120| 60| 95.0| 75.0|
| null| 60| 95.0| 75.0|
| null| null|90.33333333333333| 75.0|
| 120| 50| 96.0| 70.0|
| 110| null| 80.0| 80.0|
| null| 50| 88.0| 75.0|
+------+------+-----------------+-------+
由于强类型操作都是基于类型操作,Spark SQL的操作都是推荐使用Dataframe基于列操作,因此一般情况下不推荐使用。
val lines = Array("this is a demo", "hello spark")
val wordRDD = spark.sparkContext.makeRDD(lines)
.flatMap(_.split("\\s+"))
.map((_, 1))
import org.apache.spark.sql.expressions.scalalang.typed
val ds: Dataset[(String, Int)] = wordRDD.toDS()
ds.groupByKey(t => t._1)
.agg(typed.sum[(String, Int)](tuple => tuple._2).name("total"))
.filter(tuple => tuple._1.contains("o"))
.show()
+-----+-----+
|value|total|
+-----+-----+
|hello| 1.0|
| demo| 1.0|
+-----+-----+