SparkSql 是一种处理结构化模型数据的Spark模块,它提供了一种叫做DataFrame抽象编程,它也可以作为分布式Sql查询引擎, SparkSql可以从已经安装的Hive服务中读取数据,也可以从RDBMS 数据库中读取数据。
在Spark2.0之后,引入了SparkSession新概念。SparkSession实质上是SQLContext和HiveContext的组合,所以在SQLContext和HiveContext上使用用的API在SparkSession上同样可以使用。SparkSession内部封装了SparkContext,所以实际上是由SparkContext完成的SparkSession为用户提供了统一的切入点,来让用户学习Spark的各项功能。
下面将简单介绍SparkSession的创建和SparkSQL的使用。
SparkSession的设计遵循了工厂设计模式(factory design pattern),下面的代码将介绍SparkSession的创建。
val conf: SparkConf = new SparkConf()
conf.setMaster("local[1]").setAppName("UserOrderDataFrameExample")
val sparkSession: SparkSession = SparkSession.builder().config(conf).getOrCreate()
如果我们需要使用SparkContext,可用从SparkSession中获取:
val sparkContext:SparkContext =sparkSession.SparkContext
创建好SparkSession后,就可以使用SparkSession去创建DataFrame和SparkSQL去操作数据了。在一个Spark应用中,只能创建一个SparkSesion。
使用SparkSQL简单介绍用户订单的数据关联,用户订单数据使用程序自动模拟生成。
代码如下:
import java.sql.Timestamp
/**
* 创建用户的数据结构
**/
case class User(userId: Int, userName: String, tel: String)
/**
* 创建用户订单的数据结构
**/
case class UserOrder(userId: Int, orderId: String)
/**
* 创建订单的数据结构
**/
case class Order(orderId: String, time: Timestamp, money: Double)
/**
* 用户订单数据源
**/
class UserOrderSource(var users: List[User], var userOrder: List[UserOrder], var orders: List[Order]) {
}
代码如下:
import java.sql.Timestamp
import java.util.UUID
import scala.collection.mutable.ListBuffer
/**
* 订单生成器
**/
object OrderGenerator {
var tempUserId: Int = 0
def makerOrder(): UserOrderSource = {
val users: ListBuffer[User] = new ListBuffer[User]()
val orders: ListBuffer[Order] = new ListBuffer[Order]
val userOrders: ListBuffer[UserOrder] = new ListBuffer[UserOrder]()
var user: User = null
var order: Order = null
var userOrder: UserOrder = null
//创建10个用户
for (index <- 1 to 10) {
user = User(createUserId(), s"UserName-${index}", s"1882345889${index - 1}")
users += user
//每个用户创建3个订单
for (num <- 1 to 3) {
order = Order(createOrderId(), new Timestamp(System.currentTimeMillis()), createOrderMoney())
userOrder = UserOrder(user.userId, order.orderId)
orders += order
userOrders += userOrder
}
}
new UserOrderSource(users.toList, userOrders.toList, orders.toList)
}
def createUserId(): Int = {
this.synchronized {
tempUserId = tempUserId + 1
tempUserId
}
}
def createOrderId(): String = {
val uuid: UUID = UUID.randomUUID()
uuid.toString()
}
def createOrderMoney(): Double = {
(Math.random() * 100000).toInt / 100d
}
}
代码如下:
import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame, SparkSession}
/**
* 用户订单的DataFrame操作
*/
object UserOrderDataFrame {
var userOrderSource: UserOrderSource = null
def main(args: Array[String]): Unit = {
println("this is about user order dataframe for spark example.===================")
try {
//创建用户订单数据
userOrderSource = OrderGenerator.makerOrder()
//初始化Spark
initSpark()
dataFrameForSql()
dataFrameForJoin()
} catch {
case e: Exception => {
e.printStackTrace()
}
case e: Throwable => {
e.printStackTrace()
}
} finally {
SparkSession.builder().getOrCreate().close()
}
println("this is one user order dataframe for spark example.===================")
}
def initSpark(): Unit = {
val conf: SparkConf = new SparkConf()
conf.setMaster("local[1]").setAppName("UserOrderDataFrameExample")
val sparkSession: SparkSession = SparkSession.builder().config(conf).getOrCreate()
}
/**
* 使用SQL操作DataFrame
*/
def dataFrameForSql(): Unit = {
val sparkSession: SparkSession = SparkSession.builder().getOrCreate()
//创建DataFrame
val dfUsers: DataFrame = sparkSession.createDataFrame[User](userOrderSource.users)
val dfOrders: DataFrame = sparkSession.createDataFrame[Order](userOrderSource.orders)
val dfUserOrders: DataFrame = sparkSession.createDataFrame[UserOrder](userOrderSource.userOrder)
dfUsers.printSchema()
dfOrders.printSchema()
dfUserOrders.printSchema()
//创建临时表
dfUsers.createOrReplaceTempView("User")
dfOrders.createOrReplaceTempView("Order")
dfUserOrders.createOrReplaceTempView("UserOrder")
//Spark SQL
val sql = "SELECT T1.*,T2.*,T3.* FROM User T1 INNER JOIN UserOrder T2 ON T1.userId=T2.userId INNER JOIN Order T3 ON T2.orderId=T3.orderId"
val dfResult: DataFrame = sparkSession.sql(sql)
println("SparkSQL join result=================================")
dfResult.printSchema()
dfResult.show(100)
}
}
上述代码的运行结果如下:
root
|-- userId: integer (nullable = false)
|-- userName: string (nullable = true)
|-- tel: string (nullable = true)
root
|-- orderId: string (nullable = true)
|-- time: timestamp (nullable = true)
|-- money: double (nullable = false)
root
|-- userId: integer (nullable = false)
|-- orderId: string (nullable = true)
2018-05-30 18:18:06 INFO SparkSqlParser: Parsing command: User
2018-05-30 18:18:06 INFO SparkSqlParser: Parsing command: Order
2018-05-30 18:18:06 INFO SparkSqlParser: Parsing command: UserOrder
2018-05-30 18:18:06 INFO SparkSqlParser: Parsing command: SELECT T1.*,T2.*,T3.* FROM User T1 INNER JOIN UserOrder T2 ON T1.userId=T2.userId INNER JOIN Order T3 ON T2.orderId=T3.orderId
SparkSQL join result=================================
root
|-- userId: integer (nullable = false)
|-- userName: string (nullable = true)
|-- tel: string (nullable = true)
|-- userId: integer (nullable = false)
|-- orderId: string (nullable = true)
|-- orderId: string (nullable = true)
|-- time: timestamp (nullable = true)
|-- money: double (nullable = false)
+------+-----------+-----------+------+--------------------+--------------------+--------------------+------+
|userId| userName| tel|userId| orderId| orderId| time| money|
+------+-----------+-----------+------+--------------------+--------------------+--------------------+------+
| 1| UserName-1|18823458890| 1|f44be3a5-8e97-47f...|f44be3a5-8e97-47f...|2018-05-30 18:17:...|513.33|
| 1| UserName-1|18823458890| 1|ef516070-acec-4ee...|ef516070-acec-4ee...|2018-05-30 18:17:...|400.59|
| 1| UserName-1|18823458890| 1|ef004d7d-b0d8-4d7...|ef004d7d-b0d8-4d7...|2018-05-30 18:17:...|336.88|
| 2| UserName-2|18823458891| 2|f45b08ff-d04e-408...|f45b08ff-d04e-408...|2018-05-30 18:17:...|693.88|
| 2| UserName-2|18823458891| 2|e2d6fb42-1558-42e...|e2d6fb42-1558-42e...|2018-05-30 18:17:...|278.61|
| 2| UserName-2|18823458891| 2|9bc923d4-528a-4dc...|9bc923d4-528a-4dc...|2018-05-30 18:17:...|719.47|
| 3| UserName-3|18823458892| 3|e0dfd87b-36e5-49f...|e0dfd87b-36e5-49f...|2018-05-30 18:17:...|425.51|
| 3| UserName-3|18823458892| 3|149f0659-12c2-492...|149f0659-12c2-492...|2018-05-30 18:17:...|129.92|
| 3| UserName-3|18823458892| 3|c155a1c8-72b4-4f3...|c155a1c8-72b4-4f3...|2018-05-30 18:17:...|258.57|
| 4| UserName-4|18823458893| 4|54b0c494-a096-4e6...|54b0c494-a096-4e6...|2018-05-30 18:17:...|955.21|
| 4| UserName-4|18823458893| 4|bc6a24b8-6d77-4a5...|bc6a24b8-6d77-4a5...|2018-05-30 18:17:...| 69.42|
| 4| UserName-4|18823458893| 4|b4d68db1-7b02-44d...|b4d68db1-7b02-44d...|2018-05-30 18:17:...|571.32|
| 5| UserName-5|18823458894| 5|a6ccd370-494c-4f2...|a6ccd370-494c-4f2...|2018-05-30 18:17:...| 72.92|
| 5| UserName-5|18823458894| 5|646adb1c-73f1-44e...|646adb1c-73f1-44e...|2018-05-30 18:17:...| 69.54|
| 5| UserName-5|18823458894| 5|532be792-5343-47a...|532be792-5343-47a...|2018-05-30 18:17:...|179.44|
| 6| UserName-6|18823458895| 6|78842349-3c60-486...|78842349-3c60-486...|2018-05-30 18:17:...|111.39|
| 6| UserName-6|18823458895| 6|c4cda44d-42ae-46c...|c4cda44d-42ae-46c...|2018-05-30 18:17:...|111.26|
| 6| UserName-6|18823458895| 6|26b90354-7e46-482...|26b90354-7e46-482...|2018-05-30 18:17:...|336.82|
| 7| UserName-7|18823458896| 7|b0e51c7b-7538-4c6...|b0e51c7b-7538-4c6...|2018-05-30 18:17:...|399.73|
| 7| UserName-7|18823458896| 7|fd8acde2-b115-485...|fd8acde2-b115-485...|2018-05-30 18:17:...|295.53|
| 7| UserName-7|18823458896| 7|2c233d01-59fa-430...|2c233d01-59fa-430...|2018-05-30 18:17:...| 52.69|
| 8| UserName-8|18823458897| 8|a73308fd-f3de-4e4...|a73308fd-f3de-4e4...|2018-05-30 18:17:...| 91.96|
| 8| UserName-8|18823458897| 8|a21deab3-8d88-493...|a21deab3-8d88-493...|2018-05-30 18:17:...|343.63|
| 8| UserName-8|18823458897| 8|25092940-ecde-487...|25092940-ecde-487...|2018-05-30 18:17:...|860.76|
| 9| UserName-9|18823458898| 9|5f2298bf-0859-425...|5f2298bf-0859-425...|2018-05-30 18:17:...|907.78|
| 9| UserName-9|18823458898| 9|cb71a2f9-f973-4ad...|cb71a2f9-f973-4ad...|2018-05-30 18:17:...|666.09|
| 9| UserName-9|18823458898| 9|f64b4ede-7faa-421...|f64b4ede-7faa-421...|2018-05-30 18:17:...|134.23|
| 10|UserName-10|18823458899| 10|2eb50d4e-5230-487...|2eb50d4e-5230-487...|2018-05-30 18:17:...|957.02|
| 10|UserName-10|18823458899| 10|faa13220-d459-4b4...|faa13220-d459-4b4...|2018-05-30 18:17:...|888.55|
| 10|UserName-10|18823458899| 10|8d07cc86-9b13-4d2...|8d07cc86-9b13-4d2...|2018-05-30 18:17:...|228.51|
+------+-----------+-----------+------+--------------------+--------------------+--------------------+------+
代码如下:
/**
* 使用DataFrame的Join方法连接DataFrame
*/
def dataFrameForJoin(): Unit = {
val sparkSession: SparkSession = SparkSession.builder().getOrCreate()
//创建DataFrame
val dfUsers: DataFrame = sparkSession.createDataFrame[User](userOrderSource.users)
val dfOrders: DataFrame = sparkSession.createDataFrame[Order](userOrderSource.orders)
val dfUserOrders: DataFrame = sparkSession.createDataFrame[UserOrder](userOrderSource.userOrder)
dfUsers.printSchema()
dfOrders.printSchema()
dfUserOrders.printSchema()
val dfResult: DataFrame = dfUsers.join(dfUserOrders, "userId").join(dfOrders, "orderId")
println("DataFrame join result=================================")
dfResult.printSchema()
dfResult.show(100)
}
运行结果:
root
|-- userId: integer (nullable = false)
|-- userName: string (nullable = true)
|-- tel: string (nullable = true)
root
|-- orderId: string (nullable = true)
|-- time: timestamp (nullable = true)
|-- money: double (nullable = false)
root
|-- userId: integer (nullable = false)
|-- orderId: string (nullable = true)
DataFrame join result=================================
root
|-- orderId: string (nullable = true)
|-- userId: integer (nullable = false)
|-- userName: string (nullable = true)
|-- tel: string (nullable = true)
|-- time: timestamp (nullable = true)
|-- money: double (nullable = false)
+--------------------+------+-----------+-----------+--------------------+------+
| orderId|userId| userName| tel| time| money|
+--------------------+------+-----------+-----------+--------------------+------+
|f44be3a5-8e97-47f...| 1| UserName-1|18823458890|2018-05-30 18:17:...|513.33|
|ef516070-acec-4ee...| 1| UserName-1|18823458890|2018-05-30 18:17:...|400.59|
|ef004d7d-b0d8-4d7...| 1| UserName-1|18823458890|2018-05-30 18:17:...|336.88|
|f45b08ff-d04e-408...| 2| UserName-2|18823458891|2018-05-30 18:17:...|693.88|
|e2d6fb42-1558-42e...| 2| UserName-2|18823458891|2018-05-30 18:17:...|278.61|
|9bc923d4-528a-4dc...| 2| UserName-2|18823458891|2018-05-30 18:17:...|719.47|
|e0dfd87b-36e5-49f...| 3| UserName-3|18823458892|2018-05-30 18:17:...|425.51|
|149f0659-12c2-492...| 3| UserName-3|18823458892|2018-05-30 18:17:...|129.92|
|c155a1c8-72b4-4f3...| 3| UserName-3|18823458892|2018-05-30 18:17:...|258.57|
|54b0c494-a096-4e6...| 4| UserName-4|18823458893|2018-05-30 18:17:...|955.21|
|bc6a24b8-6d77-4a5...| 4| UserName-4|18823458893|2018-05-30 18:17:...| 69.42|
|b4d68db1-7b02-44d...| 4| UserName-4|18823458893|2018-05-30 18:17:...|571.32|
|a6ccd370-494c-4f2...| 5| UserName-5|18823458894|2018-05-30 18:17:...| 72.92|
|646adb1c-73f1-44e...| 5| UserName-5|18823458894|2018-05-30 18:17:...| 69.54|
|532be792-5343-47a...| 5| UserName-5|18823458894|2018-05-30 18:17:...|179.44|
|78842349-3c60-486...| 6| UserName-6|18823458895|2018-05-30 18:17:...|111.39|
|c4cda44d-42ae-46c...| 6| UserName-6|18823458895|2018-05-30 18:17:...|111.26|
|26b90354-7e46-482...| 6| UserName-6|18823458895|2018-05-30 18:17:...|336.82|
|b0e51c7b-7538-4c6...| 7| UserName-7|18823458896|2018-05-30 18:17:...|399.73|
|fd8acde2-b115-485...| 7| UserName-7|18823458896|2018-05-30 18:17:...|295.53|
|2c233d01-59fa-430...| 7| UserName-7|18823458896|2018-05-30 18:17:...| 52.69|
|a73308fd-f3de-4e4...| 8| UserName-8|18823458897|2018-05-30 18:17:...| 91.96|
|a21deab3-8d88-493...| 8| UserName-8|18823458897|2018-05-30 18:17:...|343.63|
|25092940-ecde-487...| 8| UserName-8|18823458897|2018-05-30 18:17:...|860.76|
|5f2298bf-0859-425...| 9| UserName-9|18823458898|2018-05-30 18:17:...|907.78|
|cb71a2f9-f973-4ad...| 9| UserName-9|18823458898|2018-05-30 18:17:...|666.09|
|f64b4ede-7faa-421...| 9| UserName-9|18823458898|2018-05-30 18:17:...|134.23|
|2eb50d4e-5230-487...| 10|UserName-10|18823458899|2018-05-30 18:17:...|957.02|
|faa13220-d459-4b4...| 10|UserName-10|18823458899|2018-05-30 18:17:...|888.55|
|8d07cc86-9b13-4d2...| 10|UserName-10|18823458899|2018-05-30 18:17:...|228.51|
+--------------------+------+-----------+-----------+--------------------+------+
使用DataFrame的Join方法和使用SparkSQL的结果是一样的。DataFrame除了Jion方法外,还提供了leftOutterJoin和rightOutterJoin关联数据,其结果与SQL的 left outter join和right outer join是一样的。
使用SparkSQL能够快速的关联多个DataFrame的数据,这对于习惯使用SQL的用户来说带来的很大的方便。对于数据的聚合统计,使用SparkSQL能够减少了很多集合运算的代码。
SparkSQL具有快速、易用性、通用性和任何平台都可以运行的特点,因此,SparkSQL受到了很多开发者的青睐。