spark2.0主要就是DataSet的成熟api,提供比rdd原生api更高level的抽象api,更加方便我们的数据开发工作。此外,就是更加完善了对sql语法支持,更加便利使用sql进行大数据分析。
为什么要用Spark RDD API来开发这些复杂的业务逻辑,为什么不直接用SQL?当然,用SQL是可以的,但是要区分一下,SQL主要适用于刚才说的大量离线批处理的ETL作业和统计分析逻辑。统计分析报表以及需求灵活,多变,经常会增加,经常逻辑会变,用SQL是很合适的。但是我们这个是一套系统,和java web配合的系统,模块和需求都是固定的。用SQL的缺点在于,Spark底层自动生成执行计划和代码,我们几乎无法进行深度的调优,遇到问题也不好解决。但是对于我们这种固定需求,少量模块,要求速度和稳定性的系统来说,使用Spark RDD API是最好的选择,因为RDD是最原始的API,我们几乎可以控制一切,包括参数调优以及数据倾斜的重构和优化等等,遇到报错,都是最底层的源码,我们可以很容易进行定位和修复问题。大量采用Spark RDD API开发复杂业务模块的原因。
*
1. 指定时间段内访问次数最多的10个用户
2. 指定时间段内购买商品金额最多的10个用户
3. 最近周期内相对前一个周期访问次数增长最快的10个用户
4. 最近周期内相对之前一个周期购买商品金额增长最快的10个用户
5. 指定周期内注册的新用户在头7天访问次数最多的10个用户
package com.scala.MyUserActiveDegreeAnalyze
import org.apache.spark.sql.SparkSession
object MyScalaActiveAnalyze {
case class UserActionLog(logId:Long,userId:Long,actionTime:String,actionType:Long,purchaseMoney:Double)
case class UserActionLogValue(logId:Long,userId:Long,actionValue:Long)
case class UserActionLogWithPurchaseMoneyValue(logId:Long,userId:Long,purchaseMoney:Double)
def main(args: Array[String]): Unit = {
val startDate = "2016-09-01";
val endDate = "2016-11-01";
val spark=SparkSession.builder().appName("UserAnalyze").master("local")
//.config()
.getOrCreate()
//导入spark隐式转换
import spark.implicits._
//导入spark sql的functions
import org.apache.spark.sql.functions._
//获取数据集
val userBaseInfo = spark.read.json("D:\\study\\拟定新\\user_base_info.json")
val userActionLog=spark.read.json("D:\\study\\拟定新\\user_action_log.json")
/* //第一个功能:统计指定时间范围内访问次数最多10个用户
//①过滤数据,找到指定时间范围内数据
userActionLog .filter("actionTime >= '" + startDate + "' and actionTime <= '" + endDate + "' and actionType = 0")
//②关联对应的用户基本信息数据
.join(userBaseInfo, userActionLog("userId") === userBaseInfo("userId"))
//③分组,按照userid和username
.groupBy(userBaseInfo("userId"), userBaseInfo("username"))
//④进行聚合
.agg(count(userActionLog("logId")).alias("actionCount"))
//⑤排序
.sort($"actionCount".desc)
//⑥抽取指定条数
.limit(10)
//⑦展示
.show()*/
//第二个功能:获取指定时间范围内购买金额最多的10个用户
/* userActionLog
.filter("actionTime >= '" + startDate + "' and actionTime <= '" + endDate + "' and actionType = 1")
.join(userBaseInfo, userActionLog("userId") === userBaseInfo("userId"))
.groupBy(userBaseInfo("userId"), userBaseInfo("username"))
.agg(round(sum(userActionLog("purchaseMoney")),2).alias("totalPurchaseMoney"))
.sort($"totalPurchaseMoney".desc)
.limit(10)
.show()
//第三个功能:统计最近一个周期相对上一个周期访问次数增长最多的10个用户
*设定周期为一个月
* 比如有一个用户,张三,张三9月份访问一共100次,10月份周期访问200次,
* 这个总周期增长100次
* 获取最近两个周期内,访问次数增长最多的10个用户
*
* 周期,自定义
* 按一个月算 2016-10-01~2016-10-31,上一个周期就是2016-09-01~2016-09-30
val userActionNewPeriod= userActionLog.as[UserActionLog]
.filter("actionTime>='2016-10-01' and actionTime<='2016-10-31' and actionType=0")
.map(userActionLogEntry=>UserActionLogValue(userActionLogEntry.logId,userActionLogEntry.userId,1))
val userActionLastPeriod=userActionLog.as[UserActionLog]
.filter("actionTime>='2016-09-01' and actionTime<='2016-09-30' and actionType=0")
.map{userActionLogEntry=>UserActionLogValue(userActionLogEntry.logId,userActionLogEntry.userId,-1)}
val userActionLogsDataSet = userActionNewPeriod.union(userActionLastPeriod)
userActionLogsDataSet.join(userBaseInfo,userActionLogsDataSet("userId")===userBaseInfo("userId"))
.groupBy(userBaseInfo("userId"),userBaseInfo("username"))
.agg(sum(userActionLogsDataSet("actionValue")).alias(("actionIncr")))
.sort($"actionIncr".desc)
.limit(10)
.show()
*/
//第四个功能:最近周期内相对之前一个周期购买商品金额增长最快的10个用户
// val userActionMoneyNewPeriod=userActionLog.as[UserActionLog]
// .filter("actionTime>='2016-10-01' and actionTime<='2016-10-31' and actionType=1")
// .map(actionEntry=>UserActionLogWithPurchaseMoneyValue(actionEntry.logId,actionEntry.userId,actionEntry.purchaseMoney))
// val userActionMoneyLastPeriod=userActionLog.as[UserActionLog]
// .filter("actionTime>='2016-09-01' and actionTime<='2016-09-30' and actionType=1")
// .map(actionEntry=>UserActionLogWithPurchaseMoneyValue(actionEntry.logId,actionEntry.userId,-actionEntry.purchaseMoney))
// val userActionPeriodMoneyDataSet = userActionMoneyNewPeriod.union(userActionMoneyLastPeriod)
// userActionPeriodMoneyDataSet.join(userBaseInfo,userActionMoneyNewPeriod("userId")===userBaseInfo("userId"))
// .groupBy(userBaseInfo("userId"),userBaseInfo("username"))
// .agg(round(sum(userActionPeriodMoneyDataSet("purchaseMoney")),2).alias("purchaseMoneyIncr"))
// .limit(10)
// .show()
//第五个功能:统计指定注册时间范围内头7天访问次数最高的10个用户
userActionLog.join(userBaseInfo,userActionLog("userId")===userBaseInfo("userId"))
.filter(userBaseInfo("registTime")>= "2016-10-01"
&& userBaseInfo("registTime")<="2016-10-31"
&& userActionLog("actionTime")>=userBaseInfo("registTime")
&& userActionLog("actionTime")<=date_add(userBaseInfo("registTime"),7)
&& userActionLog("actionType")===0
).groupBy(userBaseInfo("userId"),userBaseInfo("username"))
.agg(count(userActionLog("logId")).alias("actionCount"))
.sort($"actionCount".desc)
.limit(10)
.show()
//第六个功能:统计指定注册时间范围内头7天购买金额最高的10个用户
userActionLog.join(userBaseInfo,userActionLog("userId")===userBaseInfo("userId"))
.filter(userBaseInfo("registTime")>="2016-10-01" && userBaseInfo("registTime")<="2016-10-31" && userActionLog("actionTime")>=userBaseInfo("registTime")&& userActionLog("actionTime")<=date_add(userBaseInfo("registTime"),7))
.groupBy(userActionLog("userId"),userBaseInfo("username"))
.agg(sum(userActionLog("purchaseMoney")).alias("MoneyBuy"))
.sort($"MoneyBuy".desc)
.limit(10)
.show()
}
}
{"logId": 00,"userId": 0, "actionTime": "2016-10-04 15:42:45", "actionType": 0, "purchaseMoney": 0.0}
{"logId": 01,"userId": 0, "actionTime": "2016-10-17 15:42:45", "actionType": 1, "purchaseMoney": 33.36}
{"logId": 02,"userId": 0, "actionTime": "2016-10-18 15:42:45", "actionType": 0, "purchaseMoney": 0.0}
{"logId": 03,"userId": 0, "actionTime": "2016-10-14 15:42:45", "actionType": 0, "purchaseMoney": 0.0}
{"logId": 04,"userId": 0, "actionTime": "2016-11-09 08:45:33", "actionType": 1, "purchaseMoney": 664.35}
{"logId": 05,"userId": 0, "actionTime": "2016-10-13 15:42:45", "actionType": 0, "purchaseMoney": 0.0}
{"logId": 06,"userId": 0, "actionTime": "2016-10-03 15:42:45", "actionType": 1, "purchaseMoney": 606.34}
{"logId": 07,"userId": 0, "actionTime": "2016-10-04 15:42:45", "actionType": 1, "purchaseMoney": 120.72}
{"logId": 08,"userId": 0, "actionTime": "2016-09-24 15:42:45", "actionType": 1, "purchaseMoney": 264.96}
{"logId": 09,"userId": 0, "actionTime": "2016-09-25 15:42:45", "actionType": 0, "purchaseMoney": 0.0}
{"logId": 010,"userId": 0, "actionTime": "2016-11-09 08:45:33", "actionType": 0, "purchaseMoney": 0.0}
{"logId": 011,"userId": 0, "actionTime": "2016-09-28 15:42:45", "actionType": 1, "purchaseMoney": 342.04}
{"logId": 012,"userId": 0, "actionTime": "2016-11-12 08:45:33", "actionType": 0, "purchaseMoney": 0.0}
{"logId": 013,"userId": 0, "actionTime": "2016-10-02 15:42:45", "actionType": 1, "purchaseMoney": 945.63}
{"logId": 014,"userId": 0, "actionTime": "2016-10-09 15:42:45", "actionType": 1, "purchaseMoney": 907.8}
{"logId": 015,"userId": 0, "actionTime": "2016-10-18 15:42:45", "actionType": 1, "purchaseMoney": 520.64}
{"logId": 016,"userId": 0, "actionTime": "2016-11-11 08:45:33", "actionType": 0, "purchaseMoney": 0.0}
{"logId": 017,"userId": 0, "actionTime": "2016-10-12 15:42:45", "actionType": 1, "purchaseMoney": 72.19}
{"logId": 018,"userId": 0, "actionTime": "2016-10-02 15:42:45", "actionType": 1, "purchaseMoney": 776.1}
{"logId": 019,"userId": 0, "actionTime": "2016-09-30 15:42:45", "actionType": 1, "purchaseMoney": 737.01}
{"logId": 020,"userId": 0, "actionTime": "2016-10-14 15:42:45", "actionType": 0, "purchaseMoney": 0.0}
{"logId": 021,"userId": 0, "actionTime": "2016-09-27 15:42:45", "actionType": 1, "purchaseMoney": 912.38}
{"logId": 022,"userId": 0, "actionTime": "2016-10-14 15:42:45", "actionType": 1, "purchaseMoney": 524.03}
{"logId": 023,"userId": 0, "actionTime": "2016-10-07 15:42:45", "actionType": 0, "purchaseMoney": 0.0}
{"userId": 3, "username": "user3", "registTime": "2016-10-04 18:06:25"}
{"userId": 4, "username": "user4", "registTime": "2016-10-05 18:06:25"}
{"userId": 5, "username": "user5", "registTime": "2016-10-05 18:06:25"}
{"userId": 6, "username": "user6", "registTime": "2016-11-08 11:09:12"}
{"userId": 7, "username": "user7", "registTime": "2016-10-07 18:06:25"}
{"userId": 8, "username": "user8", "registTime": "2016-10-05 18:06:25"}
{"userId": 9, "username": "user9", "registTime": "2016-10-07 18:06:25"}
{"userId": 10, "username": "user10", "registTime": "2016-09-27 18:06:25"}
{"userId": 11, "username": "user11", "registTime": "2016-10-01 18:06:25"}
{"userId": 12, "username": "user12", "registTime": "2016-10-11 18:06:25"}
{"userId": 13, "username": "user13", "registTime": "2016-10-11 18:06:25"}
{"userId": 14, "username": "user14", "registTime": "2016-10-12 18:06:25"}
{"userId": 15, "username": "user15", "registTime": "2016-10-09 18:06:25"}
{"userId": 16, "username": "user16", "registTime": "2016-10-11 18:06:25"}
{"userId": 17, "username": "user17", "registTime": "2016-09-29 18:06:25"}
{"userId": 18, "username": "user18", "registTime": "2016-10-07 18:06:25"}
{"userId": 19, "username": "user19", "registTime": "2016-10-09 18:06:25"}
{"userId": 20, "username": "user20", "registTime": "2016-10-14 18:06:25"}
{"userId": 21, "username": "user21", "registTime": "2016-09-28 18:06:25"}
{"userId": 22, "username": "user22", "registTime": "2016-11-08 11:09:12"}