基于Spark2.2的 交互式用户活跃度分析 指定范围 访问次数 top10

spark2.0主要就是DataSet的成熟api,提供比rdd原生api更高level的抽象api,更加方便我们的数据开发工作。此外,就是更加完善了对sql语法支持,更加便利使用sql进行大数据分析。
为什么要用Spark RDD API来开发这些复杂的业务逻辑,为什么不直接用SQL?当然,用SQL是可以的,但是要区分一下,SQL主要适用于刚才说的大量离线批处理的ETL作业和统计分析逻辑。统计分析报表以及需求灵活,多变,经常会增加,经常逻辑会变,用SQL是很合适的。但是我们这个是一套系统,和java web配合的系统,模块和需求都是固定的。用SQL的缺点在于,Spark底层自动生成执行计划和代码,我们几乎无法进行深度的调优,遇到问题也不好解决。但是对于我们这种固定需求,少量模块,要求速度和稳定性的系统来说,使用Spark RDD API是最好的选择,因为RDD是最原始的API,我们几乎可以控制一切,包括参数调优以及数据倾斜的重构和优化等等,遇到报错,都是最底层的源码,我们可以很容易进行定位和修复问题。大量采用Spark RDD API开发复杂业务模块的原因。
*

需求

1. 指定时间段内访问次数最多的10个用户
2. 指定时间段内购买商品金额最多的10个用户
3. 最近周期内相对前一个周期访问次数增长最快的10个用户
4. 最近周期内相对之前一个周期购买商品金额增长最快的10个用户
5. 指定周期内注册的新用户在头7天访问次数最多的10个用户

代码

package com.scala.MyUserActiveDegreeAnalyze

import org.apache.spark.sql.SparkSession

object MyScalaActiveAnalyze {
    case class UserActionLog(logId:Long,userId:Long,actionTime:String,actionType:Long,purchaseMoney:Double)
    case class UserActionLogValue(logId:Long,userId:Long,actionValue:Long)
    case class UserActionLogWithPurchaseMoneyValue(logId:Long,userId:Long,purchaseMoney:Double)

    def main(args: Array[String]): Unit = {

        val startDate = "2016-09-01";
        val endDate = "2016-11-01";

        val spark=SparkSession.builder().appName("UserAnalyze").master("local")
            //.config()
            .getOrCreate()
        //导入spark隐式转换
        import spark.implicits._
        //导入spark sql的functions
        import org.apache.spark.sql.functions._
        //获取数据集
        val userBaseInfo = spark.read.json("D:\\study\\拟定新\\user_base_info.json")
        val userActionLog=spark.read.json("D:\\study\\拟定新\\user_action_log.json")


    /*    //第一个功能:统计指定时间范围内访问次数最多10个用户

        //①过滤数据,找到指定时间范围内数据
        userActionLog .filter("actionTime >= '" + startDate + "' and actionTime <= '" + endDate + "' and actionType = 0")
        //②关联对应的用户基本信息数据
            .join(userBaseInfo, userActionLog("userId") === userBaseInfo("userId"))
            //③分组,按照userid和username
            .groupBy(userBaseInfo("userId"), userBaseInfo("username"))
            //④进行聚合
            .agg(count(userActionLog("logId")).alias("actionCount"))
            //⑤排序
            .sort($"actionCount".desc)
        //⑥抽取指定条数
            .limit(10)
            //⑦展示
            .show()*/
        //第二个功能:获取指定时间范围内购买金额最多的10个用户
  /*      userActionLog
            .filter("actionTime >= '" + startDate + "' and actionTime <= '" + endDate + "' and actionType = 1")
            .join(userBaseInfo, userActionLog("userId") === userBaseInfo("userId"))
            .groupBy(userBaseInfo("userId"), userBaseInfo("username"))
            .agg(round(sum(userActionLog("purchaseMoney")),2).alias("totalPurchaseMoney"))
            .sort($"totalPurchaseMoney".desc)
            .limit(10)
            .show()

        //第三个功能:统计最近一个周期相对上一个周期访问次数增长最多的10个用户
        *设定周期为一个月
          * 比如有一个用户,张三,张三9月份访问一共100次,10月份周期访问200次,
          * 这个总周期增长100次
          * 获取最近两个周期内,访问次数增长最多的10个用户
          *
          * 周期,自定义
          * 按一个月算 2016-10-01~2016-10-31,上一个周期就是2016-09-01~2016-09-30
        val userActionNewPeriod= userActionLog.as[UserActionLog]
            .filter("actionTime>='2016-10-01' and actionTime<='2016-10-31' and actionType=0")
            .map(userActionLogEntry=>UserActionLogValue(userActionLogEntry.logId,userActionLogEntry.userId,1))
        val userActionLastPeriod=userActionLog.as[UserActionLog]
            .filter("actionTime>='2016-09-01' and actionTime<='2016-09-30' and actionType=0")
            .map{userActionLogEntry=>UserActionLogValue(userActionLogEntry.logId,userActionLogEntry.userId,-1)}

        val userActionLogsDataSet = userActionNewPeriod.union(userActionLastPeriod)
        userActionLogsDataSet.join(userBaseInfo,userActionLogsDataSet("userId")===userBaseInfo("userId"))
            .groupBy(userBaseInfo("userId"),userBaseInfo("username"))
            .agg(sum(userActionLogsDataSet("actionValue")).alias(("actionIncr")))
            .sort($"actionIncr".desc)
            .limit(10)
            .show()

*/


        //第四个功能:最近周期内相对之前一个周期购买商品金额增长最快的10个用户
//         val userActionMoneyNewPeriod=userActionLog.as[UserActionLog]
//             .filter("actionTime>='2016-10-01' and actionTime<='2016-10-31' and actionType=1")
//             .map(actionEntry=>UserActionLogWithPurchaseMoneyValue(actionEntry.logId,actionEntry.userId,actionEntry.purchaseMoney))
//         val userActionMoneyLastPeriod=userActionLog.as[UserActionLog]
//             .filter("actionTime>='2016-09-01' and actionTime<='2016-09-30' and actionType=1")
//             .map(actionEntry=>UserActionLogWithPurchaseMoneyValue(actionEntry.logId,actionEntry.userId,-actionEntry.purchaseMoney))
//        val userActionPeriodMoneyDataSet = userActionMoneyNewPeriod.union(userActionMoneyLastPeriod)
//        userActionPeriodMoneyDataSet.join(userBaseInfo,userActionMoneyNewPeriod("userId")===userBaseInfo("userId"))
//            .groupBy(userBaseInfo("userId"),userBaseInfo("username"))
//            .agg(round(sum(userActionPeriodMoneyDataSet("purchaseMoney")),2).alias("purchaseMoneyIncr"))
//            .limit(10)
//            .show()

        //第五个功能:统计指定注册时间范围内头7天访问次数最高的10个用户
        userActionLog.join(userBaseInfo,userActionLog("userId")===userBaseInfo("userId"))
            .filter(userBaseInfo("registTime")>= "2016-10-01"
                && userBaseInfo("registTime")<="2016-10-31"
                && userActionLog("actionTime")>=userBaseInfo("registTime")
                && userActionLog("actionTime")<=date_add(userBaseInfo("registTime"),7)
                && userActionLog("actionType")===0
            ).groupBy(userBaseInfo("userId"),userBaseInfo("username"))
            .agg(count(userActionLog("logId")).alias("actionCount"))
            .sort($"actionCount".desc)
            .limit(10)
            .show()


        //第六个功能:统计指定注册时间范围内头7天购买金额最高的10个用户
        userActionLog.join(userBaseInfo,userActionLog("userId")===userBaseInfo("userId"))
            .filter(userBaseInfo("registTime")>="2016-10-01" && userBaseInfo("registTime")<="2016-10-31" && userActionLog("actionTime")>=userBaseInfo("registTime")&& userActionLog("actionTime")<=date_add(userBaseInfo("registTime"),7))
            .groupBy(userActionLog("userId"),userBaseInfo("username"))
            .agg(sum(userActionLog("purchaseMoney")).alias("MoneyBuy"))
            .sort($"MoneyBuy".desc)
            .limit(10)
            .show()

    }

}


数据格式

{"logId": 00,"userId": 0, "actionTime": "2016-10-04 15:42:45", "actionType": 0, "purchaseMoney": 0.0}
{"logId": 01,"userId": 0, "actionTime": "2016-10-17 15:42:45", "actionType": 1, "purchaseMoney": 33.36}
{"logId": 02,"userId": 0, "actionTime": "2016-10-18 15:42:45", "actionType": 0, "purchaseMoney": 0.0}
{"logId": 03,"userId": 0, "actionTime": "2016-10-14 15:42:45", "actionType": 0, "purchaseMoney": 0.0}
{"logId": 04,"userId": 0, "actionTime": "2016-11-09 08:45:33", "actionType": 1, "purchaseMoney": 664.35}
{"logId": 05,"userId": 0, "actionTime": "2016-10-13 15:42:45", "actionType": 0, "purchaseMoney": 0.0}
{"logId": 06,"userId": 0, "actionTime": "2016-10-03 15:42:45", "actionType": 1, "purchaseMoney": 606.34}
{"logId": 07,"userId": 0, "actionTime": "2016-10-04 15:42:45", "actionType": 1, "purchaseMoney": 120.72}
{"logId": 08,"userId": 0, "actionTime": "2016-09-24 15:42:45", "actionType": 1, "purchaseMoney": 264.96}
{"logId": 09,"userId": 0, "actionTime": "2016-09-25 15:42:45", "actionType": 0, "purchaseMoney": 0.0}
{"logId": 010,"userId": 0, "actionTime": "2016-11-09 08:45:33", "actionType": 0, "purchaseMoney": 0.0}
{"logId": 011,"userId": 0, "actionTime": "2016-09-28 15:42:45", "actionType": 1, "purchaseMoney": 342.04}
{"logId": 012,"userId": 0, "actionTime": "2016-11-12 08:45:33", "actionType": 0, "purchaseMoney": 0.0}
{"logId": 013,"userId": 0, "actionTime": "2016-10-02 15:42:45", "actionType": 1, "purchaseMoney": 945.63}
{"logId": 014,"userId": 0, "actionTime": "2016-10-09 15:42:45", "actionType": 1, "purchaseMoney": 907.8}
{"logId": 015,"userId": 0, "actionTime": "2016-10-18 15:42:45", "actionType": 1, "purchaseMoney": 520.64}
{"logId": 016,"userId": 0, "actionTime": "2016-11-11 08:45:33", "actionType": 0, "purchaseMoney": 0.0}
{"logId": 017,"userId": 0, "actionTime": "2016-10-12 15:42:45", "actionType": 1, "purchaseMoney": 72.19}
{"logId": 018,"userId": 0, "actionTime": "2016-10-02 15:42:45", "actionType": 1, "purchaseMoney": 776.1}
{"logId": 019,"userId": 0, "actionTime": "2016-09-30 15:42:45", "actionType": 1, "purchaseMoney": 737.01}
{"logId": 020,"userId": 0, "actionTime": "2016-10-14 15:42:45", "actionType": 0, "purchaseMoney": 0.0}
{"logId": 021,"userId": 0, "actionTime": "2016-09-27 15:42:45", "actionType": 1, "purchaseMoney": 912.38}
{"logId": 022,"userId": 0, "actionTime": "2016-10-14 15:42:45", "actionType": 1, "purchaseMoney": 524.03}
{"logId": 023,"userId": 0, "actionTime": "2016-10-07 15:42:45", "actionType": 0, "purchaseMoney": 0.0}
{"userId": 3, "username": "user3", "registTime": "2016-10-04 18:06:25"}
{"userId": 4, "username": "user4", "registTime": "2016-10-05 18:06:25"}
{"userId": 5, "username": "user5", "registTime": "2016-10-05 18:06:25"}
{"userId": 6, "username": "user6", "registTime": "2016-11-08 11:09:12"}
{"userId": 7, "username": "user7", "registTime": "2016-10-07 18:06:25"}
{"userId": 8, "username": "user8", "registTime": "2016-10-05 18:06:25"}
{"userId": 9, "username": "user9", "registTime": "2016-10-07 18:06:25"}
{"userId": 10, "username": "user10", "registTime": "2016-09-27 18:06:25"}
{"userId": 11, "username": "user11", "registTime": "2016-10-01 18:06:25"}
{"userId": 12, "username": "user12", "registTime": "2016-10-11 18:06:25"}
{"userId": 13, "username": "user13", "registTime": "2016-10-11 18:06:25"}
{"userId": 14, "username": "user14", "registTime": "2016-10-12 18:06:25"}
{"userId": 15, "username": "user15", "registTime": "2016-10-09 18:06:25"}
{"userId": 16, "username": "user16", "registTime": "2016-10-11 18:06:25"}
{"userId": 17, "username": "user17", "registTime": "2016-09-29 18:06:25"}
{"userId": 18, "username": "user18", "registTime": "2016-10-07 18:06:25"}
{"userId": 19, "username": "user19", "registTime": "2016-10-09 18:06:25"}
{"userId": 20, "username": "user20", "registTime": "2016-10-14 18:06:25"}
{"userId": 21, "username": "user21", "registTime": "2016-09-28 18:06:25"}
{"userId": 22, "username": "user22", "registTime": "2016-11-08 11:09:12"}

你可能感兴趣的:(实战,spark)