大数据电商项目实战

一 背景需求

我们做的是一个有上千万用户,三四百万日活的app大数据项目,此项目主要解决营销分析断层和产品迭代无法量化和用户运营不精准和全局运营指标监控不实时的问题,为公司运营解决了粗放人力管理的弊端,实现了项目运营的数据化。

二 大数据项目的需求

大数据项目需求分为流量域分析,业务域分析,画像域分析。

三 流量域分析

1.基础数据分析
(1)整体概况
(2)用户获取
分为渠道访问和APP数据
(3)活跃和留存
分为访问流量和用户留存
(4)事件转化
分为各类关键事件转化和收益类事件转化
(5)用户特征
2.进阶用户行为分析
(1)漏斗分析
(2)留存分析
(3)分布分析
(4)归因分析
(5)用户路径分析
(6)间隔分析
(7)自定义查询

四 业务域分析

(1)交易域
分为购物车分析和订单GMV分析和复购分析
(2)营销域
分为优惠券分析和团购分析和秒杀限时购分析和其他运营活动
(3)运营活动域
分为广告运营位分析和拉新注册分析
(4)会员域

五 画像域

分析得出大数据主要用来构建用户画像,这一部分需要用到数据建模的思想,我们这个项目中用的是经典的维度建模中的星座建模。

六 项目具体实现-流量域

此大数据的数仓项目首先要进行技术选型,经过多方比对我们选择了以下组件来实现:数据采集:Flume;存储平台:HDFS;基础设施:Hive;运算引擎:SPARK SQL ;资源调度:YARN;任务调度:AZKABAN;元数据管理:Atlas。
(1)首先,我们要采集海量的数据来做项目支撑,数据主要来源于用户行为日志和业务数据及历史数据,此部分数据主要来源于Java前段页面埋点来获得日志数据。
(2)其次,我们要利用kafka这个组件来做日志数据的缓存,当然在服务器上要先启动zookeeper,在服务器上运行kafka组件来将前端页面中的数据采集到kafka中,代码实现手首先为:查看kafka中topic: bin/kafka-topics.sh --list --zookeeper linux01:2181;创建kafka中所需的topic:bin/kafka-topics.sh --create --topic app_log --replication-factor 2 --partitions 2 --zookeeper linux01:2181;为确保数据确实进入kafka可以在组件上启动一个命令行消费者来监视这个topic:bin/kafka-console-consumer.sh --topic app_log --bootstrap-server linux01:9092,linux02:9092,linux03:9092 --from-beginning。
(3)其次,在数仓项目中需要分层。首先我们要创建ODS层即操作数据层,对应的是原始数据etl到数仓中的表,此ods层的表存储在hive中,ods层中原数据为普通文本文件数据,为将这一数据转化为json格式数据需要一个外部jar包为“org.openx.data.jsonserde.JsonSerDe”,我们要用flume这个采集工具采集kafka中的json格式的日志数据到HDFS中,此数据的表结构存放于hive中所以在导入数据前我们需要先在hive中创建一个表来存储数据结构,代码实现如下:create database ods;

drop table ods.app_event_log;
create external table ods.app_event_log
(
    account         string,
    appId           string,
    appVersion      string,
    carrier         string,
    deviceId        string,
    deviceType      string,
    eventId         string,
    ip              string,
    latitude        double,
    longitude       double,
    netType         string,
    osName          string,
    osVersion       string,
    properties      map<string,string>,
    releaseChannel  string,
    resolution      string,
    sessionId       string,
    `timeStamp`       bigint
)
    partitioned by (y string,m string,d string)
    row format serde 'org.openx.data.jsonserde.JsonSerDe'
    stored as textfile
;

由于我们的数据在HDFS上为导入数据到hive上则有以下代码实现:load data inpath ‘/eventlog/app_event/20200730’ into table ods.app_event_log partition (y=‘2020’,m=‘07’,d=‘30’);
(4)在ODS层后我们要创建DWD层,DWD层为对ODS层数据etl处理后的扁平化明细数据,DWD层用不完全星型模型思想建模,此层的创建需要当天的事件日志和前一日与今日的idmapping绑定表和geohash地理位置表和IP地理位置表,为准备数据我们需要先准备和生成3个字典:geo位置字典,IP位置字典,设备guid绑定字典。为完成这些首先需要准备一个在mysql中的GPS位置字典,这个MySQL中要有一张表:t_md_areas,再将这张表变为扁平结构的表,代码实现如下:

create table geo_tmp as
SELECT
    a.BD09_LNG,
    a.BD09_LAT,
    a.AREANAME as district,
    b.AREANAME as city,
    c.AREANAME as province
from t_md_areas a join t_md_areas b on a.`LEVEL`=3 and a.PARENTID = b.ID
                  join t_md_areas c on b.PARENTID = c.ID

其次,将上一步的GPS位置字典转换为geohash位置字典并存入hive,实现如下:运行项目中的程序

cn.doitedu.dwetl.area.GeoDictGen:
object GeoDictGen {

  def main(args: Array[String]): Unit = {

    val spark = SparkUtil.getSparkSession(GeoDictGen.getClass.getSimpleName)
    import spark.implicits._

    // 加载mysql中的gps坐标地理位置字典
    val props = new Properties()
    props.setProperty("user","root")
    props.setProperty("password","123456")
    val tmp = spark.read.jdbc("jdbc:mysql://localhost:3306/realtimedw", "geo_tmp", props)
    tmp.show(10,false)


    // 将gps坐标转换成geohash编码
    val res: DataFrame = tmp.map(row=>{
      // 取出字段:|BD09_LAT |BD09_LNG |province|city  |district|
      val lat = row.getAs[Double]("BD09_LAT")
      val lng = row.getAs[Double]("BD09_LNG")
      val province = row.getAs[String]("province")
      val city = row.getAs[String]("city")
      val district = row.getAs[String]("district")

      // 调用geohash工具包,传入经纬度,返回geohash码
      val geo = GeoHash.geoHashStringWithCharacterPrecision(lat, lng, 5)

      // 组装行结果返回
      (geo,province,city,district)
    }).toDF("geo","province","city","district")
    
    // 将结果写入hive的表 dim.geo_dict
    res.write.saveAsTable("dim.geo_dict")


    spark.close()

  }

}

注意修改源MySQL的地址.用户密码.库名.表名,同时注意项目中要放入自己的配置文件core-site.xml,hive-site.xml。再次,要准备IP位置字典库,库名字为ip2region.db,放置位置为hdfs中的目录: /ip2region/ 下,IP查找可以用二分查找法,IP地理位置处理工具包为ip2region(含ip数据库),地址为https://gitee.com/lionsoul/ip2region。最后要进行device设备&guid绑定字典开发(每天都要运行),具体实现如下:准备一个初始的绑定字典(文件 a.txt),内容为: {“deviceid”:"",“guid”:"",lst:[]}; 放在hdfs中的:/idmp/bindtable/2020-07-29/ 下;
运行程序,来生成2020-07-30号的绑定表结果:cn.doitedu.dwetl.idmp.AppIdBInd
代码如下:

case class BindScore(account: String, timestamp: Long, score: Int)

object AppIdBInd {
  def main(args: Array[String]): Unit = {
    var pre_bind_dt = ""
    var cur_bind_dt = ""
    var log_y = ""
    var log_m = ""
    var log_d = ""
    try {
      cur_bind_dt = args(0)
      pre_bind_dt = args(1)

      log_y = cur_bind_dt.split("-")(0)
      log_m = cur_bind_dt.split("-")(1)
      log_d = cur_bind_dt.split("-")(2)
    } catch {
      case e: Exception => println(
        """
          |
          |Usage:
          |参数1:要处理的数据的日期,如:2020-07-31
          |参数2:前一个参照绑定表的日期,如:2020-07-30
          |参数3:程序要运行的模式:local|yarn
          |
          |""".stripMargin)
        sys.exit(1)
    }


    val spark = SparkUtil.getSparkSession("APP设备id账号id绑定",args(2))
    import spark.implicits._
    import org.apache.spark.sql.functions._

    // 1. 加载 T日的日志
    val isBlank = udf((s: String) => {
      if (StringUtils.isBlank(s)) null else s
    })

    val tLog = spark.read.table("ods.app_event_log")
      .where(s"y=${log_y} and m=${log_m} and d=${log_d}")
      .select(isBlank('account) as "account", $"deviceid", $"timestamp")


    // 2. 对当天日志中的(设备,账号)组合,进行去重(只取时间最早的一条)
    val window = Window.partitionBy('account, 'deviceid).orderBy('timestamp)
    // 这里用了sparksql,而且里面有shuffle,并行度就会变成默认的 200

    // --conf spark.sql.shuffle.partitions
    val pairs = tLog.select('account, 'deviceid, 'timestamp, row_number() over (window) as "rn")
      .where("rn=1")
      .select('account, 'deviceid, 'timestamp)

    pairs.coalesce(2)

    // 3. 同一个设备上,对不同账号打不同的分(登录时间越早,分数越高)
    // 3.1 先将相同设备的数据,分组
    val scoredRdd = pairs.rdd.map(row => {
      val account = row.getAs[String]("account")
      val deviceid = row.getAs[String]("deviceid")
      val timestamp = row.getAs[Long]("timestamp")
      (account, deviceid, timestamp)
    }).groupBy(_._2).map(tp => {
      val deviceid = tp._1

      // 3.2 按时间先后顺序打分
      val lst = tp._2.toList.filter(_._1 != null).sortBy(_._3)
      val scoreLst: immutable.Seq[BindScore] = for (i <- 0 until lst.size) yield BindScore(lst(i)._1, lst(i)._3, 100 - 10 * i)

      // (设备号,登录过的所有账号及分数)
      (deviceid, scoreLst.toList)
    })

    // 4. 加载 T-1日 在绑定记录表  假数据:  {"deviceid":"","lst":[],"guid":""}
    val bindTable = spark.read.textFile(s"/idmp/bindtable/${pre_bind_dt}")
    bindTable.coalesce(1)
    val bindTableRdd = bindTable.rdd.map(line => {
      val obj = JSON.parseObject(line)
      val deviceid = obj.getString("deviceid")
      val guid = obj.getString("guid")

      val lstArray = obj.getJSONArray("lst")
      val lst = new ListBuffer[BindScore]()
      for (i <- 0 until lstArray.size()) {
        val bindObj = lstArray.getJSONObject(i)
        val bindScore = BindScore(bindObj.getString("account"), bindObj.getLong("timestamp"), bindObj.getIntValue("score"))
        lst += bindScore
      }
      (deviceid, (lst.toList, guid))
    })

    /**
     * (d0,(List(BindScore(u0,7,100), BindScore(u1,8,20)),u0))
     * (d1,(List(BindScore(u1,9,100)),u1))
     * (d2,(List(BindScore(u2,8,100)),u2))
     * (d3,(List(BindScore(u2,9,90)),u2))
     * (d4,(List(),d4))
     * (d5,(List(),d5))
     */

    val joined = scoredRdd.fullOuterJoin(bindTableRdd)

    /**
     * (d0,(Some(List()),Some((List(BindScore(u0,7,100), BindScore(u1,8,20)),u0))))
     * (d1,(Some(List(BindScore(u1,11,100), BindScore(u2,13,90))),Some((List(BindScore(u1,9,100)),u1))))
     * (d2,(Some(List(BindScore(u2,14,100))),Some((List(BindScore(u2,8,100)),u2))))
     * (d3,(Some(List(BindScore(u3,14,100))),Some((List(BindScore(u2,9,90)),u2))))
     * (d4,(Some(List(BindScore(u4,15,100))),Some((List(),d4))))
     * (d5,(Some(List()),Some((List(),d5))))
     * (d8,(Some(List()),None))
     * (d9,(Some(List(BindScore(u4,18,100))),None))
     * (d6,(None,Some(List(BindScore(u4,18,100))))
     */
    val result = joined.map(tp => {
      val deviceid = tp._1
      val left: Option[List[BindScore]] = tp._2._1
      val right: Option[(List[BindScore], String)] = tp._2._2

      // 事先定义好要返回的几个变量 (deviceid,lst,guid)
      var resLst = List.empty[BindScore]
      var resGuid: String = ""

      // 分情况,处理合并
      // 情况1: 右表(历史)根本没有这个设备
      if (right.isEmpty) {
        resLst = left.get
        if (resLst.size < 1) resGuid = deviceid else resGuid = getGuid(resLst)
      }

      // 情况2:左表(今日)根本没有这个设备
      // lst和guid,都保留历史的
      if (left.isEmpty) {
        resGuid = right.get._2
        resLst = right.get._1
      }

      // 情况3: 左右表都有some,需要对两边的lst进行分数合并
      if (left.isDefined && right.isDefined) {

        val lst1 = left.get
        val lst2 = right.get._1

        // 判断两边的list是否都是空list
        if (lst1.size < 1 && lst2.size < 1) {
          resGuid = deviceid
        } else {
          // 合并分数
          resLst = mergeScoreList(lst1, lst2)
          resGuid = getGuid(resLst)
        }
      }

      // 返回最后的结果(设备id,登录账号分数记录列表,guid)
      (deviceid, resLst, resGuid)
    })

    val jsonResult = result.map(tp => {
      val deviceid = tp._1
      val lst: List[BindScore] = tp._2
      val guid = tp._3

      val scores: util.ArrayList[AccountScore] = new util.ArrayList[AccountScore]()
      for (elem <- lst) {
        val as = new AccountScore(elem.account, elem.timestamp, elem.score)
        scores.add(as)
      }

      val gb = new GuidBinBean(deviceid, guid, scores)

      val gson = new Gson()
      gson.toJson(gb)
    })

    jsonResult.coalesce(1).saveAsTextFile(s"/idmp/bindtable/${cur_bind_dt}")

    spark.close()
  }

  def getGuid(lst: List[BindScore]): String = {
    val sorted = lst.sortBy(b => (-b.score, b.timestamp))
    sorted(0).account
  }

  def mergeScoreList(lst1: List[BindScore], lst2: List[BindScore]): List[BindScore] = {
    val lst = lst1 ::: lst2
    // List(BindScore(u1,11,100), BindScore(u2,13,90),BindScore(u1,9,100))
    val sorted = lst.sortBy(b => (b.account, b.timestamp))
    val res: immutable.Iterable[BindScore] = sorted.groupBy(_.account).map(tp => {
      val account = tp._1
      tp._2.reduce((x, y) => BindScore(x.account, x.timestamp, x.score + y.score))
    })
    res.toList
  }

}

完成上述准备工作后,我们在这一阶段要用spark程序来实现,对需求进行分析得出以下需求:
(1).清洗过滤
1,去除json数据体中的废弃字段(前端开发人员在埋点设计方案变更后遗留的无用字段):
2,过滤掉json格式不正确的(脏数据)
3,过滤掉日志中account及deviceid全为空的记录
4,过滤掉日志中缺少关键字段(event/eventid/sessionid 缺任何一个都不行)的记录!
5,过滤掉日志中不符合时间段的记录(由于app上报日志可能的延迟,有数据延迟到达)
6,对于web端日志,过滤爬虫请求数据(通过useragent标识来分析)
(2)数据解析
将json格式数据打平,解析成扁平格式
注:properties字段不用扁平化,转成Map类型存储即可
(3)SESSION分割
1,对于web端日志,按天然session分割,不需处理
2,对于app日志,由于使用了登录保持技术,导致app进入后台很长时间后,再恢复前台,依然是同一个session,不符合session分析定义,需要按事件间隔时间切割(业内通用:30分钟)
3,对于wx小程序日志,与app类似,session有效期很长,需要按事件间隔时间切割(业内通用:30分钟)
(4)数据规范处理
Boolean字段,在数据中有使用1/0标识的,也有使用true/false表示的,统一为Y/N/U
字符串类型字段,在数据中有空串,有null值,统一为null值
(5)维度集成
1,将日志中的GPS经纬度坐标解析成省、市、县(区)信息;(为了方便后续的地域维度分析)
2,将日志中的IP地址解析成省、市、县(区)信息;(为了方便后续的地域维度分析)
注:app日志和wxapp日志,有采集到的用户事件行为时的所在地gps坐标信息
web日志则无法收集到用户的gps坐标,但可以收集到ip地址
gps坐标可以表达精确的地理位置,而ip地址只能表达准确度较低而且精度较低的地理位置
3,将日志中的时间戳,解析出年、季度、月、日、年第几周、月第几周、年第几天
(6)ID_MAPPING
为每一个用户生成一个全局唯一标识
选取合适的用户标识对于提高用户行为分析的准确性有非常大的影响,尤其是漏斗、留存、Session 等用户相关的分析功能。
因此,我们在进行任何数据接入之前,都应当先确定如何来标识用户。
新老访客标记
新访客,标记为1
老访客,标记为0
保存结果
最后,将数据输出为parquet格式,压缩编码用snappy
注:parquet和orc都是列式存储的文件格式,两者对于分析运算性的读取需求,都有相似优点
在实际性能测试中(读、写、压缩性能),ORC略优于PARQUET
此处可以选择orc,也可以选择parquet,选择parquet的理由则是,parquet格式的框架兼容性更好,比如impala支持parquet,但不支持orc。
实现idmapping即新老用户标识在上面的代码已作出。
在做好了上述准备后要在hive中创建相应日志的目录来存储日志,代码如下:

drop table  if exists  dwd.app_event_dtl;
create table dwd.app_event_dtl(
account             string,
appid               string,
appversion          string,
carrier             string,
deviceid            string,
devicetype          string,
eventid             string,
ip                  string,
latitude            double,
longitude           double,
nettype             string,
osname              string,
osversion           string,
properties          map<string,string>,
releasechannel      string,
resolution          string,
sessionid           string,
ts                  bigint,
province            string,
city                string,
district            string,
guid                string,
isNew               int
)
partitioned by (dt string)
stored as orc
;

为将ods层数据导入dwd层则须运行以下代码:
/**

  • ODS -> DWD 的etl程序
  • 要需求:1. json解析,清洗,过滤
  • 2.地理位置解析(先解析gps,再用ip解析)
  • 3.GUID的标识
  • 4.新老用户的标识
  • 用到的数据:1.当天的事件日志
  • 2.前一日 和今日的 idmapping 绑定表
  • 3.geohash地理位置表
  • 3.ip地理位置表

*/

object Ods2Dwd {

  def main(args: Array[String]): Unit = {
    var cur_dt = ""
    var pre_dt = ""
    var log_y = ""
    var log_m = ""
    var log_d = ""

    try {
      cur_dt = args(0)
      pre_dt = args(1)

      log_y = cur_dt.split("-")(0)
      log_m = cur_dt.split("-")(1)
      log_d = cur_dt.split("-")(2)
    } catch {
      case e: Exception => println(
        """
          |
          |Usage:
          |参数1:要处理的日志数据的日期,如:2020-07-31
          |参数2:前一天设备绑定表日期,如:2020-07-30
          |参数3:程序要运行的模式:local|yarn
          |
          |""".stripMargin)
        sys.exit(1)
    }
    val spark = SparkUtil.getSparkSession("ODS2DWD数据处理",args(2))
    import spark.implicits._

    // 1. 从hive的ods库的app_event_log表加载当天日志数据
    val eventLog = spark
      .read
      .table("ods.app_event_log")
      .where(s"y=${log_y} and m=${log_m}  and d=${log_d}")

    // 2. 脏数据过滤, 关键字段缺失,时间段是否符合,爬虫请求  等过滤
    val washedEventLog = eventLog.rdd.map(BeanUtil.row2EventLogBean(_))
      .filter(bean => {
        var flag = true
        if (bean == null) {
          flag = false
        } else {
          if (StringUtils.isBlank(bean.account) && StringUtils.isBlank(bean.deviceid)) flag = false
          if (StringUtils.isBlank(bean.eventid) || StringUtils.isBlank(bean.sessionid) || bean.properties == null) flag = false
        }

        val sdf = new SimpleDateFormat("yyyy-MM-dd")

        val startTime = sdf.parse(s"${cur_dt}").getTime
        val endTime = startTime+24*60*60*1000

        if (bean.timestamp > endTime || bean.timestamp < startTime) flag = false

        flag
      })

    // 3. 地理位置解析
    // 加载geo地理位置字典,并广播
    val geoDF = spark.read.table("dim.geo_dict")
    val geoMap = geoDF.rdd.map(row => {
      val geo = row.getAs[String]("geo")
      val province = row.getAs[String]("province")
      val city = row.getAs[String]("city")
      val district = row.getAs[String]("district")
      (geo, (province, city, district))
    }).collectAsMap()
    val bc = spark.sparkContext.broadcast(geoMap)

    // 从hdfs中加载ip地址库数据,并广播
    val conf = new Configuration()
    conf.set("fs.defaultFS", "hdfs://doitedu01:8020")
    val fs = FileSystem.get(conf)
    val file = new Path("/ip2region/ip2region.db")
    val len = fs.getFileStatus(file).getLen
    val inputStream = fs.open(file)
    val ba = new Array[Byte](len.toInt)
    inputStream.readFully(ba)

    val bc2 = spark.sparkContext.broadcast(ba)

    /**
     * 地理位置解析
     */
    val regionedRdd = washedEventLog.mapPartitions(iter => {
      // 从广播变量中取出geo字典
      val geoDict: collection.Map[String, (String, String, String)] = bc.value
      // 从广播变量中取出ip位置库字典
      val ipDictBytes: Array[Byte] = bc2.value

      // 构造ip段索引搜索器
      val config = new DbConfig()
      val searcher = new DbSearcher(config, ipDictBytes)

      iter.map(
        bean => {
          val longitude = bean.longitude
          val latitude = bean.latitude

          var area: (String, String, String) = (null, null, null)
          // 如果经纬度合法,则用经纬度变换成geohash码,去查询geo地理位置字典
          if (longitude > -180 && longitude < 180 && latitude > -90 && latitude < 90) {
            val geo = GeoHash.geoHashStringWithCharacterPrecision(latitude, longitude, 5)
            area = geoDict.getOrElse(geo, (null, null, null))
          }

          // 判断geo得到的地理位置是否有效,如果无效,则用ip地址再试一次
          if (area._1 == null && StringUtils.isNotBlank(bean.ip)) {
            val block = searcher.memorySearch(bean.ip)
            val arr = block.getRegion.split("\\|")
            if (arr.length > 3) {
              area = (arr(2), arr(3), null)
            }

          }
          // 填充地理位置
          bean.province = if (area._1 == null || area._1.equals("0")) null else area._1
          bean.city = if (area._2 == null || area._2.equals("0")) null else area._2
          bean.district = if (area._3 == null || area._3.equals("0")) null else area._3
          // 返回最后结果
          bean
        }
      )
    })

    /**
     * GUID 标识
     */
    // 加载当日的 guid设备绑定表
    val bindTableRdd = BindTableUtil.loadBindTable(spark, s"${cur_dt}")
    val bindMap = bindTableRdd.collectAsMap()
    val bc3 = spark.sparkContext.broadcast(bindMap)
    val guidedRdd = regionedRdd.map(bean => {
      val guidDict = bc3.value

      var guid: String = bean.account
      // 取出bean中的account,如果有值,则guid=account,否则,用deviceid去guidDict字典中查询
      if (StringUtils.isBlank(bean.account)) {
        guid = guidDict.getOrElse(bean.deviceid, null)
      }

      // 填充guid
      bean.guid = guid

      // 返回结果
      bean
    })

    /**
     * 标记新老访客
     * 规则: 数据上的guid如果在前日中的设备绑定表中存在,则为老访客,否则即为新访客
     */
    // 加载前一日的设备绑定表  // {"deviceid":"","lst":[],"guid":""}
    val bindTableRdd2: RDD[(String, String)] = BindTableUtil.loadBindTable(spark, s"${pre_dt}")
    val yesterdayGUIDs = bindTableRdd2.map(_._2).collect().toSet
    val bc4 = spark.sparkContext.broadcast(yesterdayGUIDs)
    val newOldFlagRdd: RDD[EventLogBean] = guidedRdd.map(bean => {
      val oldGuids = bc4.value
      if (oldGuids.contains(bean.guid)) bean.isNew = 0

      bean
    })

    /**
     * 保存结果
     * 目的地:数仓的DWD层表: dwd.app_event_dtl
     */
    val resultDF = newOldFlagRdd.toDF().withColumnRenamed("timestamp", "ts")

    resultDF.createTempView("res")
    spark.sql(
      s"""
        |
        |insert into table dwd.app_event_dtl partition(dt='${cur_dt}')
        |select
        |account             ,
        |appid               ,
        |appversion          ,
        |carrier             ,
        |deviceid            ,
        |devicetype          ,
        |eventid             ,
        |ip                  ,
        |latitude            ,
        |longitude           ,
        |nettype             ,
        |osname              ,
        |osversion           ,
        |properties          ,
        |releasechannel      ,
        |resolution          ,
        |sessionid           ,
        |ts                  ,
        |province            ,
        |city                ,
        |district            ,
        |guid                ,
        |isNew
        |
        |from res
        |
        |""".stripMargin)

    spark.close()
  }
}

实现了以上代码后dwd层的处理后数据就进入hive中了。
(5)dwd层处理完后我们要进行dws层数据的处理,dws层为dwd层中轻度聚合得来的数据,dws层技术选型为存储用HDFS,运算用spark或hive,这一层建模思想为主题建模,在这一层输入的是dwd层事实表或dim层维度表,输出的是dws层聚合表,首先要在hive中建立总体的用户行为日志表,实现代码如下:

drop table  if exists  dwd.app_event_dtl;
create table dwd.app_event_dtl(
account             string,
appid               string,
appversion          string,
carrier             string,
deviceid            string,
devicetype          string,
eventid             string,
ip                  string,
latitude            double,
longitude           double,
nettype             string,
osname              string,
osversion           string,
properties          map<string,string>,
releasechannel      string,
resolution          string,
sessionid           string,
ts                  bigint,
province            string,
city                string,
district            string,
guid                string,
isNew               int
)
partitioned by (dt string)
stored as parquet
;

在次之后要按分析的主题建表,dws层中的表按主题分为流量概况主题表和用户分布分析主题表和用户活跃度分析主题表和新用户留存分析主题表和交互事件概况表和站内运营位分析主题表和站外投放分析主题表和优惠券分析主题表和红包分析主题表和常规固定漏斗分析主题表,创建流量概况表代码如下:

DROP TABLE IF EXISTS DWS.APP_PV_AGG_SESSION;
CREATE TABLE DWS.APP_PV_AGG_SESSION(
   guid          STRING,  -- 全局唯一标识
   session_id    STRING,  -- 会话id
   start_time    BIGINT,  -- 会话起始时间
   end_time      BIGINT,
   first_page    STRING,
   last_page     STRING,
   pv_cnt        STRING,
   isnew         INT,
   hour          INT,
   province      STRING,
   city          STRING,
   district      STRING,
   device_type   STRING
)
    PARTITIONED BY (dt STRING)
    STORED AS PARQUET
;

开发期间同时需要的一个代码托管平台码云,如何运用可自行百度。
为将数据导入上面创建的表中需要以下代码及注意:

-- dws.app_pv_agg_session 表的抽取计算sql脚本
-- hive 有bug导致本语句有问题
-- set hive.vectorized.execution.enabled = false;
INSERT INTO TABLE dws.app_pv_agg_session PARTITION (dt = '2020-07-31')
SELECT guid,
       sessionid       as session_id,
       min(ts)         as start_time,
       max(ts)         as end_time,
       max(split(area,',')[3]) as first_page,
       max(last_page)  as last_page,
       count(1)        as pv_cnt,
       max(isnew)      as isnew,
       max(split(area,',')[4])       as hour,
       max(split(area,',')[0])   as province,
       max(split(area,',')[1])       as city,
       max(split(area,',')[2])   as district,
       max(devicetype) as device_type

FROM (
         SELECT guid,
                sessionid,
                ts,
                isnew,
                first_value(area) over (partition by sessionid order by ts)     as area,
                last_value(pageid) over (partition by sessionid order by ts)    as last_page,
                devicetype
         FROM (
                  SELECT
                    guid,
                    isnew,
                    properties['pageid'] as pageid,
                    concat_ws(',',nvl(province,'UNKNOWN'),nvl(city,'UNKNOWN'),nvl(district,'UNKNOWN'),nvl(lpad(properties['pageid'],5,'0'),'00000'),nvl(lpad(hour(from_unixtime(cast((ts/1000) as bigint))),2,'0'),'00')) as area  ,
                    sessionid,
                    devicetype,
                    ts
                  FROM dwd.app_event_dtl
                  WHERE dt = '2020-07-31'
                    AND eventid = 'pageView' AND properties['pageid'] is not null
              ) o1
     ) o2
GROUP BY guid, sessionid
;

接下来的用户分布分析主题表也可在上面建的表中得出,在此之后要创建用户活跃度分析主题表,代码如下:

drop table if exists dws.app_user_active_range;
create table dws.app_user_active_range
(
    dtstr     string,  -- 计算日期
    guid      string,  -- guid
    first_dt  string,  -- 用户首访日(新增日)
    rng_start string,  -- 连续活跃区间起始日期
    rng_end   string   -- 连续活跃区间结束日期
)
    partitioned by (dt string) stored as parquet
;
-- 源表:  T日 流量聚合表: dws.app_pv_agg_session   区间记录表:dws.app_user_active_range T-1日
-- 目标:  区间记录表:dws.app_user_active_range T日


-- 计算逻辑
-- 1. 将区间表中所有用户的所有区间记录进行处理(已封闭区间原样保留,开放区间得判断今日是否有活跃,如果无,则将开放区间封闭)
-- 2. 将区间表中昨天没有活跃的用户找出来,跟今日活跃老用户join,生成老用户的新活跃区间
-- 3. 从流量表中过滤出新用户,为这些新用户生成活跃区间


-- 老用户,老区间的处理
with a as (
    select *
    from dws.app_user_active_range
    where dt = '2020-07-29'
),
     oldu as (
         select guid
         from dws.app_pv_agg_session
         where dt = '2020-07-30'
           and isnew = 0
         group by guid
     ),
     newu as (
         select guid
         from dws.app_pv_agg_session
         where dt = '2020-07-30'
           and isnew = 1
         group by guid
     )

INSERT
INTO TABLE app_user_active_range PARTITION (dt = '2020-07-30')
-- 老用户老区间的处理
select '2020-07-30'                                                                as dtstr,
       a.guid,
       a.first_dt,
       a.rng_start,
       if(a.rng_end = '9999-12-31' and oldu.guid is null, '2020-07-29', a.rng_end) as rng_end
from a
         left join oldu on a.guid = oldu.guid


UNION ALL

-- 老用户,新开区间的处理
select '2020-07-30' as dtstr,
       o1.guid,
       o1.first_dt,
       '2020-07-30' as rng_start,
       '9999-12-31' as rng_end
from (
         select guid,
                max(first_dt) as first_dt
         from a
         group by guid
         having max(rng_end) != '9999-12-31'
     ) o1
         join oldu on o1.guid = oldu.guid

UNION ALL

-- 新用户,新开区间的处理
select '2020-07-30' as dtstr,
       newu.guid,
       '2020-07-30' as first_dt,
       '2020-07-30' as rng_start,
       '9999-12-31' as rng_end
from newu
;

接着是交互事件概况主题表,代码实现如下:

drop table if exists dws.app_itr_agg_session;
create table dws.app_itr_agg_session
(
    dtstr        string,  -- 日期
    guid         string,  -- guid
    session_id   string,  -- 会话id
    event_id     string,  -- 事件类型
    cnt          int,     -- 发生次数
    product_id   string,  -- 商品id
    page_id      string,  -- 页面id
    share_method string,  -- 分享方式
    province     string,
    city         string,
    district     string,
    device_type  string
)
    partitioned by (dt string)
    stored as parquet
;
-- 源表:dwd.app_event_dtl
-- 目标:dws.app_itr_agg_session

-- 计算脚本
insert into table dws.app_itr_agg_session partition (dt = '2020-07-30')
select '2020-07-30'              as dtstr,
       guid,
       sessionid                 as session_id,
       eventid                   as event_id,
       properties['productid']   as product_id,
       properties['pageid']      as page_id,
       properties['sharemethod'] as share_method,
       province                  as province,
       city                      as city,
       district                  as district,
       devicetype                as device_type,
       count(1)                  as cnt
from dwd.app_event_dtl
where eventid in ('collect', 'thumbup', 'share')
  and dt = '2020-07-30'
GROUP BY guid, sessionid, eventid, properties['productid']
        , properties['pageid'], properties['sharemethod'],
         province, city, district, devicetype
;

接着是新用户留存分析主题表,代码实现如下:

-- 新用户留存表

drop table  if exists  dws.app_user_retention;
create table dws.app_user_retention(
    dtstr    string,
    start_dt  string,
    retention_days int,
    retention_cnt  int
)
stored as parquet;
-- 源表: 用户连续活跃区间 记录表  dws.app_user_active_range 当天
-- 目标: 新用户留存表  dws.app_user_retention
with tmp as (
    select first_dt,
           if(datediff(dt, first_dt)<=30,datediff(dt, first_dt),-1) as retention_days
    from dws.app_user_active_range
    where dt = '2020-07-30'
      and rng_end = '9999-12-31'
)

insert into table dws.app_user_retention
select '2020-07-30' as dtstr,
       first_dt,
       retention_days,
       count(1)
from tmp
group by first_dt, retention_days;

最后是常规固定漏斗分析主题表,代码实现如下:

create table dws.app_funnel_model(

    dtstr     string,
    guid      string,
    funnel_name string,
    funnel_step  int

)
partitioned by (dt string)
stored as parquet
;
drop table ev;
create table ev(guid string,eventid string,properties map<string,string>,ts bigint)
    row format delimited fields terminated by ','
        collection items terminated by '\|'
        map keys terminated by ':'
;

load data local inpath '/root/ev.txt' into table ev;

a,e8,p1:v1|p6:v8,1
a,e1,p3:v4|p7:v1,2
a,e2,p1:v1|p6:v8,3
a,e1,p1:v1|p6:v8,4
a,e3,p4:v3|p6:v8,5
a,e4,p8:v5|p5:v3,6
a,e2,p1:v1|p6:v8,7
a,e3,p1:v1|p6:v8,8
a,e4,p1:v1|p6:v2,9
b,e3,p1:v1|p3:v8,1
b,e4,p1:v1|p6:v8,2
b,e2,p1:v1|p6:v1,3
b,e1,p3:v4|p2:v8,4
b,e4,p1:v1|p6:v8,5
b,e3,p4:v3|p6:v8,6
b,e3,p1:v1|p6:v5,7
b,e2,p1:v1|p6:v8,8

--漏斗模型:搜购
  步骤1:  e1, 且事件中有属性 p3 的值为 v4
  步骤2:  e3, 且事件中有属性 p4 的值为 v3
  步骤3:  e4, 且事件中有属性 p8 的值为 v5

-- 最终结果:
搜购,step1,32
搜购,step2,10
搜购,step3,8

-- 关键技术点
select regexp_extract('e1,e1,e4,e5','.*?(e1).*?(e3).*?',2);

-- 开发
-- 过滤掉不属于漏斗模型中的事件
with tmp as (
    select
        *
    from ev
    where (eventid='e1' and properties['p3']='v4')
       or (eventid='e3' and properties['p4']='v3')
       or (eventid='e4' and properties['p8']='v5')
    distribute by guid
    sort by ts
)

select
    '2020-07-30' as dtstr,
    '搜索购买' as model_name,
    guid,
    step
from
    (
        select
            guid,
            array(
                if(regexp_extract(evelst,'.*?(e1).*?',1)='e1',1,null),
                if(regexp_extract(evelst,'.*?(e1).*?(e3).*?',2)='e3',2,null),
                if(regexp_extract(evelst,'.*?(e1).*?(e3).*?(e4).*?',3)='e4',3,null)
                ) as arr
        from
            (
                select
                    guid,
                       -- 1_ev1,2_ev4,3_ev3,4_ev4
                    concat_ws(',',sort_array(collect_list(concat_ws('_',lpad(ts,12,'0'),eventid)))) as evelst
                from tmp
                group by guid
            ) o1
    ) o2
        lateral view explode(arr) tmp as step
where step is not null
;

以上就是dws层的开发。
以上都是流量域的项目开发接下来就到了业务域的具体开发。

七 代码的具体实现-业务域

为了实现日志数据的分析及存储,我们需要用sqoop这一组件来实现数据的抽取,它可以将我们存在MySQL中的原始数据导入hive中,创建业务域中ods层的订单主要信息表增量抽取代码如下:

drop table if exists ods.oms_order_info;
CREATE TABLE `ods.oms_order_info` (
  `id` bigint                                      ,
  `member_id` bigint                               ,
  `coupon_id` bigint                               ,
  `order_sn` string                               ,
  `create_time` timestamp                          ,
  `member_username` string                        ,
  `total_amount` decimal                           ,
  `pay_amount` decimal                             ,
  `freight_amount` decimal                         ,
  `promotion_amount` decimal                       ,
  `integration_amount` decimal                     ,
  `coupon_amount` decimal                          ,
  `discount_amount` decimal                        ,
  `pay_type` int                                   ,
  `source_type` int                                ,
  `status` int                                     ,
  `order_type` int                                 ,
  `delivery_company` string                       ,
  `delivery_sn` string                            ,
  `auto_confirm_day` int                           ,
  `integration` int                                ,
  `growth` int                                     ,
  `promotion_info` string                         ,
  `bill_type` int                                  ,
  `bill_header` string                            ,
  `bill_content` string                           ,
  `bill_receiver_phone` string                    ,
  `bill_receiver_email` string                    ,
  `receiver_name` string                          ,
  `receiver_phone` string                         ,
  `receiver_post_code` string                     ,
  `receiver_province` string                      ,
  `receiver_city` string                          ,
  `receiver_region` string                        ,
  `receiver_detail_address` string                ,
  `note` string                                   ,
  `confirm_status` int                             ,
  `delete_status` int                              ,
  `use_integration` int                            ,
  `payment_time` timestamp                         ,
  `delivery_time` timestamp                        ,
  `receive_time` timestamp                         ,
  `comment_time` timestamp                         ,
  `modify_time` timestamp                          
)
partitioned by (dt string)
row format delimited fields terminated by '\001'
;

运用sqoop抽取命令如下:

bin/sqoop import \
--connect jdbc:mysql://linux01:3306/realtimedw \
--username root \
--password tf069826 \
--table oms_order \
--target-dir '/user/hive/warehouse/ods.db/oms_order_info/dt=2020-07-30/'  \
--incremental lastmodified \
--check-column modify_time \
--last-value '2020-07-29 23:59:59'  \
--fields-terminated-by '\001' \
--null-string '\\N'       \
--null-non-string '\\N'   \
--compress   \
--compression-codec gzip  \
--split-by 'id'   \
-m 2

接下来操作hive映射数据:

alter table ods.oms_order_info add partition ( dt= '2020-07-30') ;

接着创建dwd层表:

DROP TABLE IF EXISTS DWD.oms_order_info;
CREATE TABLE `DWD.oms_order_info` (
  `id` bigint                                      ,
  `member_id` bigint                               ,
  `coupon_id` bigint                               ,
  `order_sn` string                               ,
  `create_time` timestamp                          ,
  `member_username` string                        ,
  `total_amount` decimal                           ,
  `pay_amount` decimal                             ,
  `freight_amount` decimal                         ,
  `promotion_amount` decimal                       ,
  `integration_amount` decimal                     ,
  `coupon_amount` decimal                          ,
  `discount_amount` decimal                        ,
  `pay_type` int                                   ,
  `source_type` int                                ,
  `status` int                                     ,
  `order_type` int                                 ,
  `delivery_company` string                       ,
  `delivery_sn` string                            ,
  `auto_confirm_day` int                           ,
  `integration` int                                ,
  `growth` int                                     ,
  `promotion_info` string                         ,
  `bill_type` int                                  ,
  `bill_header` string                            ,
  `bill_content` string                           ,
  `bill_receiver_phone` string                    ,
  `bill_receiver_email` string                    ,
  `receiver_name` string                          ,
  `receiver_phone` string                         ,
  `receiver_post_code` string                     ,
  `receiver_province` string                      ,
  `receiver_city` string                          ,
  `receiver_region` string                        ,
  `receiver_detail_address` string                ,
  `note` string                                   ,
  `confirm_status` int                             ,
  `delete_status` int                              ,
  `use_integration` int                            ,
  `payment_time` timestamp                         ,
  `delivery_time` timestamp                        ,
  `receive_time` timestamp                         ,
  `comment_time` timestamp                         ,
  `modify_time` timestamp                          
)
partitioned by (dt string)
row format delimited fields terminated by '\001'
;

接着将7.30号的ods增量数据,直接抽入 7.30号的dwd全量表,作为初始全量快照

insert into dwd.oms_order_info partition(dt='2020-07-30')
select
`id`                           ,
`member_id`                    ,
`coupon_id`                    ,
`order_sn`                     ,
`create_time`                  ,
`member_username`              ,
`total_amount`                 ,
`pay_amount`                   ,
`freight_amount`               ,
`promotion_amount`             ,
`integration_amount`           ,
`coupon_amount`                ,
`discount_amount`              ,
`pay_type`                     ,
`source_type`                  ,
`status`                       ,
`order_type`                   ,
`delivery_company`             ,
`delivery_sn`                  ,
`auto_confirm_day`             ,
`integration`                  ,
`growth`                       ,
`promotion_info`               ,
`bill_type`                    ,
`bill_header`                  ,
`bill_content`                 ,
`bill_receiver_phone`          ,
`bill_receiver_email`          ,
`receiver_name`                ,
`receiver_phone`               ,
`receiver_post_code`           ,
`receiver_province`            ,
`receiver_city`                ,
`receiver_region`              ,
`receiver_detail_address`      ,
`note`                         ,
`confirm_status`               ,
`delete_status`                ,
`use_integration`              ,
`payment_time`                 ,
`delivery_time`                ,
`receive_time`                 ,
`comment_time`                 ,
`modify_time`                  
from ods.oms_order_info 
where dt='2020-07-30'
;

再sqoop抽取命令(第2天: 7.31)

bin/sqoop import \
--connect jdbc:mysql://linux01:3306/realtimedw \
--username root \
--password tf069826 \
--table oms_order \
--target-dir '/user/hive/warehouse/ods.db/oms_order_info/dt=2020-07-31/'  \
--incremental lastmodified \
--check-column modify_time \
--last-value '2020-07-31 22:00:00'  \
--fields-terminated-by '\001' \
--null-string '\\N'       \
--null-non-string '\\N'   \
--compress   \
--compression-codec gzip  \
--split-by 'id'   \
-m 1

将写入表目录的数据文件,映射到ods.oms_order_info 的日分区中

hive> alter table ods.oms_order_info add partition ( dt= '2020-07-31') ;

DWD层数据的ETL(增量合并成全量快照)

bin/sqoop codegen \
--connect jdbc:mysql://linux01:3306/realtimedw \
--username root \
--password tf069826 \
--table oms_order \
--bindir /opt/apps/code/oms_order \
--class-name OmsOrder \
--fields-terminated-by "\001"
bin/sqoop merge \
--new-data /user/hive/warehouse/ods.db/oms_order_info/dt=2020-07-31 \
--onto /user/hive/warehouse/dwd.db/oms_order_info/dt=2020-07-30 \
--target-dir /user/hive/warehouse/dwd.db/oms_order_info/dt=2020-07-31 \
--jar-file /opt/apps/code/oms_order/OmsOrder.jar \
--class-name OmsOrder \
--merge-key id

将dwd的7.30号全量 union ods的7.31号增量 ==》 得到dwd的7.31号全量

with tmp as (
select
*
from dwd.oms_order_info where dt='2020-07-30'

union all

select
*
from ods.oms_order_info where dt='2020-07-31'
)

insert into table dwd.oms_order_info partition(dt='2020-07-31')
select
`id`                           ,
`member_id`                    ,
`coupon_id`                    ,
`order_sn`                     ,
`create_time`                  ,
`member_username`              ,
`total_amount`                 ,
`pay_amount`                   ,
`freight_amount`               ,
`promotion_amount`             ,
`integration_amount`           ,
`coupon_amount`                ,
`discount_amount`              ,
`pay_type`                     ,
`source_type`                  ,
`status`                       ,
`order_type`                   ,
`delivery_company`             ,
`delivery_sn`                  ,
`auto_confirm_day`             ,
`integration`                  ,
`growth`                       ,
`promotion_info`               ,
`bill_type`                    ,
`bill_header`                  ,
`bill_content`                 ,
`bill_receiver_phone`          ,
`bill_receiver_email`          ,
`receiver_name`                ,
`receiver_phone`               ,
`receiver_post_code`           ,
`receiver_province`            ,
`receiver_city`                ,
`receiver_region`              ,
`receiver_detail_address`      ,
`note`                         ,
`confirm_status`               ,
`delete_status`                ,
`use_integration`              ,
`payment_time`                 ,
`delivery_time`                ,
`receive_time`                 ,
`comment_time`                 ,
`modify_time`                  
from 
(
select
*,
row_number() over(partition by id order by modify_time desc) as rn
from tmp
) o
where rn=1

;

这个域中数据抽取原则是小表全量抽取,大表增量抽取;
ods层中需要用到拉链表的原则,实例如下:
– 订单oms_order_info的拉链表计算手册
–1. 拉链表建表

drop table if exists dws.oms_order_info_zipper;
create table dws.oms_order_info_zipper(
   id                       bigint     ,                         
   member_id                bigint     ,                  
   coupon_id                bigint     ,                  
   order_sn                 string     ,                   
   create_time              timestamp  ,                
   member_username          string     ,            
   total_amount             decimal    ,               
   pay_amount               decimal    ,                 
   freight_amount           decimal    ,             
   promotion_amount         decimal    ,           
   integration_amount       decimal    ,         
   coupon_amount            decimal    ,              
   discount_amount          decimal    ,            
   pay_type                 int        ,                   
   source_type              int        ,                
   status                   int        ,                     
   order_type               int        ,                 
   delivery_company         string     ,           
   delivery_sn              string     ,                
   auto_confirm_day         int        ,           
   integration              int        ,                
   growth                   int        ,                     
   promotion_info           string     ,             
   bill_type                int        ,                  
   bill_header              string     ,                
   bill_content             string     ,               
   bill_receiver_phone      string     ,        
   bill_receiver_email      string     ,        
   receiver_name            string     ,              
   receiver_phone           string     ,             
   receiver_post_code       string     ,         
   receiver_province        string     ,          
   receiver_city            string     ,              
   receiver_region          string     ,            
   receiver_detail_address  string     ,    
   note                     string     ,                       
   confirm_status           int        ,             
   delete_status            int        ,              
   use_integration          int        ,            
   payment_time             timestamp  ,               
   delivery_time            timestamp  ,              
   receive_time             timestamp  ,               
   comment_time             timestamp  ,               
   modify_time              timestamp  ,  
   start_dt                 string     ,
   end_dt                   string     
)
partitioned by (dt  string)
;

– 计算
– 源表:T-1日拉链表 + T日的ods.oms_order_info 订单增量表

```sql
with zipper as (

select * from dws.oms_order_info_zipper where dt='2020-07-31'

),
incr as (

select * from ods.oms_order_info where dt='2020-08-01'

)

insert into dws.oms_order_info_zipper partition(dt='2020-08-01')

select
zipper.id                                      ,
zipper.member_id                               ,
zipper.coupon_id                               ,
zipper.order_sn                                ,
zipper.create_time                             ,
zipper.member_username                         ,
zipper.total_amount                            ,
zipper.pay_amount                              ,
zipper.freight_amount                          ,
zipper.promotion_amount                        ,
zipper.integration_amount                      ,
zipper.coupon_amount                           ,
zipper.discount_amount                         ,
zipper.pay_type                                ,
zipper.source_type                             ,
zipper.status                                  ,
zipper.order_type                              ,
zipper.delivery_company                        ,
zipper.delivery_sn                             ,
zipper.auto_confirm_day                        ,
zipper.integration                             ,
zipper.growth                                  ,
zipper.promotion_info                          ,
zipper.bill_type                               ,
zipper.bill_header                             ,
zipper.bill_content                            ,
zipper.bill_receiver_phone                     ,
zipper.bill_receiver_email                     ,
zipper.receiver_name                           ,
zipper.receiver_phone                          ,
zipper.receiver_post_code                      ,
zipper.receiver_province                       ,
zipper.receiver_city                           ,
zipper.receiver_region                         ,
zipper.receiver_detail_address                 ,
zipper.note                                    ,
zipper.confirm_status                          ,
zipper.delete_status                           ,
zipper.use_integration                         ,
zipper.payment_time                            ,
zipper.delivery_time                           ,
zipper.receive_time                            ,
zipper.comment_time                            ,
zipper.modify_time                             ,
zipper.start_dt                                ,
if(zipper.end_dt='9999-12-31' and incr.id is not null,'2020-07-31',zipper.end_dt) as end_dt                  

from  zipper join incr on zipper.id=incr.id

union all

select
id                                      ,
member_id                               ,
coupon_id                               ,
order_sn                                ,
create_time                             ,
member_username                         ,
total_amount                            ,
pay_amount                              ,
freight_amount                          ,
promotion_amount                        ,
integration_amount                      ,
coupon_amount                           ,
discount_amount                         ,
pay_type                                ,
source_type                             ,
status                                  ,
order_type                              ,
delivery_company                        ,
delivery_sn                             ,
auto_confirm_day                        ,
integration                             ,
growth                                  ,
promotion_info                          ,
bill_type                               ,
bill_header                             ,
bill_content                            ,
bill_receiver_phone                     ,
bill_receiver_email                     ,
receiver_name                           ,
receiver_phone                          ,
receiver_post_code                      ,
receiver_province                       ,
receiver_city                           ,
receiver_region                         ,
receiver_detail_address                 ,
note                                    ,
confirm_status                          ,
delete_status                           ,
use_integration                         ,
payment_time                            ,
delivery_time                           ,
receive_time                            ,
comment_time                            ,
modify_time                             ,
'2020-08-01' as start_dt                ,
'9999-12-31' as end_dt                  
from incr;
...........

接下来是业务域dwd层以秒杀活动为例的大宽表设计:
– 秒杀活动,dwd层表模型

create table dwd.flash_promotion_dtl(
dtstr                         string,
id                            int,
member_id                     int,
product_id                    bigint,
subscribe_time                timestamp,
send_time                     timestamp,
member_level_name             string,
gender                        string,
city                          string,
brand_name                    string,
product_name                  string,
product_category_name         string,
flash_promotion_session_id    bigint,
flash_promotion_session_name  string,
flash_promotion_id            bigint,
flash_promotion_title         string         
)
partitioned by (dt string)
stored as parquet
;

– ETL代码

insert into table dwd.flash_promotion_dtl partition(dt ='2020-07-30')
select
'2020-07-30' as dtstr                               ,                       
a.id                                                ,
a.member_id                                         ,
a.product_id                                        ,
a.subscribe_time                                    ,
a.send_time                                         ,
l.name as member_level_name                         ,
case u.gender  
  when  0 then '未知'  
  when  1 then '男'
  else  '女'
end as gender                                       ,  
u.city                                              ,
p.brand_name                                        ,
p.product_name                                      ,
p.product_category_name                             ,
s.id as  flash_promotion_session_id                 ,
s.name as flash_promotion_session_name              ,
m.id as  flash_promotion_id                         ,
m.title as flash_promotion_title                    

from dwd.sms_flash_promotion_log a 
  join dwd.ums_member u on a.member_id=u.id
  join dwd.ums_member_level l on u.member_level_id=l.id
  join dwd.pms_product p on a.product_id=p.id
  join dwd.sms_flash_promotion_product_relation r on a.product_id=r.product_id
  join dwd.sms_flash_promotion_session s on r.flash_promotion_session_id=s.id
  join dwd.sms_flash_promotion m on r.flash_promotion_id=m.id
where to_date(subscribe_time)>'2020-07-29';

以下为业务域dws层的设计开发:
– 订单金额分析dws层聚合表

create table dws.oms_order_amount_agg(
gmv                             decimal(10,2),
pay_amount                      decimal(10,2),
coupon_amount                   decimal(10,2),
promotion_discount_amount       decimal(10,2),
integration_discount_amount     decimal(10,2),
time_range_hour                 int,
product_brand                   string,  -- 商品品牌
product_category_name           string,  -- 商品品类
member_level_name               string,  -- 会员等级
order_type                      string,  -- 订单类型
source_type                     string,  -- 订单来源
promotion_name                  string   -- 所属活动
)
partitioned by (dt string)
stored as parquet
;

– ETL 开发
– 源表:
– dwd.oms_order_info 订单表
– dwd.oms_order_item 订单商品详情表
– dwd.ums_member 会员信息表
– dwd.ums_member_level 会员等级信息表
– dwd.pms_product 产品信息表
– 目标:dws.oms_order_amount_agg 销售金额分析维度聚合表
– 开发逻辑,首先要以订单表为中心,去join其他几张需求相关的表,形成如下表结构
订单id,会员id,产品id,产品销售价,产品促销减免,产品积分抵扣,产品优惠券减免,订单类型,订单来源,所属活动,品类, 品牌, 会员等级,时段
o1 m1 p1 20 5 1 1 普通 PC 大放送 c1 b1 lev1 3
o1 m1 p2 80 0 5 5 活动 H5 新品秒 c2 b2 lev1 3
o2

with tmp as (
select                                                              
od.dtstr                   as          dtstr                        ,
od.order_id                as          order_id                     ,
od.member_id               as          member_id                    ,
od.product_id              as          product_id                   ,
od.order_type              as          order_type                   ,
od.source_type             as          source_type                  ,
od.promotion_info          as          promotion_info               ,
it.product_price           as          product_price                ,
it.product_quantity        as          product_quantity             ,
it.promotion_amount        as          promotion_amount             ,
it.coupon_amount           as          coupon_amount                ,
it.integration_amount      as          integration_amount           ,
it.real_amount             as          real_amount                  ,
pd.brand_name              as          brand_name                   ,
pd.product_category_name   as          product_category_name        ,
lv.name                    as          member_level_name            ,             
hour(od.create_time)       as          time_range_hour           																
from  dwd.oms_order_info od  
        join dwd.oms_order_item it on to_date(od.create_time) = '2020-07-30' and od.id=it.order_id
		join dwd.ums_member mb on od.member_id = mb.id
		join dwd.ums_member_level lv on mb.member_level_id = lv.id
		join dwd.pms_product pd on it.product_id=pd.id
		
  
)

select
sum(product_price*product_quantity)    as  gmv,
sum(real_amount)                       as  pay_amount,
sum(coupon_amount)                     as  coupon_amount,
sum(promotion_amount)                  as  promotion_discount_amount,
sum(integration_amount)                as  integration_discount_amount

from  tmp
group by  
     product_brand         ,    
     product_category_name ,    
     member_level_name     ,    
     order_type            ,    
     source_type           ,    
     promotion_name            
;

接下来还可以做用户消费画像表,用户退拒画像分析,用户购物偏好画像分析表。

八 具体稳定实现的组件

首先我们可以用shell脚本来实现各个大数据组件的一键启停,但自带的shell脚本不能满足大数据项目一直启动的要求,所以我们要使用Azkaban这个组件来实现定时任务执行,同时为简化操作我们可以使用自带web端界面的Atlas元数据管理组件。

你可能感兴趣的:(hadoop)