数据仓库1.0

该文章来源于我的hexo博客,简单地描述了数据仓库系统构建流程,例如etl,数据建模,缓慢变化维,数据治理,元数据管理等等

ODS -> DWD(用户行为数据分析)

维度集成

GeoHash编码
//经纬度字典表存入数据库
object Geo {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName(" ")
    val spark = SparkSession.builder().config(conf).getOrCreate()
    val rdd = spark.sparkContext.textFile("file:///C:\\Users\\hp\\IdeaProjects\\TiTan\\dataware\\src\\main\\resources\\geodict.txt")
    import spark.implicits._
    val maprdd = rdd.map(data => {
      val txt = data.split(" ")
      val str = txt(4).split(":")
      //第一个是维度 第二个是精度
      val geo: String = GeoHash.withCharacterPrecision(txt(3).toDouble, txt(2).toDouble, 6).toBase32
      (geo, str(0), str(1), str(2))
    })
    val frame = maprdd.toDF("geo", "province", "city", "district")
    frame.write.format("jdbc")
      // 注意格式
      .option("url", "jdbc:mysql://linux03:3306/db_demo1?useUnicode=true&characterEncoding=utf-8&useSSL=false")
      .option("dbtable", "geo")
      .option("user", "root")
      .option("password", "123456")
      .mode("append")
      .save()
  }
}

IDMapping

方案一:IDBind评分

步骤

score表的构成deviceID account score timestamp
1.创建今天的bind表:
由两个部分union all 组成
第一部分:今天的新设备
过滤出今天表没有account的数据,与昨天的bind表join 并且deviceid is null
第二部分今天的账号不为空的的数据(都为空没有讨论的必要)
要点:如何设置表的字段
设备只要昨天出现过就取昨天的,否则今天的
账号只要昨天出现过就取昨天的,否则今天的
时间戳取最大的
score(今天deviceid和昨天都出现则分数相加,今天出现昨天没有分数不变,今天没有出现昨天出现分数衰减)

2.创建今天guid字段
获取今天的评分表,获取同一个设备组的最大分数数据,今天的log表与这个处理过的评分表left join
规则:
如果今天的account不为空guid就取这个
否则获取评分表的account,guid为评分表的account,这个情况是处理没有今天数据没有account字段的情况
否则其他情况guid就是今天的设备号

代码

object IdBindMappingTest {

  /*  每日滚动更新: 设备和账号的绑定评分(表)
    设备id,登录账号,绑定得分,最近一次登录时间戳
    deviceid,account,score,ts
    这个数据的加工,应该是一个逐日滚动计算的方式

    具体计算逻辑:
        1. 加载 T-1 日的绑定评分表
        2. 计算  T日的  "设备-账号绑定评分"
        3. 综合 T-1日 和 T日的  绑定评分数据得到 T日的绑定评分数据最终结果
    T日的设备-账号绑定评分,得分规则:  每登录一次,+100分

    两日数据综合合并的逻辑
    T-1 日
    d01,a01,900
    d02,a02,800
    d01,a11,600
    d03,a03,700
    d04,a04,600

    T日
    d01,a01,200
    d03,a03,100
    d03,a13,100
    d06,a06,200
    d06,a04,100*/

  def main(args: Array[String]): Unit = {
    Logger.getLogger("org.apache").setLevel(Level.WARN)

    val spark = SparkSession.builder().enableHiveSupport().config(new SparkConf().setAppName(this
      .getClass.getSimpleName).setMaster("local[*]")).getOrCreate()
    import spark.implicits._
    import org.apache.spark.sql.functions._
    val schema = new StructType(Array(
      StructField("deviceid", DataTypes.StringType),
      StructField("account", DataTypes.StringType),
      StructField("timestamp", DataTypes.DoubleType),
      StructField("sessionid", DataTypes.StringType)
    ))

    val file = spark.read.format("csv").option("header", false.toString).schema(schema).load("file:///C:\\Users\\hp\\IdeaProjects\\TiTan\\dataware\\src\\main\\resources\\idbind.csv")
    val frame = file.select("deviceid", "account", "timestamp", "sessionid")
    frame.createTempView("cur_day_log")


    /*  deviceid,account,timestamp,sessionid
        d01,a01,12235,s01
        d01,a01,12266,s01
        d01,a01,12345,s02
        d01,a01,12368,s02
        d01,,12345,s03
        d01,,12376,s03
        d02,a02,12445,s04
        d02,a02,12576,s04
        d03,a03,13345,s05
        d03,a03,13376,s05
        d04,,14786,s06
        d04,,14788,s06*/

    // TODO 最好将这些标识数据写入到hive中进行查询与覆盖


    // todo 按设备和账号来分组,存在一次就+100分 ---- 今天的绑定评分
    val today_temp = spark.sql(
      """
        |select
        |deviceid,account,cast(count(distinct sessionid ) * 100 as double) as score ,max(timestamp) as  timestamp
        |from cur_day_log
        |group by deviceid,account
        |""".stripMargin)

    today_temp
      // TODO  此处过滤出没有账号登录的log,为什么?
      // TODO 生成今天的评分表,过滤出账号为空的记录,假设这个设备正好是新设备,
      //   确实要过滤出来防止重复,如果不是新的设备那么之前的score表一定有记录

      .where("trim(account) != '' and account is not null")
      .select("deviceid", "account", "score", "timestamp")
      .createTempView("today_score")

    /* 测试 != null 和 is not null
     spark.sql(
        """
          |select
          |deviceid,account,cast(count(distinct sessionid ) * 100 as double) as score ,max(timestamp) as  timestamp
          |from cur_day_log
          |group by deviceid,account
          |""".stripMargin).where("trim(account) != '' and account != null").show()*/

    spark.sql(
      """
        |select * from  today_score
        |""".stripMargin).show()



    // TODO 昨天的绑定评分表

    val preFrame = spark.read.format("csv").option("header", true.toString).load("file:///C:\\Users\\hp\\IdeaProjects\\TiTan\\dataware\\src\\main\\resources\\idscore.csv")
    preFrame.selectExpr("deviceid", "account", "cast(timestamp as double) as timestamp ", "cast(score as double) as score")
      .createTempView("last_score")

    spark.sql(
      """
        |select * from  last_score
        |""".stripMargin).show()

    // TODO 整合昨天和今天的评分表,临时
    // 历史出现 今天出现 评分相加(取昨天的)
    // 历史出现 今天不出现 评分衰减(取昨天的)
    // 历史不出现 今天出现  取今天的

    val score_temp = spark.sql(
      """
        |select
        |nvl(pre.deviceid,cur.deviceid) as deviceid, -- 设备
        |nvl(pre.account,cur.account) as account, -- 账号
        |nvl(cur.timestamp,pre.timestamp) as timestamp, -- 最近一次访问时间戳
        |case
        |when pre.deviceid is not null and cur.deviceid is not null then pre.score+cur.score
        |when pre.deviceid is not null and cur.deviceid is null then pre.score*0.5
        |when pre.deviceid is  null and cur.deviceid is not  null then cur.score
        |end as score
        |from
        |last_score pre
        |full join
        |today_score cur
        |on pre.deviceid = cur.deviceid and  pre.account = cur.account
        |""".stripMargin)

    // 今天临时的评分表,没有新设备,已经过滤掉了
    score_temp.createTempView("score_temp")

    //  TODO 定义今天的新设备 与昨天left join

    //取出今天账号为null或者空的数据,与昨天的评分表join且deviceid is  null,获取新的设备
    today_temp
      // 注意这里是 or
      .where("trim(account) = '' or account is  null").createTempView("cur_may_new")


    //与昨天left join 取出 昨天设备为null的数据

    //注意字段的对齐
    //获取今天的新的设备与临时的没有新设备的进行union all
    spark.sql(
      """
        |select
        |a.deviceid,
        |a.account,
        |a.timestamp,
        |a.score
        |from
        |cur_may_new a
        |left join last_score b
        |on a.deviceid = b.deviceid
        |where b.deviceid is  null
        |
        |union all
        |select deviceid,account,timestamp,score from  score_temp
        |""".stripMargin).show()

    //存在把两个用户识别为同一个人是可能性
    // join计算量大
    // 评分衰减加权规则不合理
  }
}

方案二:图计算

object GraphxUniqueLdentification {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().config(new SparkConf())
      .appName(this.getClass.getSimpleName)
      .enableHiveSupport()
      .master("local[*]")
      .getOrCreate()
    import spark.implicits._
    import org.apache.spark.sql.functions._
    val frame = spark.read.table("db_demo1.app_event_log")

    val dataFrame = frame.select("account", "deviceId")
    //暂时的一个用户标识
    val temporaryOne: RDD[Array[String]] = dataFrame.rdd.map(row => {
      val account = row.getAs[String](0)
      val deviceId = row.getAs[String](1)
      Array(account, deviceId).filter(StringUtils.isNotBlank(_))
    })
    //println(temporaryOne.collect().length)

    // TODO 构造图的点和边

    val today_dot: RDD[(Long, String)] = temporaryOne.flatMap(arr => {
      for (element <- arr) yield (element.hashCode.toLong, element)
    })

    val today_side: RDD[Edge[String]] = temporaryOne.flatMap(arr => {
      for (i <- 0 to arr.length - 2; j <- i + 1 until arr.length) yield Edge(arr(i).hashCode.toLong, arr(j).hashCode.toLong, "")
    }).map((_, 1))
      .reduceByKey(_ + _)
      .filter(_._2 > 2)
      .map(_._1)

    // TODO 这里是把第一次的日志获取的用户标签写入到hive中
    //firstOneGuid(today_dot,today_side)


    // TODO 生成今天的 DUID && 用户标签写入到hive中
    //today_Guid(today_dot,today_side)


    // TODO  将今日的图计算结果与昨天的结果进行比对,如果最小的 guid 出现在今天了,则替换为昨天的guid

    //分别读取hive表中的用户标签字段
    val last_identifying: collection.Map[VertexId, VertexId] = spark.read.table("dw.graphData").select("identifying_hash", "guid").where("dy='2020-12-16'").rdd
      .map(row => {
        (row.getAs[VertexId](0), row.getAs[VertexId](1))
      })
      .collectAsMap()

    //把昨天的标签数据广播出去 把今天的数据按照guid按照分组
    val last_identifying_broadcast = spark.sparkContext.broadcast(last_identifying)

    val today_identifying: RDD[(VertexId, VertexId)] = spark.read.table("dw.graphData").select("identifying_hash", "guid").where("dy='2020-12-17'").rdd
      .map(row => {
        (row.getAs[VertexId]("guid"), row.getAs[VertexId]("identifying_hash"))
      })
      .groupByKey()
      .mapPartitions(iter => {
        //获取映射字典
        val idmpMap = last_identifying_broadcast.value
        iter.map(tp => {
          var guid = tp._1
          val identifying_hash = tp._2
          var findmin_guid = false
          for (elem <- identifying_hash if !findmin_guid) {
            //如果在今日存在 昨天的那个最小的guid 那么就把今天的guid替换为昨天的那个
            val maybeId: Option[VertexId] = idmpMap.get(elem)
            if (maybeId.isDefined) {
              guid = maybeId.get
              findmin_guid = true
            }
          }
          (guid, identifying_hash)
        })
      }).flatMap(tp => {
      //扁平化
      val guid = tp._1
      val identifying_hash = tp._2
      for (elem <- identifying_hash) yield (elem, guid)
    })


    //today_identifying.toDF("identifying_hash", "guid").show(20)
    //把数据写入到hive表中,覆盖今天的标签表
    today_identifying.toDF("identifying_hash", "guid").createTempView("graph")

    spark.sql(
      """
        |insert overwrite table  dw.graphData   partition(dy='2020-12-17')
        |select
        |identifying_hash,guid
        |from
        |graph
        |""".stripMargin)


    def today_Guid(today_dot: RDD[(Long, String)], today_side: RDD[Edge[String]]) {
      // TODO  读取上一日的生成的标签字段
      val oneday_identifying: RDD[Row] = spark.read.table("dw.graphData").select("identifying_hash", "guid").where("dy='2020-12-16'").rdd

      val last_dot: RDD[(VertexId, String)] = oneday_identifying.map(row => {
        val identifying_hash = row.getAs[VertexId](0)
        (identifying_hash, "")
      })

      val last_side: RDD[Edge[String]] = oneday_identifying.map(row => {
        val scr: VertexId = row.getAs[VertexId]("identifying_hash")
        val dst = row.getAs[VertexId]("guid")
        Edge(scr, dst, "")

      })
      // 构建图
      val graph = Graph(today_dot.union(last_dot), today_side.union(last_side))
      graph.connectedComponents().vertices.toDF("identifying_hash", "guid").createTempView("graph")

      spark.sql(
        """
          |insert  into  table  dw.graphData   partition(dy='2020-12-17')
          |select
          |identifying_hash,guid
          |from
          |graph
          |""".stripMargin)
    }


    def firstOneGuid(dot: RDD[(Long, String)], side: RDD[Edge[String]]) {
      val graph = Graph(dot, side)
      val conngraph = graph.connectedComponents().vertices
      // todo 此处的guid是标识中的最小的值,我们可以把这个guid替换为一个生成的唯一的UUID来作为一个guid

      val graphDataFrame = conngraph.toDF("identifying_hash", "guid")
      graphDataFrame.createTempView("graph")

      // TODO 把用户唯一标识库写入到hive表中
      spark.sql(
        """
          |insert  into  table  dw.graphData   partition(dy='2020-12-16')
          |select
          |identifying_hash,guid
          |from
          |graph
          |""".stripMargin).show()
    }
  }
}

新老用户标识

方案一:布隆过滤器
//测试布隆过滤器
object BloomFilterTest {
  def main(args: Array[String]): Unit = {
    //hadoopBloom()
    sparkBloom()
  }

  //使用hadoop的布隆过滤器
  // TODO 注意hadoop的序列化器和spark的序列化器不一致,要在config中指定序列化器

  def hadoopBloom() {
    val filter = new BloomFilter(1000000, 5, Hash.MURMUR_HASH)
    filter.add(new Key("a".getBytes()))
    filter.add(new Key("b".getBytes()))
    filter.add(new Key("c".getBytes()))
    filter.add(new Key("d".getBytes()))

    val bool = filter.membershipTest(new Key("a".getBytes))
    val bool1 = filter.membershipTest(new Key("dwd".getBytes))
    println(bool)
    println(bool1)
  }
  //使用spark的布隆过滤器
  def sparkBloom(): Unit = {
    val spark = SparkSession.builder().config(new SparkConf().setMaster("local[*]").setAppName(this.getClass.getSimpleName)).getOrCreate()
    import spark.implicits._
    val filter = spark.sparkContext.makeRDD(List("A", "B", "C", "D", "E", "F")).toDF("alphabet").stat.bloomFilter("alphabet", 100000, 0.001)
    val bool = filter.mightContain("A")
    val bool1 = filter.mightContain("R")
    println(bool)
    println(bool1)
  }
}
方案二:Spark广播变量

把表中的deviceid account字段 广播出去即可,如果两个字段都存在则 isnew =1

方案三:join

full join 昨天和今天的guid,如果如果今天的为空就是新用户

DWD -> DWS(用户行为数据分析)

流量/用户分析主题

会话聚合表(轻度聚合)

按用户会话粒度聚合可以支持多种报表,例如:访问次数,访问深度(定义包括重复的页面),跳出页,流量对比等

轻度聚合是对个体的统计,,统计个体只要对guid分组即可

-- 从DWD表enent_app_detail,按照用户会话粒度聚合
-- 一次会话一个用户可能在不同的地点,这个来用第一个城市/地域

CREATE TABLE dw.traffic_aggr_session
(
    guid             string,
    session_id       string,
    start_time       bigint,
    end_time         bigint,
    in_page          string,
    out_page         string,
    pv_cnt           bigint, --  一次会话访问的页面的总数
    isnew            int,
    hour_segment     int,    --  定义为start_time的访问hour,用来统计某时段的会话总数
    province         string,
    city             string,
    district         string,
    device_type      string,
    release_channel  string,
    app_version      string,
    os_name          string
)
PARTITIONED BY (dy  string)
stored as parquet
tblproperties("parquet.compress"="snappy");

-- HQL

with x as (
  select
    guid, -- 唯一标识(日活)
    sessionId,
    first_value(`timeStamp`) over ( --  timeStamp不规范
      partition by guid,
      sessionId
      order by
        `timeStamp` rows between unbounded preceding
        and unbounded following
    ) as start_time,
    last_value(`timeStamp`) over (
      partition by guid,
      sessionId
      order by
        `timeStamp` rows between unbounded preceding
        and unbounded following
    ) as end_time,
    first_value(properties ['pageid']) over ( -- 从map中获取访问的页面信息
      partition by guid,
      sessionId
      order by
        `timeStamp` rows between unbounded preceding
        and unbounded following
    ) as in_page,
    last_value(properties ['pageid']) over (
      partition by guid,
      sessionId
      order by
        `timeStamp` rows between unbounded preceding
        and unbounded following
    ) as out_page,
    newuser,
    first_value(province) over (
      partition by guid,
      sessionId
      order by
        `timestamp` rows between unbounded preceding
        and unbounded following
    ) as province,
    first_value(city) over (
      partition by guid,
      sessionId
      order by
        `timestamp` rows between unbounded preceding
        and unbounded following
    ) as city,
    first_value(district) over (
      partition by guid,
      sessionId
      order by
        `timestamp` rows between unbounded preceding
        and unbounded following
    ) as district,
    devicetype,
    releasechannel,
    appversion,
    osName
  from
    enent_app_detail
  WHERE
    dy = '2020-12-11'
    and eventid = 'pageView'
)
insert into
  table dw.traffic_aggr_session partition(dy = '2020-12-11')
select
  guid,
  sessionId,
  min(start_time) as start_time,
  min(end_time) as end_time,
  min(in_page) as in_page,
  min(out_page) as out_page,
  count(1) as pv_cnt,
  min(newuser) as isnew,
  hour(
    from_unixtime(cast(min(start_time) / 1000 as bigint)) -- 这里要把double转成bigint类型
  ) as hour_segment, -- 此次在某时段访问的
  min(province) as province,
  min(city) as city,
  min(district) as district,
  min(devicetype) as device_type,
  min(releasechannel) as release_channel,
  min(appversion) as app_version,
  min(osname) as os_name
from
  x
group by
  guid,
  sessionId
流量用户聚合表

按用户维度进行聚合,是对会话维度聚合表进行进一步的聚合

多维立方体(高阶聚合)

实际生产中在各种指标下的统计有各种维度组合,工作繁冗

关键要点:创建一个统一的目标维度,这个表应该包含所有的可能性维度字段

利用hive的高阶聚合函数,在一条SQL中实现可能的维度组合

高阶聚合不是对个体的统计,是对全局的统计

with cube 函数包含所有可能的组合

grouping sets() 用户自定义所有可能的组合

with rollup 右到做递减多级的统计

-- 多维组合表

create table dw.traffic_overview_cube(
  province         string,
  city             string,
  district         string,
  device_type      string,
  release_channel  string,
  app_version      string,
  os_name          string,
  hour_segment     int   ,
  session_cnt      bigint, -- 会话总和
  pv_cnt           bigint, -- 访问页面总和
  acc_timelong     bigint, -- 访问总时长
  uv_cnt           bigint  -- 访问总人数
)
partitioned by (dy  string)
stored as parquet
tblproperties ("parquet.compress"="snappy")
;

#!/bin/bash
# shell脚本运行
# 高阶聚合表  traffic_aggr_session 到 traffic_overview_cube
export HADOOP_HOME=/opt/hadoop/hadoop-3.1.1/
export SPARK_HOME=/opt/spark/spark-3.0.1-bin/
export HIVE_HOME=/opt/hive/hive-3.1.2/

DT_CAL=\`date -d'-1 day' +%Y-%m-%d\`

if [ $# -eq 1 ]
then
DT_CAL=$1
fi

echo "准备启动任务,要计算的数据日期: $DT_CAL  ..........."

$HIVE_HOME/bin/hive -e "
insert into  table dw.traffic_overview_cube partition (dy='${DT_CAL}')
select
  province,
  city,
  district,
  device_type,
  release_channel,
  app_version,
  os_name,
  hour_segment,
  count(1) as session_cnt, -- 会话总数
  sum(pv_cnt) as pv_cnt,   -- 访问页面总数
  sum(end_time - start_time) as time_long, -- 总访问时长
  count( distinct guid) as  uv_cnt -- 访问用户总数
from
  dw.traffic_aggr_session
where dy='${DT_CAL}'
group by
  province,
  city,
  district,
  device_type,
  release_channel,
  app_version,
  os_name,
  hour_segment
  grouping sets (
    (),(province),
    (province, city),
    (province, city, district),
    (device_type),
    (release_channel),
    (device_type, app_version),
    (hour_segment),
    (os_name, app_version)
  );
"

if [ $? -eq 0 ]
then
echo ">>>>   任务成功    >>>>      "
else
echo ">>>>   任务失败 >>>>>   "
fi

用户活跃度分析主题

类拉链表

步骤一: dw.traffic_aggr_session会话表计算今天登录的用户guid

步骤二:昨天的活跃表与今天的日活表full join;计算的规则:

first_dt guid range_start 规则是一致的,只要昨天有那就是昨天的,否则今天的(这种情况是新用户了)

range_end 规则:如果昨天登录了,今天没有登录,那就昨天日期,连续中断要封存,如果昨天没有但是有那就今天的(新用户),其他情况一律是昨天日期(昨天用户今天没有登录的情形,封闭区间保持原样)

步骤三:一种情形没有full jion上:之前存在的用户今天登陆的( max(range_end) != ‘9999-12-31’),所以要union all

从活跃表中获取这种用户的guid和first_dt与日活表left semi join

-- 类拉链表
create table dw.user_act_range(
first_dt    string,
guid        string,
range_start string,
range_end   string
)
partitioned by (dt string)
stored as parquet
tblproperties ("parquet.compress"="snappy");

-- 获取今日登陆的人数
-- 如果range_end都封闭了,但是今天登陆了,是不会出现在这里的,考虑使用 union all
select
  nvl(pre.first_dt, cur.cur_time) as first_dt,
  -- 如果之前存在那就取之前,之前不存在那就是新用户了
  nvl(pre.guid, cur.guid) as guid,
  nvl(pre.range_start, cur.cur_time) as range_start,
  case when pre.range_end = '9999-12-31'
  and cur.cur_time is null then pre.pre_time --  昨天登录,今天没有登录
  when pre.range_end is null then cur.cur_time -- 新用户(jion不上的)
  else pre.range_end -- 封闭区间保持原样
  end as range_end
from
  (
    select
      first_dt,
      guid,
      range_start,
      range_end,
      dy as pre_time
    from
      dw.user_act_range
    where
      dy = '2020-12-10'
  ) pre full
  join (
    select
      guid,
      max(dy) as cur_time
    from
      dw.traffic_aggr_session
    where
      dy = '2020-12-11'
    group by
      guid
  ) cur on pre.guid = cur.guid
union all
  -- 从会话层获取今日登陆的用户
  -- range_end封闭且今日登陆的情况
select
  first_dt as first_dt,
  o1.guid as guid,
  '2020-12-11' as range_start,
  '9999-12-31' as range_end
from
  (
    select
      guid,
      first_dt
    from
      dw.user_act_range
    where
      dy = '2020-12-10'
    group by
      guid,
      first_dt
    having
      max(range_end) != '9999-12-31'
  ) o1 -- 从会话层取出今天登陆的所有用户
  left semi
  join (
    select
      guid
    from
      dw.traffic_aggr_session
    where
      dy = '2020-12-11'
    group by
      guid
  ) o2 on o1.guid = o2.guid
-- 报表分析-连续活跃区间分布
-- 取最大的活跃区间

with x as (
select 
max(datediff(if(range_end='9999-12-31','2020-12-29',range_end),if(date_sub('2020-12-29' , 30)< range_start,range_start,date_sub('2020-12-29',30)))) as num
from 
dw.user_act_range
where dy='2020-12-17' and date_sub('2020-12-29',30) <= range_end
group by guid
)
select 
count(if(num<=10,1,null)) as continous_10,
count(if(num>10 and num<=20 ,1,null)) as continous_20,
count(if(num>20 and num<=30 ,1,null)) as continous_30
from 
x

bitMap思想
计算方案
步骤一:去重按照guid datime
步骤二:按照guid来分组聚合
使用sum(pow(2,datediff(时间,dt))) 类型为double,sum的结果是登陆数的和
把上述的结果转换为bin() 说明最近登陆的在bitmap的最右边,考虑使用lpad()来补全
考虑使用reverse来反转,这样最左边就是最近登陆的天数
步骤三:更新这个bitmap表
与今天的表进行join,on guid
对这个bitmap进行补全,补全的规则如下:
substr(1,1,30),今天和bitmap表的都有则+1 今天有bitmap表无则是新的guid 今天无bitmap表有说明没有登录+0

-- 测试数据
-- 假设有一张包含所有用户登陆信息的表 字段 dt guid,按照dt,guid进行聚合去重生成一张临时表dou,目的是要创建初始bitmap表
-- bitmap生成表报的思路:使用like函数,replace函数,split()函数即可解决用户登陆问题
select 
reverse(lpad(bin(cast(sum(pow(2,datediff('2020-11-18',dt))) as bigint)),31,'0'))
from 
db_demo1.dou
group by guid
-- 其实只要存储sum()的即可,上面用于计算
-- 更新今天的bitmap表,可能会有转换类型错误没有测试
-- 方案一:
with a as (
  select
    -- 会话视图层
    guid
  from
    dw.traffic_aggr_session
  where
    dy = '2020-12-17'
  group by
    guid
),
b as (
  select
    -- bitmap表
    guid,
    reverse(lpad(cast(bin(bitmap) as string), 31, '0')) as bitstr
  from
    dw.bitmp_30d
  where
    dy = '2020-12-16'
)
insert into
  table dw.bitmp_30d partition(dt = '2020-12-17')
select
  nvl(a.guid, b.guid) as guid,
  conv(
    -- conv函数把二进制转为10进制
    reverse(
      -- 反转把最近登陆反转到结尾,离今天越近数值越小
      case when a.guid is not null
      and b.guid is not null then concat('1', substr(bitstr, 1, 30)) when a.guid is null
      and b.guid is not null then concat('0', substr(bitstr, 1, 30)) when b.guid is null then rpad('1', 31, '0') end
    ),
    2,
    10
  ) as bitmap
from
  a full
  join b on a.guid = b.guid
-- 方案二:&运算

T+1日,用户没活跃,则如下更新:
select bin(1073741823 & cast(conv('111111111111111111111111111111',2,10)*2 as int));

T+1日,用户活跃,则如下更新:
select bin(1073741823 & cast(conv('111111111111111111111111111111',2,10)*2+1 as int));

用户连续活跃分析

//连续活跃天数
//guid,first_dt,rng_start,rng_end
object UserActivityAnalysis {
  //注意传入日期的格式:'2020-06-03'
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().master("local[*]").appName(this.getClass.getSimpleName).getOrCreate()
    val dataFrame = spark.read.option("header",true.toString).csv("file:///C:\\Users\\hp\\IdeaProjects\\TiTan\\dataware\\src\\main\\resources\\Useractivityanalysis.csv")
      .where("rng_end = '9999-12-31'")
      .selectExpr("cast(guid as int) ", s"datediff(${args(0)},rng_start) as days")

      import  spark.implicits._
      // 循环获取天数
    dataFrame.show(100,false)
    val value = dataFrame.rdd.flatMap(row => {
      val guid = row.getAs[Int]("guid")
      val days = row.getAs[Int]("days")
      for (i <- 1 to days + 1) yield (guid, i)
    })

    value.toDF("guid","days").createTempView("UserActivity")

    spark.sql(
      s"""
        |select
        |${args(0)} as dt,
        |days ,--天数
        |count(1) ,-- 人数
        |collect_set(guid) -- 人
        |from
        |UserActivity
        |group by days
        |order by days
        |""".stripMargin).show(100,false)
  }
}

用户访问间隔分析

//用户访问间隔分析
//guid,first_dt,rng_start,rng_end
object UserAccessInterval {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName(this.getClass.getSimpleName).master("local[*]").getOrCreate()
    val schema = new StructType()
      .add("guid", DataTypes.LongType)
      .add("first_dt", DataTypes.StringType)
      .add("rng_start", DataTypes.StringType)
      .add("rng_end", DataTypes.StringType)
    val frame = spark.read.schema(schema).csv("file:///C:\\Users\\hp\\IdeaProjects\\TiTan\\dataware\\src\\main\\resources\\UserAccessInterval.csv")
    frame.createTempView("tmp")
    /*1.过滤出rng_end时间与今天相比小于30的数据
    2.把rng_end为9999-12-31的日期改成2020-03-14
    3.如果first_dt与今天相比大于30天则更改为今天的日期-30,否则不变*/
    val frame1 = spark.sql(
      """
        |select
        |  guid,
        |  if(datediff('2020-03-14',rng_start)>=30,date_sub('2020-03-14',30),rng_start) as rng_start,
        |  if(rng_end = '9999-12-31','2020-03-14',rng_end) as rng_end
        |from
        |  tmp
        |where datediff('2020-03-14',rng_end) <=30
        |""".stripMargin)

    //统计登陆间隔
    //按用户来分组
    val guidrdd = frame1.rdd.map(row => {
      val guid = row.getAs[Long]("guid")
      val rng_start = row.getAs[String]("rng_start")
      val rng_end = row.getAs[String]("rng_end")
      (guid, (rng_start, rng_end))
    }).groupByKey()

    val value = guidrdd.flatMap(data => {
      val guid = data._1
      val sorted = data._2.toList.sortBy(_._2)
      //统计间隔为0的次数
      val list: List[(Long, Int, Int)] = for (elem <- sorted) yield (guid, 0, datediff(elem._1, elem._2))

      //统计间隔为N的次数(要排序),用后一个的rng_start- 前一个的 rng_end
      val list1 = for (i <- 0 until sorted.length - 1) yield (guid, datediff(sorted(i)._2, sorted(i + 1)._1), 1)
      list ++ list1
    })
    import spark.implicits._
    import org.apache.spark.sql.functions._
    value.toDF("guid", "interval", "times")
      .groupBy("guid", "interval").agg(sum("times") as "times")
      .show(100, false)
  }

  //时间相减
  def datediff(date1: String, date2: String): Int = {
    val sdf = new SimpleDateFormat("yyyy-MM-dd")
    val s1: Date = sdf.parse(date2)
    val s2: Date = sdf.parse(date1)
    ((s1.getTime - s2.getTime) / (24 * 60 * 60 * 1000)).toInt
  }
}
//用户访问间隔分析
//guid,first_dt,rng_start,rng_end
object UserAccessInterval1 {
  def main(args: Array[String]): Unit = {
    //UserAccessInterval是spark实现,UserAccessInterval1是SQL实现

    val spark = SparkSession.builder().appName(this.getClass.getSimpleName).master("local[*]").getOrCreate()
    val schema = new StructType()
      .add("guid", DataTypes.LongType)
      .add("first_dt", DataTypes.StringType)
      .add("rng_start", DataTypes.StringType)
      .add("rng_end", DataTypes.StringType)
    val frame = spark.read.option("header", true.toString).schema(schema).csv("file:///C:\\Users\\hp\\IdeaProjects\\TiTan\\dataware\\src\\main\\resources\\UserAccessInterval.csv")
    frame.createTempView("usertable")
    spark.sql(
      """
        |with x as (
        |select
        |  guid,
        |  if(datediff('2020-03-14',rng_start) >= 30,date_sub('2020-03-14',30),rng_start) as rng_start,
        |  if(rng_end = '9999-12-31','2020-03-14',rng_end) as rng_end
        |from
        |  usertable -- 测试省略分区
        |where datediff('2020-03-14',rng_end) <= 30
        |)
        |
        |-- 计算间隔为0的天数
        |select
        |  guid,
        |  0 as interval ,
        |  datediff(rng_end,rng_start) as times
        |from
        |x
        |
        |union all
        |-- 计算间隔为N的天数
        |-- 去重rn为null的值,无法计算
        |select
        |  guid,
        |  datediff(date_rn,rng_end) as interval,
        |  1 as times
        |from
        |(
        |select
        |  guid,
        |  rng_end,
        |  lead(rng_start,1,null) over(partition by guid order by rng_end) as date_rn
        |from
        |  x
        |) o
        |where date_rn is not null
        |""".stripMargin).show(100,false)
  }
}

新用户留存分析主题

新用户留存报表
-- 基于流量/用户活跃度分析主题
-- 今日(2020-12-16)的新用户留存分析(新用户的定义:某人的初次登入日期就是这日的新用户)
select
  '2020-12-16' as calc_dt, -- 计算日
  first_dt as first_dt, -- 用户第一次登陆日期                                     
  datediff('2020-12-16', first_dt) as retention_days, -- 留存日
  count(1) as user_cnt -- 留存数
from
  dw.user_act_range
where
  dy = '2020-12-16'
  and range_end = '9999-12-31'
  and datediff('2020-12-16', first_dt) <= 30 -- 过滤出超过一个月的情况
  and first_dt < '2020-12-16'
group by
  first_dt,
  datediff('2020-12-16', first_dt)
-- 按照用户和事件两个维度进行聚合,统计出每个用户的行为记录

create table dw.event_overview_aggr_user(
  guid string,
  event_id string,
  -- 事件类型
  act_cnt int -- 事件访问次数
) PARTITIONED BY (dy string) STORED AS PARQUET TBLPROPERTIES("parquet.compress" = "snappy")
insert into
  table dw.event_overview_aggr_user partition(dy = '2020-12-11')
select
  guid as guid,
  eventid as event_id,
  count(1) as act_cnt
from
  dw.enent_app_detail
where
  dy = '2020-12-11'
group by
  guid,
  eventid -- 报表分析,事件访问次数的range
  -- 如果不借助用户事件聚合维度表直接group by guid 则无法知道具体的每个用户的具体的访问记录
select
  '2020-12-11' as dt,
  event_id as event_id,
  sum(act_cnt) as act_cnt,  -- 一共访问了多少次
  count(1) as act_users,   -- 有多少用户
  count(if(act_cnt <= 10, 1, null)) as cnt_0_users,
  -- 访问占比 cnt11_20_users
  count(
    if(
      act_cnt <= 20
      and act_cnt > 10,
      1,
      null
    )
  ) as cnt_1_users,
  count(if(act_cnt > 10, 1, null)) as cnt_3_users
from
  dw.event_overview_aggr_user
where
  dy = '2020-12-11'
group by
  event_id

漏斗模型分析主题

业务路径概况统计
create table dw.funnel_statistic_1d
( 
  funnel_name  string,-- 漏斗的业务步骤
  guid         string,-- 用户
  comp_step    int -- 用户完成这个步骤多少步
)
partitioned by (dy string)
stored as parquet 
tblproperties("parquet.compress" = "snappy")
;

insert into table dw.funnel_statistic_1d partition(dy='2020-12-12')
select
  '浏览分享添加' as funnel_name,
  guid,
  comp_step
from
  (
    select
      guid as guid,
      -- sort_array()返回的是一个数组  regexp_extract() 只能对字符串操作
      case when regexp_extract(
        concat_ws(
          ',',
          sort_array(
            collect_list(
              concat_ws('_', cast(`timestamp` as string), eventId)
            )
          )
        ),
        '.*?(pageView).*?(share).*?(addCart).*?',
        3
      ) = 'addCart' then 3 when regexp_extract(
        concat_ws(
          ',',
          sort_array(
            collect_list(
              concat_ws('_', cast(`timestamp` as string), eventId)
            )
          )
        ),
        '.*?(pageView).*?(share).*?',
        2
      ) = 'share' then 2 when regexp_extract(
        concat_ws(
          ',',
          sort_array(
            collect_list(
              concat_ws('_', cast(`timestamp` as string), eventId)
            )
          )
        ),
        '.*?(pageView).*?',
        1
      ) = 'pageView' then 1 else 0 end as comp_step
    from
      dw.enent_app_detail
    where
      dy = '2020-12-12'
      and -- 漏斗步骤(
        (
          eventId = 'pageView'
          and properties ['pageId'] = '877'
        )
        or (
          eventId = 'share'
          and properties ['pageId'] = '791'
        )
        or (
          eventId = 'addCart'
          and properties ['pageId'] = '72'
           )
         group by guid
     )o
where
  comp_step > 0;

-- 报表案例:统计完成步骤的人数

select 
count(if(comp_step>=1,1,null)) as step_1,--完成第一步的人
count(if(comp_step>=2,1,null)) as step_2,
count(if(comp_step>=3,1,null)) as step_3,
from 
dw.funnel_statistic_1d 
where dy='2020-12-12';
-- 访问路径明细表
-- guid  ssessionid  eventid  ts  referral  stay_time
with x as (
select 
guid,
ssessionid,
eventid['url'] as eventid, 
row_number() over(partition by guid , ssessionid order by ts) as step,
lag(eventid['url'],1,null) over(partition by guid , ssessionid order by ts) as referral 
lead(ts,1,null) over(partition by guid , ssessionid order by ts) - ts as stay_time
from 
ods_etl -- 经过etl处理的原表
where dy = '2020-12-12'
)
insert  into dwd_atl_rut_dtl  partition (dy = '2020-12-12')
select 
guid, -- 用户唯一标识
ssessionid, -- 会话
eventid, --  事件
step , --第几步
referral ,  -- 前一个事件
if(stay_time is null ,3000,stay_time) as stay_time -- 最后一个页面停留时间怎么处理(给定一个定长),停留时间
from 
x 

-- 访问路径概况统计报表
-- 使用count() over() 也可以实现
with x as (
select
  eventid,step,referral,
  count(ssessionid)  as rut_cnt -- 不需要去重
from 
  dwd_atl_rut_dtl
group by  eventid,step,referral
) 
select 
  eventid,step,referral,
  rut_cnt, -- 路径会话数
  sum(rut_cnt) over(partition by eventid,step order by eventid) as step_cnt, -- 步骤会话数
  sum(rut_cnt) over (partition by eventid order by eventid) as page_cnt -- 页面会话数
from 
  x 

事件归因分析主题*

归因分析是对用户复杂的消费行为路径的分析,归因分析(Attribution Analysis)要解决的问题就是广告效果的产生,其功劳应该如何合理的分配给哪些渠道。

末次触点分析:转化路径少周期短的业务,或者就是起临门一脚作用的广告,为了吸引客户购买,点击直接落地到商品详情页

首次触点分析:这种模型适用于没什么品牌知名度的公司,关注能给他们带来客户的最初的渠道,对于扩展市场很有帮助的渠道

末次非直接点击归因:业务的直接流量大部分都被来自于被其他渠道吸引的客户,需要排除掉直接流量

线性归因分析:根据线性归因模型的特点,他更适用于企业期望在整个销售周期内保持与客户的联系,并维持品牌认知度的公司.在这种情况下,各个渠道在客户的考虑过程中,都起到相同的促进作用

时间衰减归因分析:客户决策周期短、销售周期短的情况,比如,做短期的促销,就打了两天的广告,那么这两天的广告理应获得较高的权重

U型归因分析:混合使用了首次互动归因和末次互动归因

object FactorAnalysis {
  //归因分析:对用户复杂的消费行为路径的分析
  //目标事件 e6
  //待归因事件 'e1','e3','e5'
  //数据:见项目factor.csv

  def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder().appName(this.getClass.getSimpleName).master("local[*]").enableHiveSupport().getOrCreate()
    spark.read.format("csv").load("file:///C:\\Users\\hp\\IdeaProjects\\TiTan\\dataware\\src\\main\\resources\\factor.csv").toDF("guid", "event", "timestamp").createTempView("factor")
    //使用SQL的方式聚合出每个用户的事件信息


    spark.sql(
      """
        |select
        |guid,
        |sort_array(collect_list(concat_ws('_',timestamp,event) )) as eventId
        |from
        |factor
        |where event in ('e1','e3','e5','e6')
        |group by guid
        |having array_contains(collect_list(event),'e6')
        |""".stripMargin).createTempView("attr_event")

    spark.udf.register("first_time", firstTime)
    spark.udf.register("first_time1", firstTime1)
    spark.udf.register("Thelastattribution", Thelastattribution)
    spark.udf.register("linearattribution", linearattribution)
    spark.udf.register("decay", decay)

    spark.sql(
      """
        |select
        |guid,
        |decay(eventId,array('e1','e3','e5')) as attribute
        |from
        |attr_event
        |""".stripMargin).show(10, false)

  }
  
  // 其实只要返回第一个值即可,发现问题:可能会用多个"e6"
  val firstTime = (event_list: mutable.WrappedArray[String], attribute: mutable.WrappedArray[String]) => {
    //判断首次出现的事件'e1','e3','e5'
    val event = event_list.map(data => {
      val strings = data.split("_")
      (strings(0), strings(1))
    })

    val array = for (i <- event; j <- attribute if (i._2 == j)) yield (i, j)
    // 按照 value 来去重获取第一次出现的值
    implicit var sort = Ordering[(String, String)].on[(String, (String, String))](t => t._2).reverse
    var set = mutable.TreeSet[((String, String), String)]()
    for (elem <- array) {
      set += elem
    }
    set.head._2
  }
  
  val firstTime1 = (event_list: mutable.WrappedArray[String], attribute: mutable.WrappedArray[String]) => {
    var tuple: (List[String], List[String]) = event_list.toList.map(_.split("_")(1)).span(_ != "e6")
    var temp: (List[String], List[String]) = tuple
    var strings = ListBuffer[String]()
    while (temp._2.nonEmpty) {
      if (temp._1.nonEmpty) {
        strings += temp._1.head
      }
      temp = temp._2.tail.span(_ != "e6")
    }
    strings
  }
  // 末次触点归因
  val Thelastattribution = (event_list: mutable.WrappedArray[String], attribute: mutable.WrappedArray[String]) => {
    var tuple: (List[String], List[String]) = event_list.toList.map(_.split("_")(1)).span(_ != "e6")
    var strings = ListBuffer[String]()
    while (tuple._2.nonEmpty) {
      if (tuple._1.nonEmpty) {
        //init 方法是返回除最后一个元素之外的全部元素
        strings += tuple._1.last
      }
      tuple = tuple._2.tail.span(_ != "e6")
    }
    strings
  }

  //线性归因
  val linearattribution = (event_list: mutable.WrappedArray[String], attribute: mutable.WrappedArray[String]) => {
    //要对每段的匹配到的去重
    var tuple: (List[String], List[String]) = event_list.toList.map(_.split("_")(1)).span(_ != "e6")
    var buffer = ListBuffer[(String, Int)]()
    while (tuple._2.nonEmpty) {
      if (tuple._1.nonEmpty) {
        val value = tuple._1.toSet
        val size = value.size
        for (elem <- value) {
          buffer += ((elem, 100 / size))
        }
      }
      tuple = tuple._2.tail.span(_ != "e6")
    }
    buffer
  }
  //时间衰减归因
  val decay = (event_list: mutable.WrappedArray[String], attribute: mutable.WrappedArray[String]) => {
    var tuple: (List[String], List[String]) = event_list.toList.map(_.split("_")(1)).span(_ != "e6")
    var buffer = ListBuffer[(String, Double)]()
    while (tuple._2.nonEmpty) {
      if (tuple._1.nonEmpty) {
        val list = tuple._1.toSet.toList
        val seq = for (i <- list.indices) yield Math.pow(0.9, i)
        val tuples1: immutable.Seq[(String, Double)] = for (j <- list.indices) yield (list(j), seq(j) / seq.sum)
        for (elem <- tuples1) {
          buffer += elem
        }
      }
      tuple = tuple._2.tail.span(_ != "e6")
    }
    buffer
  }
  //|guid|attribute                                                                       |
  //+----+--------------------------------------------------------------------------------+
  //|g02 |[[e3, 0.5263157894736842], [e1, 0.4736842105263158], [e1, 1.0]]                 |
  //|g01 |[[e1, 0.5263157894736842], [e3, 0.4736842105263158]]                            |
  //|g03 |[[e3, 0.36900369003690037], [e1, 0.33210332103321033], [e5, 0.2988929889298893]]|

  //位置归因
  //同理只要去除第一个和最后一个即可不再写了

  //末次非触点归因分析,排除最后一个即可,例如:list.init.last取出第二个元素
}

ODS->DWD(业务数据分析)

Sqoop

-- sqoop安装流程略
-- mysql表导入到hive表(可能报错将hive/lib目录下commonjar拷贝到sqoop下lib目录)
bin/sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true \ -- 没有数字主键要添加此-D参数
--connect jdbc:mysql://linux03/db_demo1 \
--username root \
--password 123456 \
--table TEST \
--hive-import \
--hive-table db_demo1.tb_mysql \ -- 直接指定数据库无需重新指定了
--delete-target-dir \ -- 删除中间文件信息
--as-textfile \   -- 指定hive表生成文件类型
--fields-terminated-by ',' \ 
--compress   \
--compression-codec gzip \ -- 指定压缩编码
--null-string '\\N' \    -- 处理空值
--null-non-string '\\N' \ 
--hive-overwrite \     -- 是否覆盖
--split-by ITEM_CODE \ -- 按照某字段来分组
-m 2 -- 两个map任务
-- query查询导入,先导入--target-dir 指定的 HDFS 的目录中
-- 尽量不要使用复杂查询
-- 要添加$CONDITIONS作为将来拼接条件的占位符
-- 对于引号的问题,方案一外层使用双引号内层使用单引号方案二外层使用单引号,里层对$进行转义\$

bin/sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true \
--connect jdbc:mysql://linux03/db_demo1 \
--username root \
--password 123456 \
--hive-import \
--hive-table db_demo1.tb_geo1 \
--target-dir /mydata/temporary \
--as-textfile \
--fields-terminated-by ',' \
--compress   \
--compression-codec gzip \
--split-by geo \
--null-string '\\N' \
--null-non-string '\\N' \
--hive-overwrite  \
--query " select * from geo   where  geo ='wx4g1n'  and  \$CONDITIONS"  \
-m 2

-- 其他支持
-- 条件导入: --columns  指定要导的字段
-- 按字段的增量导入/界定时间戳
--check-colum       -- 指定在确定要导入的行时要检查的列
--incremental       -- 有两种类型一种是append一种是lastmodified。append是增加,lastmodified是根据
                    时间戳
--last-value        -- 指定上一次导入的检查列的最大值
--last-value        -- '2020-03-18 23:59:59' 通过修改时间戳来界定数据
-- 导入到hive分区
-- --target-dir "...dt=''"  hive无法识别,添加hive的分区 alter table tablename partition(dt='') locatition ''

-- 导入到hive分区,手动 alter table db_demo1.geo1 add  partition (dt='2020-12-23');
-- target-dir -- 生成的结果保存在这个目录如果不在hive表下,要手动指定location hdfs文件地址
bin/sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true \
--connect jdbc:mysql://linux03/db_demo1 \
--username root \
--password 123456 \
--target-dir /user/hive/warehouse/db_demo1.db/geo1/dt=2020-12-23 \ 
--delete-target-dir \
--as-textfile \
--fields-terminated-by '\001' \
--compress   \
--compression-codec gzip \
--split-by geo \
--null-string '\\N' \
--null-non-string '\\N' \
--query " select * from geo   where  geo ='wx4g1n'  and  \$CONDITIONS"  \
-m 2

增量表和全量表
增量表
特征每个分区就是业务数据每天更新的数据
怎么从业务系统导入
方法一:取某个时间戳之后的数据(即为今天更新的数据)
方法二:递增字段,这个场景适用于递增数据库

全量表
为什么不直接从业务系统导出全部的数据?因为业务系统这张表可能数据过多,而数据系统是分布式的,对系统压力小
今天的增量数据 与全量表的数据 full join (能join上的就取增量表的,不能join上的就是新的用户/没有更新的数据,把join的结果添加到今天的分区,关于是否要保存全量表昨天分区的数据,可以保存相当于快照,可以采用类拉链表的方法来更新全增表,这样数据就会减少)

对全量表和增量表的论述在《大数据系统构建原理与最佳实践》有详细的解释

-- 一个关于类拉链表的案例
-- 方案一:
-- 20号拉链
create table dwd_append_zip(
id  int,
name string,
age  int,
nick string,
start_dt  string,
end_dt    string
)
partitioned by (dt string)
row format delimited fields terminated by ','

vi zip.20
1,zs,24,ss,2020-12-20,9999-12-31 
2,ww,25,ww,2020-12-20,9999-12-31 
3,tq,22,qq,2020-12-20,9999-12-31 

load data local inpath '/root/zip.20' into table dwd_append_zip partition(dt='2020-12-20')

-- 21号增量
create table ods_append(
id  int,
name string,
age  int,
nick string
)
partitioned by (dt string)
row format delimited fields terminated by ','


vi append.21
2,ww,25,zz
4,wb,24,bb

load data local inpath '/root/append.21' into table ods_append partition(dt='2020-12-21')

-- joi规则今天不存在昨天存在就修改end_dt为昨天的日期,如果没有匹配到且不存在就是新的行为

with y as ( -- 昨天的拉链表
select 
from 
dwd_append_zip 
where dt='2020-12-20'
 )
select
nvl(x.id,y.id) as id,
nvl(x.name,y.name) as name,
nvl(x.nick,y.nick) as nick,
nvl(x.start_dt,y.start_dt) as start_dt,
case 
   -- 这种情况是已经封装的数据了
   when x.end_dt != '9999-12-31'  then x.end_dt
   -- 这种情况是昨天存在并且今天也在
   when x.end_dt = '9999-12-31' and y.end_dt is not null  then '9999-21-31'
   -- 这种情况是昨天在,今天不在封装
   when x.end_dt = '9999-21-31' and y.end_dt is null then x.end_dt
   -- 这种情况是昨天没有,但是今天有
   when x.end_dt is null and y.end_dt is not null then '2020-12-21'
   end 
   as end_dt
from
ods_append x
where dt='2020-12-21'
full join 
x.id = y.id 

-- 方案二:
-- 今天的新增数据全部添加到增量表
-- 今天有的end_dt全部为今天的时间,已经封装的全部是默认的end_dt

with a as (
  SELECT
    id,
    name,
    age,
    nick,
    start_dt,
    end_dt,
    dt
  FROM
    dwd_append_zip
  where
    (dt = '2020-12-20')
),
b as (
  SELECT
    id,
    name,
    age,
    nick,
    dt
  FROM
    ods_append
  WHERE
    (dt = '2020-12-21')
)
SELECT
  a.id,
  a.name,
  a.age,
  a.nick,
  a.start_dt,
  if(
    b.id is not null
    and a.end_dt = '9999-12-31',
    a.dt,
    a.end_dt
  ) as end_dt
FROM
  a
  left join b on a.id = b.id
UNION ALL
SELECT
  id,
  name,
  age,
  nick,
  dt as start_dt,
  '9999-12-31' as end_dt
FROM
  b

DWD->DWS(业务数据分析)

报表分析示例

- 复购率分析
-- 需要两张表 订单主要信息表 order 订单商品详情表 order_goods
-- 订单 用户id 商品 品类
with x as (
  select
    t1.order_id,
    t1.use_id,
    t2.good_id,
    t2.type_id
  from
    (
      select
        order_id,
        use_id
      from
        order
      where
        dy = '2020-12-12'
    ) t1
    join order_goods t2 on t1.order_id = t2.order_goods
)
select
  '2020-12-12' as dy,
  --今日时间
  type_id,
  -- 品类
  count(use_id) -- 购买该品类多少人,已经对人进行了去重了,上一步group by use_id
  count(if(user_cnt >= 2, 1, null)) as time_2,
  -- 统计购买次数大于2次的人的个数
  count(if(user_cnt >= 3, 1, null)) as time_3,
from
  (
    select
      -- 一个用户购买多少该品类
      type_id,
      use_id,
      count (distinct order_id) as user_cnt 
      -- 为什么要对order_id 去重因为一个品类一条数据,一个用户一次可能购买多个品类
    from
      x
    group by
      use_id,
      type_id
  ) o
group by
  o.type_id


-- 复杂报表分析
-- 表报分析
-- 店铺 地区 月份 月销售金额 同地区同月份所有店铺销售金额 该店铺到该月的累计金额
-- 数据
shop  province    provincetable
shop month sale   shoptable


with y as (
  with x as (
    select
      a.shop,
      max(b.province) as province,
      a.month,
      sum(a.sale) as sale_cnt
    from
      shoptable a
      join provincetable b on a.shop = b.shop
    group by
      a.shop,
      a.month
  )
  select
    -- 增加字段累计金额
    shop,
    province,
    month,
    sale_cnt,
    sum(sale_cnt) over(
      partition by shop
      order by
        month rows between unbounded preceding
        and current row
    ) as shop_account_cnt
  from
    x
) -- 按地区和月份进行分组,求同地区同月份下所有店铺的销售金额
select
  y.shop,
  y.province,
  y.sale_cnt,
  z.province_sale,
  y.shop_account_cnt
from
  (
    select
      province,
      month,
      sum(s.sale) as province_sale
    from
      shoptable s
      join provincetable p on s.shop = p.shop
    group by
      s.month,
      p.province
  ) z
  join y on z.province = y.province
  and z.month = y.month
  
|c   |1    |陕西    |100     |100          |100             |
|c   |2    |陕西    |100     |100          |200             |
|c   |3    |陕西    |100     |100          |300             |
|b   |1    |湖北    |200     |500          |200             |
|b   |2    |湖北    |200     |1300         |400             |
|b   |3    |湖北    |300     |400          |700             |
|a   |1    |湖北    |300     |500          |300             |
|a   |2    |湖北    |1100    |1300         |1400            |
|a   |3    |湖北    |100     |400          |1500   

用户行为画像

订单标签
-- 源表.订单主要信息表  订单签收人信息表
-- order order_desc
-- 重要的处理事项:是否要去重,where(分区字段),
-- 可以考虑把temp单独地做成一张表,这样方便计算
with temp as (
select
  a.order_id   ,-- 订单ID
  a.order_date ,-- 订单日期
  a.user_id    , -- 用户ID
  a.order_money , -- 订单金额(应付金额)
  a.order.status , -- 订单状态(6:退货,7:拒收)
  a.pay_type , -- 订单支付类型
  b.area_name ,-- 收货人地址
  b.address ,-- 手工地址
  b.coupen_money -- 代金券金额
from 
  order a join order_desc b on a.user_id = b.user_id
)

-- 模块一
select 
  user_id , -- 用户
  min(order_date) , -- 首单
  max(order_date) , -- 末单日期
  datediff('2020-12-12',min(order_date)) , -- 首单距今时间
  datediff('2020-12-12',max(order_date)) , -- 末单距今时间
  count(if(datediff('2020-12-12',order_date) <=30,1,null)), -- 最近三十天购买次数
  sum(if(datediff('2020-12-12',order_date) <=30,order_money,0)), -- 最近三十天购买的金额
  count(if(datediff('2020-12-12',order_date) <=60,1,null)), -- 最近六十天购买次数
  sum(if(datediff('2020-12-12',order_date) <=60,order_money,0)), -- 最近六十天购买的金额
  count(if(datediff('2020-12-12',order_date) <=90,1,null)), -- 最近九十天购买次数
  sum(if(datediff('2020-12-12',order_date) <=90,order_money,0)), -- 最近九十天购买的金额
  min(order_money) , -- 最小的金额
  max(order_money) , -- 最大的金额
  count(if(order_status != '6' and order_status != '7',1,null)) , -- 累计消费次数(不含推拒),注意是6.7是字符串
  sum(if(order_status != '6' and order_status != '7',order_money,0)) , -- 累计代金券金额(不含推拒)
  avg(order_money) , -- 平均订单金额
  avg(if(datediff('2020-12-12',order_date) <=90,order_money,null)) -- 最近九十天的平均的订单的金额           
from
  temp
group by user_id -- 用户画像标签计算


-- 模块二:常用地址
select 
  user_id, 
  common_address -- 常用地址
from 
(
select 
  user_id  ,-- 用户
  concat_ws(' ',address,area_name ) as common_address  ,-- 常用地址(可能为null)
  row_number() over(partition by user_id order by count(1) desc rows between unbounded preceding  and current row) as rn
from
  temp 
group by user_id,concat_ws(' ',address,area_name )
) o 
where o.rn = 1

-- 模块三:常用支付方式
select
  user_id, -- 用户ID
  pay_type,-- 常用支付方式
from 
(
select 
  user_id,
  pay_type,
  row_number() over(partition by user_id order by num_pay desc rows between unbounded preceding  and current row ) as rn
from 
(
select 
  user_id,
  pay_type,
  count(1) as num_pay
from
  temp
group by user_id,pay_type
) o 
) o1
where rn = 1 

-- 模块四:购物车
select 
  user_id , -- 用户ID
  count(1),      --最近30天加购次数
  sum(number) ,-- 最近30天加购商品件数
  sum(if(submit_time is not null ,number ,0)) -- 最近30天提交次数
from
  cart -- 购物车信息表
where datediff('2020-12-12',add_time) <= 30 -- 先过滤
group  by user_id

-- 模块五:可能有的用户订单购物车某一部分没有,考虑是否这两张表的uid作为数据链接的基础
with temp5 as (
select  user_id from order 
union 
select  user_id from cart
)

-- 整合五个模块即可求出

偏好标签

-- 用户购物偏好标签
-- 关联三张表
-- 订单表 order 订单商品详情表 orders_good 商品描述表 goods
-- 做一张中间表存储三张表关联的结果

create table tag(
user_id string ,
first_cat_name  string ,
second_cat_name string ,
third_cat_name string,
brand_id_name  string,
)
stored as parquet tblproperties("parquet.compress" = "snappy")

insert  into  tag 
select 
  order.user_id, -- 用户ID
  goods.first_cat_name, -- 一类标签
  goods.second_cat_name,
  goods.third_cat_name,
  goods.brand_id_name -- 商品标签
from 
  order 
join orders_good on order.order_id = orders_good.order_id
join goods on  orders_good.goods_id = goods.sku_id

with tmp1 as (
select 
  user_id,
  first_cat_name
from 
(
select 
  user_id,
  first_cat_name,
  row_number() over(partition by user_id order by count(1)) as rn-- 标签数
from
  tag
group by user_id,first_cat_name
)
where rn = 1
),
tmp2 as (
select 
  user_id,
  second_cat_name
from 
(
select 
  user_id,
  second_cat_name,
  row_number() over(partition by user_id order by count(1)) as rn 
from
  tag
group by user_id,second_cat_name
)
where rn = 1
),
tmp3 as (
select 
  user_id,
  third_cat_name
from 
(
select 
  user_id,
  third_cat_name,
  row_number() over(partition by user_id order by count(1)) as rn 
from
  tag
group by user_id,third_cat_name
)
where rn = 1
),
tmp4 as (
select 
  user_id,
  brand_id_name
from 
(
select 
  user_id,
  brand_id_name,
  row_number() over(partition by user_id order by count(1)) as rn 
from
  tag
group by user_id,brand_id_name
)
where rn = 1
)

select 
  tmp1.user_id, -- 用户ID
  tmp1.first_cat_name as common_first_cat, -- 最常购买的一类标签
  tmp2.second_cat_name as common_second_cat,
  tmp3.third_cat_name as common_third_cat, -- 最常购买的三类标签
  tmp4.brand_id_name as common_brand_id -- 最常购买的标签
from 
  tmp 
join  tmp2 on tmp1.user_id = tmp2.user_id
join  tmp3 on tmp1.user_id = tmp3.user_id
join  tmp4 on tmp1.user_id = tmp4.user_id

Task Scheduling

Azkaban

-- 安装略
--job文件内容如下
type=command
command=/home/atguigu/bin/dwd_to_dws.sh ${dt}
dependencies=ods_to_dwd_db,ods_to_dwd_start_log

-- 状况的处理
schedule定时调度对于参数dt的value值设置为空
在任务执行中断.对执行脚本进行修正,在web页面点击prepareexecutor接着执行

Metadata Management

Atlas

安装略
对于 hive sql 任务,我们可以完美的通过 atlas 的 hive hook 来实现表、以及字段的血缘,而对于 spark sql 任务,使用 spark-atlas-connector 只能实现表级别的血缘

User Profile

朴素贝叶斯算法示例

-- 示例一:图像的简单识别:构建特征工程(矩阵) 略

-- 示例二:预测出轨率:依据原始数据特征,加工成数值特征
object CheatPredict {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("出轨预测")
      .master("local[*]")
      .getOrCreate()
    //加载原始样本数据
    spark.read.option("header", "true").csv("C:\\Users\\hp\\IdeaProjects\\TiTan\\userprofile\\src\\main\\resources\\sample.csv").selectExpr("name", "job", "cast(income as double) as income", "age", "sex", "label").createTempView("simple")
    //加载测试数据
    spark.read.option("header", "true").csv("C:\\Users\\hp\\IdeaProjects\\TiTan\\userprofile\\src\\main\\resources\\test.csv").selectExpr("name", "job", "cast(income as double) as income", "age", "sex").createTempView("test")
    //原始数据的特征,加工成数值特征
   spark.sql(
      """
        |select
        |name as name ,
        |cast(
        |case job
        |     when '程序员'  then 0.0
        |     when '老师'    then 1.0
        |     when '公务员'  then 2.0
        |     end as double ) as job,
        |
        |cast(
        |case
        |  when income<10000                   then  0.0
        |  when income>=10000 and income<20000 then  1.0
        |  when income>=20000 and income<30000 then  2.0
        |  when income>=30000 and income<40000 then  3.0
        |  else 4.0
        |end as double) as income,
        |
        |cast(
        |case
        |  when  age='青年' then 0.0
        |  when  age='中年' then 1.0
        |  when  age='老年' then 2.0
        |  end as double ) as age,
        |
        |cast(if(sex='男',1,0) as double) as sex,
        |cast(if(label='出轨',0.0,1.0)  as double) as label  -- 标签
        |from
        |simple
        |""".stripMargin)
    .createTempView("simpledata")

    spark.sql(
      """
        |select
        |name as name ,
        |cast(
        |case  job
        |   when  '老师' then 0.0
        |   when '程序员' then 1.0
        |   when '公务员' then 2.0
        |   end as double
        |) as job ,
        |
        |cast(
        |case
        |  when income<10000                   then  0.0
        |  when income>=10000 and income<20000 then  1.0
        |  when income>=20000 and income<30000 then  2.0
        |  when income>=30000 and income<40000 then  3.0
        |  else 4.0
        |end as double) as income ,
        |cast(
        |case
        |  when  age='青年' then 0.0
        |  when  age='中年' then 1.0
        |  when  age='老年' then 2.0
        |  end as double ) as age,
        |
        |cast(if(sex='男',1,0) as double) as sex
        |from
        |test
        |""".stripMargin)
    .createTempView("testdata")


    // 将数值化的特征数据,向量化!(把特征转成特征向量 Vector=>DenseVector密集型向量 , SparseVector 稀疏型向量)
    val arr_Vector = (arr: mutable.WrappedArray[Double]) => {
      Vectors.dense(arr.toArray)
    }

    spark.udf.register("arr_vector", arr_Vector)

    val simple = spark.sql(
      """
        |select
        |name,
        |arr_vector(array(job,income,age,sex)) as features,
        |label
        |from
        |simpledata
        |""".stripMargin)

    val test = spark.sql(
      """
        |select
        |name,
        |arr_vector(array(job,income,age,sex)) as features
        |from
        |testdata
        |""".stripMargin)


    //算法
    // 构建对象
    val bayes = new NaiveBayes()
      .setLabelCol("label")
      .setFeaturesCol("features")
      .setSmoothing(0.01) // 拉普拉斯平滑系数
      .setPredictionCol("cheat")

    val model = bayes.fit(simple)
    //持久化训练集
    model.save("C:\\Users\\hp\\IdeaProjects\\TiTan\\userprofile\\src\\main\\resources\\mode")
    val naiveBayesModel = NaiveBayesModel.load("C:\\Users\\hp\\IdeaProjects\\TiTan\\userprofile\\src\\main\\resources\\mode")
    val frame = naiveBayesModel.transform(test)
    frame.show(10, false)
  }
}

KNN算法示例

//label,f1,f2,f3,f4,f5 simple
//id,f1,f2,f3,f4,f5 test
object KNNdemo {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().master("local[*]").appName(this.getClass.getSimpleName).getOrCreate()
    val schema = new StructType()
      .add("label", DataTypes.DoubleType)
      .add("f1", DataTypes.DoubleType)
      .add("f2", DataTypes.DoubleType)
      .add("f3", DataTypes.DoubleType)
      .add("f4", DataTypes.DoubleType)
      .add("f5", DataTypes.DoubleType)
    //测试数据
    val simpleframe = spark.read.schema(schema).option("header", true.toString).csv("userprofile/src/main/resources/KNN/simple.csv")
    simpleframe.createTempView("simple")
    val schema1 = new StructType()
      .add("id", DataTypes.DoubleType)
      .add("f1", DataTypes.DoubleType)
      .add("f2", DataTypes.DoubleType)
      .add("f3", DataTypes.DoubleType)
      .add("f4", DataTypes.DoubleType)
      .add("f5", DataTypes.DoubleType)

    val testframe = spark.read.schema(schema1).option("header", true.toString).csv("userprofile/src/main/resources/KNN/test.csv")
    testframe.createTempView("test")


    import org.apache.spark.sql.functions._
    //计算欧氏距离的函数
    val eudi: UserDefinedFunction = udf(
      (arr1: mutable.WrappedArray[Double], arr2: mutable.WrappedArray[Double]) => {
        val v1 = Vectors.dense(arr1.toArray)
        val v2 = Vectors.dense(arr2.toArray)
        Vectors.sqdist(v1, v2)
      }
    )
    //计算欧氏距离的函数
    val eudi1 = (arr1: mutable.WrappedArray[Double], arr2: mutable.WrappedArray[Double]) => {
      val data: Double = arr1.zip(arr2).map(it => Math.pow((it._2 - it._1), 2)).sum
      1 / (Math.pow(data, 0.5) + 1)
    }
    spark.udf.register("eudi1", eudi1)
    //笛卡尔积
    spark.sql(
      """
        |select
        |a.id,
        |b.label,
        |eudi1(array(a.f1,a.f2,a.f3,a.f4,a.f5),array(b.f1,b.f2,b.f3,b.f4,b.f5)) as dist
        |from
        |test  a cross join simple b
        |""".stripMargin).createTempView("tmp")

    //|id |label|dist                |
    //+---+-----+--------------------+
    //|1.0|0.0  |0.17253779651421453 |
    //|2.0|0.0  |0.11696132920126338 |
    //|3.0|0.0  |0.0389439561817535  |
    //|4.0|0.0  |0.03583298491583323 |
    //|5.0|0.0  |0.03583298491583323 |
    //|1.0|0.0  |0.18660549686337075 |
    //|2.0|0.0  |0.11189119247086728 |
    spark.sql(
      """
        |select
        |id,label
        |from
        |(
        |select
        |id,label,
        |row_number() over(partition by id order by dist desc  ) as rn
        |from
        |tmp
        |) o
        |where rn = 1
        |""".stripMargin).show(100, false)

    //|id |label|
    //+---+-----+
    //|1.0|1.0  |
    //|4.0|0.0  |
    //|3.0|0.0  |
    //|2.0|1.0  |
    //|5.0|0.0  |
  }
}

TF-IDF算法示例

//TF 词频
//IDF 逆文档频率 lg(文档总数/(1+出现这个词的文档数))
object TFIDFdemo {
  def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder().master("local[*]").appName(this.getClass.getSimpleName).getOrCreate()
    import spark.implicits._
    import org.apache.spark.sql.functions._
    //分别读取三类评价数据
    //添加标签
    val frame0 = spark.read.textFile("userprofile/src/main/resources/TFIDF/good.txt").selectExpr("value as cmt", "cast(0.0 as double) as label")
    val frame1 = spark.read.textFile("userprofile/src/main/resources/TFIDF/general.txt").selectExpr("value as cmt", "cast(1.0 as double) as label")
    val frame2 = spark.read.textFile("userprofile/src/main/resources/TFIDF/poor.txt").selectExpr("value as cmt", "cast(2.0 as double) as label")
    //读取停止数据
    val framestop: Array[String] = spark.read.textFile("userprofile/src/main/resources/TFIDF/poor.txt").collect()
    //广播停止数据
    val broadcastvalue = spark.sparkContext.broadcast(framestop)
    val frame3 = frame0.union(frame1).union(frame2)
    //持久化
    frame3.cache()
    val hanrdd: Dataset[(mutable.Seq[String], Double)] = frame3.map(row => {
      val str = row.getAs[String]("cmt")
      val label = row.getAs[Double]("label")
      (str, label)
      //一个分区的数据
    }).mapPartitions(data => {
      val value = broadcastvalue.value
      //一个分区的一行数据
      data.map(txt => {
        import scala.collection.JavaConversions._
        //HanLP中文分词器
        val words: mutable.Seq[String] = HanLP.segment(txt._1).map(_.word).filter(st => (!value.contains(st)) && st.length >= 2)
        (words, txt._2)
      })
    })
    val dataFrame = hanrdd.toDF("words", "label")
    //hash映射
    val tf = new HashingTF()
      .setInputCol("words")
      .setNumFeatures(100000)
      //输出字段
      .setOutputCol("tf_vec")
    val dataFrame1 = tf.transform(dataFrame)
    // 用idf算法,将上面tf特征向量集合变成 TF-IDF特征值向量集合
    val idf = new IDF()
      .setInputCol("tf_vec")
      .setOutputCol("tf_idf_vec")
    //fit得到一个模型
    val iDFModel = idf.fit(dataFrame1)
    val tfidfVecs = iDFModel.transform(dataFrame1)
   tfidfVecs.show(100,false)
       //|[好评]            |0.0  |(100000,[10695],[1.0])                    |(100000,[10695],[1.791759469228055])                                                  |
    //|[不错, ....]      |0.0  |(100000,[42521,70545],[1.0,1.0])          |(100000,[42521,70545],[2.1972245773362196,2.1972245773362196])                        |
    //|[红包]            |0.0  |(100000,[10970],[1.0])                    |(100000,[10970],[2.1972245773362196]) 
      
    //将总的数据分区测试集合样本集
    val array = tfidfVecs.randomSplit(Array(0.8, 0.2))
    val train = array(0)
    val test = array(1)

    //训练
    // 训练朴素贝叶斯模型
    val bayes = new NaiveBayes()
      .setLabelCol("label")
      .setFeaturesCol("tf_idf_vec")
      .setSmoothing(1.0)
      .setModelType("multinomial")
    //模型
    val model = bayes.fit(train)
    //测试模型的效果
    val frame = model.transform(test)
  }
}

逻辑回归算法示例

//流失率预测
//逻辑回归算法与朴素贝叶斯算法的一点区别在于逻辑回归的特征向量有数量含义,而朴素贝叶斯的特征向量就是简单的离散标签没有数量特征
object Logistic {
  def main(args: Array[String]): Unit = {
    Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
    val spark = SparkSession.builder().appName(this.getClass.getSimpleName).master("local[*]").getOrCreate()
    val simple = spark.read.option("header", "true").option("inferSchema", true).csv("userprofile/src/main/resources/Logistic/ loss_predict.csv")

    //将数组转为向量
    val arr2Vec = (arr: mutable.WrappedArray[Double]) => Vectors.dense(arr.toArray)

    //label,gid,3_cs,15_cs,3_xf,15_xf,3_th,15_th,3_hp,15_hp,3_cp,15_cp,last_dl,last_xf
    import spark.implicits._
    spark.udf.register("arr2Vec", arr2Vec)
    //将样本数据向量化
    val frame = simple.selectExpr("label", "arr2Vec(array(3_cs,15_cs,3_xf,15_xf,3_th,15_th,3_hp,15_hp,3_cp,15_cp,last_dl,last_xf)) as vec")
    val minMaxScaler = new MinMaxScaler()
      .setInputCol("vec")
      .setOutputCol("features")
    //生成模型.训练样本
    val maxScalerModel = minMaxScaler.fit(frame)
    val minMaxFrame = maxScalerModel.transform(frame).drop("vec")
    minMaxFrame.show(100, false)
    //|0.0 |[0.6666666666666666,0.9354838709677419,0.3333333333333333,0.7692307692307693,0.0,0.3333333333333333,0.7142857142857142,0.7894736842105263,0.3333333333333333,0.5,0.0,0.0]                  |
    //|0.0 |[0.7777777777777777,0.967741935483871,0.4444444444444444,0.7692307692307693,0.5,0.6666666666666666,0.7857142857142857,0.8421052631578947,0.6666666666666666,0.75,0.0,0.0]                  |
    //|0.0 |[0.8888888888888888,0.9032258064516129,0.6666666666666666,0.8461538461538463,1.0,0.6666666666666666,0.8571428571428571,0.894736842105263,0.3333333333333333,0.25,0.0,0.07692307692307693]  |

    //构建逻辑回归算法
    val logisticRegression = new LogisticRegression()
      .setFeaturesCol("features")
      .setLabelCol("label")
    val  Array(train,test) = minMaxFrame.randomSplit(Array(0.8, 0.2))

    val logisticRegressionModel = logisticRegression.fit(train)
    val frame1: DataFrame = logisticRegressionModel.transform(test)
    frame1.select("label","prediction").show(100,false)
    //|label|prediction|
    //+-----+----------+
    //|0.0  |0.0       |
    //|0.0  |0.0       |
    //|1.0  |1.0       |

    //计算准确率select count(*)  from where label = prediction / select from count(*)

模型评估示例

1. 混淆矩阵(Confusion Matrix):
以分类模型中最简单的二分类为例,对于这种问题,我们的模型最终需要判断样本的结果是0还是1,或者说是positive还是negative
我们通过样本的采集,能够直接知道真实情况下,哪些数据结果是positive,哪些结果是negative.同时,我们通过用样本数据跑出分类型模型的结果,也可以知道模型认为这些数据哪些是positive,哪些是negative
因此,我们就能得到这样四个基础指标,我称他们是一级指标(最底层的)
真实值是positive,模型认为是positive的数量(True Positive=TP)
真实值是positive,模型认为是negative的数量(False Negative=FN)
统计学上的第一类错误(Type I Error)
真实值是negative,模型认为是positive的数量(False Positive=FP)
统计学上的第二类错误(Type II Error)
真实值是negative,模型认为是negative的数量(True Negative=TN)
将这四个指标一起呈现在表格中,就能得到一个矩阵,我们称它为混淆矩阵(Confusion Matrix)

2. ROC曲线 
精确率(precision)的定义为:P = TP / (TP + FP),是指正样本被预测正确的比例
召回率(recall)定义为:R = TP / (TP + FN),是指预测为正样本的样本中,正确的比例
灵敏度(True Positive Rate,TPR)的定义为:TPR = TP / (TP + FN),实际上与召回率的定义相同
1-特异度(False Positive Rate,FPR)的定义为: FPR = FP / (FP + TN)
以 TPR 为 y 轴,以 FPR 为 x 轴,调整不同的阈值得到 TPR,FPR,就可以得到 ROC 曲线
ROC 曲线下的面积称为 AUC(Area Under Curve),从几何的角度讲,ROC 曲线下方的面积越大越大,则模型越优

3. 回归分析评估指标
RMSE(Root Mean Square Error)均方根误差
衡量观测值与真实值之间的偏差
常用来作为机器学习模型预测结果衡量的标准

MSE(Mean Square Error)均方误差
MSE是真实值与预测值的差值的平方然后求和平均
通过平方的形式便于求导,所以常被用作线性回归的损失函数

MAE(Mean Absolute Error)平均绝对误差
是绝对误差的平均值
可以更好地反映预测值误差的实际情况

SD(Standard Deviation)标准差
方差的算术平均根
用于衡量一组数值的离散程度

OLAP

Hbase row key设计

1. Region热点问题的解决
Reverse反转:针对固定长度的Rowkey反转后存储,这样可以使Rowkey中经常改变的部分放在最前面,可以有效的随机Rowkey
Salt加盐:Salt是将每一个Rowkey加一个前缀,前缀使用一些随机字符,使得数据分散在多个不同的Region,达到Region负载均衡的目标
Hash散列或者Mod:用Hash散列来替代随机Salt前缀的好处是能让一个给定的行有相同的前缀,这在分散了Region负载的同时,使读操作也能够推断

Kylin

1. cube的构建流程
构建一个中间平表(Hive Table):将Model中的fact表和look up表构建成一个大的Flat Hive Table
重新分配Flat Hive Tables
从事实表中抽取维度的Distinct值
对所有维度表进行压缩编码,生成维度字典
计算和统计所有的维度组合,并保存,其中,每一种维度组合,称为一个Cuboid
创建HTable
构建最基础的Cuboid数据
利用算法构建N维到0维的Cuboid数据
构建Cube
将Cuboid数据转换成HFile
将HFile直接加载到HBase Table中
更新Cube信息
清理Hive

2. Kylin中bitmap
精确算法: Bitmap 
例如有一个集合[2,3,5,8]对应的 Bitmap 数组是[001101001],我们知道一个 Integer 是32位的,如果一个 Bitmap 可以存放最多 Integer.MAX_VALUE 个值,那么这个 Bitmap 最少需要 32 的长度.一个 32 位长度的 Bitmap 占用的空间是512 M      (2^32/8/1024/1024),这种 Bitmap 存在着非常明显的问题:这种 Bitmap 中不论只有 1 个元素或者有 40 亿个元素,它都需要占据 512 M的空间
咆哮位图:https://github.com/lemire/RoaringBitmap

3. 请学习文档:http://kylin.apache.org/docs/

Presto

请学习官方文档 https://prestodb.io/docs/current/

SuperSet

安装教程: https://www.jianshu.com/p/36a7e1cf97b5

请学习官方文档 https://superset.apache.org/

你可能感兴趣的:(数据仓库1.0)