电商指标项目-实时频道热点分析业务开发

1. 业务介绍


频道热点,就是要统计频道被访问(点击)的数量。

分析得到以下的数据:

频道ID 访问数量
频道ID1 128
频道ID2 401
频道ID3 501

需要将历史的点击数据进行累加

2. 业务开发

步骤

  1. 创建实时热点样例类,专门用来计算实时热点的数据
  2. 将预处理后的数据,转换为要分析出来的数据(频道、访问次数)样例类
  3. 按照频道进行分组(分流)
  4. 划分时间窗口(3秒一个窗口)
  5. 进行合并计数统计
  6. 打印测试
  7. 将计算后的数据下沉到Hbase

实现

  1. 创建一个ChannelRealHotTask单例对象
  2. 添加一个ChannelRealHot样例类,它封装要统计的两个业务字段:频道ID(channelID)、访问数量(visited)
  3. ChannelRealHotTask中编写一个process方法,接收预处理后的DataStream
  4. 使用map算子,将ClickLog对象转换为ChannelRealHot
  5. 按照频道ID进行分流
  6. 划分时间窗口(3秒一个窗口)
  7. 执行reduce合并计算
  8. 将合并后的数据下沉到hbase
    • 判断hbase中是否已经存在结果记录
    • 若存在,则获取后进行累加
    • 若不存在,则直接写入
package com.xu.realprocess.task

import com.xu.realprocess.bean.ClickLogWide.ClickLogWide
import com.xu.realprocess.util.HBaseUtil
import org.apache.commons.lang.StringUtils
import org.apache.flink.streaming.api.scala.{
     DataStream, KeyedStream, WindowedStream}
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.datastream.DataStreamSink
import org.apache.flink.streaming.api.functions.sink.SinkFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow



case class ChannelRealHot(var channelid: String, var visited: Long)

/**
  * 频道热点分析
  *
  * 1. 字段转换
  * 2. 分组
  * 3. 时间窗口
  * 4. 聚合
  * 5. 落地HBase
  *
  */
object ChannelRealHotTask {
     

  def process(clickLogWideDataStream: DataStream[ClickLogWide]) = {
     

    // 1. 字段转换 channelid, visited
    val realHotDataStream: DataStream[ChannelRealHot] = clickLogWideDataStream.map {
     
      clickLogWide: ClickLogWide =>
        ChannelRealHot(clickLogWide.channelID, clickLogWide.count)
    }
    // 2. 分组
    val keyedStream: KeyedStream[ChannelRealHot, String] = realHotDataStream.keyBy(_.channelid)

    // 3. 时间窗口
    val windowedStream: WindowedStream[ChannelRealHot, String, TimeWindow] = keyedStream.timeWindow(Time.seconds(3))

    // 4. 聚合
    val reduceDataStream: DataStream[ChannelRealHot] = windowedStream.reduce {
     
      (t1: ChannelRealHot, t2: ChannelRealHot) =>
        ChannelRealHot(t1.channelid, t1.visited + t2.visited)
    }

    // 5. 落地HBase
    reduceDataStream.addSink(new SinkFunction[ChannelRealHot] {
     

      override def invoke(value: ChannelRealHot): Unit = {
     

        // hbase相关字段
        val tableName = "channel"
        val clfName = "info"
        val channelIdColumn = "channelId"
        val visitedColumn = "visited"
        val rowkey = value.channelid

        // 查询HBase,获取相关记录
        val visitedValue: String = HBaseUtil.getData(tableName, rowkey, clfName, visitedColumn)
        // 创建总数的临时变量
        var totalCount: Long = 0

        if (StringUtils.isBlank(visitedValue)) {
     
          totalCount = value.visited
        } else {
     
          totalCount = visitedValue.toLong + value.visited
        }

        // 保存数据
        HBaseUtil.putMapData(tableName, rowkey, clfName, Map(
          channelIdColumn -> value.channelid,
          visitedColumn -> totalCount.toString
        ))
      }
    })
  }

}

三 实时频道PV/UV分析

针对频道的PV、UV进行不同时间维度的分析。有以下三个维度:

  • 小时

3.1 业务介绍


PV(访问量)

即Page View,页面刷新一次算一次。

UV(独立访客)

即Unique Visitor,指定时间内相同的客户端只被计算一次

统计分析后得到的数据如下所示:

频道ID 时间 PV UV
频道1 2017010116 1230 350
频道2 2017010117 1251 330
频道3 2017010118 5512 610

3.2 小时维度PV/UV业务开发


步骤

  1. 创建频道PV、UV样例类
  2. 将预处理后的数据,转换为要分析出来的数据(频道、PV、UV)样例类
  3. 按照频道时间进行分组(分流)
  4. 划分时间窗口(3秒一个窗口)
  5. 进行合并计数统计
  6. 打印测试
  7. 将计算后的数据下沉到Hbase

实现

  1. 创建一个ChannelPvUvTask单例对象
  2. 添加一个ChannelPvUv样例类,它封装要统计的四个业务字段:频道ID(channelID)、年月日时、PV、UV
  3. ChannelPvUvTask中编写一个processHourDim方法,接收预处理后的DataStream
  4. 使用map算子,将ClickLog对象转换为ChannelPvUv
  5. 按照频道ID年月日时进行分流
  6. 划分时间窗口(3秒一个窗口)
  7. 执行reduce合并计算
  8. 打印测试
  9. 将合并后的数据下沉到hbase
    • 判断hbase中是否已经存在结果记录
    • 若存在,则获取后进行累加
    • 若不存在,则直接写入

3.3 天维度PV/UV业务开发

按天的维度来统计PV、UV与按小时维度类似,就是分组字段不一样。可以直接复制按小时维度的PV/UV,然后修改即可。

3.4 小时/天/月维度PV/UV业务开发


但是,其实上述代码,都是一样的。我们可以将小时三个时间维度的数据放在一起来进行分组

思路

  1. 每一条ClickLog生成三个维度的ChannelPvUv,分别用于三个维度的统计
  • ChannelPvUv --> 小时维度
  • ChannelPvUv --> 天维度
  • ChannelPvUv --> 月维度

电商指标项目-实时频道热点分析业务开发_第1张图片

实现

  1. 使用flatmap算子,将ClickLog转换为三个ChannelPvUv
  2. 重新运行测试

核心代码:

```scala
  def process(clicklogWideDataStream:DataStream[ClickLogWide]) = {
     
    ...
    val channelPvUvDataStream: DataStream[ChannelPvUv] = clicklogWideDataStream.flatMap {
     
      clicklog =>
        List(
          ChannelPvUv(clicklog.channelID, clicklog.yearMonthDayHour, clicklog.count, clicklog.isHourNew),
          ChannelPvUv(clicklog.channelID, clicklog.yearMonthDay, clicklog.count, clicklog.isDayNew),
          ChannelPvUv(clicklog.channelID, clicklog.yearMonth, clicklog.count, clicklog.isMonthNew)
        )
    }
    ...
  }

四 实时频道用户新鲜度分析

4.1 业务介绍


用户新鲜度即分析网站每小时、每天、每月活跃的新老用户占比

可以通过新鲜度:

  • 从宏观层面上了解每天的新老用户比例以及来源结构

  • 当天新增用户与当天推广行为是否相关

统计分析要得到的数据如下:

频道ID 时间 新用户 老用户
频道1 201703 512 144
频道1 20170318 411 4123
频道1 2017031810 342 4412

4.2 业务开发


步骤

  1. 创建频道新鲜度样例类,包含以下字段(频道、时间、新用户、老用户)
  2. 将预处理后的数据,转换为新鲜度样例类
  3. 按照频道时间进行分组(分流)
  4. 划分时间窗口(3秒一个窗口)
  5. 进行合并计数统计
  6. 打印测试
  7. 将计算后的数据下沉到Hbase

实现

  1. 创建一个ChannelFreshnessTask单例对象

  2. 添加一个ChannelFreshness样例类,它封装要统计的四个业务字段:频道ID(channelID)、日期(date)、新用户(newCount)、老用户(oldCount)

  3. ChannelFreshnessTask中编写一个process方法,接收预处理后的DataStream

  4. 使用flatMap算子,将ClickLog对象转换为三个不同时间维度ChannelFreshness

  5. 按照频道ID日期进行分流

  6. 划分时间窗口(3秒一个窗口)

  7. 执行reduce合并计算

  8. 打印测试

  9. 将合并后的数据下沉到hbase

    • 准备hbase的表名、列族名、rowkey名、列名
    • 判断hbase中是否已经存在结果记录
    • 若存在,则获取后进行累加
    • 若不存在,则直接写入

注意:

这个地方,老用户需要注意处理,因为如果不进行判断,就会计算重复的一些用户访问数据

  1. 新用户就是根据clicklog拓宽后的isNew来判断

  2. 老用户需要判断

  • 如果isNew是0,且isHourNew为1/isDayNew为1、isMonthNew为1,则进行老用户为1
  • 否则为0

核心代码:

// 1. 添加一个`ChannelFreshness`样例类,它封装要统计的四个业务字段:频道ID(channelID)、日期(date)、新用户(newCount)、老用户(oldCount)
case class ChannelFreshness(var channelID:String,
                            var date:String,
                            var newCount:Long,
                            var oldCount:Long)



object ChannelFreshnessTask {
     
  // 2. 在`ChannelFreshnessTask`中编写一个`process`方法,接收预处理后的`DataStream`
  def process(clicklogWideDataStream:DataStream[ClickLogWide]) = {
     

    // 3. 使用flatMap算子,将`ClickLog`对象转换为`ChannelFreshness`
    val channelFreshnessDataStream: DataStream[ChannelFreshness] = clicklogWideDataStream.flatMap {
     
      clicklog =>
        val isOld = (isNew: Int, isDateNew:Int) => if (isNew == 0 && isDateNew == 1) 1 else 0

        List(
          ChannelFreshness(clicklog.channelID, clicklog.yearMonthDayHour, clicklog.isNew, isOld(clicklog.isNew, clicklog.isHourNew)),
          ChannelFreshness(clicklog.channelID, clicklog.yearMonthDay, clicklog.isNew, isOld(clicklog.isDayNew, clicklog.isDayNew)),
          ChannelFreshness(clicklog.channelID, clicklog.yearMonth, clicklog.isNew, isOld(clicklog.isMonthNew, clicklog.isMonthNew))
        )
    }

    // 4. 按照`频道ID`、`日期`进行分流

    val groupedDateStream: KeyedStream[ChannelFreshness, String] = channelFreshnessDataStream.keyBy {
     
      freshness =>
        freshness.channelID + freshness.date
    }

    // 5. 划分时间窗口(3秒一个窗口)
    val windowStream: WindowedStream[ChannelFreshness, String, TimeWindow] = groupedDateStream.timeWindow(Time.seconds(3))

    // 6. 执行reduce合并计算
    val reduceDataStream: DataStream[ChannelFreshness] = windowStream.reduce {
     
      (freshness1, freshness2) =>
        ChannelFreshness(freshness2.channelID, freshness2.date, freshness1.newCount + freshness2.newCount, freshness1.oldCount + freshness2.oldCount)
    }

    // 打印测试
    reduceDataStream.print()

    // 7. 将合并后的数据下沉到hbase
    reduceDataStream.addSink(new SinkFunction[ChannelFreshness] {
     
      override def invoke(value: ChannelFreshness): Unit = {
     
        val tableName = "channel_freshness"
        val cfName = "info"
        // 频道ID(channelID)、日期(date)、新用户(newCount)、老用户(oldCount)
        val channelIdColName = "channelID"
        val dateColName = "date"
        val newCountColName = "newCount"
        val oldCountColName = "oldCount"
        val rowkey = value.channelID + ":" + value.date

        // - 判断hbase中是否已经存在结果记录
        val newCountOldCountMap = HBaseUtil.getData(tableName, rowkey, cfName, List(newCountColName, oldCountColName))

        var totalNewCount = 0L
        var totalOldCount = 0L

        // - 若存在,则获取后进行累加
        if(newCountOldCountMap != null && StringUtils.isNotBlank(newCountOldCountMap.getOrElse(newCountColName, ""))) {
     
          totalNewCount = value.newCount + newCountOldCountMap(newCountColName).toLong
        }
        else {
     
          totalNewCount = value.newCount
        }
        // - 若不存在,则直接写入

        HBaseUtil.putMapData(tableName, rowkey, cfName, Map(
          channelIdColName -> value.channelID,
          dateColName -> value.date,
          newCountColName -> totalNewCount.toString,
          oldCountColName -> totalOldCount.toString
        ))
      }
    })
  }
}

4.3 模板方法提取公共类

模板方法模式是在父类中定义算法的骨架,把具体实延迟到子类中去,可以在不改变一个算法的结构时可重定义该算法的某些步骤。

BaseTask.scala

package com.itheima.realprocess.task

import com.itheima.realprocess.bean.ClickLogWide
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.windows.TimeWindow

// 抽取一个公共的trait, 所有的任务都来实现它
trait BaseTask[T] {
     

  /**
    * 对原始日志数据流 进行map转换 分组 时间窗口 聚合 落地HBase
    * @param clickLogWideDataStream
    * @return
    */
  def process(clickLogWideDataStream: DataStream[ClickLogWide]):Any={
     
    val mapDataStream:DataStream[T] = map(clickLogWideDataStream)
    val keyedStream:KeyedStream[T, String] = keyBy(mapDataStream)
    val windowedStream: WindowedStream[T, String, TimeWindow] = timeWindow(keyedStream)
    val reduceDataStream: DataStream[T] = reduce(windowedStream)
    sink2HBase(reduceDataStream)
  }

  // Map转换数据流
  def map(source:DataStream[ClickLogWide]):DataStream[T]

  // 分组
  def keyBy(mapDataStream: DataStream[T]):KeyedStream[T,String]

  // 时间窗口
  def timeWindow(keyedStream: KeyedStream[T, String]):WindowedStream[T, String, TimeWindow]

  // 聚合  
  def reduce(windowedStream: WindowedStream[T, String, TimeWindow]): DataStream[T]

  // 落地HBase
  def sink2HBase(reduceDataStream: DataStream[T])
}

改造后的代码:

// 添加一个`ChannelFreshness`样例类,它封装要统计的四个业务字段:频道ID(channelID)、日期(date)、新用户(newCount)、老用户(oldCount)
case class ChannelFreshness(var channelID: String,
                            var date: String,
                            var newCount: Long,
                            var oldCount: Long)


object ChannelFreshnessTask extends BaseTask[ChannelFreshness] {
     

  // 1. 使用flatMap算子,将`ClickLog`对象转换为`ChannelFreshness`
  override def map(source: DataStream[ClickLogWide]): DataStream[ChannelFreshness] = {
     
    source.flatMap {
     
      clicklog =>
        val isOld = (isNew: Int, isDateNew: Int) => if (isNew == 0 && isDateNew == 1) 1 else 0

        List(
          ChannelFreshness(clicklog.channelID, clicklog.yearMonthDayHour, clicklog.isNew, isOld(clicklog.isNew, clicklog.isHourNew)),
          ChannelFreshness(clicklog.channelID, clicklog.yearMonthDay, clicklog.isNew, isOld(clicklog.isDayNew, clicklog.isDayNew)),
          ChannelFreshness(clicklog.channelID, clicklog.yearMonth, clicklog.isNew, isOld(clicklog.isMonthNew, clicklog.isMonthNew))
        )
    }
  }

  override def keyBy(mapDataStream: DataStream[ChannelFreshness]): KeyedStream[ChannelFreshness, String] = {
     
    mapDataStream.keyBy {
     
      freshness =>
        freshness.channelID + freshness.date
    }
  }

  override def timeWindow(keyedStream: KeyedStream[ChannelFreshness, String]): WindowedStream[ChannelFreshness, String, TimeWindow] = {
     
    keyedStream.timeWindow(Time.seconds(3))
  }

  override def reduce(windowedStream: WindowedStream[ChannelFreshness, String, TimeWindow]): DataStream[ChannelFreshness] = {
     
    windowedStream.reduce {
     
      (freshness1, freshness2) =>
        ChannelFreshness(freshness2.channelID, freshness2.date, freshness1.newCount + freshness2.newCount, freshness1.oldCount + freshness2.oldCount)
    }
  }

  override def sink2HBase(reduceDataStream: DataStream[ChannelFreshness]) = {
     
    reduceDataStream.addSink {
     
      value => {
     
        val tableName = "channel_freshness"
        val cfName = "info"
        // 频道ID(channelID)、日期(date)、新用户(newCount)、老用户(oldCount)
        val channelIdColName = "channelID"
        val dateColName = "date"
        val newCountColName = "newCount"
        val oldCountColName = "oldCount"
        val rowkey = value.channelID + ":" + value.date

        // - 判断hbase中是否已经存在结果记录
        val newCountInHBase = HBaseUtil.getData(tableName, rowkey, cfName, newCountColName)
        val oldCountInHBase = HBaseUtil.getData(tableName, rowkey, cfName, oldCountColName)

        var totalNewCount = 0L
        var totalOldCount = 0L

        // 判断hbase中是否有历史的指标数据
        if (StringUtils.isNotBlank(newCountInHBase)) {
     
          totalNewCount = newCountInHBase.toLong + value.newCount
        }
        else {
     
          totalNewCount = value.newCount
        }

        if (StringUtils.isNotBlank(oldCountInHBase)) {
     
          totalOldCount = oldCountInHBase.toLong + value.oldCount
        }
        else {
     
          totalOldCount = value.oldCount
        }

        // 将合并累计的数据写入到hbase中
        HBaseUtil.putMapData(tableName, rowkey, cfName, Map(
          channelIdColName -> value.channelID,
          dateColName -> value.date,
          newCountColName -> totalNewCount,
          oldCountColName -> totalOldCount
        ))
      }
    }
  }
}

五 实时频道地域分析业务开发

5.1 业务介绍

通过地域分析,可以帮助查看地域相关的PV/UV、用户新鲜度。

需要分析出来指标

  • PV

  • UV

  • 新用户

  • 老用户

需要分析的维度

  • 地域(国家省市)——这里为了节省时间,只分析市级的地域维度,其他维度大家可以自己来实现
  • 时间维度(时、天、月)

统计分析后的结果如下:

频道ID 地域(国/省/市) 时间 PV UV 新用户 老用户
频道1 中国北京市朝阳区 201809 1000 300 123 171
频道1 中国北京市朝阳区 20180910 512 123 23 100
频道1 中国北京市朝阳区 2018091010 100 41 11 30

5.2 业务开发


步骤

  1. 创建频道地域分析样例类(频道、地域(国省市)、时间、PV、UV、新用户、老用户)
  2. 将预处理后的数据,使用flatMap转换为样例类
  3. 按照频道时间地域进行分组(分流)
  4. 划分时间窗口(3秒一个窗口)
  5. 进行合并计数统计
  6. 打印测试
  7. 将计算后的数据下沉到Hbase

实现

  1. 创建一个ChannelAreaTask单例对象

  2. 添加一个ChannelArea样例类,它封装要统计的四个业务字段:频道ID(channelID)、地域(area)、日期(date)pv、uv、新用户(newCount)、老用户(oldCount)

  3. ChannelAreaTask中编写一个process方法,接收预处理后的DataStream

  4. 使用flatMap算子,将ClickLog对象转换为三个不同时间维度ChannelArea

  5. 按照频道ID时间地域进行分流

  6. 划分时间窗口(3秒一个窗口)

  7. 执行reduce合并计算

  8. 打印测试

  9. 将合并后的数据下沉到hbase

    • 准备hbase的表名、列族名、rowkey名、列名

    • 判断hbase中是否已经存在结果记录

    • 若存在,则获取后进行累加

    • 若不存在,则直接写入

核心代码:
ChannelFreshnessTask.scala

package com.itheima.realprocess.task

import com.itheima.realprocess.bean.ClickLogWide
import com.itheima.realprocess.util.HBaseUtil
import org.apache.commons.lang.StringUtils
import org.apache.flink.streaming.api.scala.{
     DataStream, KeyedStream, WindowedStream}
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.functions.sink.SinkFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow


// 添加一个`ChannelFreshness`样例类,它封装要统计的四个业务字段:频道ID(channelID)、日期(date)、新用户(newCount)、老用户(oldCount)
case class ChannelFreshness(var channelID: String,
                            var date: String,
                            var newCount: Long,
                            var oldCount: Long)


object ChannelFreshnessTask extends BaseTask[ChannelFreshness] {
     

  // 1. 使用flatMap算子,将`ClickLog`对象转换为`ChannelFreshness`
  override def map(source: DataStream[ClickLogWide]): DataStream[ChannelFreshness] = {
     
    source.flatMap {
     
      clicklog =>
        val isOld = (isNew: Int, isDateNew: Int) => if (isNew == 0 && isDateNew == 1) 1 else 0

        List(
          ChannelFreshness(clicklog.channelID, clicklog.yearMonthDayHour, clicklog.isNew, isOld(clicklog.isNew, clicklog.isHourNew)),
          ChannelFreshness(clicklog.channelID, clicklog.yearMonthDay, clicklog.isNew, isOld(clicklog.isDayNew, clicklog.isDayNew)),
          ChannelFreshness(clicklog.channelID, clicklog.yearMonth, clicklog.isNew, isOld(clicklog.isMonthNew, clicklog.isMonthNew))
        )
    }
  }

  override def keyBy(mapDataStream: DataStream[ChannelFreshness]): KeyedStream[ChannelFreshness, String] = {
     
    mapDataStream.keyBy {
     
      freshness =>
        freshness.channelID + freshness.date
    }
  }

  override def timeWindow(keyedStream: KeyedStream[ChannelFreshness, String]): WindowedStream[ChannelFreshness, String, TimeWindow] = {
     
    keyedStream.timeWindow(Time.seconds(3))
  }

  override def reduce(windowedStream: WindowedStream[ChannelFreshness, String, TimeWindow]): DataStream[ChannelFreshness] = {
     
    windowedStream.reduce {
     
      (freshness1, freshness2) =>
        ChannelFreshness(freshness2.channelID, freshness2.date, freshness1.newCount + freshness2.newCount, freshness1.oldCount + freshness2.oldCount)
    }
  }

  override def sink2HBase(reduceDataStream: DataStream[ChannelFreshness]) = {
     
    reduceDataStream.addSink {
     
      value => {
     
        val tableName = "channel_freshness"
        val cfName = "info"
        // 频道ID(channelID)、日期(date)、新用户(newCount)、老用户(oldCount)
        val channelIdColName = "channelID"
        val dateColName = "date"
        val newCountColName = "newCount"
        val oldCountColName = "oldCount"
        val rowkey = value.channelID + ":" + value.date

        // - 判断hbase中是否已经存在结果记录
        val newCountInHBase = HBaseUtil.getData(tableName, rowkey, cfName, newCountColName)
        val oldCountInHBase = HBaseUtil.getData(tableName, rowkey, cfName, oldCountColName)

        var totalNewCount = 0L
        var totalOldCount = 0L

        // 判断hbase中是否有历史的指标数据
        if (StringUtils.isNotBlank(newCountInHBase)) {
     
          totalNewCount = newCountInHBase.toLong + value.newCount
        }
        else {
     
          totalNewCount = value.newCount
        }

        if (StringUtils.isNotBlank(oldCountInHBase)) {
     
          totalOldCount = oldCountInHBase.toLong + value.oldCount
        }
        else {
     
          totalOldCount = value.oldCount
        }

        // 将合并累计的数据写入到hbase中
        HBaseUtil.putMapData(tableName, rowkey, cfName, Map(
          channelIdColName -> value.channelID,
          dateColName -> value.date,
          newCountColName -> totalNewCount,
          oldCountColName -> totalOldCount
        ))
      }
    }
  }
}

六 实时运营商分析业务开发

6.1 业务介绍


分析出来中国移动、中国联通、中国电信等运营商的指标。来分析,流量的主要来源是哪个运营商的,这样就可以进行较准确的网络推广。

需要分析出来指标

  • PV

  • UV

  • 新用户

  • 老用户

需要分析的维度

  • 运营商
  • 时间维度(时、天、月)

统计分析后的结果如下:

频道ID 运营商 时间 PV UV 新用户 老用户
频道1 201809 1000 300 0 300
频道1 中国联通 20180910 123 1 0 1
频道1 中国电信 2018091010 55 2 2 0

6.2 业务开发

步骤

  1. 将预处理后的数据,转换为要分析出来数据(频道、运营商、时间、PV、UV、新用户、老用户)样例类
  2. 按照频道时间运营商进行分组(分流)
  3. 划分时间窗口(3秒一个窗口)
  4. 进行合并计数统计
  5. 打印测试
  6. 将计算后的数据下沉到Hbase

实现

  1. 创建一个ChannelNetworkTask单例对象

  2. 添加一个ChannelNetwork样例类,它封装要统计的四个业务字段:频道ID(channelID)、运营商(network)、日期(date)pv、uv、新用户(newCount)、老用户(oldCount)

  3. ChannelNetworkTask中编写一个process方法,接收预处理后的DataStream

  4. 使用flatMap算子,将ClickLog对象转换为三个不同时间维度ChannelNetwork

  5. 按照频道ID时间运营商进行分流

  6. 划分时间窗口(3秒一个窗口)

  7. 执行reduce合并计算

  8. 打印测试

  9. 将合并后的数据下沉到hbase

    • 准备hbase的表名、列族名、rowkey名、列名

    • 判断hbase中是否已经存在结果记录

    • 若存在,则获取后进行累加

    • 若不存在,则直接写入

核心代码:

package com.itheima.realprocess.task

import com.itheima.realprocess.bean.ClickLogWide
import com.itheima.realprocess.util.HBaseUtil
import org.apache.commons.lang.StringUtils
import org.apache.flink.streaming.api.scala.{
     DataStream, KeyedStream, WindowedStream}
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.functions.sink.SinkFunction
import org.apache.flink.streaming.api.windowing.time.Time

// 2. 添加一个`ChannelNetwork`样例类,它封装要统计的四个业务字段:频道ID(channelID)、运营商(network)、日期(date)pv、uv、新用户(newCount)、老用户(oldCount)
case class ChannelNetwork(var channelID: String,
                          var network: String,
                          var date: String,
                          var pv: Long,
                          var uv: Long,
                          var newCount: Long,
                          var oldCount: Long)

object ChannelNetworkTask extends BaseTask[ChannelNetwork] {
     

  override def map(source: DataStream[ClickLogWide]): DataStream[ChannelNetwork] = {
     

    source.flatMap {
     
      clicklog =>
        val isOld = (isNew: Int, isDateNew: Int) => if (isNew == 0 && isDateNew == 1) 1 else 0

        List(
          ChannelNetwork(clicklog.channelID,
            clicklog.network,
            clicklog.yearMonthDayHour,
            clicklog.count,
            clicklog.isHourNew,
            clicklog.isNew,
            isOld(clicklog.isNew, clicklog.isHourNew)), // 小时维度
          ChannelNetwork(clicklog.channelID,
            clicklog.network,
            clicklog.yearMonthDay,
            clicklog.count,
            clicklog.isDayNew,
            clicklog.isNew,
            isOld(clicklog.isNew, clicklog.isDayNew)), // 天维度
          ChannelNetwork(clicklog.channelID,
            clicklog.network,
            clicklog.yearMonth,
            clicklog.count,
            clicklog.isMonthNew,
            clicklog.isNew,
            isOld(clicklog.isNew, clicklog.isMonthNew)) // 月维度
        )
    }
  }

  override def keyBy(mapDataStream: DataStream[ChannelNetwork]): KeyedStream[ChannelNetwork, String] = {
     
    mapDataStream.keyBy {
     
      network =>
        network.channelID + network.date + network.network
    }
  }

  override def timeWindow(keyedStream: KeyedStream[ChannelNetwork, String]): WindowedStream[ChannelNetwork, String, TimeWindow] = {
     
    keyedStream.timeWindow(Time.seconds(3))
  }

  override def reduce(windowedStream: WindowedStream[ChannelNetwork, String, TimeWindow]): DataStream[ChannelNetwork] = {
     
    windowedStream.reduce {
     
      (network1, network2) =>
        ChannelNetwork(network2.channelID,
          network2.network,
          network2.date,
          network1.pv + network2.pv,
          network1.uv + network2.uv,
          network1.newCount + network2.newCount,
          network1.oldCount + network2.oldCount)
    }
  }

  override def sink2HBase(reduceDataStream: DataStream[ChannelNetwork]): Unit = {
     
    reduceDataStream.addSink(new SinkFunction[ChannelNetwork] {
     
      override def invoke(value: ChannelNetwork): Unit = {
     
        // - 准备hbase的表名、列族名、rowkey名、列名
        val tableName = "channel_network"
        val cfName = "info"
        // 频道ID(channelID)、运营商(network)、日期(date)pv、uv、新用户(newCount)、老用户(oldCount)
        val rowkey = s"${value.channelID}:${value.date}:${value.network}"
        val channelIdColName = "channelID"
        val networkColName = "network"
        val dateColName = "date"
        val pvColName = "pv"
        val uvColName = "uv"
        val newCountColName = "newCount"
        val oldCountColName = "oldCount"

        // - 判断hbase中是否已经存在结果记录
        val resultMap: Map[String, String] = HBaseUtil.getMapData(tableName, rowkey, cfName, List(
          pvColName,
          uvColName,
          newCountColName,
          oldCountColName
        ))

        var totalPv = 0L
        var totalUv = 0L
        var totalNewCount = 0L
        var totalOldCount = 0L

        if(resultMap != null && resultMap.size > 0 && StringUtils.isNotBlank(resultMap(pvColName))) {
     
          totalPv = resultMap(pvColName).toLong + value.pv
        }
        else {
     
          totalPv = value.pv
        }

        if(resultMap != null && resultMap.size > 0 && StringUtils.isNotBlank(resultMap(uvColName))) {
     
          totalUv = resultMap(uvColName).toLong + value.uv
        }
        else {
     
          totalUv = value.uv
        }

        if(resultMap != null && resultMap.size > 0 && StringUtils.isNotBlank(resultMap(newCountColName))) {
     
          totalNewCount = resultMap(newCountColName).toLong + value.newCount
        }
        else {
     
          totalNewCount = value.newCount
        }

        if(resultMap != null && resultMap.size > 0 && StringUtils.isNotBlank(resultMap(oldCountColName))) {
     
          totalOldCount = resultMap(oldCountColName).toLong + value.oldCount
        }
        else {
     
          totalOldCount = value.oldCount
        }

        // 频道ID(channelID)、运营商(network)、日期(date)pv、uv、新用户(newCount)、老用户(oldCount)
        HBaseUtil.putMapData(tableName, rowkey, cfName, Map(
          channelIdColName -> value.channelID,
          networkColName -> value.network,
          dateColName -> value.date,
          pvColName -> totalPv.toString,
          uvColName -> totalUv.toString,
          newCountColName -> totalNewCount.toString,
          oldCountColName -> totalOldCount.toString
        ))
      }
    })
  }
}

七 实时频道浏览器分析业务开发

7.1 业务介绍

需要分别统计不同浏览器(或者客户端)的占比

需要分析出来指标

  • PV

  • UV

  • 新用户

  • 老用户

需要分析的维度

  • 浏览器
  • 时间维度(时、天、月)

统计分析后的结果如下:

频道ID 浏览器 时间 PV UV 新用户 老用户
频道1 360浏览器 201809 1000 300 0 300
频道1 IE 20180910 123 1 0 1
频道1 Chrome 2018091010 55 2 2 0

7.2 业务开发


步骤

  1. 创建频道浏览器分析样例类(频道、浏览器、时间、PV、UV、新用户、老用户)
  2. 将预处理后的数据,使用flatMap转换为要分析出来数据样例类
  3. 按照频道时间浏览器进行分组(分流)
  4. 划分时间窗口(3秒一个窗口)
  5. 进行合并计数统计
  6. 打印测试
  7. 将计算后的数据下沉到Hbase

实现

  1. 创建一个ChannelBrowserTask单例对象

  2. 添加一个ChannelBrowser样例类,它封装要统计的四个业务字段:频道ID(channelID)、浏览器(browser)、日期(date)pv、uv、新用户(newCount)、老用户(oldCount)

  3. ChannelBrowserTask中编写一个process方法,接收预处理后的DataStream

  4. 使用flatMap算子,将ClickLog对象转换为三个不同时间维度ChannelBrowser

  5. 按照频道ID时间浏览器进行分流

  6. 划分时间窗口(3秒一个窗口)

  7. 执行reduce合并计算

  8. 打印测试

  9. 将合并后的数据下沉到hbase

    • 准备hbase的表名、列族名、rowkey名、列名

    • 判断hbase中是否已经存在结果记录

    • 若存在,则获取后进行累加

    • 若不存在,则直接写入

核心代码:

package com.itheima.realprocess.task

import com.itheima.realprocess.bean.ClickLogWide
import com.itheima.realprocess.util.HBaseUtil
import org.apache.commons.lang.StringUtils
import org.apache.flink.streaming.api.scala.{
     DataStream, KeyedStream, WindowedStream}
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.functions.sink.SinkFunction
import org.apache.flink.streaming.api.windowing.time.Time

// 2. 添加一个`ChannelBrowser`样例类,它封装要统计的四个业务字段:频道ID(channelID)、浏览器(browser)、日期(date)pv、uv、新用户(newCount)、老用户(oldCount)
case class ChannelBrowser(var channelID: String,
                          var browser: String,
                          var date: String,
                          var pv: Long,
                          var uv: Long,
                          var newCount: Long,
                          var oldCount: Long)


object ChannelBrowserTask extends BaseTask[ChannelBrowser] {
     

  override def map(source: DataStream[ClickLogWide]): DataStream[ChannelBrowser] = {
     

    source.flatMap {
     
      clicklog =>
        val isOld = (isNew: Int, isDateNew: Int) => if (isNew == 0 && isDateNew == 1) 1 else 0

        List(
          ChannelBrowser(clicklog.channelID,
            clicklog.browserType,
            clicklog.yearMonthDayHour,
            clicklog.count,
            clicklog.isHourNew,
            clicklog.isNew,
            isOld(clicklog.isNew, clicklog.isHourNew)), // 小时维度
          ChannelBrowser(clicklog.channelID,
            clicklog.browserType,
            clicklog.yearMonthDayHour,
            clicklog.count,
            clicklog.isDayNew,
            clicklog.isNew,
            isOld(clicklog.isNew, clicklog.isDayNew)), // 天维度
          ChannelBrowser(clicklog.channelID,
            clicklog.browserType,
            clicklog.yearMonth,
            clicklog.count,
            clicklog.isMonthNew,
            clicklog.isNew,
            isOld(clicklog.isNew, clicklog.isMonthNew)) // 月维度
        )
    }
  }

  override def keyBy(mapDataStream: DataStream[ChannelBrowser]): KeyedStream[ChannelBrowser, String] = {
     
    mapDataStream.keyBy {
     
      broswer =>
        broswer.channelID + broswer.date + broswer.browser
    }
  }

  override def timeWindow(keyedStream: KeyedStream[ChannelBrowser, String]): WindowedStream[ChannelBrowser, String, TimeWindow] = {
     
    keyedStream.timeWindow(Time.seconds(3))
  }

  override def reduce(windowedStream: WindowedStream[ChannelBrowser, String, TimeWindow]): DataStream[ChannelBrowser] = {
     
    windowedStream.reduce {
     
      (broswer1, broswer2) =>
        ChannelBrowser(broswer2.channelID,
          broswer2.browser,
          broswer2.date,
          broswer1.pv + broswer2.pv,
          broswer1.uv + broswer2.uv,
          broswer1.newCount + broswer2.newCount,
          broswer1.oldCount + broswer2.oldCount)
    }
  }

  override def sink2HBase(reduceDataStream: DataStream[ChannelBrowser]): Unit = {
     

    reduceDataStream.addSink(new SinkFunction[ChannelBrowser] {
     
      override def invoke(value: ChannelBrowser): Unit = {
     
        // - 准备hbase的表名、列族名、rowkey名、列名
        val tableName = "channel_broswer"
        val cfName = "info"
        // 频道ID(channelID)、浏览器(browser)、日期(date)pv、uv、新用户(newCount)、老用户(oldCount)
        val rowkey = s"${value.channelID}:${value.date}:${value.browser}"
        val channelIDColName = "channelID"
        val broswerColName = "browser"
        val dateColName = "date"
        val pvColName = "pv"
        val uvColName = "uv"
        val newCountColName = "newCount"
        val oldCountColName = "oldCount"

        var totalPv = 0L
        var totalUv = 0L
        var totalNewCount = 0L
        var totalOldCount = 0L

        val resultMap: Map[String, String] = HBaseUtil.getMapData(tableName, rowkey, cfName, List(
          pvColName,
          uvColName,
          newCountColName,
          oldCountColName
        ))

        // 计算PV,如果Hbase中存在pv数据,就直接进行累加

        if (resultMap != null && resultMap.size > 0 && StringUtils.isNotBlank(resultMap(pvColName))) {
     
          totalPv = resultMap(pvColName).toLong + value.pv
        }
        else {
     
          totalPv = value.pv
        }

        if (resultMap != null && resultMap.size > 0 && StringUtils.isNotBlank(resultMap(uvColName))) {
     
          totalUv = resultMap(uvColName).toLong + value.uv
        }
        else {
     
          totalUv = value.uv
        }


        // - 判断hbase中是否已经存在结果记录
        // - 若存在,则获取后进行累加
        // - 若不存在,则直接写入
        if (resultMap != null && resultMap.size > 0 && StringUtils.isNotBlank(resultMap(newCountColName))) {
     
          totalNewCount = resultMap(newCountColName).toLong + value.newCount
        }
        else {
     
          totalNewCount = value.newCount
        }

        if (resultMap != null && resultMap.size > 0 && StringUtils.isNotBlank(resultMap(oldCountColName))) {
     
          totalOldCount = resultMap(oldCountColName).toLong + value.oldCount
        }
        else {
     
          totalOldCount = value.oldCount
        }

        // 频道ID(channelID)、浏览器(browser)、日期(date)pv、uv、新用户(newCount)、老用户(oldCount)
        HBaseUtil.putMapData(tableName, rowkey, cfName, Map(
          channelIDColName -> value.channelID,
          broswerColName -> value.browser,
          dateColName -> value.date,
          pvColName -> totalPv.toString,
          uvColName -> totalUv.toString,
          newCountColName -> totalNewCount.toString,
          oldCountColName -> totalOldCount.toString
        ))
      }
    })
  }

}

你可能感兴趣的:(Flink,flink,flink实战)