频道热点,就是要统计频道被访问(点击)的数量。
分析得到以下的数据:
频道ID | 访问数量 |
---|---|
频道ID1 | 128 |
频道ID2 | 401 |
频道ID3 | 501 |
需要将历史的点击数据进行累加
步骤
转换
为要分析出来的数据(频道、访问次数)样例类频道
进行分组(分流)实现
ChannelRealHotTask
单例对象ChannelRealHot
样例类,它封装要统计的两个业务字段:频道ID(channelID)、访问数量(visited)ChannelRealHotTask
中编写一个process
方法,接收预处理后的DataStream
ClickLog
对象转换为ChannelRealHot
package com.xu.realprocess.task
import com.xu.realprocess.bean.ClickLogWide.ClickLogWide
import com.xu.realprocess.util.HBaseUtil
import org.apache.commons.lang.StringUtils
import org.apache.flink.streaming.api.scala.{
DataStream, KeyedStream, WindowedStream}
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.datastream.DataStreamSink
import org.apache.flink.streaming.api.functions.sink.SinkFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
case class ChannelRealHot(var channelid: String, var visited: Long)
/**
* 频道热点分析
*
* 1. 字段转换
* 2. 分组
* 3. 时间窗口
* 4. 聚合
* 5. 落地HBase
*
*/
object ChannelRealHotTask {
def process(clickLogWideDataStream: DataStream[ClickLogWide]) = {
// 1. 字段转换 channelid, visited
val realHotDataStream: DataStream[ChannelRealHot] = clickLogWideDataStream.map {
clickLogWide: ClickLogWide =>
ChannelRealHot(clickLogWide.channelID, clickLogWide.count)
}
// 2. 分组
val keyedStream: KeyedStream[ChannelRealHot, String] = realHotDataStream.keyBy(_.channelid)
// 3. 时间窗口
val windowedStream: WindowedStream[ChannelRealHot, String, TimeWindow] = keyedStream.timeWindow(Time.seconds(3))
// 4. 聚合
val reduceDataStream: DataStream[ChannelRealHot] = windowedStream.reduce {
(t1: ChannelRealHot, t2: ChannelRealHot) =>
ChannelRealHot(t1.channelid, t1.visited + t2.visited)
}
// 5. 落地HBase
reduceDataStream.addSink(new SinkFunction[ChannelRealHot] {
override def invoke(value: ChannelRealHot): Unit = {
// hbase相关字段
val tableName = "channel"
val clfName = "info"
val channelIdColumn = "channelId"
val visitedColumn = "visited"
val rowkey = value.channelid
// 查询HBase,获取相关记录
val visitedValue: String = HBaseUtil.getData(tableName, rowkey, clfName, visitedColumn)
// 创建总数的临时变量
var totalCount: Long = 0
if (StringUtils.isBlank(visitedValue)) {
totalCount = value.visited
} else {
totalCount = visitedValue.toLong + value.visited
}
// 保存数据
HBaseUtil.putMapData(tableName, rowkey, clfName, Map(
channelIdColumn -> value.channelid,
visitedColumn -> totalCount.toString
))
}
})
}
}
针对频道的PV、UV进行不同时间维度的分析。有以下三个维度:
PV(访问量)
即Page View,页面刷新一次算一次。
UV(独立访客)
即Unique Visitor,指定时间内相同的客户端只被计算一次
统计分析后得到的数据如下所示:
频道ID | 时间 | PV | UV |
---|---|---|---|
频道1 | 2017010116 | 1230 | 350 |
频道2 | 2017010117 | 1251 | 330 |
频道3 | 2017010118 | 5512 | 610 |
步骤
转换
为要分析出来的数据(频道、PV、UV)样例类频道
和时间
进行分组(分流)实现
ChannelPvUvTask
单例对象ChannelPvUv
样例类,它封装要统计的四个业务字段:频道ID(channelID)、年月日时、PV、UVChannelPvUvTask
中编写一个processHourDim
方法,接收预处理后的DataStream
ClickLog
对象转换为ChannelPvUv
频道ID
、年月日时
进行分流按天的维度来统计PV、UV与按小时维度类似,就是分组字段不一样。可以直接复制按小时维度的PV/UV,然后修改即可。
但是,其实上述代码,都是一样的。我们可以将小时
、天
、月
三个时间维度的数据放在一起来进行分组
思路
ChannelPvUv
,分别用于三个维度的统计实现
flatmap
算子,将ClickLog
转换为三个ChannelPvUv
核心代码:
```scala
def process(clicklogWideDataStream:DataStream[ClickLogWide]) = {
...
val channelPvUvDataStream: DataStream[ChannelPvUv] = clicklogWideDataStream.flatMap {
clicklog =>
List(
ChannelPvUv(clicklog.channelID, clicklog.yearMonthDayHour, clicklog.count, clicklog.isHourNew),
ChannelPvUv(clicklog.channelID, clicklog.yearMonthDay, clicklog.count, clicklog.isDayNew),
ChannelPvUv(clicklog.channelID, clicklog.yearMonth, clicklog.count, clicklog.isMonthNew)
)
}
...
}
用户新鲜度即分析网站每小时、每天、每月活跃的新老用户占比
可以通过新鲜度:
从宏观层面上了解每天的新老用户比例以及来源结构
当天新增用户与当天推广行为
是否相关
统计分析要得到的数据如下:
频道ID | 时间 | 新用户 | 老用户 |
---|---|---|---|
频道1 | 201703 | 512 | 144 |
频道1 | 20170318 | 411 | 4123 |
频道1 | 2017031810 | 342 | 4412 |
步骤
频道
和时间
进行分组(分流)实现
创建一个ChannelFreshnessTask
单例对象
添加一个ChannelFreshness
样例类,它封装要统计的四个业务字段:频道ID(channelID)、日期(date)、新用户(newCount)、老用户(oldCount)
在ChannelFreshnessTask
中编写一个process
方法,接收预处理后的DataStream
使用flatMap
算子,将ClickLog
对象转换为三个不同时间维度ChannelFreshness
按照频道ID
、日期
进行分流
划分时间窗口(3秒一个窗口)
执行reduce合并计算
打印测试
将合并后的数据下沉到hbase
注意:
这个地方,老用户需要注意处理,因为如果不进行判断,就会计算重复的一些用户访问数据
新用户就是根据clicklog拓宽后的isNew来判断
老用户需要判断
- 如果isNew是0,且isHourNew为1/isDayNew为1、isMonthNew为1,则进行老用户为1
- 否则为0
核心代码:
// 1. 添加一个`ChannelFreshness`样例类,它封装要统计的四个业务字段:频道ID(channelID)、日期(date)、新用户(newCount)、老用户(oldCount)
case class ChannelFreshness(var channelID:String,
var date:String,
var newCount:Long,
var oldCount:Long)
object ChannelFreshnessTask {
// 2. 在`ChannelFreshnessTask`中编写一个`process`方法,接收预处理后的`DataStream`
def process(clicklogWideDataStream:DataStream[ClickLogWide]) = {
// 3. 使用flatMap算子,将`ClickLog`对象转换为`ChannelFreshness`
val channelFreshnessDataStream: DataStream[ChannelFreshness] = clicklogWideDataStream.flatMap {
clicklog =>
val isOld = (isNew: Int, isDateNew:Int) => if (isNew == 0 && isDateNew == 1) 1 else 0
List(
ChannelFreshness(clicklog.channelID, clicklog.yearMonthDayHour, clicklog.isNew, isOld(clicklog.isNew, clicklog.isHourNew)),
ChannelFreshness(clicklog.channelID, clicklog.yearMonthDay, clicklog.isNew, isOld(clicklog.isDayNew, clicklog.isDayNew)),
ChannelFreshness(clicklog.channelID, clicklog.yearMonth, clicklog.isNew, isOld(clicklog.isMonthNew, clicklog.isMonthNew))
)
}
// 4. 按照`频道ID`、`日期`进行分流
val groupedDateStream: KeyedStream[ChannelFreshness, String] = channelFreshnessDataStream.keyBy {
freshness =>
freshness.channelID + freshness.date
}
// 5. 划分时间窗口(3秒一个窗口)
val windowStream: WindowedStream[ChannelFreshness, String, TimeWindow] = groupedDateStream.timeWindow(Time.seconds(3))
// 6. 执行reduce合并计算
val reduceDataStream: DataStream[ChannelFreshness] = windowStream.reduce {
(freshness1, freshness2) =>
ChannelFreshness(freshness2.channelID, freshness2.date, freshness1.newCount + freshness2.newCount, freshness1.oldCount + freshness2.oldCount)
}
// 打印测试
reduceDataStream.print()
// 7. 将合并后的数据下沉到hbase
reduceDataStream.addSink(new SinkFunction[ChannelFreshness] {
override def invoke(value: ChannelFreshness): Unit = {
val tableName = "channel_freshness"
val cfName = "info"
// 频道ID(channelID)、日期(date)、新用户(newCount)、老用户(oldCount)
val channelIdColName = "channelID"
val dateColName = "date"
val newCountColName = "newCount"
val oldCountColName = "oldCount"
val rowkey = value.channelID + ":" + value.date
// - 判断hbase中是否已经存在结果记录
val newCountOldCountMap = HBaseUtil.getData(tableName, rowkey, cfName, List(newCountColName, oldCountColName))
var totalNewCount = 0L
var totalOldCount = 0L
// - 若存在,则获取后进行累加
if(newCountOldCountMap != null && StringUtils.isNotBlank(newCountOldCountMap.getOrElse(newCountColName, ""))) {
totalNewCount = value.newCount + newCountOldCountMap(newCountColName).toLong
}
else {
totalNewCount = value.newCount
}
// - 若不存在,则直接写入
HBaseUtil.putMapData(tableName, rowkey, cfName, Map(
channelIdColName -> value.channelID,
dateColName -> value.date,
newCountColName -> totalNewCount.toString,
oldCountColName -> totalOldCount.toString
))
}
})
}
}
模板方法模式是在父类中定义算法的骨架,把具体实延迟到子类中去,可以在不改变一个算法的结构时可重定义该算法的某些步骤。
BaseTask.scala
package com.itheima.realprocess.task
import com.itheima.realprocess.bean.ClickLogWide
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
// 抽取一个公共的trait, 所有的任务都来实现它
trait BaseTask[T] {
/**
* 对原始日志数据流 进行map转换 分组 时间窗口 聚合 落地HBase
* @param clickLogWideDataStream
* @return
*/
def process(clickLogWideDataStream: DataStream[ClickLogWide]):Any={
val mapDataStream:DataStream[T] = map(clickLogWideDataStream)
val keyedStream:KeyedStream[T, String] = keyBy(mapDataStream)
val windowedStream: WindowedStream[T, String, TimeWindow] = timeWindow(keyedStream)
val reduceDataStream: DataStream[T] = reduce(windowedStream)
sink2HBase(reduceDataStream)
}
// Map转换数据流
def map(source:DataStream[ClickLogWide]):DataStream[T]
// 分组
def keyBy(mapDataStream: DataStream[T]):KeyedStream[T,String]
// 时间窗口
def timeWindow(keyedStream: KeyedStream[T, String]):WindowedStream[T, String, TimeWindow]
// 聚合
def reduce(windowedStream: WindowedStream[T, String, TimeWindow]): DataStream[T]
// 落地HBase
def sink2HBase(reduceDataStream: DataStream[T])
}
改造后的代码:
// 添加一个`ChannelFreshness`样例类,它封装要统计的四个业务字段:频道ID(channelID)、日期(date)、新用户(newCount)、老用户(oldCount)
case class ChannelFreshness(var channelID: String,
var date: String,
var newCount: Long,
var oldCount: Long)
object ChannelFreshnessTask extends BaseTask[ChannelFreshness] {
// 1. 使用flatMap算子,将`ClickLog`对象转换为`ChannelFreshness`
override def map(source: DataStream[ClickLogWide]): DataStream[ChannelFreshness] = {
source.flatMap {
clicklog =>
val isOld = (isNew: Int, isDateNew: Int) => if (isNew == 0 && isDateNew == 1) 1 else 0
List(
ChannelFreshness(clicklog.channelID, clicklog.yearMonthDayHour, clicklog.isNew, isOld(clicklog.isNew, clicklog.isHourNew)),
ChannelFreshness(clicklog.channelID, clicklog.yearMonthDay, clicklog.isNew, isOld(clicklog.isDayNew, clicklog.isDayNew)),
ChannelFreshness(clicklog.channelID, clicklog.yearMonth, clicklog.isNew, isOld(clicklog.isMonthNew, clicklog.isMonthNew))
)
}
}
override def keyBy(mapDataStream: DataStream[ChannelFreshness]): KeyedStream[ChannelFreshness, String] = {
mapDataStream.keyBy {
freshness =>
freshness.channelID + freshness.date
}
}
override def timeWindow(keyedStream: KeyedStream[ChannelFreshness, String]): WindowedStream[ChannelFreshness, String, TimeWindow] = {
keyedStream.timeWindow(Time.seconds(3))
}
override def reduce(windowedStream: WindowedStream[ChannelFreshness, String, TimeWindow]): DataStream[ChannelFreshness] = {
windowedStream.reduce {
(freshness1, freshness2) =>
ChannelFreshness(freshness2.channelID, freshness2.date, freshness1.newCount + freshness2.newCount, freshness1.oldCount + freshness2.oldCount)
}
}
override def sink2HBase(reduceDataStream: DataStream[ChannelFreshness]) = {
reduceDataStream.addSink {
value => {
val tableName = "channel_freshness"
val cfName = "info"
// 频道ID(channelID)、日期(date)、新用户(newCount)、老用户(oldCount)
val channelIdColName = "channelID"
val dateColName = "date"
val newCountColName = "newCount"
val oldCountColName = "oldCount"
val rowkey = value.channelID + ":" + value.date
// - 判断hbase中是否已经存在结果记录
val newCountInHBase = HBaseUtil.getData(tableName, rowkey, cfName, newCountColName)
val oldCountInHBase = HBaseUtil.getData(tableName, rowkey, cfName, oldCountColName)
var totalNewCount = 0L
var totalOldCount = 0L
// 判断hbase中是否有历史的指标数据
if (StringUtils.isNotBlank(newCountInHBase)) {
totalNewCount = newCountInHBase.toLong + value.newCount
}
else {
totalNewCount = value.newCount
}
if (StringUtils.isNotBlank(oldCountInHBase)) {
totalOldCount = oldCountInHBase.toLong + value.oldCount
}
else {
totalOldCount = value.oldCount
}
// 将合并累计的数据写入到hbase中
HBaseUtil.putMapData(tableName, rowkey, cfName, Map(
channelIdColName -> value.channelID,
dateColName -> value.date,
newCountColName -> totalNewCount,
oldCountColName -> totalOldCount
))
}
}
}
}
通过地域分析,可以帮助查看地域相关的PV/UV、用户新鲜度。
需要分析出来指标
PV
UV
新用户
老用户
需要分析的维度
统计分析后的结果如下:
频道ID | 地域(国/省/市) | 时间 | PV | UV | 新用户 | 老用户 |
---|---|---|---|---|---|---|
频道1 | 中国北京市朝阳区 | 201809 | 1000 | 300 | 123 | 171 |
频道1 | 中国北京市朝阳区 | 20180910 | 512 | 123 | 23 | 100 |
频道1 | 中国北京市朝阳区 | 2018091010 | 100 | 41 | 11 | 30 |
步骤
flatMap
转换为样例类频道
、时间
、地域
进行分组(分流)实现
创建一个ChannelAreaTask
单例对象
添加一个ChannelArea
样例类,它封装要统计的四个业务字段:频道ID(channelID)、地域(area)、日期(date)pv、uv、新用户(newCount)、老用户(oldCount)
在ChannelAreaTask
中编写一个process
方法,接收预处理后的DataStream
使用flatMap
算子,将ClickLog
对象转换为三个不同时间维度ChannelArea
按照频道ID
、时间
、地域
进行分流
划分时间窗口(3秒一个窗口)
执行reduce合并计算
打印测试
将合并后的数据下沉到hbase
准备hbase的表名、列族名、rowkey名、列名
判断hbase中是否已经存在结果记录
若存在,则获取后进行累加
若不存在,则直接写入
核心代码:
ChannelFreshnessTask.scala
package com.itheima.realprocess.task
import com.itheima.realprocess.bean.ClickLogWide
import com.itheima.realprocess.util.HBaseUtil
import org.apache.commons.lang.StringUtils
import org.apache.flink.streaming.api.scala.{
DataStream, KeyedStream, WindowedStream}
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.functions.sink.SinkFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
// 添加一个`ChannelFreshness`样例类,它封装要统计的四个业务字段:频道ID(channelID)、日期(date)、新用户(newCount)、老用户(oldCount)
case class ChannelFreshness(var channelID: String,
var date: String,
var newCount: Long,
var oldCount: Long)
object ChannelFreshnessTask extends BaseTask[ChannelFreshness] {
// 1. 使用flatMap算子,将`ClickLog`对象转换为`ChannelFreshness`
override def map(source: DataStream[ClickLogWide]): DataStream[ChannelFreshness] = {
source.flatMap {
clicklog =>
val isOld = (isNew: Int, isDateNew: Int) => if (isNew == 0 && isDateNew == 1) 1 else 0
List(
ChannelFreshness(clicklog.channelID, clicklog.yearMonthDayHour, clicklog.isNew, isOld(clicklog.isNew, clicklog.isHourNew)),
ChannelFreshness(clicklog.channelID, clicklog.yearMonthDay, clicklog.isNew, isOld(clicklog.isDayNew, clicklog.isDayNew)),
ChannelFreshness(clicklog.channelID, clicklog.yearMonth, clicklog.isNew, isOld(clicklog.isMonthNew, clicklog.isMonthNew))
)
}
}
override def keyBy(mapDataStream: DataStream[ChannelFreshness]): KeyedStream[ChannelFreshness, String] = {
mapDataStream.keyBy {
freshness =>
freshness.channelID + freshness.date
}
}
override def timeWindow(keyedStream: KeyedStream[ChannelFreshness, String]): WindowedStream[ChannelFreshness, String, TimeWindow] = {
keyedStream.timeWindow(Time.seconds(3))
}
override def reduce(windowedStream: WindowedStream[ChannelFreshness, String, TimeWindow]): DataStream[ChannelFreshness] = {
windowedStream.reduce {
(freshness1, freshness2) =>
ChannelFreshness(freshness2.channelID, freshness2.date, freshness1.newCount + freshness2.newCount, freshness1.oldCount + freshness2.oldCount)
}
}
override def sink2HBase(reduceDataStream: DataStream[ChannelFreshness]) = {
reduceDataStream.addSink {
value => {
val tableName = "channel_freshness"
val cfName = "info"
// 频道ID(channelID)、日期(date)、新用户(newCount)、老用户(oldCount)
val channelIdColName = "channelID"
val dateColName = "date"
val newCountColName = "newCount"
val oldCountColName = "oldCount"
val rowkey = value.channelID + ":" + value.date
// - 判断hbase中是否已经存在结果记录
val newCountInHBase = HBaseUtil.getData(tableName, rowkey, cfName, newCountColName)
val oldCountInHBase = HBaseUtil.getData(tableName, rowkey, cfName, oldCountColName)
var totalNewCount = 0L
var totalOldCount = 0L
// 判断hbase中是否有历史的指标数据
if (StringUtils.isNotBlank(newCountInHBase)) {
totalNewCount = newCountInHBase.toLong + value.newCount
}
else {
totalNewCount = value.newCount
}
if (StringUtils.isNotBlank(oldCountInHBase)) {
totalOldCount = oldCountInHBase.toLong + value.oldCount
}
else {
totalOldCount = value.oldCount
}
// 将合并累计的数据写入到hbase中
HBaseUtil.putMapData(tableName, rowkey, cfName, Map(
channelIdColName -> value.channelID,
dateColName -> value.date,
newCountColName -> totalNewCount,
oldCountColName -> totalOldCount
))
}
}
}
}
分析出来中国移动、中国联通、中国电信等运营商的指标。来分析,流量的主要来源是哪个运营商的,这样就可以进行较准确的网络推广。
需要分析出来指标
PV
UV
新用户
老用户
需要分析的维度
统计分析后的结果如下:
频道ID | 运营商 | 时间 | PV | UV | 新用户 | 老用户 |
---|---|---|---|---|---|---|
频道1 | 201809 | 1000 | 300 | 0 | 300 | |
频道1 | 中国联通 | 20180910 | 123 | 1 | 0 | 1 |
频道1 | 中国电信 | 2018091010 | 55 | 2 | 2 | 0 |
步骤
频道
、时间
、运营商
进行分组(分流)实现
创建一个ChannelNetworkTask
单例对象
添加一个ChannelNetwork
样例类,它封装要统计的四个业务字段:频道ID(channelID)、运营商(network)、日期(date)pv、uv、新用户(newCount)、老用户(oldCount)
在ChannelNetworkTask
中编写一个process
方法,接收预处理后的DataStream
使用flatMap
算子,将ClickLog
对象转换为三个不同时间维度ChannelNetwork
按照频道ID
、时间
、运营商
进行分流
划分时间窗口(3秒一个窗口)
执行reduce合并计算
打印测试
将合并后的数据下沉到hbase
准备hbase的表名、列族名、rowkey名、列名
判断hbase中是否已经存在结果记录
若存在,则获取后进行累加
若不存在,则直接写入
核心代码:
package com.itheima.realprocess.task
import com.itheima.realprocess.bean.ClickLogWide
import com.itheima.realprocess.util.HBaseUtil
import org.apache.commons.lang.StringUtils
import org.apache.flink.streaming.api.scala.{
DataStream, KeyedStream, WindowedStream}
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.functions.sink.SinkFunction
import org.apache.flink.streaming.api.windowing.time.Time
// 2. 添加一个`ChannelNetwork`样例类,它封装要统计的四个业务字段:频道ID(channelID)、运营商(network)、日期(date)pv、uv、新用户(newCount)、老用户(oldCount)
case class ChannelNetwork(var channelID: String,
var network: String,
var date: String,
var pv: Long,
var uv: Long,
var newCount: Long,
var oldCount: Long)
object ChannelNetworkTask extends BaseTask[ChannelNetwork] {
override def map(source: DataStream[ClickLogWide]): DataStream[ChannelNetwork] = {
source.flatMap {
clicklog =>
val isOld = (isNew: Int, isDateNew: Int) => if (isNew == 0 && isDateNew == 1) 1 else 0
List(
ChannelNetwork(clicklog.channelID,
clicklog.network,
clicklog.yearMonthDayHour,
clicklog.count,
clicklog.isHourNew,
clicklog.isNew,
isOld(clicklog.isNew, clicklog.isHourNew)), // 小时维度
ChannelNetwork(clicklog.channelID,
clicklog.network,
clicklog.yearMonthDay,
clicklog.count,
clicklog.isDayNew,
clicklog.isNew,
isOld(clicklog.isNew, clicklog.isDayNew)), // 天维度
ChannelNetwork(clicklog.channelID,
clicklog.network,
clicklog.yearMonth,
clicklog.count,
clicklog.isMonthNew,
clicklog.isNew,
isOld(clicklog.isNew, clicklog.isMonthNew)) // 月维度
)
}
}
override def keyBy(mapDataStream: DataStream[ChannelNetwork]): KeyedStream[ChannelNetwork, String] = {
mapDataStream.keyBy {
network =>
network.channelID + network.date + network.network
}
}
override def timeWindow(keyedStream: KeyedStream[ChannelNetwork, String]): WindowedStream[ChannelNetwork, String, TimeWindow] = {
keyedStream.timeWindow(Time.seconds(3))
}
override def reduce(windowedStream: WindowedStream[ChannelNetwork, String, TimeWindow]): DataStream[ChannelNetwork] = {
windowedStream.reduce {
(network1, network2) =>
ChannelNetwork(network2.channelID,
network2.network,
network2.date,
network1.pv + network2.pv,
network1.uv + network2.uv,
network1.newCount + network2.newCount,
network1.oldCount + network2.oldCount)
}
}
override def sink2HBase(reduceDataStream: DataStream[ChannelNetwork]): Unit = {
reduceDataStream.addSink(new SinkFunction[ChannelNetwork] {
override def invoke(value: ChannelNetwork): Unit = {
// - 准备hbase的表名、列族名、rowkey名、列名
val tableName = "channel_network"
val cfName = "info"
// 频道ID(channelID)、运营商(network)、日期(date)pv、uv、新用户(newCount)、老用户(oldCount)
val rowkey = s"${value.channelID}:${value.date}:${value.network}"
val channelIdColName = "channelID"
val networkColName = "network"
val dateColName = "date"
val pvColName = "pv"
val uvColName = "uv"
val newCountColName = "newCount"
val oldCountColName = "oldCount"
// - 判断hbase中是否已经存在结果记录
val resultMap: Map[String, String] = HBaseUtil.getMapData(tableName, rowkey, cfName, List(
pvColName,
uvColName,
newCountColName,
oldCountColName
))
var totalPv = 0L
var totalUv = 0L
var totalNewCount = 0L
var totalOldCount = 0L
if(resultMap != null && resultMap.size > 0 && StringUtils.isNotBlank(resultMap(pvColName))) {
totalPv = resultMap(pvColName).toLong + value.pv
}
else {
totalPv = value.pv
}
if(resultMap != null && resultMap.size > 0 && StringUtils.isNotBlank(resultMap(uvColName))) {
totalUv = resultMap(uvColName).toLong + value.uv
}
else {
totalUv = value.uv
}
if(resultMap != null && resultMap.size > 0 && StringUtils.isNotBlank(resultMap(newCountColName))) {
totalNewCount = resultMap(newCountColName).toLong + value.newCount
}
else {
totalNewCount = value.newCount
}
if(resultMap != null && resultMap.size > 0 && StringUtils.isNotBlank(resultMap(oldCountColName))) {
totalOldCount = resultMap(oldCountColName).toLong + value.oldCount
}
else {
totalOldCount = value.oldCount
}
// 频道ID(channelID)、运营商(network)、日期(date)pv、uv、新用户(newCount)、老用户(oldCount)
HBaseUtil.putMapData(tableName, rowkey, cfName, Map(
channelIdColName -> value.channelID,
networkColName -> value.network,
dateColName -> value.date,
pvColName -> totalPv.toString,
uvColName -> totalUv.toString,
newCountColName -> totalNewCount.toString,
oldCountColName -> totalOldCount.toString
))
}
})
}
}
需要分别统计不同浏览器(或者客户端)的占比
需要分析出来指标
PV
UV
新用户
老用户
需要分析的维度
统计分析后的结果如下:
频道ID | 浏览器 | 时间 | PV | UV | 新用户 | 老用户 |
---|---|---|---|---|---|---|
频道1 | 360浏览器 | 201809 | 1000 | 300 | 0 | 300 |
频道1 | IE | 20180910 | 123 | 1 | 0 | 1 |
频道1 | Chrome | 2018091010 | 55 | 2 | 2 | 0 |
步骤
flatMap
转换为要分析出来数据样例类频道
、时间
、浏览器
进行分组(分流)实现
创建一个ChannelBrowserTask
单例对象
添加一个ChannelBrowser
样例类,它封装要统计的四个业务字段:频道ID(channelID)、浏览器(browser)、日期(date)pv、uv、新用户(newCount)、老用户(oldCount)
在ChannelBrowserTask
中编写一个process
方法,接收预处理后的DataStream
使用flatMap
算子,将ClickLog
对象转换为三个不同时间维度ChannelBrowser
按照频道ID
、时间
、浏览器
进行分流
划分时间窗口(3秒一个窗口)
执行reduce合并计算
打印测试
将合并后的数据下沉到hbase
准备hbase的表名、列族名、rowkey名、列名
判断hbase中是否已经存在结果记录
若存在,则获取后进行累加
若不存在,则直接写入
核心代码:
package com.itheima.realprocess.task
import com.itheima.realprocess.bean.ClickLogWide
import com.itheima.realprocess.util.HBaseUtil
import org.apache.commons.lang.StringUtils
import org.apache.flink.streaming.api.scala.{
DataStream, KeyedStream, WindowedStream}
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.functions.sink.SinkFunction
import org.apache.flink.streaming.api.windowing.time.Time
// 2. 添加一个`ChannelBrowser`样例类,它封装要统计的四个业务字段:频道ID(channelID)、浏览器(browser)、日期(date)pv、uv、新用户(newCount)、老用户(oldCount)
case class ChannelBrowser(var channelID: String,
var browser: String,
var date: String,
var pv: Long,
var uv: Long,
var newCount: Long,
var oldCount: Long)
object ChannelBrowserTask extends BaseTask[ChannelBrowser] {
override def map(source: DataStream[ClickLogWide]): DataStream[ChannelBrowser] = {
source.flatMap {
clicklog =>
val isOld = (isNew: Int, isDateNew: Int) => if (isNew == 0 && isDateNew == 1) 1 else 0
List(
ChannelBrowser(clicklog.channelID,
clicklog.browserType,
clicklog.yearMonthDayHour,
clicklog.count,
clicklog.isHourNew,
clicklog.isNew,
isOld(clicklog.isNew, clicklog.isHourNew)), // 小时维度
ChannelBrowser(clicklog.channelID,
clicklog.browserType,
clicklog.yearMonthDayHour,
clicklog.count,
clicklog.isDayNew,
clicklog.isNew,
isOld(clicklog.isNew, clicklog.isDayNew)), // 天维度
ChannelBrowser(clicklog.channelID,
clicklog.browserType,
clicklog.yearMonth,
clicklog.count,
clicklog.isMonthNew,
clicklog.isNew,
isOld(clicklog.isNew, clicklog.isMonthNew)) // 月维度
)
}
}
override def keyBy(mapDataStream: DataStream[ChannelBrowser]): KeyedStream[ChannelBrowser, String] = {
mapDataStream.keyBy {
broswer =>
broswer.channelID + broswer.date + broswer.browser
}
}
override def timeWindow(keyedStream: KeyedStream[ChannelBrowser, String]): WindowedStream[ChannelBrowser, String, TimeWindow] = {
keyedStream.timeWindow(Time.seconds(3))
}
override def reduce(windowedStream: WindowedStream[ChannelBrowser, String, TimeWindow]): DataStream[ChannelBrowser] = {
windowedStream.reduce {
(broswer1, broswer2) =>
ChannelBrowser(broswer2.channelID,
broswer2.browser,
broswer2.date,
broswer1.pv + broswer2.pv,
broswer1.uv + broswer2.uv,
broswer1.newCount + broswer2.newCount,
broswer1.oldCount + broswer2.oldCount)
}
}
override def sink2HBase(reduceDataStream: DataStream[ChannelBrowser]): Unit = {
reduceDataStream.addSink(new SinkFunction[ChannelBrowser] {
override def invoke(value: ChannelBrowser): Unit = {
// - 准备hbase的表名、列族名、rowkey名、列名
val tableName = "channel_broswer"
val cfName = "info"
// 频道ID(channelID)、浏览器(browser)、日期(date)pv、uv、新用户(newCount)、老用户(oldCount)
val rowkey = s"${value.channelID}:${value.date}:${value.browser}"
val channelIDColName = "channelID"
val broswerColName = "browser"
val dateColName = "date"
val pvColName = "pv"
val uvColName = "uv"
val newCountColName = "newCount"
val oldCountColName = "oldCount"
var totalPv = 0L
var totalUv = 0L
var totalNewCount = 0L
var totalOldCount = 0L
val resultMap: Map[String, String] = HBaseUtil.getMapData(tableName, rowkey, cfName, List(
pvColName,
uvColName,
newCountColName,
oldCountColName
))
// 计算PV,如果Hbase中存在pv数据,就直接进行累加
if (resultMap != null && resultMap.size > 0 && StringUtils.isNotBlank(resultMap(pvColName))) {
totalPv = resultMap(pvColName).toLong + value.pv
}
else {
totalPv = value.pv
}
if (resultMap != null && resultMap.size > 0 && StringUtils.isNotBlank(resultMap(uvColName))) {
totalUv = resultMap(uvColName).toLong + value.uv
}
else {
totalUv = value.uv
}
// - 判断hbase中是否已经存在结果记录
// - 若存在,则获取后进行累加
// - 若不存在,则直接写入
if (resultMap != null && resultMap.size > 0 && StringUtils.isNotBlank(resultMap(newCountColName))) {
totalNewCount = resultMap(newCountColName).toLong + value.newCount
}
else {
totalNewCount = value.newCount
}
if (resultMap != null && resultMap.size > 0 && StringUtils.isNotBlank(resultMap(oldCountColName))) {
totalOldCount = resultMap(oldCountColName).toLong + value.oldCount
}
else {
totalOldCount = value.oldCount
}
// 频道ID(channelID)、浏览器(browser)、日期(date)pv、uv、新用户(newCount)、老用户(oldCount)
HBaseUtil.putMapData(tableName, rowkey, cfName, Map(
channelIDColName -> value.channelID,
broswerColName -> value.browser,
dateColName -> value.date,
pvColName -> totalPv.toString,
uvColName -> totalUv.toString,
newCountColName -> totalNewCount.toString,
oldCountColName -> totalOldCount.toString
))
}
})
}
}