课程安排:
Flink的介绍(特点,整合),FLink的环境安装(standAlone,yarn),Flink dataSet(批处理)
flink核心计算模块:runtime
角色:
bin/yarn-session.sh -n 2 -tm 800 -jm 800 -s 1 -d
-n : 表示taskmanager容器的数量
-s : 表示slot的数量
-tm: 表示taskmanager容器的内存
-jm: 表示jobmanager容器的内存
-d: 表示分离模式
总共三个容器,tm有2个,jm有1个
提交任务:
bin/flink run examples/batch/WordCount.jar
查看yarn-session资源列表
yarn application -list
删除指定yarn-session
yarn application -kill application_1571196306040_0002
查看其他命令:
yarn-session -help
提交命令:
bin/flink run -m yarn-cluster -yn 2 -ys 2 -ytm 1024 -yjm 1024 /export/servers/flink-1.7.0/examples/batch/WordCount.jar
-m : 指定模式
-yn : 容器的数量
-ys: slot的数量
-ytm: tm的内存
-yjm: jm的内存
/export/servers/flink-1.7.0/examples/batch/WordCount.jar : 执行jar包
查看help
bin/flink run -m yarn-cluster -help
扩展: flume自定义source
https://www.cnblogs.com/nstart/p/7699904.html
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-77a1IdGX-1577071557946)(photo/1571198075931.png)]
1.wordCount开发
package cn.itcast
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
* @Date 2019/10/16
*/
object WordCount {
def main(args: Array[String]): Unit = {
/**
* 1.获取批处理执行环境
* 2.加载数据源
* 3.数据转换:切分,分组,聚合
* 4.数据打印
* 5.触发执行
*/
//1.获取批处理执行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2.加载数据源
val source: DataSet[String] = env.fromElements("ss dd dd ss ff")
source.flatMap(line=>line.split("\\W+"))//正则表达式W+ ,多个空格
.map(line=>(line,1)) //(ss,1)
//分组
.groupBy(0)
.sum(1) //求和
.print()//数据打印,在批处理中,print是一个触发算子
//env.execute() //表示触发执行
}
}
2.打包部署
第一种:采用maven打包
第二种:idea打包
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fybmK7v8-1577071557949)(photo/1571208964476.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PSauyXLa-1577071557950)(photo/1571209075510.png)]
任务执行:
bin/flink run -m yarn-cluster -yn 1 -ys 1 -ytm 1024 -yjm 1024 /export/servers/tmp/flink-1016.jar
不将jar包打入工程:
jar变小
升级维护方便
将一个元素转换成另一个元素
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3unS2gpd-1577071557952)(photo/1571209951856.png)]
package cn.itcast
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
* @Date 2019/10/16
*/
object MapDemo {
def main(args: Array[String]): Unit = {
/**
* 1. 获取 ExecutionEnvironment 运行环境
* 2. 使用 fromCollection 构建数据源
* 3. 创建一个 User 样例类
* 4. 使用 map 操作执行转换
* 5. 打印测试
*/
//1. 获取 ExecutionEnvironment 运行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2. 使用 fromCollection 构建数据源
val source: DataSet[String] = env.fromCollection(List("1,张三", "2,李四", "3,王五", "4,赵六"))
//3.数据转换
source.map(line=>{
val arr: Array[String] = line.split(",")
User(arr(0).toInt,arr(1))
}).print()
}
}
case class User(id:Int,userName:String)
将一个元素转换成0/1/n个元素
package cn.itcast
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
* @Date 2019/10/16
*/
object FlatMap {
def main(args: Array[String]): Unit = {
/**
* 1. 获取 ExecutionEnvironment 运行环境
* 2. 使用 fromElements构建数据源
* 3. 使用 flatMap 执行转换
* 4. 使用groupBy进行分组
* 5. 使用sum求值
* 6. 打印测试
*/
//1. 获取 ExecutionEnvironment 运行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2. 使用 fromElements构建数据源
val source: DataSet[List[(String, Int)]] = env.fromElements(List(("java", 1), ("scala", 1), ("java", 1)) )
source.flatMap(line=>line)
.groupBy(0) //对第一个元素进行分组
.sum(1) //对第二个元素求和
.print() //打印和触发执行
}
}
分区转换算子,将一个分区中的元素转换为另一个元素
package cn.itcast
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
* @Date 2019/10/16
*/
object MapPartition {
def main(args: Array[String]): Unit = {
/**
* 1. 获取 ExecutionEnvironment 运行环境
* 2. 使用 fromElements构建数据源
* 3. 创建一个 Demo 样例类
* 4. 使用 mapPartition 操作执行转换
* 5. 打印测试
*/
//1. 获取 ExecutionEnvironment 运行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2. 使用 fromElements构建数据源
val source: DataSet[(String, Int)] = env.fromElements(("java", 1), ("scala", 1), ("java", 1) )
//3.数据转换
source.mapPartition(line=>{
line.map(y=>(y._1,y._2))
}).print()
}
}
过滤boolean值为true的元素
package cn.itcast
import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.api.scala._
/**
* @Date 2019/10/16
*/
object Filter {
def main(args: Array[String]): Unit = {
//1. 获取 ExecutionEnvironment 运行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2. 使用 fromElements构建数据源
val source: DataSet[(String, Int)] = env.fromElements(("java", 1), ("scala", 1), ("java", 1) )
//3.数据过滤
source.filter(line=>line._1.contains("java"))
.print()
}
}
增量聚合函数,将数据集最终聚合成一个元素
package cn.itcast
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
* @Date 2019/10/16
*/
object Reduce {
def main(args: Array[String]): Unit = {
/**
* 1. 获取 ExecutionEnvironment 运行环境
* 2. 使用 fromElements 构建数据源
* 3. 使用 map和group执行转换操作
* 4.使用reduce进行聚合操作
* 5.打印测试
*/
//1. 获取 ExecutionEnvironment 运行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2. 使用 fromElements 构建数据源
val source: DataSet[(String, Int)] = env.fromElements(("java", 1), ("scala", 1), ("java", 1) )
//3.数据转换
source.groupBy(0)
//4.使用reduce进行聚合操作
.reduce((x,y)=>(x._1,x._2+y._2))
//5.打印测试
.print()
}
}
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Lzf15Vrd-1577071557955)(photo/1571213462938.png)]
package cn.itcast
import java.lang
import akka.stream.impl.fusing.Collect
import org.apache.flink.api.common.functions.{GroupCombineFunction, GroupReduceFunction}
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.util.Collector
/**
* @Date 2019/10/16
*/
object Reduce {
def main(args: Array[String]): Unit = {
/**
* 1. 获取 ExecutionEnvironment 运行环境
* 2. 使用 fromElements 构建数据源
* 3. 使用 map和group执行转换操作
* 4.使用reduce进行聚合操作
* 5.打印测试
*/
//1. 获取 ExecutionEnvironment 运行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2. 使用 fromElements 构建数据源
val source: DataSet[(String, Int)] = env.fromElements(("java", 1), ("scala", 1), ("java", 1))
//3.数据转换
source.groupBy(0)
//4.使用reduce进行聚合操作
//.reduce((x,y)=>(x._1,x._2+y._2))
//reduceGroup写法
// .reduceGroup(line => {
// line.reduce((x, y) => (x._1, x._2 + y._2))
// })
// .reduceGroup{
// (in:Iterator[(String,Int)],out:Collector[(String,Int)])=>{
// val tuple: (String, Int) = in.reduce((x,y)=>(x._1,x._2+y._2))
// out.collect(tuple)
// }
// }
//combine
.combineGroup(new GroupCombineAndReduce)
//5.打印测试
.print()
}
}
//导入包 java语法改成可以使用scala语法
import collection.JavaConverters._
class GroupCombineAndReduce extends GroupReduceFunction[(String,Int),(String,Int)]
with GroupCombineFunction[(String,Int),(String,Int)] {
//后执行
override def reduce(values: lang.Iterable[(String, Int)], out: Collector[(String, Int)]): Unit = {
for(line<- values.asScala){
out.collect(line)
}
}
//先执行,能够预先合并数据
override def combine(values: lang.Iterable[(String, Int)], out: Collector[(String, Int)]): Unit = {
var key= ""
var sum:Int =0
for(line<- values.asScala){
key =line._1
sum = sum+ line._2
}
out.collect((key,sum))
}
}
注意接收数据量不能太大
7.聚合函数
package cn.itcast
import org.apache.flink.api.java.aggregation.Aggregations
import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.api.scala._
import scala.collection.mutable
/**
* @Date 2019/10/16
*/
object Aggregate {
def main(args: Array[String]): Unit = {
//1.获取执行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2.加载数据源
val data = new mutable.MutableList[(Int, String, Double)]
data.+=((1, "yuwen", 89.0))
data.+=((2, "shuxue", 92.2))
data.+=((3, "yingyu", 89.99))
data.+=((4, "wuli", 98.9))
data.+=((5, "yuwen", 88.88))
data.+=((6, "wuli", 93.00))
data.+=((7, "yuwen", 94.3))
val source: DataSet[(Int, String, Double)] = env.fromCollection(data)
//3.数据分组
val groupData: GroupedDataSet[(Int, String, Double)] = source.groupBy(1)
//4.数据聚合
groupData
//根据第三个元素取最小值
// .minBy(2)
//.maxBy(2) //返回满足条件的一组元素
//.min(2)
// .max(2) //返回满足条件的最值
.aggregate(Aggregations.MAX,2)
.print()
}
}
数据去重
package cn.itcast
import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.api.scala._
import scala.collection.mutable
/**
* @Date 2019/10/16
*/
object DistinctDemo {
def main(args: Array[String]): Unit = {
//1.获取执行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2.加载数据源
val data = new mutable.MutableList[(Int, String, Double)]
data.+=((1, "yuwen", 89.0))
data.+=((2, "shuxue", 92.2))
data.+=((3, "yingyu", 89.99))
data.+=((4, "wuli", 93.00))
data.+=((5, "yuwen", 89.0))
data.+=((6, "wuli", 93.00))
val source: DataSet[(Int, String, Double)] = env.fromCollection(data)
source.distinct(1) //去重
.print()
}
}
package cn.itcast
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
* @Date 2019/10/16
*/
object LeftAndRightAndFull {
def main(args: Array[String]): Unit = {
//1.获取执行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2.加载数据源
val s1: DataSet[(Int, String)] = env.fromElements((1, "zhangsan") , (2, "lisi") ,(3 , "wangwu") ,(4 , "zhaoliu"))
val s2: DataSet[(Int, String)] = env.fromElements((1, "beijing"), (2, "shanghai"), (4, "guangzhou"))
//3.join关联
//leftJoin
// s1.leftOuterJoin(s2).where(0).equalTo(0){
// (s1,s2)=>{
// if(s2 == null){
// (s1._1,s1._2,null)
// }else{
// (s1._1,s1._2,s2._2)
// }
// }
// }
//rightJoin
// s1.rightOuterJoin(s2).where(0).equalTo(0) {
// (s1, s2) => {
// if (s1 == null) {
// (s2._1, null, s2._2)
// } else {
// (s2._1, s1._2, s2._2)
// }
// }
// }
//fullJoin
s1.fullOuterJoin(s2).where(0).equalTo(0){
(s1,s2)=>{
if (s1 == null) {
(s2._1, null, s2._2)
}else if(s2 == null){
(s1._1,s1._2,null)
} else {
(s2._1, s1._2, s2._2)
}
}
}
.print()
}
}
多数据流合并
object Union {
def main(args: Array[String]): Unit = {
//1.获取执行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
val s1: DataSet[String] = env.fromElements("java")
val s2: DataSet[String] = env.fromElements("scala")
val s3: DataSet[String] = env.fromElements("java")
//union数据合并
s1.union(s2).union(s3).print()
}
}
package cn.itcast
import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration
/**
* @Date 2019/10/16
*/
object Rebalance {
def main(args: Array[String]): Unit = {
/**
* 1. 获取 ExecutionEnvironment 运行环境
* 2. 生成序列数据源
* 3. 使用filter过滤大于50的数字
* 4. 执行rebalance操作
* 5.使用map操作传入 RichMapFunction ,将当前子任务的ID和数字构建成一个元组
* 6. 打印测试
*/
//1. 获取 ExecutionEnvironment 运行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2. 生成序列数据源
val source: DataSet[Long] = env.generateSequence(0,100)
//3. 使用filter过滤大于50的数字
val filterData: DataSet[Long] = source.filter(_>50)
//4.避免数据倾斜
val rebData: DataSet[Long] = filterData.rebalance()
//5.数据转换
rebData.map(new RichMapFunction[Long,(Int,Long)] {
var subtask: Int = 0
//open方法会在map方法之前执行
override def open(parameters: Configuration): Unit = {
//获取线程任务执行id
//通过上下文对象获取
subtask = getRuntimeContext.getIndexOfThisSubtask
}
override def map(value: Long): (Int, Long) = {
(subtask,value)
}
})
//数据打印,触发执行
.print()
}
}
分区算子
package cn.itcast
import org.apache.flink.api.common.operators.Order
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.core.fs.FileSystem.WriteMode
import scala.collection.mutable
/**
* @Date 2019/10/16
*/
object PartitionDemo {
def main(args: Array[String]): Unit = {
//1.获取执行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//加载数据源
val data = new mutable.MutableList[(Int, Long, String)]
data.+=((1, 1L, "Hi"))
data.+=((2, 2L, "Hello"))
data.+=((3, 2L, "Hello world"))
data.+=((4, 3L, "Hello world, how are you?"))
data.+=((5, 3L, "I am fine."))
data.+=((6, 3L, "Luke Skywalker"))
data.+=((7, 4L, "Comment#1"))
data.+=((8, 4L, "Comment#2"))
data.+=((9, 4L, "Comment#3"))
data.+=((10, 4L, "Comment#4"))
data.+=((11, 5L, "Comment#5"))
data.+=((12, 5L, "Comment#6"))
data.+=((13, 5L, "Comment#7"))
data.+=((14, 5L, "Comment#8"))
data.+=((15, 5L, "Comment#9"))
data.+=((16, 6L, "Comment#10"))
data.+=((17, 6L, "Comment#11"))
data.+=((18, 6L, "Comment#12"))
data.+=((19, 6L, "Comment#13"))
data.+=((20, 6L, "Comment#14"))
data.+=((21, 6L, "Comment#15"))
val source = env.fromCollection(data)
//3.partitionByHash分区
// val result: DataSet[(Int, Long, String)] = source.partitionByHash(0).setParallelism(2).mapPartition(line => {
// line.map(line => (line._1, line._2, line._3))
// })
//partitionByRange
// val result: DataSet[(Int, Long, String)] = source.partitionByRange(0).setParallelism(2).mapPartition(line => {
// line.map(line => (line._1, line._2, line._3))
// })
//sortPartition
val result: DataSet[(Int, Long, String)] = source.sortPartition(0,Order.DESCENDING).setParallelism(2).mapPartition(line => {
line.map(line => (line._1, line._2, line._3))
})
//4.数据落地
result.writeAsText("sort",WriteMode.OVERWRITE)
//5.触发执行
env.execute("partition")
}
}
13.first
取前N条数据,下标从1开始
package cn.itcast
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import scala.collection.mutable
/**
* @Date 2019/10/16
*/
object FirstDemo {
def main(args: Array[String]): Unit = {
//1.获取执行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
//2.加载数据
val data = new mutable.MutableList[(Int, Long, String)]
data.+=((1, 1L, "Hi"))
data.+=((2, 2L, "Hello"))
data.+=((3, 2L, "Hello world"))
data.+=((4, 3L, "Hello world, how are you?"))
data.+=((5, 3L, "I am fine."))
data.+=((6, 3L, "Luke Skywalker"))
data.+=((7, 4L, "Comment#1"))
data.+=((8, 4L, "Comment#2"))
data.+=((9, 4L, "Comment#3"))
data.+=((10, 4L, "Comment#4"))
data.+=((11, 5L, "Comment#5"))
data.+=((12, 5L, "Comment#6"))
data.+=((13, 5L, "Comment#7"))
data.+=((14, 5L, "Comment#8"))
data.+=((15, 5L, "Comment#9"))
data.+=((16, 6L, "Comment#10"))
data.+=((17, 6L, "Comment#11"))
data.+=((18, 6L, "Comment#12"))
data.+=((19, 6L, "Comment#13"))
data.+=((20, 6L, "Comment#14"))
data.+=((21, 6L, "Comment#15"))
val ds = env.fromCollection(data)
// ds.first(10).print()
//还可以先goup分组,然后在使用first取值
ds.first(2).print()
}
}
package cn.itcast
import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.core.fs.FileSystem.WriteMode
/**
* @Date 2019/10/16
*/
object TeadTxtDemo {
def main(args: Array[String]): Unit = {
//1.获取执行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2.读取本地磁盘文件
//val source: DataSet[String] = env.readTextFile("C:\\Users\\zhb09\\Desktop\\tmp\\user.txt")
//读取hdfs文件
val source: DataSet[String] = env.readTextFile("hdfs://node01:8020/tmp/user.txt")
//3.数据转换,单词统计
val result: AggregateDataSet[(String, Int)] = source.flatMap(_.split(","))
.map((_, 1))
.groupBy(0)
.sum(1)
//4.数据写入hdfs,OVERWRITE:数据覆盖
result.writeAsText("hdfs://node01:8020/tmp/user2.txt",WriteMode.OVERWRITE)
env.execute()
}
}
package cn.itcast
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
* @Date 2019/10/16
*/
object ReadCsv {
def main(args: Array[String]): Unit = {
//1.获取执行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
//2.读取CSV文件
val result: DataSet[(String, String, Int)] = env.readCsvFile[(String, String, Int)](
"C:\\Users\\zhb09\\Desktop\\write\\test\\test.csv",
lineDelimiter = "\n", //行分隔符
fieldDelimiter = ",", //字段之间的分隔符
ignoreFirstLine = true, //忽略首行
lenient = false, //不忽略解析错误的行
includedFields = Array(0, 1, 2) //读取列
)
result.first(5).print()
}
}
课程复习:
Flink的环境搭建
standAlone
flink on yarn
打包方式:代码与依赖包分离,代码与配置文件分离
算子
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-TEdjTiIs-1577071557958)(photo/1571277131670.png)]
package cn.itcast.dataset
import java.util
import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration
import scala.collection.mutable
/**
* @Date 2019/10/17
*/
object BrocastDemo {
def main(args: Array[String]): Unit = {
/**
* 1.获取批处理执行环境
* 2.加器数据源
* 3.数据转换
* (1)共享广播变量
* (2)获取广播变量
* (3)数据合并
* 4.数据打印/触发执行
*需求:从内存中拿到data2的广播数据,再与data1数据根据第二列元素组合成(Int, Long, String, String)
*/
//1.获取批处理执行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2.加器数据源
val data1 = new mutable.MutableList[(Int, Long, String)]
data1 .+=((1, 1L, "xiaoming"))
data1 .+=((2, 2L, "xiaoli"))
data1 .+=((3, 2L, "xiaoqiang"))
val ds1 = env.fromCollection(data1)
val data2 = new mutable.MutableList[(Int, Long, Int, String, Long)]
data2 .+=((1, 1L, 0, "Hallo", 1L))
data2 .+=((2, 2L, 1, "Hallo Welt", 2L))
data2 .+=((2, 3L, 2, "Hallo Welt wie", 1L))
val ds2 = env.fromCollection(data2)
//3.数据转换
ds1.map(new RichMapFunction[(Int,Long,String),(Int, Long, String, String)] {
var ds: util.List[(Int, Long, Int, String, Long)] = null
//open在map方法之前先执行
override def open(parameters: Configuration): Unit = {
//(2)获取广播变量
ds = getRuntimeContext.getBroadcastVariable[(Int, Long, Int, String, Long)]("ds2")
}
//(3)数据合并
import collection.JavaConverters._
override def map(value: (Int, Long, String)): (Int, Long, String, String) = {
var tuple: (Int, Long, String, String) = null
for(line<- ds.asScala){
if(line._2 == value._2){
tuple = (value._1,value._2,value._3,line._4)
}
}
tuple
}
}).withBroadcastSet(ds2,"ds2") //(1)共享广播变量
//4.数据打印/触发执行
.print()
}
}
package cn.itcast.dataset
import java.io.File
import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration
import scala.collection.mutable.ArrayBuffer
import scala.io.Source
/**
* @Date 2019/10/17
*/
object DistributeCache {
def main(args: Array[String]): Unit = {
/**
* 1.获取执行环境
* 2.加载数据源
* 3.注册分布式缓存
* 4.数据转换
* (1)获取缓存文件
* (2)解析文件
* (3)数据转换
* 5.数据打印/以及触发执行
*/
//1.获取执行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2.加载数据源
val clazz:DataSet[Clazz] = env.fromElements(
Clazz(1,"class_1"),
Clazz(2,"class_1"),
Clazz(3,"class_2"),
Clazz(4,"class_2"),
Clazz(5,"class_3"),
Clazz(6,"class_3"),
Clazz(7,"class_4"),
Clazz(8,"class_1")
)
//3.注册分布式缓存
val url = "hdfs://node01:8020/tmp/subject.txt"
env.registerCachedFile(url,"cache")
//4.数据转换
clazz.map(new RichMapFunction[Clazz,Info] {
val buffer = new ArrayBuffer[String]()
override def open(parameters: Configuration): Unit = {
//(1)获取缓存文件
val file: File = getRuntimeContext.getDistributedCache.getFile("cache")
//(2)解析文件
val strs: Iterator[String] = Source.fromFile(file.getAbsoluteFile).getLines()
strs.foreach(line=>{
buffer.append(line)
})
}
override def map(value: Clazz): Info = {
var info:Info = null
//(3)数据转换
for(line <- buffer){
val arr: Array[String] = line.split(",")
if(arr(0).toInt == value.id){
info = Info(value.id,value.clazz,arr(1),arr(2).toDouble)
}
}
info
}
}).print() // 5.数据打印/以及触发执行
}
}
//(学号 , 班级 , 学科 , 分数)
case class Info(id:Int,clazz:String,subject:String,score:Double)
case class Clazz(id:Int,clazz:String)
package cn.itcast.dataset
import org.apache.flink.api.common.JobExecutionResult
import org.apache.flink.api.common.accumulators.IntCounter
import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration
/**
* @Date 2019/10/17
*/
object AccumulatorCount {
def main(args: Array[String]): Unit = {
/**
* 1.获取执行环境
* 2.加载数据源
* 3.数据转换
* (1)新建累加器
* (2)注册累加器
* (3)使用累加器
* 4.批量数据sink
* 5.触发执行
* 6.获取累加器的结果
*/
//1.获取执行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2.加载数据源
val source: DataSet[Int] = env.fromElements(1, 2, 3, 4, 5, 6)
//3.数据转换
val result: DataSet[Int] = source.map(new RichMapFunction[Int, Int] {
//(1)新建累加器
var counter = new IntCounter()
override def open(parameters: Configuration): Unit = {
//(2)注册累加器
getRuntimeContext.addAccumulator("accumulator", counter)
}
override def map(value: Int): Int = {
//(3)使用累加器
counter.add(value)
value
}
})
//4.批量数据sink
result.writeAsText("accumulator")
//5.触发执行
val execuResult: JobExecutionResult = env.execute()
//6.获取累加器的结果
val i: Int = execuResult.getAccumulatorResult[Int]("accumulator")
println("累加器结果:" + i)
}
}
package cn.itcast.datastream
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
/**
* @Date 2019/10/17
*/
object WordCountStream {
def main(args: Array[String]): Unit = {
/**
* 1.获取流处理执行环境
* 2.数据源加载
* 3.数据转换
* 切分,分组,聚合
* 4.数据打印
* 5.触发执行
*/
//1.获取流处理执行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//2.数据源加载
val source: DataStream[String] = env.socketTextStream("node01",8090)
//3.数据转换
source.split(_.split("\\W+"))
.map((_,1))
.keyBy(0) //分组
.sum(1) //数据聚合
.print() //4.数据打印
//5.触发执行
env.execute()
}
}
package cn.itcast.datastream
import org.apache.flink.streaming.api.functions.co.CoMapFunction
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
/**
* @Date 2019/10/17
*/
object Connect {
def main(args: Array[String]): Unit = {
/**
* 1.获取流处理执行环境
* 2.加载/创建数据源
* 3.使用connect连接数据流,并做map转换
* 4.打印测试
* 5.触发执行
*/
//1.获取流处理执行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//2.加载/创建数据源
val source: DataStream[Int] = env.fromElements(1,2,3,4,5,6)
//3.使用connect连接数据流,并做map转换
val strSource: DataStream[String] = source.map(line=>line+"==")
//connect
source.connect(strSource).map(new CoMapFunction[Int,String,String] {
override def map1(value: Int): String = {
value+"xxxxx"
}
override def map2(value: String): String = {
value
}
}).print()//4.打印测试
//5.触发执行
env.execute()
}
}
package cn.itcast.datastream
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
/**
* @Date 2019/10/17
*/
object SplitAndSelect {
def main(args: Array[String]): Unit = {
//1.获取执行
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//2.加载数据源
val source: DataStream[Int] = env.fromElements(1,2,3,4,5,6)
//3.数据切分
val splitData: SplitStream[Int] = source.split(line => {
line % 2 match {
case 0 => List("even")
case 1 => List("odd")
}
})
//4.切分流数据查询
splitData.select("even").print()
env.execute()
}
}
基于本地集合
基于分布式文件系统
自定义source
基于socket
并行数据源
package cn.itcast.datastream.source
import org.apache.flink.streaming.api.functions.source.{RichParallelSourceFunction, RichSourceFunction, SourceFunction}
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
/**
* @Date 2019/10/17
*/
object SourceFunDemo {
def main(args: Array[String]): Unit = {
//1.获取执行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//2.自定义数据源
env.addSource(new SourceFun).setParallelism(1)
.print()
env.execute()
}
}
//多并行度数据源
class SourceFun extends RichParallelSourceFunction[Int]{
override def run(ctx: SourceFunction.SourceContext[Int]): Unit = {
var i:Int = 0
while (true){
i+=1
ctx.collect(i)
Thread.sleep(1000)
}
}
override def cancel(): Unit = ???
}
//单数源,不能够用多个并行度
//class SourceFun extends RichSourceFunction[Int] {
// override def run(ctx: SourceFunction.SourceContext[Int]): Unit = {
// var i:Int = 0
// while (true){
// i+=1
// ctx.collect(i)
// Thread.sleep(1000)
// }
//
// }
//
// override def cancel(): Unit = ???
//}
扩展
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-u0hUZwBq-1577071557959)(photo/1571284628010.png)]
1.查看指定topic的偏移量
kafka-consumer-groups.sh --group test1017 --describe --bootstrap-server node01:9092,node02:9092,node03:9092
2.新建topic
kafka-topics.sh --create --topic demo --partitions 3 --replication-factor 2 --zookeeper node01:2181,node02:2181,node03:2181
3.生产数据
kafka-console-producer.sh --broker-list node01:9092,node02:9092,node03:9092 --topic demo
4.消费数据
kafka-console-consumer.sh --topic demo --bootstrap-server node01:9092,node02:9092,node03:9092
5.修改分区
kafka-topics.sh --alter --partitions 4 --topic demo --zookeeper node01:2181,node02:2181,node03:2181
package cn.itcast.datastream.source
import java.{lang, util}
import java.util.Properties
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition
/**
* @Date 2019/10/17
*/
object KafkaConsumer {
def main(args: Array[String]): Unit = {
/**
* 1.获取流处理执行环境
* 2.配置kafka参数
* 3.整合kafka
* 4.设置kafka消费者模式
* 5.加载数据源
* 6.数据打印
* 7.触发执行
*/
//1.获取流处理执行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//2.配置kafka参数
val properties = new Properties()
properties.setProperty("bootstrap.servers","node01:9092,node02:9092,node03:9092")
properties.setProperty("group.id","test1017")
properties.setProperty("auto.offset.reset", "latest") //最近消费,与offset相关,从消费组上次消费的偏移量开始消费
//3.整合kafka
val kafkaConsumer = new FlinkKafkaConsumer011[String]("demo",new SimpleStringSchema(),properties)
//4.设置kafka消费者模式
//默认值,当前消费组记录的偏移量开始,接着上次的偏移量消费
//kafkaConsumer.setStartFromGroupOffsets()
//从头消费
//kafkaConsumer.setStartFromEarliest()
//从最近消费,与offset无关,会导致数据丢失
// kafkaConsumer.setStartFromLatest()
//指定偏移量消费数据
// val map = new util.HashMap[KafkaTopicPartition,lang.Long]()
// map.put(new KafkaTopicPartition("demo",0),6L)
// map.put(new KafkaTopicPartition("demo",1),6L)
// map.put(new KafkaTopicPartition("demo",2),6L)
//
// kafkaConsumer.setStartFromSpecificOffsets(map)
//动态感知kafka主题分区的增加 单位毫秒
properties.setProperty("flink.partition-discovery.interval-millis", "5000");
//5.加载数据源
val source: DataStream[String] = env.addSource(kafkaConsumer)
//6.数据打印
source.print()
//7.触发执行
env.execute()
}
}
package cn.itcast.datastream.source
import java.sql.{Connection, DriverManager, PreparedStatement, ResultSet}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.source.{RichSourceFunction, SourceFunction}
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
/**
* @Date 2019/10/17
*/
object MysqlSource {
def main(args: Array[String]): Unit = {
//1.获取执行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//2.自定义数据源,读取mysql数据
env.addSource(new MySqlSourceDemo)
.print()
//3.触发执行
env.execute()
}
}
case class Demo(id:Int,name:String,age:Int)
class MySqlSourceDemo extends RichSourceFunction[Demo] {
var conn: Connection = null
var pst: PreparedStatement = null
//初始化数据源
override def open(parameters: Configuration): Unit = {
val driver = "com.mysql.jdbc.Driver"
Class.forName(driver)
val url = "jdbc:mysql://node02:3306/itcast"
//获取连接
conn = DriverManager.getConnection(url,"root","123456")
pst = conn.prepareStatement("select * from demo")
pst.setMaxRows(100) //查询最大行数
}
//执行业务查询的主方法
override def run(ctx: SourceFunction.SourceContext[Demo]): Unit = {
val rs: ResultSet = pst.executeQuery()
while (rs.next()){
val id: Int = rs.getInt(1)
val name: String = rs.getString(2)
val age: Int = rs.getInt(3)
ctx.collect(Demo(id,name,age))
}
}
//关流
override def close(): Unit = {
if(pst != null){
pst.close()
}
if(conn != null){
conn.close()
}
}
override def cancel(): Unit = ???
}
package cn.itcast.datastream
import java.sql.{Connection, DriverManager, PreparedStatement}
import cn.itcast.datastream.source.Demo
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
/**
* @Date 2019/10/17
*/
object MysqlSinkDemo {
def main(args: Array[String]): Unit = {
//1.获取流处理执行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//2.加载数据源
val source: DataStream[Demo] = env.fromElements(Demo(20, "xiaoli", 20))
//3.数据写入mysql
source.addSink(new SinkMysql)
//4.触发执行
env.execute()
}
}
class SinkMysql extends RichSinkFunction[Demo] {
var conn: Connection = null
var pst: PreparedStatement = null
//初始化数据源
override def open(parameters: Configuration): Unit = {
val driver = "com.mysql.jdbc.Driver"
Class.forName(driver)
val url = "jdbc:mysql://node02:3306/itcast"
//获取连接
conn = DriverManager.getConnection(url, "root", "123456")
pst = conn.prepareStatement("insert into demo values(?,?,?)")
}
//数据插入操作
override def invoke(value: Demo): Unit = {
pst.setInt(1, value.id)
pst.setString(2, value.name)
pst.setInt(3, value.age)
pst.executeUpdate()
}
//关流
override def close(): Unit = {
if (pst != null) {
pst.close()
}
if (conn != null) {
conn.close()
}
}
}
package cn.itcast.datastream.sink
import java.util.Properties
import cn.itcast.datastream.source.Demo
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011
/**
* @Date 2019/10/17
*/
object KafkaSinkDemo {
def main(args: Array[String]): Unit = {
//1.获取执行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//2.加载本地数据
val source: DataStream[Demo] = env.fromElements(Demo(1,"xiaoli",20))
val sourceStr: DataStream[String] = source.map(line => {
line.toString
})
//3.flink整合kafka
val properties = new Properties()
properties.setProperty("bootstrap.servers","node01:9092,node02:9092,node03:9092")
val kafkaProducer: FlinkKafkaProducer011[String] = new FlinkKafkaProducer011[String]("demo",new SimpleStringSchema(),properties)
//4.数据写入kafka
sourceStr.addSink(kafkaProducer)
//5.触发执行
env.execute()
}
}
package cn.itcast.datastream.sink
import java.net.{InetAddress, InetSocketAddress}
import java.util
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.redis.RedisSink
import org.apache.flink.streaming.connectors.redis.common.config.FlinkJedisClusterConfig
import org.apache.flink.streaming.connectors.redis.common.mapper.{RedisCommand, RedisCommandDescription, RedisMapper}
/**
* @Date 2019/10/17
*/
object RedisSinkDemo {
def main(args: Array[String]): Unit = {
//1.获取执行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//2.实时读取数据
val source: DataStream[String] = env.socketTextStream("node01",8090)
//3.将数据求和
val result: DataStream[(String, Int)] = source.flatMap(_.split("\\W+"))
.map((_, 1))
.keyBy(0)
.sum(1)
//4.将结果放入redis
//节点配置
val set = new util.HashSet[InetSocketAddress]()
set.add(new InetSocketAddress(InetAddress.getByName("node01"),7001))
set.add(new InetSocketAddress(InetAddress.getByName("node01"),7002))
set.add(new InetSocketAddress(InetAddress.getByName("node01"),7003))
//配置对象
val config: FlinkJedisClusterConfig = new FlinkJedisClusterConfig.Builder()
.setNodes(set)
.setMaxTotal(5)
.build()
//5.数据写入redis
result.addSink(new RedisSink(config,new MySinkRedis))
//6.触发执行
env.execute()
}
}
class MySinkRedis extends RedisMapper[(String,Int)] {
//指定redis的数据类型
override def getCommandDescription: RedisCommandDescription = {
new RedisCommandDescription(RedisCommand.HSET,"sinkRedis")
}
//redis key
override def getKeyFromData(data: (String, Int)): String = {
data._1
}
//redis value
override def getValueFromData(data: (String, Int)): String = {
data._2.toString
}
}
时间类型:
窗口类型:
时间窗口:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KcsFMaSv-1577071557960)(photo/1571302824199.png)]
数量窗口:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fNCDIRq0-1577071557962)(photo/1571303238317.png)]
时间窗口和数量窗口统计:
package cn.itcast.datastream
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
/**
* @Date 2019/10/17
*/
object WindowDemo {
def main(args: Array[String]): Unit = {
/**
* 1.获取流处理执行环境
* 2.实时加载数据源
* 3.数据转换:Car(id,count)
* 4.数据分组
* 5.划分窗口
* 6.求和
* 7.数据打印
* 8.触发执行
*/
//1.获取流处理执行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//2.实时加载数据源
val source: DataStream[String] = env.socketTextStream("node01", 8090)
//3.数据转换:Car(id,count)
val cars: DataStream[Car] = source.map(line => {
val arr: Array[String] = line.split(",")
Car(arr(0).toInt, arr(1).toInt)
})
//4.数据分组
cars.keyBy(_.id)
//5.划分窗口
//无重叠的时间窗口
.timeWindow(Time.seconds(3))
//有重叠的时间窗口
// .timeWindow(Time.seconds(6),Time.seconds(3))
//无重叠的数量窗口
//.countWindow(3)
//有重叠的数量窗口
//.countWindow(6,3)
//.reduce((x,y)=>Car(x.id,x.count+y.count))
//.apply(new CarWindow)
.fold(100){
(x,y)=>{
x+y.count
}
}
//6.求和
//.sum(1)
//7.数据打印
.print()
// 8.触发执行
env.execute()
}
}
case class Car(id: Int, count: Int)
class CarWindow extends WindowFunction[Car,Car,Int,TimeWindow] {
override def apply(key: Int, window: TimeWindow, input: Iterable[Car], out: Collector[Car]): Unit = {
var key = 0
var sum =0
for(line<- input){
key = line.id
sum = sum+line.count
}
out.collect(Car(key,sum))
}
}
本地模式
standAlone模式(HA)
flink on yarn
dataSet
DataStream
time按照提取类型划分:
时间事件: eventTime, 是消息源本身携带的时间
提取时间: flink消费数据的时间,当前系统时间
处理时间: 算子的处理的时间,当前系统时间
package cn.itcast
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.time.Time
/**
* @Date 2019/10/19
*/
object EventTimeDemo {
def main(args: Array[String]): Unit = {
/**
* 需求:以EventTime划分窗口,计算3秒钟内出价最高的产品
* 步骤:
* 1.获取流处理执行环境
* 2.设置事件时间
* 3.加载数据源
* 4.数据转换,新建样例类
* 5.设置水位线(延迟的时间轴)
* 6.分组
* 7.划分窗口时间 3s
* 8.统计最大值
* 9.数据打印
* 10.触发执行
*/
//1.获取流处理执行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
//2.设置事件时间,必须设置
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
//3.加载数据源
val source: DataStream[String] = env.socketTextStream("node01", 8090)
//4.数据转换,新建样例类
val bossData: DataStream[Boss] = source.map(line => {
val arr: Array[String] = line.split(",")
Boss(arr(0).toLong, arr(1), arr(2), arr(3).toDouble)
})
//5.设置水位线(延迟的时间轴),周期性水位线
//实现一:
val waterData: DataStream[Boss] = bossData.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks[Boss] {
//延时时间,自己定义的,到公司中,看具体实际情况
val delayTime: Long = 2000L
var currentTimestamp: Long = 0L //当前时间 22s 23s
//2.后执行, 获取当前水位线
override def getCurrentWatermark: Watermark = {
new Watermark(currentTimestamp - delayTime) //这就是水位线的时间,是一个延迟的时间轴
}
//1.先执行, 提取事件时间(消息源本身的时间)
override def extractTimestamp(element: Boss, previousElementTimestamp: Long): Long = {
//消息本身的时间 22s 23s 19s
val timestamp = element.time
//谁打大取谁
currentTimestamp = Math.max(timestamp, currentTimestamp) //保证时间永远往前走
timestamp
}
})
//实现二:
//Time.seconds(2) 是延时时间
// val waterData: DataStream[Boss] = bossData.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[Boss](Time.seconds(2)) {
// override def extractTimestamp(element: Boss): Long = {
// val time: Long = element.time
// time
// }
// })
//6.分组
waterData.keyBy(_.boss)
//7.划分窗口时间 3s
.timeWindow(Time.seconds(3))
//8.统计最大值
.maxBy(3)
//9.数据打印
.print()
//10.触发执行
env.execute()
}
}
//数据:(时间,公司,产品,价格)
case class Boss(time: Long, boss: String, product: String, price: Double)
1.窗口的划分,不依赖于事件时间,依赖于窗口本身,假如说窗口时间是3s,从00s开始算,(00,01,02),依次类推
2.用事件时间划分水位线,必须设置TimeCharacteristic.EventTime
3.必须要实现周期性水位线 AssignerWithPeriodicWatermarks
package cn.itcast
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.time.Time
/**
* @Date 2019/10/19
*/
object EventTimeDemo {
def main(args: Array[String]): Unit = {
/**
* 需求:以EventTime划分窗口,计算3秒钟内出价最高的产品
* 步骤:
* 1.获取流处理执行环境
* 2.设置事件时间
* 3.加载数据源
* 4.数据转换,新建样例类
* 5.设置水位线(延迟的时间轴)
* 6.分组
* 7.划分窗口时间 3s
* 8.统计最大值
* 9.数据打印
* 10.触发执行
*/
//1.获取流处理执行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
//2.设置事件时间,必须设置
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
//3.加载数据源
val source: DataStream[String] = env.socketTextStream("node01", 8090)
//4.数据转换,新建样例类
val bossData: DataStream[Boss] = source.map(line => {
val arr: Array[String] = line.split(",")
Boss(arr(0).toLong, arr(1), arr(2), arr(3).toDouble)
})
//5.设置水位线(延迟的时间轴),周期性水位线
//实现一:
val waterData: DataStream[Boss] = bossData.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks[Boss] {
//延时时间,自己定义的,到公司中,看具体实际情况
val delayTime: Long = 2000L
var currentTimestamp: Long = 0L //当前时间 22s 23s
//2.后执行, 获取当前水位线
override def getCurrentWatermark: Watermark = {
new Watermark(currentTimestamp - delayTime) //这就是水位线的时间,是一个延迟的时间轴
}
//1.先执行, 提取事件时间(消息源本身的时间)
override def extractTimestamp(element: Boss, previousElementTimestamp: Long): Long = {
//消息本身的时间 22s 23s 19s
val timestamp = element.time
//谁打大取谁
currentTimestamp = Math.max(timestamp, currentTimestamp) //保证时间永远往前走
timestamp
}
})
//定义侧输出流
val outputTag = new OutputTag[Boss]("outPutTag")
//6.分组
val result: DataStream[Boss] = waterData.keyBy(_.boss)
//7.划分窗口时间 3s
.timeWindow(Time.seconds(3))
//在水位线延迟的基础之上,再延迟2s钟
.allowedLateness(Time.seconds(2))
//收集延迟数据
.sideOutputLateData(outputTag)
//8.统计最大值
.maxBy(3)
//9.数据打印
result.print("正常数据:")
//打印延迟数据
val outputData: DataStream[Boss] = result.getSideOutput(outputTag)
outputData.print("延迟数据:")
//10.触发执行
env.execute()
}
}
//数据:(时间,公司,产品,价格)
case class Boss(time: Long, boss: String, product: String, price: Double)
package cn.itcast;
import org.apache.flink.streaming.api.datastream.AsyncDataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import java.util.concurrent.TimeUnit;
/**
* @Date 2019/10/19
*/
public class AsyIoDemo {
public static void main(String[] args) throws Exception {
/**
* 1.获取流处理执行环境
* 2.加载ab.txt
* 3.异步流对象AsyncDataStream使用有序模式
* 4.自定义异步处理函数,初始化redis
* 5.CompletableFuture发起异步请求
* 6.thenAccept接收异步返回数据
* 7.打印结果
* 8.触发执行
*/
//1.获取流处理执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//2.加载ab.txt
DataStreamSource source = env.readTextFile("C:\\Users\\zhb09\\Desktop\\tmp\\fs\\ab.txt");
//3.异步流对象AsyncDataStream使用有序模式
SingleOutputStreamOperator result = AsyncDataStream.orderedWait(source, new MyAsyncFun(), 60000, TimeUnit.SECONDS, 1);
//7.打印结果
result.print();
//8.触发执行
env.execute();
}
}
package cn.itcast;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.async.ResultFuture;
import org.apache.flink.streaming.api.functions.async.RichAsyncFunction;
import org.apache.flink.streaming.api.scala.async.AsyncFunction;
import redis.clients.jedis.HostAndPort;
import redis.clients.jedis.JedisCluster;
import redis.clients.jedis.JedisPoolConfig;
import java.util.Collections;
import java.util.HashSet;
import java.util.concurrent.CompletableFuture;
import java.util.function.Supplier;
/**
* @Date 2019/10/19
*/
//
public class MyAsyncFun extends RichAsyncFunction {
//初始化redis连接
JedisCluster jedisCluster = null;
@Override
public void open(Configuration parameters) throws Exception {
HashSet set = new HashSet();
set.add(new HostAndPort("node01", 7001));
set.add(new HostAndPort("node01", 7002));
set.add(new HostAndPort("node01", 7003));
JedisPoolConfig jedisPoolConfig = new JedisPoolConfig();
jedisPoolConfig.setMaxTotal(10);
jedisPoolConfig.setMaxIdle(10);
jedisPoolConfig.setMinIdle(5);
jedisCluster = new JedisCluster(set, jedisPoolConfig);
}
@Override
public void asyncInvoke(String input, ResultFuture resultFuture) throws Exception {
//5.发起异步请求
CompletableFuture.supplyAsync(new Supplier() {
//第一步,获取所需结果值
@Override
public String get() {
//数据切分,获取name值
String name = input.split(",")[1];
String redisValue = jedisCluster.hget("AsyncReadRedis", name);
return redisValue;
}
}).thenAccept((String str) -> { //6.thenAccept接收异步返回数据
//第二,异步回调
resultFuture.complete(Collections.singleton(str));
});
}
}
package cn.itcast
import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.common.state.{ValueState, ValueStateDescriptor}
import org.apache.flink.api.common.typeinfo.{TypeHint, TypeInformation}
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
/**
* @Date 2019/10/19
*/
object ValueState {
def main(args: Array[String]): Unit = {
/**
* 开发步骤:
* 1.获取流处理执行环境
* 2.加载数据源,以及数据处理
* 3.数据分组
* 4.数据转换,定义ValueState,保存中间结果
* 5.数据打印
* 6.触发执行
*/
//1.获取流处理执行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//2.加载数据源
val source: DataStream[(Long, Long)] = env.fromCollection(List(
(1L, 4L),
(2L, 3L),
(3L, 1L),
(1L, 2L),
(3L, 2L),
(1L, 2L),
(2L, 2L),
(2L, 9L)
))
//数据处理
//3.数据分组
val keyData: KeyedStream[(Long, Long), Tuple] = source.keyBy(0)
//4.数据转换,定义ValueState,保存中间结果
keyData.map(new RichMapFunction[(Long, Long), (Long, Long)] {
var vsState: ValueState[(Long, Long)] = _
//定义ValueState
override def open(parameters: Configuration): Unit = {
//新建状态描述器
val vs = new ValueStateDescriptor[(Long, Long)]("vs", TypeInformation.of(new TypeHint[(Long, Long)] {}))
//获取ValueState
vsState = getRuntimeContext.getState(vs)
}
//计算并保存中间结果
override def map(value: (Long, Long)): (Long, Long) = {
//获取vsState内的值
val tuple: (Long, Long) = vsState.value()
val currnetTuple = if (tuple == null) {
(0L, 0L)
} else {
tuple
}
val tupleResult: (Long, Long) = (value._1, value._2 + currnetTuple._2)
//更新vsState
vsState.update(tupleResult)
tupleResult
}
}).print() //5.数据打印
//6.触发执行
env.execute()
}
}
package cn.itcast
import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.common.state.{MapState, MapStateDescriptor}
import org.apache.flink.api.common.typeinfo.{TypeHint, TypeInformation}
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
/**
* @Date 2019/10/19
*/
object MapState {
def main(args: Array[String]): Unit = {
/**
* 1.获取流处理执行环境
* 2.加载数据源
* 3.数据分组
* 4.数据转换,定义MapState,保存中间结果
* 5.数据打印
* 6.触发执行
*/
//1.获取流处理执行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//2.加载数据源
val source: DataStream[(String, Int)] = env.fromCollection(List(
("java", 1),
("python", 3),
("java", 2),
("scala", 2),
("python", 1),
("java", 1),
("scala", 2)
))
//3.数据分组
val keyData: KeyedStream[(String, Int), Tuple] = source.keyBy(0)
//4.数据转换,定义MapState,保存中间结果
keyData.map(new RichMapFunction[(String, Int), (String, Int)] {
var mapState: MapState[String, Int] = _
//定义MapState
override def open(parameters: Configuration): Unit = {
//定义MapState的描述器
val ms = new MapStateDescriptor[String, Int]("ms", TypeInformation.of(new TypeHint[String] {}),
TypeInformation.of(new TypeHint[Int] {}))
//注册和获取mapState
mapState = getRuntimeContext.getMapState(ms)
}
//计算并保存中间结果
override def map(value: (String, Int)): (String, Int) = {
//先获取mapState内的值
val i: Int = mapState.get(value._1)
//mapState数据更新
mapState.put(value._1, value._2 + i)
(value._1, value._2 + i)
}
}).print() //5.数据打印
//6.触发执行
env.execute()
}
}
(1)数据累加
package cn.itcast
import java.{lang, util}
import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor}
import org.apache.flink.api.common.typeinfo.{TypeHint, TypeInformation}
import org.apache.flink.api.java.typeutils.ListTypeInfo
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
import redis.clients.jedis.Tuple
/**
* @Date 2019/10/14
*/
object ListState {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val source: DataStream[(String, Int)] = env.fromElements(
("java", 1),
("python", 3),
("java", 4),
("scala", 2),
("python", 1),
("java", 1))
source.keyBy(0).map(new RichMapFunction[(String,Int),(String,Int)] {
//定义listState
var listState: ListState[(String,Int)] = _
override def open(parameters: Configuration): Unit = {
//定义描述器
val liState: ListStateDescriptor[(String,Int)] = new ListStateDescriptor[(String,Int)]("liState",TypeInformation.of(new TypeHint[(String,Int)] {}))
//注册和获取listState
listState = getRuntimeContext.getListState(liState)
}
//数据转换和计算
override def map(value: (String,Int)): (String,Int) = {
//获取最新的listState的值
val ints: lang.Iterable[(String,Int)] = listState.get()
val v2: util.Iterator[(String,Int)] = ints.iterator()
var i: (String,Int) =null
while (v2.hasNext){
i = v2.next()
}
val iData = if(i == null){
("null",0)
}else{
i
}
val v3 = (value._1,value._2+iData._2)
listState.clear()
listState.add(v3)
v3
}
}).print()
env.execute()
}
}
(2)模拟kafka offset
package cn.itcast
import java.util
import org.apache.flink.api.common.restartstrategy.RestartStrategies
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor}
import org.apache.flink.api.common.time.Time
import org.apache.flink.api.common.typeinfo.{TypeHint, TypeInformation}
import org.apache.flink.runtime.state.{FunctionInitializationContext, FunctionSnapshotContext}
import org.apache.flink.runtime.state.filesystem.FsStateBackend
import org.apache.flink.streaming.api.CheckpointingMode
import org.apache.flink.streaming.api.checkpoint.CheckpointedFunction
import org.apache.flink.streaming.api.environment.CheckpointConfig
import org.apache.flink.streaming.api.functions.source.{RichSourceFunction, SourceFunction}
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
/**
* @Date 2019/10/19
*/
object OperateListState {
def main(args: Array[String]): Unit = {
/**
* 1.获取执行环境
* 2.设置检查点机制:路径,重启策略
* 3.自定义数据源
* (1)需要继承并行数据源和CheckpointedFunction
* (2)设置listState,通过上下文对象context获取
* (3)数据处理,保留offset
* (4)制作快照
* 4.数据打印
* 5.触发执行
*/
//1.获取执行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
//2.设置检查点机制:路径,重启策略
env.enableCheckpointing(1000)//每1s,启动一次检查点
//检查点保存路径
env.setStateBackend(new FsStateBackend("hdfs://node01:8020/checkpoint"))
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)//强一致性
env.getCheckpointConfig.setCheckpointTimeout(60000)//检查点制作超时时间
env.getCheckpointConfig.setFailOnCheckpointingErrors(false) //检查点制作失败,任务继续运行
//任务取消时,删除检查点
env.getCheckpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION)
//重启策略 ,重启3次,每次间隔5s
//注意:一旦配置检查点机制,会无限重启 ,宕机如果取消检查点机制,出现异常直接宕机
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3,Time.seconds(5)))
//3.自定义数据源
env.addSource(new OpeSource)
.print() // 4.数据打印
//5.触发执行
env.execute()
}
}
class OpeSource extends RichSourceFunction[Long] with CheckpointedFunction {
var lsState: ListState[Long] =_
var offset:Long = 0L
//业务逻辑处理
override def run(ctx: SourceFunction.SourceContext[Long]): Unit = {
//获取listState的数据
val lsData: util.Iterator[Long] = lsState.get().iterator()
while (lsData.hasNext){
offset = lsData.next()
}
//注意:我们时模拟kafka的offset的提交
while (true){
offset +=1
ctx.collect(offset)
Thread.sleep(1000)
if(offset>10){
1/0
}
}
}
//取消任务
override def cancel(): Unit = ???
//制作offset 的快照
override def snapshotState(context: FunctionSnapshotContext): Unit = {
lsState.clear() //清空历史offset数据
lsState.add(offset) //更新最新的offset
}
//(2)设置listState,通过上下文对象context获取
//初始化状态
override def initializeState(context: FunctionInitializationContext): Unit = {
//定义并获取listState
val ls = new ListStateDescriptor[Long]("ls",TypeInformation.of(new TypeHint[Long] {}))
lsState = context.getOperatorStateStore.getListState(ls)
}
}
package cn.itcast
import java.util
import org.apache.flink.api.common.restartstrategy.RestartStrategies
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor}
import org.apache.flink.api.common.time.Time
import org.apache.flink.api.common.typeinfo.{TypeHint, TypeInformation}
import org.apache.flink.runtime.state.{FunctionInitializationContext, FunctionSnapshotContext}
import org.apache.flink.runtime.state.filesystem.FsStateBackend
import org.apache.flink.streaming.api.CheckpointingMode
import org.apache.flink.streaming.api.checkpoint.CheckpointedFunction
import org.apache.flink.streaming.api.environment.CheckpointConfig
import org.apache.flink.streaming.api.functions.source.{RichSourceFunction, SourceFunction}
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
/**
* @Date 2019/10/19
*/
object OperateListState {
def main(args: Array[String]): Unit = {
/**
* 1.获取执行环境
* 2.设置检查点机制:路径,重启策略
* 3.自定义数据源
* (1)需要继承并行数据源和CheckpointedFunction
* (2)设置listState,通过上下文对象context获取
* (3)数据处理,保留offset
* (4)制作快照
* 4.数据打印
* 5.触发执行
*/
//1.获取执行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
//2.设置检查点机制:路径,重启策略
env.enableCheckpointing(1000)//每1s,启动一次检查点
//检查点保存路径
env.setStateBackend(new FsStateBackend("hdfs://node01:8020/checkpoint"))
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)//强一致性
env.getCheckpointConfig.setCheckpointTimeout(60000)//检查点制作超时时间
env.getCheckpointConfig.setFailOnCheckpointingErrors(false) //检查点制作失败,任务继续运行
//任务取消时,删除检查点
env.getCheckpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION)
//重启策略 ,重启3次,每次间隔5s
//注意:一旦配置检查点机制,会无限重启 ,宕机如果取消检查点机制,出现异常直接宕机
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3,Time.seconds(5)))
//3.自定义数据源
env.addSource(new OpeSource)
.print() // 4.数据打印
//5.触发执行
env.execute()
}
}
class OpeSource extends RichSourceFunction[Long] with CheckpointedFunction {
var lsState: ListState[Long] =_
var offset:Long = 0L
//业务逻辑处理
override def run(ctx: SourceFunction.SourceContext[Long]): Unit = {
//获取listState的数据
val lsData: util.Iterator[Long] = lsState.get().iterator()
while (lsData.hasNext){
offset = lsData.next()
}
//注意:我们时模拟kafka的offset的提交
while (true){
offset +=1
ctx.collect(offset)
Thread.sleep(1000)
if(offset>10){
1/0
}
}
}
//取消任务
override def cancel(): Unit = ???
//制作offset 的快照
override def snapshotState(context: FunctionSnapshotContext): Unit = {
lsState.clear() //清空历史offset数据
lsState.add(offset) //更新最新的offset
}
//(2)设置listState,通过上下文对象context获取
//初始化状态
override def initializeState(context: FunctionInitializationContext): Unit = {
//定义并获取listState
val ls = new ListStateDescriptor[Long]("ls",TypeInformation.of(new TypeHint[Long] {}))
lsState = context.getOperatorStateStore.getListState(ls)
}
}
课程安排:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gB6IsY8E-1577071557965)(photo/1571565339794.png)]
检查点存储的三种方式:
检查点的执行模式:
fixed-delay :固定时间重启策略 ,只要失败N次就宕机
failure-rate: 失败率重启策略,在指定时间范围之内,如果重启N次之后就宕机
no restart: 不重启
如果配置检查点机制,会无限重启
如果配置了检查点机制,失败之后不重启,需要配置no restart
flink默认是不重启的(没有配置checkpoint)
package cn.itcast.checkpoint
import org.apache.flink.api.common.restartstrategy.RestartStrategies
import org.apache.flink.api.common.time.Time
import org.apache.flink.runtime.state.filesystem.FsStateBackend
import org.apache.flink.streaming.api.CheckpointingMode
import org.apache.flink.streaming.api.environment.CheckpointConfig
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
/**
* @Date 2019/10/20
*/
object CheckpointDemo {
def main(args: Array[String]): Unit = {
/**
* 输入三次zhangsan,程序挂掉
*/
/**
* 1.获取流处理执行环境
* 2.设置检查点机制
* 3.设置重启策略
* 4.数据打印
* 5.触发执行
*/
// //1.获取流处理执行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
// //2.设置检查点机制
// env.enableCheckpointing(5000) //开启检查点,每5s钟触发一次
// //设置检查点存储路径
// env.setStateBackend(new FsStateBackend("hdfs://node01:8020/tmp/checkpoint"))
// env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE) //强一致性
// env.getCheckpointConfig.setCheckpointTimeout(60000) //检查点制作的超时时间
// env.getCheckpointConfig.setFailOnCheckpointingErrors(false) //如果检查点制作失败,任务继续运行
// env.getCheckpointConfig.setMaxConcurrentCheckpoints(1) //检查点最大次数
// //DELETE_ON_CANCELLATION:任务取消的时候,会删除检查点
// //RETAIN_ON_CANCELLATION:任务取消的时候,会保留检查点,需要手动删除检查点,生产上主要使用这种方式
// env.getCheckpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)
//3.设置重启策略 失败重启三次,重启间隔时间是5s
//env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3,Time.seconds(5)))
//4.数据打印
val source: DataStream[String] = env.socketTextStream("node01",8090)
source.flatMap(_.split("\\W+"))
.map(line=>{
if(line.equals("zhangsan")){
throw new RuntimeException("失败重启........")
}
line
}).print()
//5.触发执行
env.execute()
}
}
即不配置检查点,也不配置重启策略,如果抛异常,直接宕机
如果不配置检查点,配置重启策略,按重启策略执行
即配置检查点,配置重启策略,按重启策略执行
即配置检查点,,也不配置重启策略,无限重启
将checkpoint的结果保存到HDFS中
package cn.itcast.checkpoint
import java.util
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.runtime.state.filesystem.FsStateBackend
import org.apache.flink.streaming.api.{CheckpointingMode, TimeCharacteristic}
import org.apache.flink.streaming.api.checkpoint.ListCheckpointed
import org.apache.flink.streaming.api.environment.CheckpointConfig
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks
import org.apache.flink.streaming.api.functions.source.{RichSourceFunction, SourceFunction}
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
/**
* @Date 2019/10/20
*/
object CheckpointSumResult {
/**
* 1)使用自定义算子每秒钟产生大约10000条数据。
* 2)产生的数据为一个四元组(Long,String,String,Integer)—------(id,name,info,count)
* 3)数据经统计后,统计结果打印到终端输出
* 4)打印输出的结果为Long类型的数据
*
* 开发思路
* 1)source算子每隔1秒钟发送1000条数据,并注入到Window算子中。
* 2)window算子每隔1秒钟统计一次最近4秒钟内数据数量。
* 3)每隔1秒钟将统计结果打印到终端
* 4)每隔6秒钟触发一次checkpoint,然后将checkpoint的结果保存到HDFS中。
*/
def main(args: Array[String]): Unit = {
/**
* 开发步骤:
* 1.获取执行环境
* 2.设置检查点机制
* 3.自定义数据源 ,新建样例类
* 4.设置水位线(必须设置处理时间)
* 5.数据分组
* 6.划分时间窗口
* 7.数据聚合,然后将checkpoint的结果保存到HDFS中
* 8.数据打印
* 9.触发执行
*/
//1.获取执行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//2.设置检查点存储路径
env.enableCheckpointing(5000)
env.setStateBackend(new FsStateBackend("hdfs://node01:8020/tmp/checkpoint"))
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE) //强一致性
env.getCheckpointConfig.setCheckpointTimeout(60000) //检查点制作的超时时间
env.getCheckpointConfig.setFailOnCheckpointingErrors(false) //如果检查点制作失败,任务继续运行
env.getCheckpointConfig.setMaxConcurrentCheckpoints(1) //检查点最大次数
//DELETE_ON_CANCELLATION:任务取消的时候,会删除检查点
//RETAIN_ON_CANCELLATION:任务取消的时候,会保留检查点,需要手动删除检查点,生产上主要使用这种方式
env.getCheckpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)
//3.自定义数据源 ,新建样例类
val source: DataStream[Info] = env.addSource(new CheckpointSourceFunc)
//4.设置水位线(必须设置处理时间)
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)
source.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks[Info] {
override def getCurrentWatermark: Watermark = {
new Watermark(System.currentTimeMillis())
}
override def extractTimestamp(element: Info, previousElementTimestamp: Long): Long = {
System.currentTimeMillis()
}
})
//5.数据分组
.keyBy(0)
//6.划分时间窗口
.timeWindow(Time.seconds(4), Time.seconds(1))
//7.数据聚合,然后将checkpoint的结果保存到HDFS中
.apply(new CheckpointWindowFunc)
//8.数据打印
.print()
//9.触发执行
env.execute()
}
}
//(Long,String,String,Integer)—------(id,name,info,count)
case class Info(id: Long, name: String, info: String, count: Long)
class CheckpointSourceFunc extends RichSourceFunction[Info] {
override def run(ctx: SourceFunction.SourceContext[Info]): Unit = {
while (true) {
for (line <- 0 until 1000) {
ctx.collect(Info(1, "test", "test:" + line, 1))
}
Thread.sleep(1000)
}
}
override def cancel(): Unit = ???
}
class State extends Serializable {
var total: Long = 0
def getTotal = total
def setTotal(value: Long) = {
total = value
}
}
//聚合结果,并将结果保存到hdfs
class CheckpointWindowFunc extends WindowFunction[Info, Long, Tuple, TimeWindow]
with ListCheckpointed[State] {
var total: Long = 0
//数据处理,数据聚合
override def apply(key: Tuple, window: TimeWindow, input: Iterable[Info], out: Collector[Long]): Unit = {
var sum: Long = 0L
for (line <- input) {
sum = sum + line.count
}
total += sum
out.collect(total)
}
//制作快照
override def snapshotState(checkpointId: Long, timestamp: Long): util.List[State] = {
val states = new util.ArrayList[State]()
val state = new State
state.setTotal(total)
states.add(state)
states
}
//修复快照,更新数据,通过快照,拿到最新数据
override def restoreState(state: util.List[State]): Unit = {
total = state.get(0).getTotal
}
}
官方文档:
https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
package cn.itcast.checkpoint
import java.util.Properties
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.CheckpointingMode
import org.apache.flink.streaming.api.environment.CheckpointConfig
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011
import org.apache.flink.streaming.util.serialization.KeyedSerializationSchemaWrapper
/**
* @Date 2019/10/20
*/
object ExactlyKafka {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//隐式转换
import org.apache.flink.api.scala._
//checkpoint配置
env.enableCheckpointing(5000)
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
env.getCheckpointConfig.setMinPauseBetweenCheckpoints(500)
env.getCheckpointConfig.setCheckpointTimeout(60000)
env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)
env.getCheckpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)
//数据加载及数据转换
val source: DataStream[String] = env.socketTextStream("node01", 8090)
val strValue: DataStream[String] = source.flatMap(_.split(" ")).map(line=>line)
//配置kafka生产,参数
val topic = "demo"
val prop = new Properties()
prop.setProperty("bootstrap.servers", "node01:9092,node02:9092,node03:9092")
//设置事务超时时间,也可在kafka配置中设置
prop.setProperty("transaction.timeout.ms",60000*15+"")
val kafkaProducer = new FlinkKafkaProducer011[String](topic, new KeyedSerializationSchemaWrapper[String](new SimpleStringSchema), prop, FlinkKafkaProducer011.Semantic.EXACTLY_ONCE)
//使用至少一次语义的形式
//val myProducer = new FlinkKafkaProducer011<>(brokerList, topic, new SimpleStringSchema());
//使用支持仅一次语义的形式
strValue.addSink(kafkaProducer)
env.execute("StreamingKafkaSinkScala")
}
}
package cn.itcast.checkpoint
import java.net.{InetAddress, InetSocketAddress}
import java.util
import java.util.Properties
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.CheckpointingMode
import org.apache.flink.streaming.api.environment.CheckpointConfig
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.redis.RedisSink
import org.apache.flink.streaming.connectors.redis.common.config.FlinkJedisClusterConfig
import org.apache.flink.streaming.connectors.redis.common.mapper.{RedisCommand, RedisCommandDescription, RedisMapper}
/**
* @Date 2019/10/20
*/
object ExactlyRedis {
def main(args: Array[String]): Unit = {
/**
* 1.获取流处理执行环境
* 2.设置检查点机制
* 3.定义kafkaConsumer
* 4.数据转换:分组,求和
* 5.数据写入redis
* 6.触发执行
*/
//1.获取流处理执行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//2.设置检查点机制
//checkpoint配置
env.enableCheckpointing(5000)
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
env.getCheckpointConfig.setMinPauseBetweenCheckpoints(500)
env.getCheckpointConfig.setCheckpointTimeout(60000)
env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)
env.getCheckpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)
//3.定义kafkaConsumer
val properties = new Properties()
properties.setProperty("bootstrap.servers","node01:9092,node02:9092,node03:9092")
properties.setProperty("group.id","test1020")
//不自动提交偏移量
properties.setProperty("enable.auto.commit", "false")
val kafkaConsumer = new FlinkKafkaConsumer011[String]("demo",new SimpleStringSchema(),properties)
//检查制作成功之后,在提交偏移量
kafkaConsumer.setCommitOffsetsOnCheckpoints(true)
val source: DataStream[String] = env.addSource(kafkaConsumer)
//4.数据转换:分组,求和
val sumResult: DataStream[(String, Int)] = source.flatMap(_.split(" "))
.map((_, 1))
.keyBy(0)
.sum(1)
//5.数据写入redis
//设置redis属性
//redis的节点
val set = new util.HashSet[InetSocketAddress]()
set.add(new InetSocketAddress(InetAddress.getByName("node01"),7001))
set.add(new InetSocketAddress(InetAddress.getByName("node01"),7002))
set.add(new InetSocketAddress(InetAddress.getByName("node01"),7003))
val config: FlinkJedisClusterConfig = new FlinkJedisClusterConfig.Builder()
.setNodes(set)
.setMaxTotal(5)
.setMinIdle(2)
.setMaxIdle(5)
.build()
sumResult.addSink(new RedisSink[(String, Int)](config,new ExactlyRedisMapper))
env.execute("redis exactly")
}
}
class ExactlyRedisMapper extends RedisMapper[(String,Int)] {
override def getCommandDescription: RedisCommandDescription = {
//设置redis的数据解构类型
new RedisCommandDescription(RedisCommand.HSET,"exactlyRedis")
}
override def getKeyFromData(data: (String, Int)): String = {
data._1
}
override def getValueFromData(data: (String, Int)): String = {
data._2.toString
}
}
package cn.itcast.sql
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.table.api.{Table, TableEnvironment}
import org.apache.flink.table.api.scala.BatchTableEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.core.fs.FileSystem.WriteMode
import org.apache.flink.table.sinks.CsvTableSink
import org.apache.flink.types.Row
/**
* @Date 2019/10/20
*/
object BatchDataSetSql {
def main(args: Array[String]): Unit = {
/**
* 1)获取一个批处理运行环境
* 2)获取一个Table运行环境
* 3)创建一个样例类 Order 用来映射数据(订单名、用户名、订单日期、订单金额)
* 4)基于本地 Order 集合创建一个DataSet source
* 5)使用Table运行环境将DataSet注册为一张表
* 6)使用SQL语句来操作数据(统计用户消费订单的总金额、最大金额、最小金额、订单总数)
* 7)使用TableEnv.toDataSet将Table转换为DataSet
* 8)打印测试
*/
//1)获取一个批处理运行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2)获取一个Table运行环境
val tblEnv: BatchTableEnvironment = TableEnvironment.getTableEnvironment(env)
//4)基于本地 Order 集合创建一个DataSet source
val source: DataSet[Order] = env.fromElements(
Order(1, "zhangsan", "2018-10-20 15:30", 358.5),
Order(2, "zhangsan", "2018-10-20 16:30", 131.5),
Order(3, "lisi", "2018-10-20 16:30", 127.5),
Order(4, "lisi", "2018-10-20 16:30", 328.5),
Order(5, "lisi", "2018-10-20 16:30", 432.5),
Order(6, "zhaoliu", "2018-10-20 22:30", 451.0),
Order(7, "zhaoliu", "2018-10-20 22:30", 362.0),
Order(8, "zhaoliu", "2018-10-20 22:30", 364.0),
Order(9, "zhaoliu", "2018-10-20 22:30", 341.0)
)
//将批量数据注册成表
tblEnv.registerDataSet("order2", source)
//1.table api数据查询
//val table: Table = tblEnv.scan("order").select("userId,userName")
//2.sql数据查询
//val table: Table = tblEnv.sqlQuery("select * from order2")
//统计用户消费订单的总金额、最大金额、最小金额、订单总数
val sql =
"""
| select
| userName,
| sum(price) totalMoney,
| max(price) maxMoney,
| min(price) minMoney,
| count(1) totalCount
| from order2
| group by userName
|""".stripMargin //在scala中stripMargin默认是“|”作为多行连接符
// val sql = "select userName,sum(price),max(price),min(price),count(*) from order2 group by userName"
val table: Table = tblEnv.sqlQuery(sql)
// table.writeToSink(new CsvTableSink("C:\\Users\\zhb09\\Desktop\\write\\test\\orderCsv.csv",
// ",",1,WriteMode.OVERWRITE))
//打印数据到控制台
val rows: DataSet[Row] = tblEnv.toDataSet[Row](table)
rows.print()
// env.execute()
}
}
//3)创建一个样例类 Order 用来映射数据(订单名、用户名、订单日期、订单金额)
case class Order(userId: Int, userName: String, eventTime: String, price: Double)
package cn.itcast.sql
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.table.api.{Table, TableEnvironment, Types}
import org.apache.flink.table.api.scala.BatchTableEnvironment
import org.apache.flink.table.sources.CsvTableSource
import org.apache.flink.types.Row
import org.apache.flink.api.scala._
/**
* @Date 2019/10/20
*/
object ReadCsvSource {
def main(args: Array[String]): Unit = {
/**
* 1.获取批处理执行环境
* 2.获取表执行环境
* 3.加载csv数据
* 4.注册表
* 5.查询表数据
* 6.table 转换为批量数据
* 7.数据打印
*/
//1.获取批处理执行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2.获取表执行环境
val tblEnv: BatchTableEnvironment = TableEnvironment.getTableEnvironment(env)
//3.加载csv数据
val tableSource: CsvTableSource = CsvTableSource.builder()
.path("C:\\Users\\zhb09\\Desktop\\write\\test\\test.csv")
.field("name", Types.STRING)
.field("address", Types.STRING)
.field("age", Types.INT)
.ignoreFirstLine() //忽略首行
.lineDelimiter("\n") //换行符
.fieldDelimiter(",")
.build()
//4.注册表
tblEnv.registerTableSource("csv",tableSource)
//5.查询表数据
val table: Table = tblEnv.sqlQuery("select * from csv")
//6.table 转换为批量数据
val values: DataSet[Row] = tblEnv.toDataSet[Row](table)
//数据打印
values.print()
}
}
需求:使用Flink SQL来统计5秒内 用户的 订单总数、订单的最大金额、订单的最小金额。
package cn.itcast.sql
import java.util.UUID
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.source.{RichSourceFunction, SourceFunction}
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.table.api.{Table, TableEnvironment}
import org.apache.flink.table.api.scala.StreamTableEnvironment
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.types.Row
import scala.util.Random
/**
* @Date 2019/10/20
*/
object StreamSql {
def main(args: Array[String]): Unit = {
/**
* 1)获取流处理运行环境
* 2)获取Table运行环境
* 3)设置处理时间为 EventTime
* 4)创建一个订单样例类 Order ,包含四个字段(订单ID、用户ID、订单金额、时间戳)
* 5)创建一个自定义数据源
* 6)添加水印,允许延迟2秒
* 7)导入 import org.apache.flink.table.api.scala._ 隐式参数
* 8)使用 registerDataStream 注册表,并分别指定字段,还要指定rowtime字段
* 9)编写SQL语句统计用户订单总数、最大金额、最小金额
* 分组时要使用 tumble(时间列, interval '窗口时间' second) 来创建窗口
* 10)使用 tableEnv.sqlQuery 执行sql语句
* 11)将SQL的执行结果转换成DataStream再打印出来
* 12)启动流处理程序
* 需求:使用Flink SQL来统计5秒内 用户的 订单总数、订单的最大金额、订单的最小金额。
*/
//1)获取流处理运行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
//2)获取Table运行环境
val tblEnv: StreamTableEnvironment = TableEnvironment.getTableEnvironment(env)
//3)设置处理时间为 EventTime
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
//5)创建一个自定义数据源
val source: DataStream[Order3] = env.addSource(new OrderSourceFunc)
//6)添加水印,允许延迟2秒
val waterData: DataStream[Order3] = source.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[Order3](Time.seconds(2)) {
override def extractTimestamp(element: Order3): Long = {
val time: Long = element.orderTime
time
}
})
//7)导入 import org.apache.flink.table.api.scala._ 隐式参数
import org.apache.flink.table.api.scala._
//8)使用 registerDataStream 注册表,并分别指定字段,还要指定rowtime字段
tblEnv.registerDataStream("order3",waterData,'orderId,'userId,'orderPrice,'orderTime.rowtime)
// 9)编写SQL语句统计用户订单总数、最大金额、最小金额
// * 分组时要使用 tumble(时间列, interval '窗口时间' second) 来创建窗口
val sql = "select userId,count(orderId),max(orderPrice),min(orderPrice) from order3 group by userId,tumble(orderTime, interval '5' second) "
//10)使用 tableEnv.sqlQuery 执行sql语句
val table: Table = tblEnv.sqlQuery(sql)
//11)将SQL的执行结果转换成DataStream再打印出来
val values: DataStream[Row] = tblEnv.toAppendStream[Row](table)
values.print()
//12)启动流处理程序
env.execute()
}
}
//4)创建一个订单样例类 Order ,包含四个字段(订单ID、用户ID、订单金额、时间戳)
case class Order3(orderId:String,userId:Int,orderPrice:Double,orderTime:Long)
class OrderSourceFunc extends RichSourceFunction[Order3] {
/**
* * a.使用for循环生成1000个订单
* * b.随机生成订单ID(UUID)
* * c.随机生成用户ID(0-2)
* * d.随机生成订单金额(0-100)
* * e.时间戳为当前系统时间
* * f.每隔1秒生成一个订单
*
* @param ctx
*/
override def run(ctx: SourceFunction.SourceContext[Order3]): Unit = {
for(line<- 0 until 1000){
ctx.collect(Order3(UUID.randomUUID().toString,Random.nextInt(2),Random.nextInt(100),System.currentTimeMillis()))
Thread.sleep(100)
}
}
override def cancel(): Unit = ???
}
通过对用户的消费行为分析,用大数据技术,进行后台计算,对各种消费指标进行统计分析,目的是为了提高市场占有率,提高营业额。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-SYfvrJBk-1577071557967)(B:/video/flink/10.16/flink-day04/photo/1571562941575.png)]
流式计算框架: flink
消息队列: kafka(100万/s)
数据库: hbase(NoSql),单表亿级别
以上三个框架的特点:大吞吐量,高并发,高可用
canal: 实时读取mysql binlog日志
springBoot: 快速java开发框架,纯注解开发,不需要xml配置文件
java ,scala
1):掌握 HBASE 的搭建和基本运维操作
2):掌握 flink 基本语法
3):掌握 kafka 的搭建和基本运维操作
4):掌握 canal 的使⽤
5):能够独⽴开发出上报服务
6):能够使⽤flink:处理实时热点数据及数据落地 Hbase
7):能够使⽤flink:处理频道的 PV、UV 及数据落地 Hbase
8):能够使⽤flink:处理新鲜度
9):能够使⽤flink:处理频道地域分布
10):能够使⽤flink:处理运营商平台数据
11):能够使⽤flink:处理浏览器类型
12):能够使⽤代码对接 canal,并将数据同步到 kafka
13):能够使⽤flink 同步数据到 hbase
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XRh1eRem-1577071557969)(B:/video/flink/10.16/flink-day04/photo/1571563693185.png)]
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>pyg-1020artifactId>
<groupId>cn.itcastgroupId>
<version>1.0-SNAPSHOTversion>
parent>
<modelVersion>4.0.0modelVersion>
<artifactId>reportartifactId>
<properties>
<project.build.sourceEncoding>UTF-8project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8project.reporting.outputEncoding>
<java.version>1.8java.version>
<spring-cloud.version>Greenwich.M3spring-cloud.version>
properties>
<repositories>
<repository>
<id>alimavenid>
<name>alimavenname>
<url>http://maven.aliyun.com/nexus/content/groups/public/url>
repository>
repositories>
<dependencies>
<dependency>
<groupId>org.springframework.bootgroupId>
<artifactId>spring-boot-starterartifactId>
<version>1.5.13.RELEASEversion>
dependency>
<dependency>
<groupId>org.springframework.bootgroupId>
<artifactId>spring-boot-starter-testartifactId>
<version>1.5.13.RELEASEversion>
dependency>
<dependency>
<groupId>org.springframework.bootgroupId>
<artifactId>spring-boot-starter-webartifactId>
<version>1.5.13.RELEASEversion>
dependency>
<dependency>
<groupId>org.springframework.bootgroupId>
<artifactId>spring-boot-starter-tomcatartifactId>
<version>1.5.13.RELEASEversion>
dependency>
<dependency>
<groupId>org.apache.tomcatgroupId>
<artifactId>tomcat-catalinaartifactId>
<version>8.5.35version>
dependency>
<dependency>
<groupId>com.alibabagroupId>
<artifactId>fastjsonartifactId>
<version>1.2.47version>
dependency>
<dependency>
<groupId>org.springframework.kafkagroupId>
<artifactId>spring-kafkaartifactId>
<version>1.0.6.RELEASEversion>
dependency>
<dependency>
<groupId>org.springframework.bootgroupId>
<artifactId>spring-boot-autoconfigureartifactId>
<version>1.5.13.RELEASEversion>
dependency>
dependencies>
<build>
<plugins>
<plugin>
<groupId>org.springframework.bootgroupId>
<artifactId>spring-boot-maven-pluginartifactId>
plugin>
plugins>
build>
project>
package cn.itcast;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
/**
* @Date 2019/10/20
* 启动类
* SpringBoot内置tomcat
*/
@SpringBootApplication
public class ReportApplication {
//通过main方法启动服务
public static void main(String[] args) {
SpringApplication.run(ReportApplication.class,args);
}
}
package cn.itcast.controller;
import org.springframework.stereotype.Controller;
import org.springframework.web.bind.annotation.RequestMapping;
/**
* @Date 2019/10/20
*/
@Controller
@RequestMapping("report")
public class ReportTestController {
@RequestMapping("test")
public void accceptData(String str){
System.out.println("<<<接收的数据<<<:"+str);
}
}
springBoot可以做为web框架单独开发java项目
springCloud必须依赖springBoot来开发java项目
package cn.itcast.task
import cn.itcast.`trait`.ProcessData
import cn.itcast.bean.Message
import cn.itcast.map.ChannelPvuvFlatMap
import cn.itcast.reduce.ChannelPvuvReduce
import cn.itcast.sink.ChannelPvuvSink
import org.apache.flink.streaming.api.scala.DataStream
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
/**
* @Date 2019/10/22
*/
object ChannelPvuvTask extends ProcessData {
override def process(waterData: DataStream[Message]): Unit = {
/**
* 开发步骤一:
* 开发userState
*/
/** 开发步骤二:
* (1)数据转换
* (2)数据分组
* (3)划分时间窗口
* (4)数据聚合
* (5)数据落地
*/
//(1)数据转换
waterData.flatMap(new ChannelPvuvFlatMap)
//(2)数据分组
.keyBy(line => line.getChannelId + line.getTime)
//(3)划分时间窗口
.timeWindow(Time.seconds(3))
//(4)数据聚合
.reduce(new ChannelPvuvReduce)
//5)数据落地
.addSink(new ChannelPvuvSink)
}
}
package cn.itcast.map
import cn.itcast.bean.{ChannelPvuv, Message, UserBrowse, UserState}
import cn.itcast.util.TimeUtil
import org.apache.flink.api.common.functions.RichFlatMapFunction
import org.apache.flink.util.Collector
/**
* @Date 2019/10/23
*/
class ChannelPvuvFlatMap extends RichFlatMapFunction[Message,ChannelPvuv]{
//格式化模板
val hour = "yyyyMMddHH"
val day ="yyyyMMdd"
val month ="yyyyMM"
override def flatMap(in: Message, out: Collector[ChannelPvuv]): Unit = {
val userBrowse: UserBrowse = in.userBrowse
val timestamp: Long = userBrowse.timestamp
val channelID: Long = userBrowse.channelID
val userID: Long = userBrowse.userID
//查询用户访问状态
val userState: UserState = UserState.getUserState(userID,timestamp)
val isNew: Boolean = userState.isNew
val firstHour: Boolean = userState.isFirstHour
val firstDay: Boolean = userState.isFirstDay
val firstMonth: Boolean = userState.isFirstMonth
//封装数据到ChannelPvuv
val channelPvuv = new ChannelPvuv
channelPvuv.setChannelId(channelID)
channelPvuv.setPv(1)
//日期格式化
val hourTime: String = TimeUtil.parseTime(timestamp,hour)
val dayTime: String = TimeUtil.parseTime(timestamp,day)
val monthTime: String = TimeUtil.parseTime(timestamp,month)
//判断用户访问状态
if(isNew == true){
channelPvuv.setUv(1L)
}
//小时
if(firstHour == true){
channelPvuv.setUv(1L)
channelPvuv.setTime(hourTime)
}else{
channelPvuv.setUv(0L)
channelPvuv.setTime(hourTime)
}
out.collect(channelPvuv)
//天
if(firstDay == true){
channelPvuv.setUv(1L)
channelPvuv.setTime(dayTime)
}else{
channelPvuv.setUv(0L)
channelPvuv.setTime(dayTime)
}
out.collect(channelPvuv)
//月
if(firstMonth == true){
channelPvuv.setUv(1L)
channelPvuv.setTime(monthTime)
}else{
channelPvuv.setUv(0L)
channelPvuv.setTime(monthTime)
}
out.collect(channelPvuv)
}
}
package cn.itcast.reduce
import cn.itcast.bean.ChannelPvuv
import org.apache.flink.api.common.functions.ReduceFunction
/**
* @Date 2019/10/23
*/
class ChannelPvuvReduce extends ReduceFunction[ChannelPvuv] {
override def reduce(value1: ChannelPvuv, value2: ChannelPvuv): ChannelPvuv = {
//增量聚合,最终聚合成一条数据
val pvuv = new ChannelPvuv
pvuv.setTime(value1.getTime)
pvuv.setChannelId(value1.getChannelId)
pvuv.setPv(value1.getPv + value2.getPv)
pvuv.setUv(value1.getUv + value2.getUv)
pvuv
}
}
package cn.itcast.sink
import cn.itcast.bean.ChannelPvuv
import cn.itcast.util.HbaseUtil
import org.apache.commons.lang3.StringUtils
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction
/**
* @Date 2019/10/23
*/
class ChannelPvuvSink extends RichSinkFunction[ChannelPvuv] {
/**
* 落地数据到hbase
*/
override def invoke(value: ChannelPvuv): Unit = {
/**
* 表名: channel
* rowkey: channelId+ time(格式化)
* 字段: channelId, time,pv,uv
* 列名:channelId,time,pv,uv
* 列族: info
*/
val tableName = "channel"
val rowkey = value.getChannelId + value.getTime
val pvCol = "pv"
val uvCol = "uv"
val family = "info"
//需要查询hbase,如果有pv/uv数据,需要累加,再插入数据库,如果没有数据,直接插入
val pvData: String = HbaseUtil.queryByRowkey(tableName, family, pvCol, rowkey)
val uvData: String = HbaseUtil.queryByRowkey(tableName, family, uvCol, rowkey)
var pv = value.getPv
var uv = value.getUv
//数据非空判断
if (StringUtils.isNotBlank(pvData)) {
pv = pv + pvData.toLong
}
if (StringUtils.isNotBlank(uvData)) {
uv = uv + uvData.toLong
}
//需要封装map多列数据
var map = Map[String,Any]()
map+=("channelId"->value.getChannelId)
map+=("time"->value.getTime)
map+=(pvCol -> pv)
map+=(uvCol -> uv)
//将数据插入hbase
HbaseUtil.putMapDataByRowkey(tableName,family,map,rowkey)
}
}
package cn.itcast.task
import cn.itcast.`trait`.ProcessData
import cn.itcast.bean.Message
import cn.itcast.map.ChannelFreshnessFlatMap
import cn.itcast.reduce.ChannelFreshnessReduce
import cn.itcast.sink.ChannelFreshnessSink
import org.apache.flink.streaming.api.scala.DataStream
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
/**
* @Date 2019/10/23
*/
object ChannelFreshnessTask extends ProcessData{
override def process(waterData: DataStream[Message]): Unit = {
/**
* 1.数据转换
* 2.数据分组
* 3.划分时间窗口
* 4.数据聚合
* 5.数据落地
*/
//1.数据转换
waterData.flatMap(new ChannelFreshnessFlatMap)
//2.数据分组
.keyBy(line=>line.getChannelId+line.getTime)
//3.划分时间窗口
.timeWindow(Time.seconds(3))
//4.数据聚合
.reduce(new ChannelFreshnessReduce)
//5.数据落地
.addSink(new ChannelFreshnessSink)
}
}
package cn.itcast.map
import cn.itcast.bean.{ChannelFreshness, Message, UserBrowse, UserState}
import cn.itcast.util.TimeUtil
import org.apache.flink.api.common.functions.RichFlatMapFunction
import org.apache.flink.util.Collector
/**
* @Date 2019/10/23
*/
class ChannelFreshnessFlatMap extends RichFlatMapFunction[Message, ChannelFreshness] {
//格式化模板
val hour = "yyyyMMddHH"
val day = "yyyyMMdd"
val month = "yyyyMM"
override def flatMap(in: Message, out: Collector[ChannelFreshness]): Unit = {
val userBrowse: UserBrowse = in.userBrowse
val timestamp: Long = userBrowse.timestamp
val channelID: Long = userBrowse.channelID
val userID: Long = userBrowse.userID
//获取用户访问状态
val userState: UserState = UserState.getUserState(userID, timestamp)
val isNew: Boolean = userState.isNew
val firstHour: Boolean = userState.isFirstHour
val firstDay: Boolean = userState.isFirstDay
val firstMonth: Boolean = userState.isFirstMonth
//日期格式化
val hourTime: String = TimeUtil.parseTime(timestamp, hour)
val dayTime: String = TimeUtil.parseTime(timestamp, day)
val monthTime: String = TimeUtil.parseTime(timestamp, month)
val freshness = new ChannelFreshness
freshness.setChannelId(channelID)
//根据用户访问状态设置结果值
isNew match {
case true =>
freshness.setNewCount(1L)
case false =>
freshness.setOldCount(1L)
}
//小时
firstHour match {
case true =>
freshness.setNewCount(1L)
freshness.setTime(hourTime)
case false =>
freshness.setOldCount(1L)
freshness.setTime(hourTime)
}
out.collect(freshness)
//天
firstDay match {
case true =>
freshness.setNewCount(1L)
freshness.setTime(dayTime)
case false =>
freshness.setOldCount(1L)
freshness.setTime(dayTime)
}
out.collect(freshness)
//月
firstMonth match {
case true =>
freshness.setNewCount(1L)
freshness.setTime(monthTime)
case false =>
freshness.setOldCount(1L)
freshness.setTime(monthTime)
}
out.collect(freshness)
}
}
package cn.itcast.reduce
import cn.itcast.bean.ChannelFreshness
import org.apache.flink.api.common.functions.ReduceFunction
/**
* @Date 2019/10/23
*/
class ChannelFreshnessReduce extends ReduceFunction[ChannelFreshness]{
override def reduce(value1: ChannelFreshness, value2: ChannelFreshness): ChannelFreshness = {
val freshness = new ChannelFreshness
freshness.setChannelId(value1.getChannelId)
freshness.setTime(value1.getTime)
freshness.setNewCount(value1.getNewCount + value2.getNewCount)
freshness.setOldCount(value1.getOldCount+value2.getOldCount)
freshness
}
}
package cn.itcast.sink
import cn.itcast.bean.ChannelFreshness
import cn.itcast.util.HbaseUtil
import org.apache.commons.lang3.StringUtils
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction
/**
* @Date 2019/10/23
*/
class ChannelFreshnessSink extends RichSinkFunction[ChannelFreshness]{
override def invoke(value: ChannelFreshness): Unit = {
/**
* 表名: channel
* rowkey: channelId+ time(格式化)
* 字段: channelId, time,newCount,oldCount
* 列名:channelId,time,newCount,oldCount
* 列族: info
*/
val tableName = "channel"
val rowkey = value.getChannelId + value.getTime
val newCountCol = "newCount"
val oldCountCol = "oldCount"
val family ="info"
var newCount = value.getNewCount
var oldCount = value.getOldCount
//需要先查询hbase,如果数据库有数据,需要进行累加,如果没有数据,直接插入数据
val newCountData: String = HbaseUtil.queryByRowkey(tableName,family,newCountCol,rowkey)
val oldCountData: String = HbaseUtil.queryByRowkey(tableName,family,oldCountCol,rowkey)
//数据非空判断,并累加
if(StringUtils.isNotBlank(newCountData)){
newCount = newCount + newCountData.toLong
}
if(StringUtils.isNotBlank(oldCountData)){
oldCount = oldCount + oldCountData.toLong
}
//封装map数据
var map = Map[String ,Any]()
map+=("channelId"->value.getChannelId)
map+=("time"->value.getTime)
map+=(newCountCol -> newCount)
map+=(oldCountCol -> oldCount)
//插入数据
HbaseUtil.putMapDataByRowkey(tableName,family,map,rowkey)
}
}
package cn.itcast.task
import cn.itcast.`trait`.ProcessData
import cn.itcast.bean.Message
import cn.itcast.map.ChannelRegionFlatMap
import cn.itcast.reduce.ChannelRegionReduce
import cn.itcast.sink.ChannelRegionSink
import org.apache.flink.streaming.api.scala.DataStream
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
/**
* @Date 2019/10/23
*/
object ChannelRegionTask extends ProcessData {
override def process(waterData: DataStream[Message]): Unit = {
/** 开发步骤:
* (1)数据转换
* (2)数据分组
* (3)划分时间窗口
* (4)数据聚合
* (5)数据落地
*/
//(1)数据转换
waterData.flatMap(new ChannelRegionFlatMap)
//(2)数据分组
.keyBy(line => line.getChannelId + line.getTime)
//(3)划分时间窗口
.timeWindow(Time.seconds(3))
//(4)数据聚合
.reduce(new ChannelRegionReduce)
//(5)数据落地
.addSink(new ChannelRegionSink)
}
}
package cn.itcast.map
import cn.itcast.bean._
import cn.itcast.util.TimeUtil
import org.apache.flink.api.common.functions.RichFlatMapFunction
import org.apache.flink.util.Collector
/**
* @Date 2019/10/23
*/
class ChannelRegionFlatMap extends RichFlatMapFunction[Message,ChannelRegion]{
//格式化模板
val hour = "yyyyMMddHH"
val day ="yyyyMMdd"
val month ="yyyyMM"
override def flatMap(in: Message, out: Collector[ChannelRegion]): Unit = {
val userBrowse: UserBrowse = in.userBrowse
val timestamp: Long = userBrowse.timestamp
val userID: Long = userBrowse.userID
val channelID: Long = userBrowse.channelID
//查询用户访问状态
val userState: UserState = UserState.getUserState(userID,timestamp)
val isNew: Boolean = userState.isNew
val firstHour: Boolean = userState.isFirstHour
val firstDay: Boolean = userState.isFirstDay
val firstMonth: Boolean = userState.isFirstMonth
//日期格式化
val hourTime: String = TimeUtil.parseTime(timestamp,hour)
val dayTime: String = TimeUtil.parseTime(timestamp,day)
val monthTime: String = TimeUtil.parseTime(timestamp,month)
//封装一部分数据
val channelRegion = new ChannelRegion
channelRegion.setChannelId(channelID)
channelRegion.setCity(userBrowse.city)
channelRegion.setCountry(userBrowse.country)
channelRegion.setProvince(userBrowse.province)
channelRegion.setPv(1L)
//需要根据用户访问状态设置结果值
isNew match {
case true=>
channelRegion.setUv(1L)
channelRegion.setNewCount(1L)
case false =>
channelRegion.setUv(0L)
channelRegion.setOldCount(1L)
}
//小时
firstHour match {
case true =>
channelRegion.setUv(1L)
channelRegion.setNewCount(1L)
case false =>
channelRegion.setUv(0L)
channelRegion.setOldCount(1L)
}
channelRegion.setTime(hourTime)
out.collect(channelRegion)
//天
firstDay match {
case true =>
channelRegion.setUv(1L)
channelRegion.setNewCount(1L)
case false =>
channelRegion.setUv(0L)
channelRegion.setOldCount(1L)
}
channelRegion.setTime(dayTime)
out.collect(channelRegion)
//月
firstMonth match {
case true =>
channelRegion.setUv(1L)
channelRegion.setNewCount(1L)
case false =>
channelRegion.setUv(0L)
channelRegion.setOldCount(1L)
}
channelRegion.setTime(monthTime)
out.collect(channelRegion)
}
}
package cn.itcast.reduce
import cn.itcast.bean.ChannelRegion
import org.apache.flink.api.common.functions.ReduceFunction
/**
* @Date 2019/10/23
*/
class ChannelRegionReduce extends ReduceFunction[ChannelRegion] {
override def reduce(value1: ChannelRegion, value2: ChannelRegion): ChannelRegion = {
val region = new ChannelRegion
region.setChannelId(value1.getChannelId)
region.setCity(value1.getCity)
region.setCountry(value1.getCountry)
region.setNewCount(value1.getNewCount + value2.getNewCount)
region.setOldCount(value1.getOldCount + value2.getOldCount)
region.setProvince(value1.getProvince)
region.setPv(value1.getPv + value2.getPv)
region.setUv(value1.getUv + value2.getUv)
region.setTime(value1.getTime)
region
}
}
package cn.itcast.sink
import cn.itcast.bean.ChannelRegion
import cn.itcast.util.HbaseUtil
import org.apache.commons.lang3.StringUtils
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction
/**
* @Date 2019/10/23
*/
class ChannelRegionSink extends RichSinkFunction[ChannelRegion] {
override def invoke(value: ChannelRegion): Unit = {
/**
* 设计表:
* 表名: region
* rowkey: channelId+ time(格式化)
* 字段: channelId, time,newCount,oldCount,pv,uv,country,province,city
* 列名:channelId,time,newCount,oldCount,pv,uv,country,province,city
* 列族: info
*/
val tableName = "region"
val family = "info"
val rowkey = value.getChannelId + value.getTime
val pvCol = "pv"
val uvCol = "uv"
val newCountCol = "newCount"
val oldCountCol = "oldCount"
//需要先查询hbase数据库,并进行数据累加
val pvData: String = HbaseUtil.queryByRowkey(tableName, family, pvCol, rowkey)
val uvData: String = HbaseUtil.queryByRowkey(tableName, family, uvCol, rowkey)
val newCountData: String = HbaseUtil.queryByRowkey(tableName, family, newCountCol, rowkey)
val oldCountData: String = HbaseUtil.queryByRowkey(tableName, family, oldCountCol, rowkey)
//先从value取待累加值
var pv = value.getPv
var uv = value.getUv
var newCount = value.getNewCount
var oldCount = value.getOldCount
if (StringUtils.isNotBlank(pvData)) {
pv = pv + pvData.toLong
}
if (StringUtils.isNotBlank(uvData)) {
uv = uv + uvData.toLong
}
if (StringUtils.isNotBlank(newCountData)) {
newCount = newCount + newCountData.toLong
}
if (StringUtils.isNotBlank(oldCountData)) {
oldCount = oldCount + oldCountData.toLong
}
//封装插入数据
//channelId, time,newCount,oldCount,pv,uv,country,province,city
var map = Map[String, Any]()
map += ("channelId" -> value.getChannelId)
map += ("time" -> value.getTime)
map += (newCountCol -> newCount)
map += (oldCountCol -> oldCount)
map += (pvCol -> pv)
map += (uvCol -> uv)
map += ("country" -> value.getCountry)
map += ("province" -> value.getProvince)
map += ("city" -> value.getCity)
HbaseUtil.putMapDataByRowkey(tableName,family,map,rowkey)
}
}