1.flink 中算子是将一个或多个DataStream转换为新的DataStream,可以将多个转换组合成复杂的数据流拓扑
2.在flink中有多种不同的DataStream类型,他们之间是通过使用各种算子进行的
3.在flink中使用scala语言开发,需要引用import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
map可以理解为映射,对每个元素进行一定的变换后,映射为另一个元素
package com.kn.operator
import org.apache.flink.api.common.functions.MapFunction
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
import org.apache.flink.api.java.tuple.Tuple1
object MapOperator {
def main(args: Array[String]): Unit = {
//获取环境变量
val env = StreamExecutionEnvironment.getExecutionEnvironment
//准备数据,类型DataStreamSource
val dataStreamSource = env.fromElements(Tuple1.of("flink")
,Tuple1.of("spark")
,Tuple1.of("hadoop"))
.map(new MapFunction[Tuple1[String],String] { //准备map操作,将元素做一定的转换,映射
override def map(value: Tuple1[String]): String = {
return "i like "+ value.f0
}
})
.print()
env.execute("flink map operator")
}
}
运行结果:
2> i like flink
4> i like hadoop
3> i like spark
package com.kn.operator
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
object MapOperator {
def main(args: Array[String]): Unit = {
//获取环境变量
val env = StreamExecutionEnvironment.getExecutionEnvironment
//准备数据,类型DataStreamSource
val dataStreamSource = env.fromElements(Tuple1.apply("flink")
,Tuple1.apply("spark")
,Tuple1.apply("hadoop"))
.map("i like "+_._1)
.print()
env.execute("flink map operator")
}
}
运行结果:
3> i like hadoop
2> i like spark
1> i like flink
flatmap 可以理解为将元素摊平,每个元素可以变为0个、1个、或者多个元素。
package com.kn.operator
import org.apache.flink.api.common.functions.FlatMapFunction
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
import org.apache.flink.api.java.tuple.Tuple1
import org.apache.flink.util.Collector
object FlatMapOperator {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.fromElements(Tuple1.of("flink jobmanger taskmanager")
,Tuple1.of("spark streaming")
,Tuple1.of("hadoop hdfs"))
//注意这里的FlatMapFuncation函数,第一个参数为input类型,第二个参数为output类型
.flatMap(new FlatMapFunction[Tuple1[String],Tuple1[String]](){
override def flatMap(value: Tuple1[String], out: Collector[Tuple1[String]]): Unit = {
for(s:String <- value.f0.split(" ")){
out.collect(Tuple1.of(s))
}
}
})
.print()
env.execute("flink flatmap operator")
}
}
运行结果:
4> (spark)
3> (flink)
1> (hadoop)
3> (jobmanger)
4> (streaming)
3> (taskmanager)
1> (hdfs)
package com.kn.operator
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector
object FlatMapOperator {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.fromElements(Tuple1.apply("flink jobmanger taskmanager")
,Tuple1.apply("spark streaming")
,Tuple1.apply("hadoop hdfs"))
//注意这里的FlatMapFuncation函数,第一个参数为input类型,第二个参数为output类型
.flatMap((t1,out:Collector[Tuple2[String,Long]]) => {
t1._1.split(" ").foreach(s => out.collect(Tuple2.apply(s,1L)))
})
.print()
env.execute("flink flatmap operator")
}
}
运行结果:
2> (flink,1)
4> (hadoop,1)
3> (spark,1)
4> (hdfs,1)
2> (jobmanger,1)
3> (streaming,1)
2> (taskmanager,1)
filter 用于过滤记录
package com.kn.operator
import org.apache.flink.api.common.functions.RichFilterFunction
import org.apache.flink.api.java.tuple.Tuple1
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
object FilterOperator {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.fromElements(Tuple1.of("flink")
,Tuple1.of("spark")
,Tuple1.of("hadoop"))
.filter(new RichFilterFunction[Tuple1[String]] {
override def filter(value: Tuple1[String]): Boolean = {
return !"flink".equals(value.f0) //过滤掉flink的记录
}
})
.print()
env.execute("flink filter operator")
}
}
运行结果:
2> (spark)
3> (hadoop)
package com.kn.operator
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
object FilterOperator {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.fromElements(Tuple1.apply("flink")
,Tuple1.apply("spark")
,Tuple1.apply("hadoop"))
.filter(!"flink".equals(_._1)) //过滤掉flink字符串
.print()
env.execute("flink filter operator")
}
}
运行环境:
3> (spark)
4> (hadoop)
逻辑上将Stream根据指定的Key进行分区,是根据key的散列值进行分区的。
注:keyed state 必须要在keyby() 之后使用
package com.kn.operator
import org.apache.flink.api.common.functions.{RichFlatMapFunction, RichMapFunction}
import org.apache.flink.api.java.tuple.{Tuple1, Tuple2}
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
import org.apache.flink.util.Collector
import org.apache.flink.api.common.state.{ValueState, ValueStateDescriptor}
import org.apache.flink.api.common.typeinfo.{TypeHint, TypeInformation}
import org.apache.flink.configuration.Configuration
/*
* 1.先按空格将字符串拆分
* 2.计算key分组统计数量
* */
object KeyByOperator {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.fromElements(Tuple1.of("flink jobmanger taskmanager")
,Tuple1.of("spark hadoop")
,Tuple1.of("hadoop hdfs"))
.flatMap(new RichFlatMapFunction[Tuple1[String],Tuple2[String,Long]] {
override def flatMap(value: Tuple1[String], out: Collector[Tuple2[String,Long]]): Unit = {
for(s:String <- value.f0.split(" ")){
out.collect(Tuple2.of(s,1L))
}
}
})
.keyBy(0)
.map(new RichMapFunction[Tuple2[String,Long],Tuple2[String,Long]] {
private var state :ValueState[Tuple2[String,Long]] = null
override def open(parameters: Configuration): Unit = {
super.open(parameters)
val descriptor = new ValueStateDescriptor[Tuple2[String,Long]]("keyby-wordCount",
TypeInformation.of(new TypeHint[Tuple2[String,Long]](){}))
state = getRuntimeContext.getState(descriptor)
}
override def map(value: Tuple2[String, Long]):Tuple2[String, Long] = {
val oldState = state.value()
if(oldState != null && value.f0.equals(oldState.f0)){
state.update(Tuple2.of(value.f0,oldState.f1+1L))
return Tuple2.of(value.f0,oldState.f1+1L)
}else{
state.update(Tuple2.of(value.f0,1L))
return Tuple2.of(value.f0,1L)
}
}
})
.print()
env.execute("flink keyby operator")
}
}
运行结果:
1> (jobmanger,1)
1> (taskmanager,1)
3> (hdfs,1)
1> (spark,1)
4> (hadoop,1)
4> (flink,1)
4> (hadoop,2)
package com.kn.operator
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector
/*
* 1.先按空格将字符串拆分
* 2.计算key分组统计数量
* */
object KeyByOperator {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.fromElements(Tuple1.apply("flink jobmanger taskmanager")
,Tuple1.apply("spark hadoop")
,Tuple1.apply("hadoop hdfs"))
.flatMap((t1,out:Collector[Tuple2[String,Long]]) => {
t1._1.split(" ").foreach(s => out.collect(Tuple2.apply(s,1L)))
})
.keyBy(0)
.reduce((t1,t2) => Tuple2.apply(t1._1,t1._2+t2._2))
.print()
env.execute("flink keyby operator")
}
}
reduce是归并操作,它可以将KeyedStream 转变为 DataStream,实质是按照key做叠加计算
package com.kn.operator
import org.apache.flink.api.common.functions.{RichFlatMapFunction, RichMapFunction, RichReduceFunction}
import org.apache.flink.api.java.tuple.{Tuple1, Tuple2}
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
import org.apache.flink.util.Collector
object ReduceOperator {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.fromElements(Tuple1.of("flink hadoop taskmanager")
,Tuple1.of("spark hadoop")
,Tuple1.of("hadoop hdfs"))
.flatMap(new RichFlatMapFunction[Tuple1[String],Tuple2[String,Long]] {
override def flatMap(value: Tuple1[String], out: Collector[Tuple2[String,Long]]): Unit = {
for(s:String <- value.f0.split(" ")){
out.collect(Tuple2.of(s,1L))
}
}
})
.keyBy(0)
.reduce(new RichReduceFunction[Tuple2[String, Long]] {
override def reduce(value1: Tuple2[String, Long], value2: Tuple2[String, Long]): Tuple2[String, Long] = {
return Tuple2.of(value1.f0,value1.f1+value2.f1)
}
})
.print()
env.execute("flink reduce operator")
}
}
运行结果:
4> (hadoop,1)
3> (hdfs,1)
4> (flink,1)
4> (hadoop,2)
4> (hadoop,3)
1> (spark,1)
1> (taskmanager,1)
reduce是归并操作,最后归并后结果的数据类型和input的数据类型一致,所以scala在定义参数时,不用指定返回值类型
package com.kn.operator
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector
object ReduceOperator {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.fromElements(Tuple1.apply("flink hadoop taskmanager")
,Tuple1.apply("spark hadoop")
,Tuple1.apply("hadoop hdfs"))
.flatMap((t1,out:Collector[Tuple2[String,Long]]) => {
t1._1.split(" ").foreach(s => out.collect(Tuple2.apply(s,1L)))
})
.keyBy(0)
.reduce((t1,t2) => Tuple2.apply(t1._1,t1._2+t2._2))
.print()
env.execute("flink reduce operator")
}
}
运行结果:
1> (spark,1)
3> (hdfs,1)
1> (taskmanager,1)
4> (hadoop,1)
4> (hadoop,2)
4> (flink,1)
4> (hadoop,3)
union可以将多个流合并到一个流中,以便对合并的流进行集中统处理。是对多个流的水平拼接
多个数据流必须是同类型
package com.kn.operator
import org.apache.flink.api.java.tuple.Tuple1
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
object UnionOperator {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val df1 = env.fromElements(Tuple1.of("flink")
,Tuple1.of("spark")
,Tuple1.of("hadoop"))
val df2 = env.fromElements(Tuple1.of("oracle")
,Tuple1.of("mysql")
,Tuple1.of("sqlserver"))
//将多个流合并到一个流,多个数据流必须同类型,使流数据集中处理
df1.union(df2).print()
env.execute("flink union operator")
}
}
package com.kn.operator
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
object UnionOperator {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val df1 = env.fromElements(Tuple1.apply("flink")
,Tuple1.apply("spark")
,Tuple1.apply("hadoop"))
val df2 = env.fromElements(Tuple1.apply("oracle")
,Tuple1.apply("mysql")
,Tuple1.apply("sqlserver"))
//将多个流合并到一个流,多个数据流必须同类型,使流数据集中处理
df1.union(df2).filter(!_._1.equals("hadoop")).print()
env.execute("flink union operator")
}
}
运行结果:
3> (sqlserver)
1> (oracle)
2> (mysql)
3> (spark)
2> (flink)
join是根据指定的key将两个流做关联
package com.kn.operator
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.triggers.CountTrigger
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.windowing.assigners.{ProcessingTimeSessionWindows}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.util.Collector
object JoinOperator {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val df1 = env.fromElements(
Tuple2.apply("flink",1L)
,Tuple2.apply("spark",2L)
,Tuple2.apply("hadoop",3L))
val df2 = env.fromElements(Tuple2.apply("flink",1L)
,Tuple2.apply("mysql",1L)
,Tuple2.apply("spark",1L))
df1.join(df2)
.where(_._1)
.equalTo(_._1)
.window(ProcessingTimeSessionWindows.withGap(Time.seconds(10)))
.trigger(CountTrigger.of(1)) //这里的含义为每到来一个元素,都会立刻触发计算。
.apply((t1,t2,out:Collector[Tuple2[String,Long]]) =>{
out.collect(Tuple2.apply(t1._1,t1._2+t2._2))
})
.print()
env.execute("flink join operator")
}
}
运行结果:
4> (flink,2)
1> (spark,3)
(1)使用flink中的eventTime和watermark用于解决数据乱序的问题,这里要重点说明一下:只是指定了一个解决数据乱序问题的方案,但是无法彻底解决数据延迟及乱序
(2)watermark的生成方式有2种:
With Periodic Watermarks:周期性的触发watermark的生成和发送 (比较常用)
with punctuated watermarks:基于某些事件触发watermark的生成和发送
先上代码:详细执行过程请参考https://blog.csdn.net/xu470438000/article/details/83271123
package com.kn.operator
import java.text.SimpleDateFormat
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.util.Collector
object WaterMarkOpertor {
def main(args: Array[String]): Unit = {
val sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS")
val env = StreamExecutionEnvironment.getExecutionEnvironment
//设置eventTime 默认为process time
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.setParallelism(1)
//add source,每一个消息体有两个元素,第一个元素为context ,第二个元素为时间戳,以逗号分隔
env.socketTextStream("localhost", 9000)
.map(s => Tuple2.apply(s.split(",")(0),s.split(",")(1).toLong))
.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks[Tuple2[String,Long]] {
private var currentMaxTimestamp =0L
private val maxOutofOrderness = 10000L //最大允许乱序时间为10S
//每条记录都要先执行extractTimestamp,从记录中抽取时间戳eventTime,确认currentMaxTimestamp;再确认watermark 即调用getCurrentWatermark 方法
override def getCurrentWatermark: Watermark = new Watermark(currentMaxTimestamp - maxOutofOrderness)
override def extractTimestamp(element: (String, Long), previousElementTimestamp: Long): Long = {
currentMaxTimestamp = math.max(element._2,currentMaxTimestamp)
println("key:"+element._1,"eventtime_format:"+sdf.format(element._2)
,"currentMaxTimestamp_format:"+sdf.format(currentMaxTimestamp)
,"currentWaterMark_format:"+sdf.format(getCurrentWatermark.getTimestamp))
element._2
}
})
.keyBy(0)
.window(TumblingEventTimeWindows.of(Time.seconds(3))) //按照消息的EventTime分配窗口,和调用TimeWindow效果一样
.allowedLateness(Time.seconds(2)) //允许数据延迟2秒,只有当(watermark < window_end_time + lateness) 且有新增窗口数据才会触发window计算
// .trigger(CountTrigger.of(1))
.apply((t1,w1,input,out:Collector[Tuple2[String,String]]) => { //满足时间窗口条件才会触发,t1:
val key = t1.toString
input.foreach(t => {
out.collect(t._1,sdf.format(t._2)) //窗口中数据集合
})
println("key:"+key,"window start time:"+sdf.format(w1.getStart)," window end time:"+sdf.format(w1.getEnd))
})
.print()
env.execute("flink window watermark")
}
}
代码解读:
(1)env设置时间特性,默认为processing Time,我们这里设置为eventTime,即已事件时间为依据来计算时间窗口数据
(2)数据源,直接读取socket数据,每条数据包含2个元素,第一个元素为context,第二个元素为数据的timestamp,元素中间用逗号分隔;
(3)使用map函数,将每条信息封装为Tuple2[String,Long]
(4)配置时间戳及watermark的生成策略:这里使用AssignerWithPeriodicWatermarks,采用周期性的生成watermark;实现两个方法(每条消息都会执行):
extractTimestamp:抽取消息记录中的时间戳,保证系统全部最大时间戳,跟消息体的时间戳取max,这里的currentMaxTimestamp只会递增,保证了watermark的递增
getCurrentWatermark :指定watermark的生成策略,这里使用的是,全局消息体中时间戳-最大允许乱序时间(这里设置的最大乱序允许时间为10S)
(5)window: 这里指定的是TumblingEventTimeWindows,固定时间范围窗口,这里指定的时间窗口间隔为3s,则系统自动分配窗口为[0,3),[3,6),[6,9)…[57,60) 左闭右开 (–这里的窗口分配跟数据时间没有关系,直接属于系统按照时间间隔自动分配窗口区间)
触发窗口计算条件:watermark >= window_end_time && window窗口内有数据
watermark机制,是在牺牲数据时效性为代价,尽量保证数据计算的准确性;这里如果想保持实时计算,可以使用trigger 函数,每消费一条数据,可以触发一次窗口计算,直到超过时延…
(6)针对于数据时延比较大的数据,错过了窗口计算的数据,flink默认的机制是直接丢弃(即不会再触发计算);但是flink又提供了其他2种针对时延数据的处理机制:
#1.允许数据延迟时间(allowedLateness),注意,这里的数据时延指的是跟watermark做比较,例如我们允许时延2秒,则当watermark < window_end_time + lateness && 且窗口中有新增数据,则会重新触发窗口计算 (每增加一条数据,会触发窗口计算多次),再超过时序的数据则直接丢弃
#2.sideOutputLateData 收集迟到的数据 : 可以把迟到的数据统一收集,统计存储(也可以外部存储),方便后期排查问题