目录
一,watermaker的概念
1,什么是watermaker
2,watermaker的例子
二,代码演习
1,简易版本
2,自定义watermaker类
watermaker又称水位线,它是一个带时间戳的DataStream,是为了解决实时计算中的数据乱序问题,以窗口延迟来兼容迟到的数据,当水位线大于窗口时间时触发窗口执行。
watermaker的计算方式是:
watermaker = 当前窗口中最大的eventTime - 延迟时间
flink中的时间分成三类:
事件时间(EventTime): 事件真真正正发生产生的时间
摄入时间(IngestionTime): 事件到达Flink的时间
处理时间(ProcessingTime): 事件真正被处理/计算的时间
EventTime事件时间是最为重要的,因为只要事件时间一产生就不会变化,事件时间更能反映事件的本质
在时间窗口10:00:00-10:10:00中有6条数据依次到达。
a 10:01:00
b 10:11:00
c 10:08:00
d 10:14:00
e 10:15:00
f 10:09:00
abcdef依次到达。最大延迟时间是5分钟。对于10:00:00-10:10:00的数据统计。
没有Watemaker时
a到达 会正常统计
b到达 触发窗口,认为已经到达统计时间了,只有a会被统计,后面的都不会被统计。比如c到达时统计的数据窗口已经变成[10:10:01-10:20:00]了,数据不符合就被丢弃了。
有Watemaker,且最大延迟时间是5分钟时
a到达 Watermaker=max(10:01:00)-5 =09:56:00 < 窗口结束时间10:10:00 – 不触发条件
b到达 Watermaker=max(10:11:00)-5 =10:06:00 < 窗口结束时间10:10:00 – 不触发条件
c到达 Watermaker=max(10:08:00)-5 =10:03:00 < 窗口结束时间10:10:00 – 不触发条件
d到达 Watermaker=max(10:14:00)-5 =10:09:00 < 窗口结束时间10:10:00 – 不触发条件
e到达 Watermaker=max(10:15:00)-5 =10:10:00 = 窗口结束时间10:10:00 – 满足触发条件,窗口开始计算那么a,b,c,d,e会被统计进窗口[10:00:00 10:10:00].
Watermaker机制可以在一定程度上解决数据乱序后延迟到达问题,但是更严重的还是无法解决。比如f数据到达时窗口已经计算完毕,所以f数据还是会丢失。
Event Time的使用通常要指定数据源中的时间戳:
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
需要的依赖:
org.projectlombok
lombok
1.18.10
org.apache.flink
flink-clients_2.11
${flink.version}
provided
org.apache.flink
flink-java
${flink.version}
provided
org.apache.kafka
kafka-clients
${kafka.version}
provided
org.apache.flink
flink-statebackend-rocksdb_2.11
${flink.version}
provided
org.apache.flink
flink-streaming-java_2.11
${flink.version}
provided
org.apache.flink
flink-streaming-scala_2.11
${flink.version}
provided
com.google.code.findbugs
jsr305
构造一个订单数据来模拟实时数据,然后统计相同用户在5秒内的消费情况。
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import java.time.Duration;
import java.util.Random;
import java.util.UUID;
import java.util.concurrent.TimeUnit;
public class check_waterMaker1 {
public static void main(String[] args) throws Exception {
//1.env
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.getConfig().setAutoWatermarkInterval(100);
env.getCheckpointConfig().setCheckpointTimeout(60 * 60 * 1000);
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
//2.Source
//模拟实时订单数据(数据有延迟和乱序)
DataStream orderDS = env.addSource(new SourceFunction() {
private boolean flag = true;
@Override
public void run(SourceContext ctx) throws Exception {
Random random = new Random();
while (flag) {
String orderId = UUID.randomUUID().toString();
int userId = random.nextInt(3);
int money = random.nextInt(100);
//模拟数据延迟和乱序!
long eventTime = System.currentTimeMillis() - random.nextInt(5) * 1000;
ctx.collect(new Order(orderId, userId, money, eventTime));
System.out.println("orderId:"+orderId+"\tuserId:"+userId+"\tmoney:"+money+"\teventTime:"+eventTime);
TimeUnit.SECONDS.sleep(1);
}
}
@Override
public void cancel() {
flag = false;
}
}).setParallelism(1);
DataStream watermakerDS = orderDS
.assignTimestampsAndWatermarks(
WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(3))
.withTimestampAssigner((event, timestamp) -> event.getEventTime())
).setParallelism(1);
//代码走到这里,就已经被添加上Watermaker了!接下来就可以进行窗口计算了
//要求每隔5s,计算5秒内(基于时间的滚动窗口),每个用户的订单总金额
SingleOutputStreamOperator money = watermakerDS
.keyBy(Order_t -> Order_t.userId)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.sum("money")
.setParallelism(1);
money.print();
//5.execute
env.execute();
}
@Data
@AllArgsConstructor
@NoArgsConstructor
public static class Order {
public String orderId;
public Integer userId;
public Integer money;
public Long eventTime;
public Order(String orderId, int userId, int money, long eventTime) {
this.orderId = orderId;
this.userId = userId;
this.money = money;
this.eventTime = eventTime;
}
public long getEventTime() {
return this.eventTime;
}
public static Integer getUserId(Order order) {
return (Integer) order.userId;
}
}
}
构造实时数据时需要引用lombok包。
如下:
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
import org.apache.commons.lang.time.FastDateFormat;
import org.apache.flink.api.common.eventtime.*;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import java.util.*;
import java.util.concurrent.TimeUnit;
public class check_waterMaker2 {
public static void main(String[] args) throws Exception {
FastDateFormat df = FastDateFormat.getInstance("HH:mm:ss");
//1.env
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
// 提交水位线的周期,将之设置成1秒下面orderDS.assignTimestampsAndWatermarks中的输出就是一条一次,设置成100就是一条十次。
env.getConfig().setAutoWatermarkInterval(1000);
env.setParallelism(1);
// env.enableCheckpointing(checkpointInteval); //把这个注释掉就不会有offset上传了。
env.getCheckpointConfig().setCheckpointTimeout(60 * 60 * 1000);
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
//2.Source
//模拟实时订单数据(数据有延迟和乱序)
DataStreamSource orderDS = env.addSource(new SourceFunction() {
private boolean flag = true;
@Override
public void run(SourceContext ctx) throws Exception {
Random random = new Random();
while (flag) {
String orderId = UUID.randomUUID().toString();
int userId = random.nextInt(3);
int money = random.nextInt(100);
//模拟数据延迟和乱序!
long eventTime = System.currentTimeMillis() - random.nextInt(5) * 1000;
System.out.println("发送的数据为: " + orderId + " \t userId:"+userId+"\t"+money+"\t"+ df.format(eventTime));
ctx.collect(new Order(orderId, userId, money, eventTime));
TimeUnit.SECONDS.sleep(1);
}
}
@Override
public void cancel() {
flag = false;
}
});
// orderDS.print();
DataStream watermakerDS = orderDS
.assignTimestampsAndWatermarks(
new WatermarkStrategy() {
@Override
public WatermarkGenerator createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
return new WatermarkGenerator() {
private int userId = 0;
private long eventTime = 0L;
private final long outOfOrdernessMillis = 3000;
private long maxTimestamp = Long.MIN_VALUE + outOfOrdernessMillis + 1;
@Override
public void onEvent(Order event, long eventTimestamp, WatermarkOutput output) {
userId = event.userId;
eventTime = event.eventTime;
maxTimestamp = Math.max(maxTimestamp, eventTimestamp);
}
@Override
public void onPeriodicEmit(WatermarkOutput output) {
//Watermaker = 当前最大事件时间 - 最大允许的延迟时间或乱序时间
Watermark watermark = new Watermark(maxTimestamp - outOfOrdernessMillis - 1);
System.out.println("key:" + userId + ",系统时间:" + df.format(System.currentTimeMillis()) + ",事件时间:" + df.format(eventTime) + ",水位线时间:" + df.format(watermark.getTimestamp()));
output.emitWatermark(watermark);
}
};
}
}.withTimestampAssigner((event, timestamp) -> event.getEventTime())
).setParallelism(1);
//学习测试时可以使用下面的代码对数据进行更详细的输出,如输出窗口触发时各个窗口中的数据的事件时间,Watermaker时间
SingleOutputStreamOperator
参考:
Flink教程(12)- Flink高级API(Time与Watermaker)_杨林伟的博客-CSDN博客
https://www.cnblogs.com/dalianpai/p/15268363.html
Flink最佳实践 - Watermark原理及实践问题解析 - 知乎
https://zhuanlan.zhihu.com/p/364013202
2021年大数据Flink(二十三):Watermaker案例演示-云社区-华为云
Flink水位线Watermaker生产应用避坑分享 - 墨天轮