@羲凡——只为了更好的活着
统计过去5分钟内的一些数据是流处理中最常见的一种模式。这就涉及到经典的一个问题——数据延迟或乱序怎么办?
Flink,针对数据延迟或乱序有几个重要的解决思路,
1.添加水位线Watermark
2.推迟关闭窗口时间
3.超时数据的side输出
下面的例子是,统计10s内的数据,水位线位2s,窗口再延迟4s关闭,最后超时数据side输出
package flink.window;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.api.java.tuple.Tuple4;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;
import static org.apache.flink.streaming.api.windowing.time.Time.seconds;
public class Test {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 并行度必须是1,不然数据会进入不同的线程中
env.setParallelism(1);
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
DataStreamSource<String> inputStream = env.socketTextStream("ml20.com", 8888);
// 超时数据输出的匿名子类对象
OutputTag outputTag = new OutputTag<Tuple3<String, Long, Long>>("side") {
};
// 将输入数据中的字段作为水位线
DataStream<Tuple3<String, Long, Long>> dataStream = inputStream.map(new MapFunction<String, Tuple3<String, Long, Long>>() {
@Override
public Tuple3<String, Long, Long> map(String value) throws Exception {
String[] arr = value.split(",");
Tuple3<String, Long, Long> tuple3 = new Tuple3<>();
tuple3.f0 = arr[0].trim();
tuple3.f1 = Long.valueOf(arr[1].trim());
tuple3.f2 = Long.valueOf(arr[2].trim());
return tuple3;
}
}).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<Tuple3<String, Long, Long>>(seconds(2)) {
@Override
public long extractTimestamp(Tuple3<String, Long, Long> element) {
return Long.valueOf(element.f1.toString()) * 1000;
}
});
// 聚合数据并读取每个窗口的开始和结束时间
SingleOutputStreamOperator sumStream = dataStream
.keyBy(0).timeWindow(seconds(10)).allowedLateness(seconds(4)).sideOutputLateData(outputTag)
.aggregate(new AggregateFunction<Tuple3<String, Long, Long>, Long, Long>() {
@Override
public Long createAccumulator() {
return 0L;
}
@Override
public Long add(Tuple3<String, Long, Long> value, Long accumulator) {
return accumulator + value.f2;
}
@Override
public Long getResult(Long accumulator) {
return accumulator;
}
@Override
public Long merge(Long a, Long b) {
return null;
}
}, new WindowFunction<Long, Tuple4<String, Long, Long, Long>, Tuple, TimeWindow>() {
@Override
public void apply(Tuple tuple, TimeWindow window, Iterable<Long> input, Collector<Tuple4<String, Long, Long, Long>> out) throws Exception {
long windowStart = window.getStart();
long windowEnd = window.getEnd();
//窗口集合的结果
Long aLong = input.iterator().next();
//输出数据
out.collect(new Tuple4<>(tuple.getField(0), windowStart, windowEnd, aLong));
}
});
// 打印
dataStream.print("data");
sumStream.print("sum");
sumStream.getSideOutput(outputTag).print("sideOutput");
env.execute("Demo227");
}
}
user1, 1592470610,1
user1, 1592470620,2
user1, 1592470621,3
user1, 1592470622,4
user1, 1592470612,5
user1, 1592470625,6
user1, 1592470614,7
user1, 1592470626,8
user1, 1592470616,9
data> (user1,1592470610,1)
data> (user1,1592470620,2)
data> (user1,1592470621,3)
data> (user1,1592470622,4)
sum> (user1,1592470610000,1592470620000,1)
data> (user1,1592470612,5)
sum> (user1,1592470610000,1592470620000,6)
data> (user1,1592470625,6)
data> (user1,1592470614,7)
sum> (user1,1592470610000,1592470620000,13)
data> (user1,1592470626,8)
data> (user1,1592470616,9)
sideOutput> (user1,1592470616,9)
解释各位同学可能的问题
问:为啥在输入(user1,1592470620,2)后,没有触发10-20区间的计算?
答:因为我们设置了水位线时间为2秒,说白了就是向后等了2s时间再计算
问:为啥10-20区间的sum值是1而不是3?
答:因为窗口的区间是左闭右开的,10-20区间就是包含10不包含20
问:为啥(user1,1592470612,5)和(user1,1592470614,7)这两个值还能累加到10-20区间呢,而(user1,1592470616,9)不行?
答:我们设置的窗口延迟关闭4秒,所以10-20区间的延迟数据在22s-26s内到达还是能够累加到原先的数据上的。26s及之后的数据就只能到side里面了
参考1
参考2
参考3
====================================================================
@羲凡——只为了更好的活着
若对博客中有任何问题,欢迎留言交流