Flink 窗口函数处理数据(Watermark和SideOutput)

@羲凡——只为了更好的活着

Flink 窗口函数处理数据(Watermark和SideOutput)

统计过去5分钟内的一些数据是流处理中最常见的一种模式。这就涉及到经典的一个问题——数据延迟或乱序怎么办?

Flink,针对数据延迟或乱序有几个重要的解决思路,
1.添加水位线Watermark
2.推迟关闭窗口时间
3.超时数据的side输出

下面的例子是,统计10s内的数据,水位线位2s,窗口再延迟4s关闭,最后超时数据side输出

1.直接上代码
package flink.window;

import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.api.java.tuple.Tuple4;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;

import static org.apache.flink.streaming.api.windowing.time.Time.seconds;

public class Test {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        // 并行度必须是1,不然数据会进入不同的线程中
        env.setParallelism(1);
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        DataStreamSource<String> inputStream = env.socketTextStream("ml20.com", 8888);
        // 超时数据输出的匿名子类对象
        OutputTag outputTag = new OutputTag<Tuple3<String, Long, Long>>("side") {
        };
        // 将输入数据中的字段作为水位线
        DataStream<Tuple3<String, Long, Long>> dataStream = inputStream.map(new MapFunction<String, Tuple3<String, Long, Long>>() {
            @Override
            public Tuple3<String, Long, Long> map(String value) throws Exception {
                String[] arr = value.split(",");
                Tuple3<String, Long, Long> tuple3 = new Tuple3<>();
                tuple3.f0 = arr[0].trim();
                tuple3.f1 = Long.valueOf(arr[1].trim());
                tuple3.f2 = Long.valueOf(arr[2].trim());
                return tuple3;
            }
        }).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<Tuple3<String, Long, Long>>(seconds(2)) {
            @Override
            public long extractTimestamp(Tuple3<String, Long, Long> element) {
                return Long.valueOf(element.f1.toString()) * 1000;
            }
        });
        // 聚合数据并读取每个窗口的开始和结束时间
        SingleOutputStreamOperator sumStream = dataStream
                .keyBy(0).timeWindow(seconds(10)).allowedLateness(seconds(4)).sideOutputLateData(outputTag)
                .aggregate(new AggregateFunction<Tuple3<String, Long, Long>, Long, Long>() {
                    @Override
                    public Long createAccumulator() {
                        return 0L;
                    }

                    @Override
                    public Long add(Tuple3<String, Long, Long> value, Long accumulator) {
                        return accumulator + value.f2;
                    }

                    @Override
                    public Long getResult(Long accumulator) {
                        return accumulator;
                    }

                    @Override
                    public Long merge(Long a, Long b) {
                        return null;
                    }
                }, new WindowFunction<Long, Tuple4<String, Long, Long, Long>, Tuple, TimeWindow>() {
                    @Override
                    public void apply(Tuple tuple, TimeWindow window, Iterable<Long> input, Collector<Tuple4<String, Long, Long, Long>> out) throws Exception {
                        long windowStart = window.getStart();
                        long windowEnd = window.getEnd();
                        //窗口集合的结果
                        Long aLong = input.iterator().next();
                        //输出数据
                        out.collect(new Tuple4<>(tuple.getField(0), windowStart, windowEnd, aLong));
                    }
                });
        // 打印
        dataStream.print("data");
        sumStream.print("sum");
        sumStream.getSideOutput(outputTag).print("sideOutput");
        env.execute("Demo227");
    }
}
2.测试数据(数据要一个一个的输入)
user1, 1592470610,1
user1, 1592470620,2
user1, 1592470621,3
user1, 1592470622,4
user1, 1592470612,5
user1, 1592470625,6
user1, 1592470614,7
user1, 1592470626,8
user1, 1592470616,9
3.测试结果
data> (user1,1592470610,1)
data> (user1,1592470620,2)
data> (user1,1592470621,3)
data> (user1,1592470622,4)
sum> (user1,1592470610000,1592470620000,1)
data> (user1,1592470612,5)
sum> (user1,1592470610000,1592470620000,6)
data> (user1,1592470625,6)
data> (user1,1592470614,7)
sum> (user1,1592470610000,1592470620000,13)
data> (user1,1592470626,8)
data> (user1,1592470616,9)
sideOutput> (user1,1592470616,9)

解释各位同学可能的问题
问:为啥在输入(user1,1592470620,2)后,没有触发10-20区间的计算?
答:因为我们设置了水位线时间为2秒,说白了就是向后等了2s时间再计算
问:为啥10-20区间的sum值是1而不是3?
答:因为窗口的区间是左闭右开的,10-20区间就是包含10不包含20
问:为啥(user1,1592470612,5)和(user1,1592470614,7)这两个值还能累加到10-20区间呢,而(user1,1592470616,9)不行?
答:我们设置的窗口延迟关闭4秒,所以10-20区间的延迟数据在22s-26s内到达还是能够累加到原先的数据上的。26s及之后的数据就只能到side里面了

参考1
参考2
参考3

====================================================================

@羲凡——只为了更好的活着

若对博客中有任何问题,欢迎留言交流

你可能感兴趣的:(Flink)