Flink 统计接入的数据量-滚动窗口和状态的使用

1、概述

在生产场景值,经常需要和上游、下游对数,离线场景可以直接 group by 再 count ,但是实时场景中,如果使用 kafka 作为中间件,中间经过几个 job 的过滤转化后,再对照像 Doris 或 Clickhouse 中最终层的数据,如果出现缺失,很难判断是哪一层缺失的。

2、使用 侧流输出+处理时间的滚动窗口+状态进行数据量级统计

package com.flink.feature.windowcount;

import org.apache.flink.api.common.state.MapState;
import org.apache.flink.api.common.state.MapStateDescriptor;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple4;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;

/**
 * 1、先输入数据
 * 1
 * 1
 * 1
 * 1
 * 输出结果
 * main=>:8> 业务处理=>1
 * main=>:1> 业务处理=>1
 * main=>:2> 业务处理=>1
 * main=>:3> 业务处理=>1
 * 每10秒每个key接受到的数据量=>:2> (1698913020000,1698913030000,窗口统计=>1,4)
 *
 * 2、再输入数据
 * 1
 * 2
 * 2
 * 3
 * 3
 * 4
 * 4
 * 输出结果
 * main=>:4> 业务处理=>1
 * main=>:5> 业务处理=>2
 * main=>:6> 业务处理=>2
 * main=>:7> 业务处理=>3
 * main=>:8> 业务处理=>3
 * main=>:1> 业务处理=>4
 * main=>:2> 业务处理=>4
 * 每10秒每个key接受到的数据量=>:2> (1698913030000,1698913040000,窗口统计=>1,1)
 * 每10秒每个key接受到的数据量=>:7> (1698913030000,1698913040000,窗口统计=>4,2)
 * 每10秒每个key接受到的数据量=>:6> (1698913030000,1698913040000,窗口统计=>2,2)
 * 每10秒每个key接受到的数据量=>:6> (1698913030000,1698913040000,窗口统计=>3,2)
 */


public class UseWindowValidateData {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        OutputTag<Tuple2<String,Integer>> windowCountTag = new OutputTag<Tuple2<String,Integer>>("window_count"){};

        DataStreamSource<String> source = env.socketTextStream("localhost", 8888);

        SingleOutputStreamOperator<String> process = source.process(new ProcessFunction<String, String>() {
            @Override
            public void processElement(String input, ProcessFunction<String, String>.Context ctx, Collector<String> collector) throws Exception {
                ctx.output(windowCountTag,new Tuple2<>("窗口统计=>"+input,1));
                collector.collect("业务处理=>" + input);
            }
        });

        process.getSideOutput(windowCountTag).keyBy(new KeySelector<Tuple2<String,Integer>, String>() {
                    @Override
                    public String getKey(Tuple2<String,Integer> tp) throws Exception {
                        return tp.f0;
                    }
                }).window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
                .process(new ProcessWindowFunction<Tuple2<String, Integer>, Tuple4<String,String,String,Integer>, String, TimeWindow>() {
                    private MapState<String, Integer> mapState;

                    @Override
                    public void open(Configuration parameters) throws Exception {
                        MapStateDescriptor<String, Integer> stateDescriptor = new MapStateDescriptor<>("map-state", String.class, Integer.class);
                        mapState = getRuntimeContext().getMapState(stateDescriptor);
                    }

                    @Override
                    public void process(String key, ProcessWindowFunction<Tuple2<String, Integer>, Tuple4<String, String, String, Integer>, String, TimeWindow>.Context ctx, Iterable<Tuple2<String, Integer>> elements, Collector<Tuple4<String, String, String, Integer>> out) throws Exception {

                        for (Tuple2<String, Integer> tp : elements) {
                            Integer res = mapState.get(tp.f0);

                            if (res == null) {
                                res = 0;
                            }

                            res += 1;
                            mapState.put(tp.f0, res);
                        }

                        out.collect(new Tuple4<>(String.valueOf(ctx.window().getStart()),String.valueOf(ctx.window().getEnd()),key,mapState.get(key)));

                        // 每个窗口计算后清空状态
                        mapState.clear();
                    }
                }).print("每10秒每个key接受到的数据量=>");

        process.print("main=>");

        env.execute();
    }
}

3、测试结果

1)先输入数据

1
1
1
1

输出结果

main=>:8> 业务处理=>1
main=>:1> 业务处理=>1
main=>:2> 业务处理=>1
main=>:3> 业务处理=>1

每10秒每个key接受到的数据量=>:2> (1698913020000,1698913030000,窗口统计=>1,4)

2)再输入数据

1
2
2
3
3
4
4

输出结果

main=>:4> 业务处理=>1
main=>:5> 业务处理=>2
main=>:6> 业务处理=>2
main=>:7> 业务处理=>3
main=>:8> 业务处理=>3
main=>:1> 业务处理=>4
main=>:2> 业务处理=>4

每10秒每个key接受到的数据量=>:2> (1698913030000,1698913040000,窗口统计=>1,1)
每10秒每个key接受到的数据量=>:7> (1698913030000,1698913040000,窗口统计=>4,2)
每10秒每个key接受到的数据量=>:6> (1698913030000,1698913040000,窗口统计=>2,2)
每10秒每个key接受到的数据量=>:6> (1698913030000,1698913040000,窗口统计=>3,2)

你可能感兴趣的:(flink,大数据)