FlinkSQL - 级联窗口计算并Sink2Hbase

一、背景说明

背景来源为一个实时指标计算的需求:实时计算过去一小时订单配对数。

订单配对的口径是用户下单后司机接单,且后续没有发生订单取消则定义为配对订单(基于打车场景)。

该口径在计算上,需要实现两次聚合,即对过去一小时窗口的订单进行计算,对后续发生取消的订单打上标签,下一个窗口对上一层基础上,剔除取消的订单,计算出配对单的数量。在此该需求可以再往上抽象一层:

对过去N小时的窗口数据,做级联GoupBy的需求均适用

二、开发部分

1. 官网对级联窗口的解释

Cascading Window Aggregation #
The window_start and window_end columns are regular timestamp columns, not time attributes. Thus they can’t be used as time attributes in subsequent time-based operations. In order to propagate time attributes, you need to additionally add window_time column into GROUP BY clause. The window_time is the third column produced by Windowing TVFs which is a time attribute of the assigned window. Adding window_time into GROUP BY clause makes window_time also to be group key that can be selected. Then following queries can use this column for subsequent time-based operations, such as cascading window aggregations and Window TopN.
The following shows a cascading window aggregation where the first window aggregation propagates the time attribute for the second window aggregation.
– tumbling 5 minutes for each supplier_id

CREATE VIEW window1 AS
-- Note: The window start and window end fields of inner Window TVF are optional in the select clause. However, if they appear in the clause, they need to be aliased to prevent name conflicting with the window start and window end of the outer Window TVF.
SELECT window_start as window_5mintumble_start, window_end as window_5mintumble_end, window_time as rowtime, SUM(price) as partial_price
  FROM TABLE(
    TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '5' MINUTES))
  GROUP BY supplier_id, window_start, window_end, window_time;

-- tumbling 10 minutes on the first window
SELECT window_start, window_end, SUM(partial_price) as total_price
  FROM TABLE(
      TUMBLE(TABLE window1, DESCRIPTOR(rowtime), INTERVAL '10' MINUTES))
  GROUP BY window_start, window_end;

上面是官网对级联窗口的说明及代码demo,在实时的级联窗口中,比较关键的便是时间属性的传递。在普通select查询的新建视图view的操作中,FlinkSQL默认是按照最后一个字段作为事件时间向下传递,那在开窗的操作中,需带入window_time字段,因其具备time attributes属性,故作为事件时间传递到下一级窗口。
ps:在实际实际生产环境一般均使用事件时间eventime作为时间语义,这里均使用事件时间来说明。

2. 代码部分

计算思路:

  • 第一层,按1min的步长,1hour的窗口,进行滑动窗口计算,对过去1小时的订单按订单状态sum-groupby,如出现取消状态,则赋值-9999并sum,则可以得出在该小时内订单的情况(发生取消的会有个负值,以此来剔除)
  • 第二层,按1min的窗口,进行滚动窗口计算,由于上一层已经做了窗口计算,故上一层每分钟都会传递下一层完整一小时的订单数据,在这一层只需按1min进行滚动窗口,剔除负值的订单,sum计算即可。

Version:
Flink 1.13.2
Hbase 2.0.3
Kafka 2.1.0

public class LabelSinkHbase {
    public static void main(String[] args) throws Exception{
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // checkpoint设置
        env.enableCheckpointing(300000, CheckpointingMode.EXACTLY_ONCE);
        FileSystemCheckpointStorage checkpointStorage = new FileSystemCheckpointStorage("hdfs://aly-hn1-bigdata-realtime01/torrent/flink/statebackend/xxx");
        env.getCheckpointConfig().setCheckpointStorage(checkpointStorage);

        // 创建表执行环境
        StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
        Configuration configuration = tableEnv.getConfig().getConfiguration();
        configuration.setString("pipeline.name","LABEL CALCULATE");

        // 注册sourcetable,获取request表-kafka
        tableEnv.executeSql("create table order_table(\n" +
                "`order_datetime` Timestamp(3),\n" +
                "`order_status` int,\n" +
                "`modify_time` Timestamp(3),\n" +
                "`order_id` bigint,\n" +
                "WATERMARK FOR modify_time AS modify_time - INTERVAL '1' SECOND" +
                ") with (\n" +
                "  'connector' = 'kafka',\n" +
                "  'topic' = 'topic',\n" +
                "  'properties.bootstrap.servers' = 'server1:9092,server2:9092,server3:9092',\n" +
                "  'properties.group.id' = 'groupid',\n" +
                "  'scan.startup.mode' = 'latest-offset',\n" +
                "  'format' = 'json',\n" +
                "  'json.ignore-parse-errors' = 'true'\n" +
                ")");

        // 注册sink表,hbase
        tableEnv.executeSql("CREATE TABLE hbase_order_labels (\n" +
                " rowkey string,\n" +
                " c ROW,\n" +
                " PRIMARY KEY (rowkey) NOT ENFORCED" +
                ") WITH (" +
                " 'connector' = 'hbase-2.2',\n" +
                " 'table-name' = 'rtc_dws:rtc_dws_order_labels',\n" +
                " 'zookeeper.quorum' = 'server:2181',\n" +
                " 'zookeeper.znode.parent' = '/hbase-unsecure'" +
                ")");

        // 建立视图,视图最后个字段作为rowtime向下传递
        tableEnv.executeSql("create view order_table_veiw as " +
                " select\n" +
                "    order_id\n" +
                "    ,order_status\n" +
                "    ,modify_time\n" +
                "from order_table\n" +
                "where DATE_FORMAT(modify_time,'yyyy-MM-dd')=DATE_FORMAT(order_datetime,'yyyy-MM-dd')");

        // 一次聚合,分钟滑动小时窗口,对订单状态做处理,以便下一步剔除取消的订单
        // order_status in (3, 4, 5) 代表订单发生取消
        tableEnv.executeSql("create view hop_temp as \n" +
                "SELECT\n" +
                "    order_id\n" +
                "    ,window_time as rowtime\n" +
                "    ,sum(case when order_status in (1, 2, 7, 10, 12, 13, 14, 15, 16) then 1\n" +
                "                when order_status in (3, 4, 5) then -9999\n" +
                "                else 0 end\n" +
                "        ) as order_match_num\n" +
                "from TABLE(\n" +
                "    HOP(TABLE order_table_veiw, DESCRIPTOR(modify_time), INTERVAL '1' MINUTE, INTERVAL '1' HOUR))\n" +
                "group by window_start, window_end, window_time ,order_id"
                );
        // 打印测试用
        // Table ta1 = tableEnv.sqlQuery("select * from hop_temp");
        // TableEnv.toRetractStream(ta1, Row.class).print("hop_temp----->");

        // 二次聚合,级联窗口,对上一层窗口做分钟级滚动窗口
        tableEnv.executeSql("create view tumble_temp as \n" +
                "SELECT \n" +
                "    window_end\n" +
                "    ,cast(sum(case when order_match_num <= 0 then 0 else 1 end) as double) as match_num_h\n" +
                "FROM TABLE(\n" +
                "   TUMBLE(TABLE hop_temp, DESCRIPTOR(rowtime), INTERVAL '1' MINUTE))\n" +
                "group by window_start, window_end");

        // sink2hbase
        tableEnv.executeSql("insert into hbase_order_labels" +
                " select" +
                "    concat_ws('_',md5(cast(city_id as varchar)),'sc') as rowkey" +
                "    ,ROW(cast(match_num_h as varchar))" +
                " from tumble_temp");

        // env.execute("LABEL CALCULATE");
    }
}

3. 补充说明

a. 在1.13版本无需再做时间语义的声明,即env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime),在建source表时候指定字段声明watermark即可。

b. hbase的connecor,官网中只有两个版本的依赖说明,本次使用hbase2.0的版本经实践可使用2.2的依赖
FlinkSQL - 级联窗口计算并Sink2Hbase_第1张图片

c. 该类需求第二层如不用窗口处理,直接对订单剔除负值状态后计算理论上也能得出结果,但是这样的处理方式的问题在于,实时场景下,是来一条数据触发一次计算,这样计算输出的结果是不断靠近真实结果的过程,问题就在于业务方在调取接口数据的时候,如若数据还没计算完,会导致接口调取的结果有误,故需要使用窗口一次性对数据计算完后sink到目标数据库。


学习交流,有任何问题还请随时评论指出交流。

你可能感兴趣的:(大数据,sql,大数据,flink)