Flink sql实现topN聚合结果写入kafka (Flink 1.11.0版本)

先跟鸡哥打个广告 ,博客地址: https://me.csdn.net/weixin_47482194

写的博客很有水平的,上了几次官网推荐了。

 

步入正题,在大家接触Flink SQL的时候,肯定绕不过kafka,在写入kafka的时候,不晓得大家有没有遇到问题?如下:

 Exception in thread "main" org.apache.flink.table.api.TableException: AppendStreamTableSink requires that Table has only insert changes.

额,开什么玩笑。。最基础的 select count(*) from table 这种语句都不支持的吗????

官网的解释是:这个问题是因Flink内部Retract机制导致,在没有考虑对Chanage log全链路支持之前,无法在Kafka这样的Append only的消息队列增加对Retract/Upsert的支持。

好在table可以转变stream,这是下面的代码(我这里是分组取的topn):

如果大家嫌弃还要连接kafka麻烦的话,可以直接source生产数据替代读取kafka。

public class FlinkTopN2Doris {


    private static final String KAFKA_SQL = "CREATE TABLE kafka_table (" +
            " category_id STRING," +
            " user_id STRING ," +
            " item_id STRING ," +
            " behavior STRING ," +
            " ts STRING ," +
//            " proctime as PROCTIME() ," +
            " row_ts AS TO_TIMESTAMP(FROM_UNIXTIME(cast(ts AS BIGINT), 'yyyy-MM-dd HH:mm:ss'))," +
            " WATERMARK FOR row_ts AS row_ts - INTERVAL '5' SECOND " +
            ") WITH (" +
            " 'connector' = 'kafka'," +
            " 'topic' = 'flink_test'," +
            " 'properties.bootstrap.servers' = '192.168.12.188:9092'," +
            " 'properties.group.id' = 'test1'," +
            " 'format' = 'json'," +
            " 'scan.startup.mode' = 'earliest-offset'" +
            ")";

    private static final String SINK_KAFKA_SQL = "CREATE TABLE kafka_table2 (" +
            " ts STRING," +
            " user_id STRING ," +
            " behavior STRING ," +
            "row_num BIGINT " +
            ") WITH (" +
            " 'connector' = 'kafka'," +
            " 'topic' = 'flink_test2'," +
            " 'properties.bootstrap.servers' = '192.168.12.188:9092'," +
            " 'properties.group.id' = 'test1'," +
            " 'format' = 'json'," +
            " 'scan.startup.mode' = 'earliest-offset'" +
            ")";

    private static final String PRINT_SQL = "create table sink_print (" +
            "  p_count BIGINT ," +
            "  b STRING " +
            ") with ('connector' = 'print' )";


    private static final String PRINT_SQL2 = "create table sink_print2 (" +
            "  a STRING," +
            "  b STRING," +
            "  c STRING," +
            "  d BIGINT " +
            ") with ('connector' = 'print' )";

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment bsEnv = StreamExecutionEnvironment.getExecutionEnvironment();
        EnvironmentSettings bsSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build();
        StreamTableEnvironment tableEnv = StreamTableEnvironment.create(bsEnv, bsSettings);
        bsEnv.enableCheckpointing(5000);
        bsEnv.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
//        tableEnv.getConfig().setSqlDialect(SqlDialect.HIVE);
        //todo 从造的数据通过Flink读取出来做了操作之后写入kafka。
        tableEnv.executeSql(KAFKA_SQL);
        tableEnv.executeSql(PRINT_SQL);
        tableEnv.executeSql(PRINT_SQL2);
        tableEnv.executeSql(SINK_KAFKA_SQL);
//        tableEnv.executeSql("select * from kafka_table").print();

        //todo 按behavior分组 然后求用户总数
//        tableEnv.executeSql("insert into sink_print select COUNT(DISTINCT user_id) AS uv, behavior from kafka_table GROUP BY behavior ");

        //todo 显示topN,显示最新的一条数据
        /*String top1Sql = "insert into sink_print2 SELECT * " +
                "FROM (" +
                "   SELECT user_id,behavior," +
                "       ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY ts DESC) as row_num" +
                "   FROM kafka_table ) " +
                "WHERE row_num = 1";

        tableEnv.executeSql(top1Sql);*/

        String top1Sql = "insert into sink_print2 SELECT * " +
                "FROM (" +
                "   SELECT ts,user_id,behavior," +
                "       ROW_NUMBER() OVER (PARTITION BY behavior ORDER BY ts DESC) as row_num" +
                "   FROM kafka_table ) " +
                "WHERE row_num = 1";
//        tableEnv.executeSql(top1Sql);


        // TODO: 2020/7/31  输出到kafka

        String sinkToKafka = "SELECT * " +
                "FROM (" +
                "   SELECT ts,user_id,behavior," +
                "       ROW_NUMBER() OVER (PARTITION BY behavior ORDER BY ts DESC) as row_num" +
                "   FROM kafka_table ) " +
                "WHERE row_num = 1";
//        tableEnv.executeSql(sinkToKafka);
        Table table = tableEnv.sqlQuery(sinkToKafka);
//        DataStream> tuple2DataStream = tableEnv.toRetractStream(table, Row.class);

        DataStream> tuple3DataStream = tableEnv.toRetractStream(table, Cookies.class);

        DataStream sinkKafka = tuple3DataStream.flatMap(new FlatMapFunction, String>() {
            @Override
            public void flatMap(Tuple2 value, Collector out) throws Exception {
//                System.out.println("value.f0 = " + value.f0);
                //todo 注意。这里好像没有了‘false’ 状态 只剩下了true
                String outStr = JSONObject.toJSONString(value.f1);
                out.collect(outStr);
            }
        });

        FlinkKafkaProducer myProducer = new FlinkKafkaProducer(
                "flink_doris",
                new SimpleStringSchema(),
                getProperties());   // 序列化 schema
        myProducer.setWriteTimestampToKafka(true);
        sinkKafka.addSink(myProducer).name("ods").uid("ods").setParallelism(1);

        //todo  保留最后一条最新的数据
       /* tuple2DataStream.flatMap(new FlatMapFunction, JSONObject>() {
            @Override
            public void flatMap(Tuple2 value, Collector out) throws Exception {
                Boolean lastValue = value.f0;
                Row row = value.f1;

                JSONObject json = new JSONObject();
                json.put("state",lastValue);
                json.put("ts",row.getField(0));
                json.put("user_id",row.getField(1));
                json.put("behavior",row.getField(2));
                json.put("row_num",row.getField(3));
                out.collect(json);
            }
        }).print();*/

//        insert into kafka_table2
        bsEnv.execute("执行任务中......................");
    }

    public static Properties getProperties() {
        Properties producerConfig = new Properties();
        producerConfig.setProperty("bootstrap.servers", "192.168.12.188:9092");
        producerConfig.setProperty("ack", "all");
        producerConfig.setProperty("buffer.memory", "102400");
        producerConfig.setProperty("compression.type", "snappy");
        producerConfig.setProperty("batch.size", "1000");
        producerConfig.setProperty("linger.ms", "1");
        producerConfig.setProperty("transaction.timeout.ms", 1000 * 60 * 5 + "");
        return producerConfig;
    }

    /**
     * todo 最后的结果
     *  +I(1535452032,130701,pv,1)
     *  +I(1512316772,106260,fav,1)
     *  +I(1512316741,1015357,buy,1)
     *  +I(1512316781,1014597,cart,1)
     *
     */
}

 

自定义source生产数据:

然后转成table,注册为表

SingleOutputStreamOperator ds = env.addSource(new RichSourceFunction() {
    @Override
    public void run(SourceContext ctx) throws Exception {
        Row r = new Row(2);
        r.setField(0, "a");
        r.setField(1, "a");
        ctx.collect(r);
    }

    @Override
    public void cancel() {

    }
}).returns(Types.ROW(Types.STRING, Types.STRING));


blinkStreamTableEnv.createTemporaryView("t",ds,"id,order_key,proctime.proctime");

 

 其实还有SQL可以直接代码输出到kafka的,我们参考官网给的案例,大佬文章地址:

https://mp.weixin.qq.com/s/MSs7HSaegyWWU3Fig2PYYA

1,先创建2个类,在你的项目创建 org.apache.flink.streaming.connectors.kafka包 并把上面的两个类放入该包,用于覆盖官方KafkaConnector里面的实现。

  • KafkaTableSinkBase.java 

    https://github.com/sunjincheng121/know_how_know_why/blob/master/QA/upsertKafka/src/main/java/org/apache/flink/streaming/connectors/kafka/KafkaTableSinkBase.java

  • KafkaTableSourceSinkFactoryBase.java

    https://github.com/sunjincheng121/know_how_know_why/blob/master/QA/upsertKafka/src/main/java/org/apache/flink/streaming/connectors/kafka/KafkaTableSourceSinkFactoryBase.java

2,这个案例是1.10的,我本地是1.11.0,所以需要注释代码中的报错代码:

Flink sql实现topN聚合结果写入kafka (Flink 1.11.0版本)_第1张图片

 

OK,测试我们的topn代码:

import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;

/**
 * @program: flink-tech
 * @description: Flink的topN写入到kafka,全部使用sql语句
 * @author: Mr.Wang
 * @create: 2020-07-31 14:46
 **/
public class FlinkTopNBySql {


    private static final String KAFKA_SQL = "CREATE TABLE kafka_table (" +
            " category_id STRING," +
            " user_id STRING ," +
            " item_id STRING ," +
            " behavior STRING ," +
            " ts STRING ," +
//            " proctime as PROCTIME() ," +
            " row_ts AS TO_TIMESTAMP(FROM_UNIXTIME(cast(ts AS BIGINT), 'yyyy-MM-dd HH:mm:ss'))," +
            " WATERMARK FOR row_ts AS row_ts - INTERVAL '5' SECOND " +
            ") WITH (" +
            " 'connector' = 'kafka'," +
            " 'topic' = 'flink_test'," +
            " 'properties.bootstrap.servers' = '192.168.12.188:9092'," +
            " 'properties.group.id' = 'test1'," +
            " 'format' = 'json'," +
            " 'scan.startup.mode' = 'earliest-offset'" +
            ")";



    private static final String PRINT_SQL = "create table sink_print (" +
            " b BIGINT " +
            ") with ('connector' = 'print' )";



    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment bsEnv = StreamExecutionEnvironment.getExecutionEnvironment();
        EnvironmentSettings bsSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build();
        StreamTableEnvironment tableEnv = StreamTableEnvironment.create(bsEnv, bsSettings);
        bsEnv.enableCheckpointing(5000);
        bsEnv.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
//        tableEnv.getConfig().setSqlDialect(SqlDialect.HIVE);
        //todo 从造的数据通过Flink读取出来做了操作之后写入kafka。
        tableEnv.executeSql(KAFKA_SQL);
        tableEnv.executeSql(PRINT_SQL);


        String top1Sql = "insert into sink_print SELECT count(category_id) from kafka_table GROUP BY behavior" ;

        tableEnv.executeSql(top1Sql);


    }
}

 

 

控制台打印结果:

Flink sql实现topN聚合结果写入kafka (Flink 1.11.0版本)_第2张图片

 

抱歉。。。。抱歉。。。这个代码是针对Flink 1.10版本的,针对Flink 1.11版本目前还是报错:

 Table sink 'default_catalog.default_database.kafkaSink' doesn't support consuming update changes which is produced by node GroupAggregate(groupBy=[id], select=[id, SUM(cnt) AS EXPR$1])

官方回复是这么说的

Flink sql实现topN聚合结果写入kafka (Flink 1.11.0版本)_第3张图片

 

1,要么修改源码

2,转成stream继续

 

 

=====结果测试针对Flink1.11.0版本聚合数据写入到kafka里面,只需要修改源码一个地方就可以了:

我们打开github源码类:

路径:

https://github.com/apache/flink/blob/master/flink-formats/flink-json/src/main/java/org/apache/flink/formats/json/JsonFormatFactory.java

 

把代码拷贝到本地,并且创建目录:

org.apache.flink.formats.json

如图:

Flink sql实现topN聚合结果写入kafka (Flink 1.11.0版本)_第4张图片

然后修改代码:

Flink sql实现topN聚合结果写入kafka (Flink 1.11.0版本)_第5张图片

替换成:

return ChangelogMode.newBuilder()
        .addContainedKind(RowKind.INSERT)
        .addContainedKind(RowKind.UPDATE_BEFORE)
        .addContainedKind(RowKind.UPDATE_AFTER)
        .addContainedKind(RowKind.DELETE)
        .build();

 

本地代码测试:

public class KafkaToKafka_test {


    private static final String KAFKA_SOURCE_SQL = "CREATE TABLE source (" +
            "  id INT," +
            "  name STRING " +
            ") WITH (" +
            " 'connector' = 'kafka'," +
            " 'topic' = 'order_test'," +
            " 'properties.bootstrap.servers' = 'dev-ct6-dc-worker01:9092,dev-ct6-dc-worker02:9092,dev-ct6-dc-worker03:9092'," +
            " 'properties.group.id' = 'test1'," +
            " 'format' = 'json'," +
            " 'scan.startup.mode' = 'earliest-offset'" +
            ")";

    private static final String KAFKA_SINK_SQL = "CREATE TABLE sink (\n" +
            "  id BIGINT " +
            ") WITH (" +
            " 'connector' = 'kafka'," +
            " 'topic' = 'order_test_sink'," +
            " 'properties.bootstrap.servers' = 'dev-ct6-dc-worker01:9092,dev-ct6-dc-worker02:9092,dev-ct6-dc-worker03:9092'," +
            " 'properties.group.id' = 'test1'," +
            " 'format' = 'json'," +
            " 'scan.startup.mode' = 'earliest-offset'" +
            ")";

    private static final String PRINT_SINK_SQL = "create table sink_print ( \n" +
            "  id BIGINT" +
//            "  name STRING" +
            ") with ('connector' = 'print' )";

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment bsEnv = StreamExecutionEnvironment.getExecutionEnvironment();
        EnvironmentSettings bsSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build();
        StreamTableEnvironment tEnv = StreamTableEnvironment.create(bsEnv, bsSettings);
        bsEnv.enableCheckpointing(5000);
        bsEnv.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        tEnv.executeSql(KAFKA_SOURCE_SQL);
        tEnv.executeSql(KAFKA_SINK_SQL);
        tEnv.executeSql(PRINT_SINK_SQL);
        String sql = "insert into  sink select sum(id) as ss from source ";

        tEnv.executeSql(sql);

    }
}

我们在kafka完成数据插入:

 {"id":10,"name":"wj"}

结果:

{"id":10}

 

我们再次插入,结果会更新,但是会是2条结果:

{"id":20,"name":"wj"}

对应结果:

{"id":10}

{"id":30}

 

什么意思?就是我们插入了2条数据,在source kafka里面2条数据,但是在sink的kafka里面有了3条数据,如图:

Flink sql实现topN聚合结果写入kafka (Flink 1.11.0版本)_第6张图片

Flink sql实现topN聚合结果写入kafka (Flink 1.11.0版本)_第7张图片

 

是不是很熟悉?对的,就是类似控制台打印,每次发生修改的时候,会多一条删除操作,在kafka里面就多了一条删除的重复数据,所以后续需要自己去完成去重操作,鉴于这是聚合,可以在后面直接取最新的结果数据就OK了,因为是json数据,所以没有 canal-json那么限制多,比较灵活。 

你可能感兴趣的:(Flink1.11)