Flink SQL流表与维表join实践

流表就是像kafka这种消息队列中的数据。维表是像MySQL中的静态数据。

想要使用Flink SQL的方式join两张表,静态维表的作用类似一张字典表。流表源源不断的来数据,然后去内存中查询字典将流表的数据更新然后插入到MySQL数据库中。

依赖

flink table该有的依赖就不说了。下面是连接kafka和jdbc的依赖。


            org.apache.flink
            flink-connector-kafka_${scala.version}
            ${flink.version}
            
        
        
            org.apache.flink
            flink-connector-jdbc_${scala.version}
            ${flink.version}
            
        

代码

下面的代码乍看起来可以说是非常简单,但是踩的坑也是异常的多。

package it.kenn.demo;

import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;
import static org.apache.flink.table.api.Expressions.*;

/**
 * 流表join维表
 */
public class JoinDemo {
    private static String dimTable = "CREATE TABLE dimTable (\n" +
            "  id int,\n" +
            "  user_name STRING,\n" +
            "  age INT,\n" +
            "  gender STRING,\n" +
            "  PRIMARY KEY (id) NOT ENFORCED\n" +
            ") WITH (\n" +
            "   'connector'='jdbc',\n" +
            "   'username'='root',\n" +
            "   'password'='root',\n" +
            "   'url'='jdbc:mysql://localhost:3306/aspirin',\n" +
            "   'table-name'='user_data_for_join'\n" +
            ")";

    private static String kafkaTable = "CREATE TABLE KafkaTable (\n" +
            "  `user` STRING,\n" +
            "  `site` STRING,\n" +
            "  `time` STRING\n" +
            ") WITH (\n" +
            "  'connector' = 'kafka',\n" +
            "  'topic' = 'test-old',\n" +
            "  'properties.bootstrap.servers' = 'localhost:9092',\n" +
            "  'properties.group.id' = 'testGroup',\n" +
            "  'scan.startup.mode' = 'earliest-offset',\n" +
            "  'format' = 'json'\n" +
            ")";

    private static String wideTable = "CREATE TABLE wideTable (\n" +
            "  id int,\n" +
            "  site STRING,\n" +
            "  user_name STRING,\n" +
            "  age INT,\n" +
            "  ts STRING,\n" +
            "  PRIMARY KEY (id) NOT ENFORCED\n" +
            ") WITH (\n" +
            "   'connector'='jdbc',\n" +
            "   'username'='root',\n" +
            "   'password'='root',\n" +
            "   'url'='jdbc:mysql://localhost:3306/aspirin',\n" +
            "   'table-name'='wide_table'\n" +
            ")";
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        StreamTableEnvironment tableEnvironment = StreamTableEnvironment.create(env);
        tableEnvironment.executeSql(dimTable);
        tableEnvironment.executeSql(kafkaTable);
        tableEnvironment.executeSql(wideTable);
        Table mysqlTable = tableEnvironment.from("dimTable").select("id, user_name, age, gender");
        Table kafkaTable = tableEnvironment.from("KafkaTable").select($("user"), $("site"), $("time"));

        String joinSql = "insert into wideTable " +
                " select " +
                "   dimTable.id as `id`, " +
                "   t.site as site, " +
                "   dimTable.user_name as user_name, " +
                "   dimTable.age as age, " +
                "   t.`time` as ts " +
                "from KafkaTable as t " +
                "left join dimTable on dimTable.user_name = t.`user`";
        tableEnvironment.executeSql(joinSql);

        env.execute();
    }
}

最后在执行insert的时候看到有些如time、user等字段使用反引号了吗。这些FLink的关键字必须使用反引号包裹起来,否则就会不停的报各种莫名其妙的SQL错误。总之只要跟维表join了flink就会认为这是一个批处理程序,批处理总要停止的嘛。

结论

上面程序是可以执行的,但是我的测试失败了。流表跟维表join,程序会停止。也就是说最终插入到数据库中的结果并不会是我想要结果的全部。更致命的是程序会停止也就是说即使kafka服务没有停,数据也在源源不断的写入kafka,程序还是会停止。不是因为报错停止,就是单纯的它认为程序结束了。

要实现维表作为流表丰富数据的“字典”,需要使用flink的异步IO实现。有机会研究一下。

 

 

 

 

 

 

 

 

你可能感兴趣的:(Flink,flink)