流表就是像kafka这种消息队列中的数据。维表是像MySQL中的静态数据。
想要使用Flink SQL的方式join两张表,静态维表的作用类似一张字典表。流表源源不断的来数据,然后去内存中查询字典将流表的数据更新然后插入到MySQL数据库中。
flink table该有的依赖就不说了。下面是连接kafka和jdbc的依赖。
org.apache.flink
flink-connector-kafka_${scala.version}
${flink.version}
org.apache.flink
flink-connector-jdbc_${scala.version}
${flink.version}
下面的代码乍看起来可以说是非常简单,但是踩的坑也是异常的多。
package it.kenn.demo;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;
import static org.apache.flink.table.api.Expressions.*;
/**
* 流表join维表
*/
public class JoinDemo {
private static String dimTable = "CREATE TABLE dimTable (\n" +
" id int,\n" +
" user_name STRING,\n" +
" age INT,\n" +
" gender STRING,\n" +
" PRIMARY KEY (id) NOT ENFORCED\n" +
") WITH (\n" +
" 'connector'='jdbc',\n" +
" 'username'='root',\n" +
" 'password'='root',\n" +
" 'url'='jdbc:mysql://localhost:3306/aspirin',\n" +
" 'table-name'='user_data_for_join'\n" +
")";
private static String kafkaTable = "CREATE TABLE KafkaTable (\n" +
" `user` STRING,\n" +
" `site` STRING,\n" +
" `time` STRING\n" +
") WITH (\n" +
" 'connector' = 'kafka',\n" +
" 'topic' = 'test-old',\n" +
" 'properties.bootstrap.servers' = 'localhost:9092',\n" +
" 'properties.group.id' = 'testGroup',\n" +
" 'scan.startup.mode' = 'earliest-offset',\n" +
" 'format' = 'json'\n" +
")";
private static String wideTable = "CREATE TABLE wideTable (\n" +
" id int,\n" +
" site STRING,\n" +
" user_name STRING,\n" +
" age INT,\n" +
" ts STRING,\n" +
" PRIMARY KEY (id) NOT ENFORCED\n" +
") WITH (\n" +
" 'connector'='jdbc',\n" +
" 'username'='root',\n" +
" 'password'='root',\n" +
" 'url'='jdbc:mysql://localhost:3306/aspirin',\n" +
" 'table-name'='wide_table'\n" +
")";
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tableEnvironment = StreamTableEnvironment.create(env);
tableEnvironment.executeSql(dimTable);
tableEnvironment.executeSql(kafkaTable);
tableEnvironment.executeSql(wideTable);
Table mysqlTable = tableEnvironment.from("dimTable").select("id, user_name, age, gender");
Table kafkaTable = tableEnvironment.from("KafkaTable").select($("user"), $("site"), $("time"));
String joinSql = "insert into wideTable " +
" select " +
" dimTable.id as `id`, " +
" t.site as site, " +
" dimTable.user_name as user_name, " +
" dimTable.age as age, " +
" t.`time` as ts " +
"from KafkaTable as t " +
"left join dimTable on dimTable.user_name = t.`user`";
tableEnvironment.executeSql(joinSql);
env.execute();
}
}
最后在执行insert的时候看到有些如time、user等字段使用反引号了吗。这些FLink的关键字必须使用反引号包裹起来,否则就会不停的报各种莫名其妙的SQL错误。总之只要跟维表join了flink就会认为这是一个批处理程序,批处理总要停止的嘛。
上面程序是可以执行的,但是我的测试失败了。流表跟维表join,程序会停止。也就是说最终插入到数据库中的结果并不会是我想要结果的全部。更致命的是程序会停止也就是说即使kafka服务没有停,数据也在源源不断的写入kafka,程序还是会停止。不是因为报错停止,就是单纯的它认为程序结束了。
要实现维表作为流表丰富数据的“字典”,需要使用flink的异步IO实现。有机会研究一下。