本文使用 Flink CDC 2.2 最新版本及 Flink 1.14 版本通过 DataStream API 做双表(产品表/订单表)流 Join 操作案例。
产品表: 在 MySQL 数据源中。
订单表: 在 PostgreSQL 数据源中。
案例具体划分有:
软件说明如下:
软件 | 版本 |
---|---|
Flink | 1.14.4 |
Flink CDC | 2.2 |
MySQL | 8.0.24 |
PostgreSQL | 12.10 |
说明:
产品表(products)字段说明:
[
id int primary key COMMENT '主键',
name varchar(255) not null COMMENT '产品名称',
]
订单表(orders)字段说明:
[
order_id int4 primary key COMMENT '主键',
order_name varchar(255) not null COMMENT '订单名称',
order_date timestamp(3) not null COMMENT '订单时间',
order_price numeric(10,2) not null COMMENT '订单价格',
product_id int4 not null COMMENT '产品id'
]
产品表(products)数据流字段说明:
[
id: int, // 主键id
name: string, // 产品名称
op: string, // 数据操作类型
ts_ms: long // 毫秒时间戳
]
订单表(orders)数据流字段说明:
[
order_id: int, // 订单主键id
order_name: string, // 订单名称
order_date: long, // 订单时间,毫秒时间戳
order_price: string, // 订单价格
product_id: int, // 产品id
op: string, // 数据操作类型
ts_ms: long // 毫秒时间戳
]
用 Navicat 连接 MySQL,建表及初始化数据 SQL 语句如下:
-- 创建数据库 flinkcdc_product_manage
create database flinkcdc_product_manage;
-- 使用数据库 flinkcdc_product_manage
use flinkcdc_product_manage;
-- 创建产品表
CREATE TABLE products (
id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY COMMENT '主键' ,
name VARCHAR(255) NOT NULL COMMENT '产品名称'
)AUTO_INCREMENT = 101 COMMENT = '产品表';
-- 插入两条数据
INSERT INTO products VALUES (default,"篮球"),(default,"乒乒球");
用 Navicat 连接 PostgreSQL,建表及初始化数据 SQL 语句如下:
-- 创建数据库 flinkcdc_order_manage
create database flinkcdc_order_manage;
-- 使用数据库 flinkcdc_order_manage
use flinkcdc_order_manage;
-- 创建订单表
CREATE TABLE orders (
order_id SERIAL NOT NULL PRIMARY KEY,
order_name VARCHAR(255) NOT NULL,
order_date TIMESTAMP(3) NOT NULL ,
order_price DECIMAL(10, 2) NOT NULL,
product_id INTEGER NOT NULL
);
-- 添加注释
COMMENT ON COLUMN "public"."orders"."order_id" IS '订单主键';
COMMENT ON COLUMN "public"."orders"."order_name" IS '订单名称';
COMMENT ON COLUMN "public"."orders"."order_date" IS '订单时间';
COMMENT ON COLUMN "public"."orders"."order_price" IS '订单价格';
COMMENT ON COLUMN "public"."orders"."product_id" IS '产品id';
COMMENT ON TABLE "public"."orders" IS '订单表';
ALTER SEQUENCE public.orders_order_id_seq RESTART WITH 1001;
ALTER TABLE public.orders REPLICA IDENTITY FULL;
-- 插入数据
INSERT INTO orders VALUES (default, '篮球订单', '2022-05-13 17:08:22', 88.88, 101);
https://debezium.io/documentation/reference/1.9/connectors/postgresql.html#postgresql-temporal-values
父工程(datastream-etl-demo)的 pom.xml 文件内容如下:
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0modelVersion>
<groupId>cn.mfoxgroupId>
<artifactId>datastream-etl-demoartifactId>
<packaging>pompackaging>
<version>1.0-SNAPSHOTversion>
<modules>
<module>telephone-etl-demomodule>
<module>teacher-course-etl-demomodule>
<module>datastream-commonmodule>
<module>product-orders-etl-demomodule>
modules>
<properties>
<project.build.sourceEncoding>UTF-8project.build.sourceEncoding>
<maven.compiler.source>1.8maven.compiler.source>
<maven.compiler.target>1.8maven.compiler.target>
<flink.version>1.14.4flink.version>
<scala.binary.version>2.12scala.binary.version>
properties>
<dependencies>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-streaming-java_2.12artifactId>
<version>${flink.version}version>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-javaartifactId>
<version>${flink.version}version>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-clients_${scala.binary.version}artifactId>
<version>${flink.version}version>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-connector-kafka_${scala.binary.version}artifactId>
<version>${flink.version}version>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-table-planner-blink_2.12artifactId>
<version>1.12.0version>
dependency>
<dependency>
<groupId>mysqlgroupId>
<artifactId>mysql-connector-javaartifactId>
<version>8.0.23version>
dependency>
<dependency>
<groupId>com.ververicagroupId>
<artifactId>flink-sql-connector-mysql-cdcartifactId>
<version>2.2.0version>
dependency>
<dependency>
<groupId>com.alibabagroupId>
<artifactId>fastjsonartifactId>
<version>1.2.80version>
dependency>
dependencies>
<build>
<plugins>
<plugin>
<artifactId>maven-compiler-pluginartifactId>
<version>3.6.0version>
<configuration>
<source>1.8source>
<target>1.8target>
configuration>
plugin>
plugins>
build>
project>
子工程(datastream-etl-demo)的 pom.xml 文件内容如下:
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>datastream-etl-demoartifactId>
<groupId>cn.mfoxgroupId>
<version>1.0-SNAPSHOTversion>
parent>
<modelVersion>4.0.0modelVersion>
<artifactId>product-orders-etl-demoartifactId>
<properties>
<maven.compiler.source>8maven.compiler.source>
<maven.compiler.target>8maven.compiler.target>
properties>
<dependencies>
<dependency>
<groupId>cn.mfoxgroupId>
<artifactId>datastream-commonartifactId>
<version>1.0-SNAPSHOTversion>
dependency>
<dependency>
<groupId>com.ververicagroupId>
<artifactId>flink-sql-connector-postgres-cdcartifactId>
<version>2.2.0version>
dependency>
dependencies>
project>
package cn.mfox.common.enumeration;
/**
* CDC 中 op类型
*
* @author hy
* @version 1.0
* @date 2022/5/12 17:28
*/
public enum OpEnum {
/**
* 新增
*/
CREATE("c", "create", "新增"),
/**
* 修改
*/
UPDATA("u", "update", "更新"),
/**
* 删除
*/
DELETE("d", "delete", "删除"),
/**
* 读
*/
READ("r", "read", "读");
/**
* 字典码
*/
private String dictCode;
/**
* 字典码翻译值
*/
private String dictValue;
/**
* 字典码描述
*/
private String description;
OpEnum(String dictCode, String dictValue, String description) {
this.dictCode = dictCode;
this.dictValue = dictValue;
this.description = description;
}
public String getDictCode() {
return dictCode;
}
public String getDictValue() {
return dictValue;
}
public String getDescription() {
return description;
}
}
TransformUtil.java 工具类内容如下:
package cn.mfox.common.utils;
import cn.mfox.common.enumeration.OpEnum;
import com.alibaba.fastjson.JSONObject;
/**
* 转换工具类
*
* @author hy
* @version 1.0
* @date 2022/5/12 16:25
*/
public class TransformUtil {
/**
* 格式化抽取数据格式
* 去除before、after、source等冗余内容
*
* @param extractData 抽取的数据
* @return
*/
public static JSONObject formatResult(String extractData) {
JSONObject formatDataObj = new JSONObject();
JSONObject rawDataObj = JSONObject.parseObject(extractData);
formatDataObj.putAll(rawDataObj);
formatDataObj.remove("before");
formatDataObj.remove("after");
formatDataObj.remove("source");
String op = rawDataObj.getString("op");
if (OpEnum.DELETE.getDictCode().equals(op)) {
// 新增取 before结构体数据
formatDataObj.putAll(rawDataObj.getJSONObject("before"));
} else {
// 其余取 after结构体数据
formatDataObj.putAll(rawDataObj.getJSONObject("after"));
}
return formatDataObj;
}
}
package cn.mfox.datastream;
import cn.mfox.common.utils.TransformUtil;
import com.alibaba.fastjson.JSONObject;
import com.ververica.cdc.connectors.postgres.PostgreSQLSource;
import com.ververica.cdc.debezium.DebeziumSourceFunction;
import com.ververica.cdc.debezium.JsonDebeziumDeserializationSchema;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import java.util.Properties;
/**
* 抽取 订单表 数据源流
*
* @author hy
* @version 1.0
* @date 2022/5/13 14:33
*/
public class OrdersDataStream {
/**
* 获取数据流
*
* @param env
* @return
*/
public static DataStreamSource<String> getJsonDebeziumDataStreamSource(StreamExecutionEnvironment env) {
// 1.创建Flink-PostgreSQL-CDC的Source
Properties debeziumPropert = new Properties();
debeziumPropert.setProperty("decimal.handling.mode", "double");
DebeziumSourceFunction<String> postgresqlSource = PostgreSQLSource.<String>builder()
.hostname("192.168.18.102")
.port(5432)
.username("postgres")
.password("123456")
.database("flinkcdc_order_manage")
.schemaList("public")
.tableList("public.orders")
.deserializer(new JsonDebeziumDeserializationSchema())
.debeziumProperties(debeziumPropert)
.slotName("order")
.build();
// 2.使用CDC Source 读取数据
return env.addSource(postgresqlSource, "Orders Source");
}
/**
* 获取 JOSONObject 格式数据
*
* @param env
* @return
*/
public static DataStream<JSONObject> getJSONObjectDebeziumDataStream(StreamExecutionEnvironment env) {
// 1. 抽取 Debzium格式 数据流
DataStreamSource<String> debeziumDataStream = getJsonDebeziumDataStreamSource(env);
// 2.转换为 JSONObject 格式
return debeziumDataStream.map(rawData -> {
return TransformUtil.formatResult(rawData);
});
}
}
说明:
package cn.mfox.datastream;
import cn.mfox.common.utils.TransformUtil;
import com.alibaba.fastjson.JSONObject;
import com.ververica.cdc.connectors.mysql.source.MySqlSource;
import com.ververica.cdc.connectors.mysql.table.StartupOptions;
import com.ververica.cdc.debezium.JsonDebeziumDeserializationSchema;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import java.time.Duration;
/**
* 抽取 产品表 数据源流
*
* 数据流分为:
* 有水位线数据流,指定ts_ms为时间戳
* 无水位线数据流
*
* @author hy
* @version 1.0
* @date 2022/5/13 10:33
*/
public class ProductDataStream {
/**
* 带有水位线 数据流
*
* @param env
* @return
*/
public static DataStream<JSONObject> getDataStreamWithWatermark(StreamExecutionEnvironment env) {
WatermarkStrategy<String> watermarkStrategy
= WatermarkStrategy.<String>forBoundedOutOfOrderness(Duration.ofSeconds(1L))
.withTimestampAssigner(
new SerializableTimestampAssigner<String>() {
@Override
public long extractTimestamp(String extractData, long l) {
return JSONObject.parseObject(extractData.toString()).getLong("ts_ms");
}
}
);
return getDataStream(env, watermarkStrategy);
}
/**
* 无有水位线 数据流
*
* @param env
* @return
*/
public static DataStream<JSONObject> getDataStreamNoWatermark(StreamExecutionEnvironment env) {
return getDataStream(env, WatermarkStrategy.noWatermarks());
}
/**
* 获取数据流
*
* @param env
* @return
*/
public static DataStream<JSONObject> getDataStream(StreamExecutionEnvironment env, WatermarkStrategy watermark) {
// 1.创建Flink-MySQL-CDC的Source
MySqlSource<String> mySqlSource = MySqlSource.<String>builder()
.hostname("192.168.18.101")
.port(3306)
.username("root")
.password("123456")
.databaseList("flinkcdc_product_manage")
.tableList("flinkcdc_product_manage.products")
.startupOptions(StartupOptions.initial())
.deserializer(new JsonDebeziumDeserializationSchema())
.serverTimeZone("Asia/Shanghai")
.build();
// 2.使用CDC Source 读取数据
DataStreamSource<String> mysqlDataStreamSource = env.fromSource(
mySqlSource, watermark,
"ProductDataStream Source"
);
// 3.转换为指定格式
DataStream<JSONObject> mapTransformDataStream = mysqlDataStreamSource.map(rawData -> {
return TransformUtil.formatResult(rawData);
});
return mapTransformDataStream;
}
}
创建 OrdersJsonDebeziumDataStreamTest.java 类文件,内容如下:
package cn.mfox.etl.v1;
import cn.mfox.datastream.OrdersDataStream;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
/**
* 抽取订单数据流测试
*
* @author hy
* @version 1.0
* @date 2022/5/13 14:41
*/
public class OrdersJsonDebeziumDataStreamTest {
public static void main(String[] args) throws Exception {
// 1.获取流式执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// 2.读取 Orders订单表数据流
DataStream<String> dataStream = OrdersDataStream.getJsonDebeziumDataStreamSource(env);
// 3.打印
dataStream.print();
// 4.启动环境
env.execute();
}
}
{
"before": null,
"after": {
"order_id": 1001,
"order_name": "篮球订单",
"order_date": 1652461702000,
"order_price": 88.88,
"product_id": 101
},
"source": {
"version": "1.5.4.Final",
"connector": "postgresql",
"name": "postgres_cdc_source",
"ts_ms": 1652690846192,
"snapshot": "last",
"db": "flinkcdc_order_manage",
"sequence": "[null,\"39098752\"]",
"schema": "public",
"table": "orders",
"txId": 576,
"lsn": 39098752,
"xmin": null
},
"op": "r",
"ts_ms": 1652690846199,
"transaction": null
}
结论:
创建 OrdersJsonObjectDebeziumDataStreamTest.java 类文件,内容如下:
package cn.mfox.etl.v1;
import cn.mfox.datastream.OrdersDataStream;
import com.alibaba.fastjson.JSONObject;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
/**
* 抽取订单数据流测试,数据返回格式为 JSONObject
*
* @author hy
* @version 1.0
* @date 2022/5/16 15:41
*/
public class OrdersJsonObjectDebeziumDataStreamTest {
public static void main(String[] args) throws Exception {
// 1.获取流式执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// 2.读取 Orders订单表数据流
DataStream<JSONObject> dataStream = OrdersDataStream.getJSONObjectDebeziumDataStream(env);
// 3.打印
dataStream.print();
// 4.启动环境
env.execute();
}
}
创建 ProductInnerJoinOrderByProcessTimeTest.java 类,内容如下:
package cn.mfox.etl.v2.join.window;
import cn.mfox.datastream.OrdersDataStream;
import cn.mfox.datastream.ProductDataStream;
import com.alibaba.fastjson.JSONObject;
import org.apache.flink.api.common.functions.JoinFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
/**
* 产品与订单 的内联结
* 基于 处理时间 的内联结,只输出关联数据
*
* @author hy
* @version 1.0
* @date 2022/5/16 17:07
*/
public class ProductInnerJoinOrderByProcessTimeTest {
public static void main(String[] args) throws Exception {
// 1.创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// 2.获取两者数据流
DataStream<JSONObject> productDataStream = ProductDataStream.getDataStreamNoWatermark(env);
DataStream<JSONObject> orderDataStream = OrdersDataStream.getJSONObjectDebeziumDataStream(env);
// 3.窗口内联结(产品流 和 订单表)打印输出
windowInnerJoinAndPrint(productDataStream, orderDataStream);
// 4.执行任务
env.execute("ProductInnerJoinOrderTest Job");
}
/**
* 窗口联结并打印输出
* 只支持 inner join,即窗口内联关联到的才会下发,关联不到的则直接丢掉。
* 如果想实现Window上的 outer join,需要使用coGroup算子
*
* @param productDataStream 产品表 数据流
* @param orderDataStream 订单表 数据流
*/
private static void windowInnerJoinAndPrint(DataStream<JSONObject> productDataStream,
DataStream<JSONObject> orderDataStream) {
DataStream<JSONObject> innerJoinDataStream = productDataStream
.join(orderDataStream)
.where(product -> product.getInteger("id"))
.equalTo(order -> order.getInteger("product_id"))
.window(TumblingProcessingTimeWindows.of(Time.seconds(3L)))
.apply(
new JoinFunction<JSONObject, JSONObject, JSONObject>() {
@Override
public JSONObject join(JSONObject jsonObject,
JSONObject jsonObject2) {
// 拼接
jsonObject.putAll(jsonObject2);
return jsonObject;
}
}
);
innerJoinDataStream.print("Window Inner Join By Process Time");
}
}
程序运行截图:
Window Inner Join By Process Time> {"op":"r","order_date":1652461702000,"product_id":101,"name":"篮球","order_price":88.88,"id":101,"order_id":1001,"ts_ms":1652692946143,"order_name":"篮球订单"}
说明:
创建 ProductOuterJoinOrderByProcessTimeTest.java 类,内容如下:
package cn.mfox.etl.v2.join.window;
import cn.mfox.datastream.OrdersDataStream;
import cn.mfox.datastream.ProductDataStream;
import com.alibaba.fastjson.JSONObject;
import org.apache.flink.api.common.functions.CoGroupFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
/**
* 产品与订单 的 外联结
* 基于 处理时间 的外联结,输出关联数据和单边流非关联数据
*
* @author hy
* @version 1.0
* @date 2022/5/16 17:30
*/
public class ProductOuterJoinOrderByProcessTimeTest {
public static void main(String[] args) throws Exception {
// 1.创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// 2.获取两者数据流
DataStream<JSONObject> productDataStream = ProductDataStream.getDataStreamNoWatermark(env);
DataStream<JSONObject> orderDataStream = OrdersDataStream.getJSONObjectDebeziumDataStream(env);
// 3.窗口外联结(产品表流 和 订单表流)打印输出
windowOuterJoinAndPrint(productDataStream, orderDataStream);
// 4.执行任务
env.execute("WindowOuterJoinByProcessTimeTest Job");
}
/**
* 窗口外联并打印输出
* Window上的 outer join,使用coGroup算子,关联不到的数据也会下发
*
* @param productDataStream 产品表流
* @param orderDataStream 订单表流
*/
private static void windowOuterJoinAndPrint(DataStream<JSONObject> productDataStream,
DataStream<JSONObject> orderDataStream) {
DataStream<JSONObject> outerJoinDataStream = productDataStream
.coGroup(orderDataStream)
.where(product -> product.getInteger("id"))
.equalTo(order -> order.getInteger("product_id"))
.window(TumblingProcessingTimeWindows.of(Time.seconds(3L)))
.apply(
new CoGroupFunction<JSONObject, JSONObject, JSONObject>() {
@Override
public void coGroup(Iterable<JSONObject> iterable,
Iterable<JSONObject> iterable1,
Collector<JSONObject> collector) {
JSONObject result = new JSONObject();
for (JSONObject jsonObject : iterable) {
result.putAll(jsonObject);
}
for (JSONObject jsonObject : iterable1) {
result.putAll(jsonObject);
}
collector.collect(result);
}
}
);
outerJoinDataStream.print("Window Outer Join By Process Time");
}
}
程序运行截图:
Window Outer Join By Process Time> {"op":"r","order_date":1652461702000,"product_id":101,"name":"篮球","order_price":88.88,"id":101,"order_id":1001,"ts_ms":1652693277047,"order_name":"篮球订单"}
Window Outer Join By Process Time> {"op":"r","name":"乒乒球","id":102,"ts_ms":1652693276870}
说明: