目录
业务背景
技术选型
技术可行性研究
代码实现
踩过的坑
需要针对商品属性做非常复杂的查询,商品属性分散在5,6张表中,需要将数据抽取到es中,方便筛选查询,又因为业务对实时性要求较高,故选用flink做实时在线同步.
初期使用flink sql开发,join全部的表数据写入elasticSearch.
CREATE TEMPORARY TABLE es_sink_beesforce_poc_list (
id STRING,
pocMiddleId STRING,
...,
validationType INT,
PRIMARY KEY (id) NOT ENFORCED -- 主键可选,如果定义了主键,则作为文档ID,否则文档ID将为随机值。
) WITH (
'connector' = 'elasticsearch-7',
'index' = '*****',
'hosts' = '*****',
'username' ='*****',
'password' ='*****'
);
INSERT INTO es_sink_beesforce_poc_list WITH -- 渠道 层级 1
channel_info_level_1 AS (
SELECT
channel.channel_code,
channel.channel_name,
channel.channel_code AS parent_channel_code,
channel.channel_name AS parent_channel_name
FROM
`****`.`****`.poc_channel_info /*+ OPTIONS('server-id'='5400-5409') */
AS channel
WHERE
channel.channel_level = 1
),-- 渠道 层级 2
channel_info_level_2 AS (
SELECT
channel.channel_code,
channel.channel_name,
concat( parent.parent_channel_code, ',', channel.channel_code ) AS parent_channel_code,
concat( parent.parent_channel_name, '-', channel.channel_name ) AS parent_channel_name
FROM
`****`.`****`.poc_channel_info /*+ OPTIONS('server-id'='5410-5419') */
AS channel
LEFT JOIN channel_info_level_1 AS parent ON channel.parent_channel_code = parent.channel_code
WHERE
channel.channel_level = 2
),-- 渠道 层级 3
channel_info_level_3 AS (
SELECT
channel.channel_code,
channel.channel_name,
concat( parent.parent_channel_code, ',', channel.channel_code ) AS parent_channel_code,
concat( parent.parent_channel_name, '-', channel.channel_name ) AS parent_channel_name
FROM
`****`.`****`.poc_channel_info /*+ OPTIONS('server-id'='5420-5429') */
AS channel
LEFT JOIN channel_info_level_2 AS parent ON channel.parent_channel_code = parent.channel_code
WHERE
channel.channel_level = 3
),-- 渠道 层级 4
channel_info_level_4 AS (
SELECT
channel.channel_code,
channel.channel_name,
concat( parent.parent_channel_code, ',', channel.channel_code ) AS parent_channel_code,
concat( parent.parent_channel_name, '-', channel.channel_name ) AS parent_channel_name
FROM
`****`.`****`.poc_channel_info /*+ OPTIONS('server-id'='5430-5439') */
AS channel
LEFT JOIN channel_info_level_3 AS parent ON channel.parent_channel_code = parent.channel_code
WHERE
channel.channel_level = 4
),-- 查询 渠道层级
channel_info_level AS ( SELECT * FROM channel_info_level_1 UNION ALL SELECT * FROM channel_info_level_2 UNION ALL SELECT * FROM channel_info_level_3 UNION ALL SELECT * FROM channel_info_level_4 )
SELECT
concat(
poc_info.poc_middle_id,
'_',
IF
( salesman_ref.id IS NOT NULL, salesman_ref.id, '' )) AS id,
poc_info.poc_middle_id AS pocMiddleId,
...,
poc_info.validation_type AS validationType
FROM
`****`.`****`.poc_base_info /*+ OPTIONS('server-id'='5440-5449') */
AS poc_info
LEFT JOIN (
SELECT
label_ref.poc_middle_id,
LISTAGG ( label_info.label_code ) label_code,
LISTAGG ( label_info.label_name ) label_name
FROM
`****`.`****`.poc_label_ref /*+ OPTIONS('server-id'='5450-5459') */
AS label_ref
INNER JOIN `****`.`****`.poc_label_info /*+ OPTIONS('server-id'='5460-5469') */
AS label_info ON label_ref.label_code = label_info.label_code
AND label_ref.deleted = 0
GROUP BY
label_ref.poc_middle_id
) label_info ON poc_info.poc_middle_id = label_info.poc_middle_id
LEFT JOIN channel_info_level AS channel_info ON poc_info.channel_format_code = channel_info.channel_code
LEFT JOIN `****`.`****`.poc_salesman_ref /*+ OPTIONS('server-id'='5470-5479') */
AS salesman_ref ON poc_info.poc_middle_id = salesman_ref.poc_middle_id
LEFT JOIN `****`.`****`.poc_extend_info /*+ OPTIONS('server-id'='5480-5489') */
AS extend_info ON poc_info.poc_middle_id = extend_info.poc_middle_id
LEFT JOIN `****`.`****`.wccs_dict_info /*+ OPTIONS('server-id'='5490-5499') */
AS wccs_chain ON extend_info.wccs_chain_code_2022_version = wccs_chain.dict_code
AND wccs_chain.dict_type = 4
LEFT JOIN `****`.`****`.wccs_dict_info /*+ OPTIONS('server-id'='5500-5509') */
AS wccs_grade ON extend_info.wccs_grade_code_2022_version = wccs_grade.dict_code
AND wccs_grade.dict_type = 6
LEFT JOIN `****`.`****`.poc_bees_project_info /*+ OPTIONS('server-id'='6300-6309') */
AS bees_project_info ON poc_info.poc_middle_id = bees_project_info.poc_middle_id
AND bees_project_info.deleted = 0
WHERE
poc_info.deleted = 0
AND poc_info.poc_middle_id IS NOT NULL;
逐步开发测试中发现flink sql有很多缺陷,flink社区对部分缺陷也没有成熟的解决方案,具体如下
1.flink sql中双流join的历史数据存在state中,且这些数据是有时效性的,默认36小时,flink启动时,所有涉及到的表数据都被flink加载到state中,36小时之后,state数据会失效(期中如果有更新,会刷新36小时的有效期),此时,flink监听mysql binlog之后join不到数据,部分数据会丢失.
2.如果有多个flink作业同时启动,且监听的mysql地址在一起,会经常报错,导致flink重启(可使用/*+ OPTIONS('server-id'='5430-5439') */语法指定监听binlog的server-id解决该问题),具体原因是mysql cdc source会伪装成mysql集群中的一个slave,每个slave对于主库来说拥有唯一的id,也就是serverId,同时每个serverid下也会记录不用的binlog点位,如果多个slave享有同样的serverid,会导致数据拉取错乱,如果不指定,会随机分配一个,很容易发生重复,最好是各个团队约定好,指定一个唯一的serverid.
3.flink sql自身不支持直接同步elastic search中的nested类型字段(后面调研发现,可以使用UDF上传jar包的方式实现自定义函数)
因为问题1无法解决,项目组决定使用flink datastream作业.
为了多张表分布同步,我们取消了之前关联所有数据之后在同步至es的方案,改成每次读到binlog都去更新es,为此必须做到所有表之间对于es单条数据的解耦.同时为了避免从表数据比主表数据先被flink监听到的场景,我们新引入了esLastOptTime字段,不管主表数据有没有被监听到,都在es中新增一个document文档,同时文档中没有id字段,仅有从表业务系统的字段,仅读取到主表数据的时候才给id字段赋值(此id非es的document的id,为业务系统id)
maven依赖(必须把provided的scope注释掉,项目才能本地启动)
4.0.0
abi-cloud-beesforce-data-board
com.abi
1.0.0-SNAPSHOT
abi-cloud-beesforce-data-board-flink
abi-cloud-beesforce-data-board-flink
abi api project
UTF-8
1.13.1
1.8
2.11
${target.java.version}
${target.java.version}
2.12.4
apache.snapshots
Apache Development Snapshot Repository
https://repository.apache.org/content/repositories/snapshots/
false
true
org.apache.flink
flink-java
${flink.version}
provided
org.apache.flink
flink-streaming-java_${scala.binary.version}
${flink.version}
provided
org.apache.flink
flink-clients_${scala.binary.version}
${flink.version}
provided
org.apache.flink
flink-connector-elasticsearch7_${scala.binary.version}
${flink.version}
com.ververica
flink-connector-mysql-cdc
2.2.0
org.apache.flink
flink-connector-base
${flink.version}
org.apache.flink
flink-table-common
${flink.version}
org.apache.logging.log4j
log4j-slf4j-impl
${log4j.version}
runtime
org.apache.logging.log4j
log4j-api
${log4j.version}
runtime
org.apache.logging.log4j
log4j-core
${log4j.version}
runtime
org.projectlombok
lombok
1.18.24
com.alibaba
druid
1.1.10
cn.hutool
hutool-json
5.7.16
org.apache.commons
commons-text
1.9
org.apache.maven.plugins
maven-compiler-plugin
3.1
${target.java.version}
org.apache.maven.plugins
maven-shade-plugin
3.1.1
package
shade
org.apache.flink:force-shading
com.google.code.findbugs:jsr305
org.slf4j:*
org.apache.logging.log4j:*
*:*
META-INF/*.SF
META-INF/*.DSA
META-INF/*.RSA
org.eclipse.m2e
lifecycle-mapping
1.0.0
org.apache.maven.plugins
maven-shade-plugin
[3.1.1,)
shade
org.apache.maven.plugins
maven-compiler-plugin
[3.1,)
testCompile
compile
日志打印级别控制,log4j2.properties
################################################################################
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
################################################################################
rootLogger.level = INFO
rootLogger.appenderRef.console.ref = ConsoleAppender
appender.console.name = ConsoleAppender
appender.console.type = CONSOLE
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{HH:mm:ss,SSS} %-5p %-60c %x - %m%n
binlog数据流实体类
@NoArgsConstructor
@Data
public class BinlogStreamBean implements Serializable {
@JsonProperty("before")
private Map before;
@JsonProperty("after")
private Map after;
@JsonProperty("source")
private SourceDTO source;
@JsonProperty("op")
private String op;
@JsonProperty("ts_ms")
private Long tsMs;
@JsonProperty("transaction")
private Object transaction;
@NoArgsConstructor
@Data
public static class SourceDTO {
@JsonProperty("version")
private String version;
@JsonProperty("connector")
private String connector;
@JsonProperty("name")
private String name;
@JsonProperty("ts_ms")
private Integer tsMs;
@JsonProperty("snapshot")
private String snapshot;
@JsonProperty("db")
private String db;
@JsonProperty("sequence")
private Object sequence;
@JsonProperty("table")
private String table;
@JsonProperty("server_id")
private Integer serverId;
@JsonProperty("gtid")
private Object gtid;
@JsonProperty("file")
private String file;
@JsonProperty("pos")
private Integer pos;
@JsonProperty("row")
private Integer row;
@JsonProperty("thread")
private Object thread;
@JsonProperty("query")
private Object query;
}
}
部分同步代码
@Slf4j
public class SinkToEsSteaming {
public static void main(String[] args) {
//1.处理main方法入参
ParameterTool parameterTool = ParameterTool.fromArgs(args);
//2.数据处理
ElasticsearchSinkFunction elasticsearchSinkFunction = (row, ctx, indexer) -> {
//1-数据准备
//获取全局参数
ParameterTool parameterTool1 = (ParameterTool) ctx.getExecutionConfig().getGlobalJobParameters();
//索引名
String indexName = "es_sink_jar_beesforce_poc_list" + EnvUtil.getUnderscoreEnv(parameterTool1.get("env"));
//binlog数据
BinlogStreamBean binlogStreamBean = JSONUtil.toBean(row, BinlogStreamBean.class);
//2-获取实际操作类型
String optType = binlogStreamBean.getOp();
if("u".equals(optType)){
String deleted = String.valueOf(binlogStreamBean.getAfter().get("deleted"));
//特殊处理逻辑删除
if("1".equals(deleted)){
optType = "d";
}else{
optType = "u";
}
}
//3-获取操作的request
List updateRequestList = Lists.newArrayList();
Map data = Maps.newHashMap();
if("c".equals(optType)){
data = binlogStreamBean.getAfter();
}else if("u".equals(optType)){
data = binlogStreamBean.getAfter();
}else if("d".equals(optType)){
data = binlogStreamBean.getBefore();
}else if ("r".equals(optType)) {
//读取期初数据只需读取deleted等于0的数据
String deleted = String.valueOf(binlogStreamBean.getAfter().get("deleted"));
if (!"1".equals(deleted)) {
data = binlogStreamBean.getAfter();
}
}
//es最后一次操作的事件
data.put("esLastOptTime",new Date().getTime());
//id
String id = IStringUtil.valueOf(data.get("poc_middle_id"));
IndexRequest indexRequest = Requests.indexRequest(indexName)
.id(id)
.source(data);
UpdateRequest updateRequest = new UpdateRequest(indexName, id)
.doc(indexRequest)
.upsert(indexRequest);
updateRequestList.add(updateRequest);
//4-放入indexer
for (UpdateRequest esUpdateRequest : updateRequestList) {
indexer.add(esUpdateRequest);
}
};
//3.设置Elasticsearch连接与漏斗处理方法
ElasticsearchSink esSink = getEsSink(elasticsearchSinkFunction, parameterTool);
//4-拿到mysql binlog的输入流,并绑定漏斗
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(3000);
DataStreamSource dataStreamSource = env
.fromSource(getMysqlSource(parameterTool), WatermarkStrategy.noWatermarks(), "MySQL Source");
//设置全局参数
env.getConfig().setGlobalJobParameters(parameterTool);
String parallelism = parameterTool.get("parallelism");
if (StringUtils.isNotBlank(parallelism)) {
env.setParallelism(Integer.parseInt(parallelism));
}
dataStreamSource.addSink(esSink);
try {
dataStreamSource.getExecutionEnvironment().execute();
} catch (Exception e) {
log.info("启动失败");
}
}
private static MySqlSource getMysqlSource(ParameterTool parameterTool) {
String mysqlHost = parameterTool.get("mysqlHost");
String mysqlPort = parameterTool.get("mysqlPort");
String mysqlUsername = parameterTool.get("mysqlUsername");
String mysqlPassword = parameterTool.get("mysqlPassword");
String serverId = parameterTool.get("serverId");
String pocBaseInfo = "poc_base_info";
String pocLabelRef = "poc_label_ref";
String pocSalesmanRef = "poc_salesman_ref";
String pocExtendInfo = "poc_extend_info";
String pocBeesProjectInfo = "poc_bees_project_info";
String pocWholesalerRef = "poc_wholesaler_ref";
String databaseMiddle = "abi-cloud-middle-platform-poc" + EnvUtil.getHorizontalEnv(parameterTool.get("env"));
//创建解析器
Map config = Maps.newHashMap();
config.put(JsonConverterConfig.DECIMAL_FORMAT_CONFIG, DecimalFormat.NUMERIC.name());
JsonDebeziumDeserializationSchema jdd = new JsonDebeziumDeserializationSchema(false, config);
MySqlSource mySqlSource = MySqlSource.builder()
.hostname(mysqlHost)
.port(Integer.parseInt(mysqlPort))
.databaseList(databaseMiddle)
.tableList(
MessageFormat.format("{0}.{1}", databaseMiddle, pocBaseInfo),
MessageFormat.format("{0}.{1}", databaseMiddle, pocLabelRef),
MessageFormat.format("{0}.{1}", databaseMiddle, pocSalesmanRef),
MessageFormat.format("{0}.{1}", databaseMiddle, pocExtendInfo),
MessageFormat.format("{0}.{1}", databaseMiddle, pocBeesProjectInfo),
MessageFormat.format("{0}.{1}", databaseMiddle, pocWholesalerRef)
)
.username(mysqlUsername)
.password(mysqlPassword)
.startupOptions(StartupOptions.initial())
.deserializer(jdd)
.serverId(serverId)
.build();
return mySqlSource;
}
private static ElasticsearchSink getEsSink(ElasticsearchSinkFunction elasticsearchSinkFunction, ParameterTool parameterTool) {
String esAddress = parameterTool.get("esAddress");
String esPort = parameterTool.get("esPort");
String esUsername = parameterTool.get("esUsername");
String esPassword = parameterTool.get("esPassword");
//1-httpHosts
List httpHosts = new ArrayList<>();
httpHosts.add(new HttpHost(esAddress, Integer.parseInt(esPort), "http"));
//2-restClientFactory
RestClientFactory restClientFactory = restClientBuilder -> {
Node node = new Node(new HttpHost(esAddress, Integer.parseInt(esPort), "https"));
List nodes = new ArrayList<>();
nodes.add(node);
Header[] header = new Header[1];
BasicHeader authHeader = new BasicHeader("Authorization", "Basic " + Base64.encode((esUsername + ":" + esPassword).getBytes()));
header[0] = authHeader;
restClientBuilder.setDefaultHeaders(header);
restClientBuilder.build().setNodes(
nodes
);
};
//3-创建数据源对象 ElasticsearchSink
ElasticsearchSink.Builder esSinkBuilder = new ElasticsearchSink.Builder(
httpHosts, elasticsearchSinkFunction
);
esSinkBuilder.setRestClientFactory(restClientFactory);
//配置批量提交
esSinkBuilder.setBulkFlushBackoff(true);
//设置批量提交的最大条数
esSinkBuilder.setBulkFlushMaxActions(3000);
//设置批量提交最大数据量(以MB为单位)
esSinkBuilder.setBulkFlushMaxSizeMb(50);
//设置批量提交间隔
esSinkBuilder.setBulkFlushInterval(100);
//设置重试次数
esSinkBuilder.setBulkFlushBackoffRetries(1);
//设置重试间隔
esSinkBuilder.setBulkFlushBackoffDelay(2000L);
//设置重试策略CONSTANT: 常数 eg: 重试间隔为2s 重试3次 会在2s->4s->6s进行; EXPONENTIAL:指数 eg: 重试间隔为2s 重试3次 会在2s->4s->8s进行
esSinkBuilder.setBulkFlushBackoffType(ElasticsearchSinkBase.FlushBackoffType.CONSTANT);
//设置重试机制
esSinkBuilder.setFailureHandler(new RetryRejectedExecutionFailureHandler());
ElasticsearchSink esSink = esSinkBuilder.build();
return esSink;
}
}
main方法入参样例
--esAddress
********
--esPort
********
--esUsername
********
--esPassword
********
--mysqlHost
********
--mysqlJdbcUrl
********
--mysqlPort
********
--mysqlDriver
********
--mysqlUsername
********
--mysqlPassword
********
--env
dev
--serverId
5500-5529
1.flink task manager和job manager之间的环境是隔离的,main方法体内的变量, 在process方法中读不到值(可使用flink自带的ParameterTool类解决,把某些变量变成全局变量)
2.必须自定义解析器,解析binlog日志,不然有些如decimal类型的字段会被encode
Map config = Maps.newHashMap();
config.put(JsonConverterConfig.DECIMAL_FORMAT_CONFIG, DecimalFormat.NUMERIC.name());
JsonDebeziumDeserializationSchema jdd = new JsonDebeziumDeserializationSchema(false, config);
3.必须单独处理nested类型的字段,更新nested字段需使用script脚本处理各种特殊情况
public static List buildNestedUpdateRequest(String indexName, String id,
String fieldName, String optType,
String nestedIdField, String nestedId, Map nestedObj){
List updateRequestList = new ArrayList<>();
if (StringUtils.isBlank(id) || null == nestedObj || nestedObj.isEmpty()) {
return updateRequestList;
}
//1-确保有这条数据,没有则新增,有则修改一个没卵用的字段
List createDocIfNotExistReqList = buildUpdateRequest(indexName,id,new HashMap<>());
updateRequestList.addAll(createDocIfNotExistReqList);
//2-构造真正的update请求
Script script = null;
//根据操作类型搞出不同的脚本
if("u".equals(optType) || "c".equals(optType)){
//语句
String code = String.format("if (ctx._source.%s == null) {ctx._source.%s = [];}ctx._source.%s.removeIf(item -> item.%s == params.detail.%s);ctx._source.%s.add(params.detail);",
fieldName,
fieldName,
fieldName,
nestedIdField,
nestedIdField,
fieldName);
//参数
Map paramMap = new HashMap<>();
paramMap.put("detail", nestedObj);
//构造脚本
script = new Script(ScriptType.INLINE, "painless", code, paramMap);
}else if("d".equals(optType)){
//语句
String code = String.format("if(ctx._source.%s==null){return;}ctx._source.%s.removeIf(item -> item.%s == params.nestedId);if(ctx._source.%s.length==0){ctx._source.remove('%s')}",
fieldName,
fieldName,
nestedIdField,
fieldName,
fieldName);
//参数
Map paramMap = new HashMap<>();
paramMap.put("nestedId", nestedId);
//构造脚本
script = new Script(ScriptType.INLINE, "painless", code, paramMap);
}
//创建request
UpdateRequest updateRequest = new UpdateRequest(indexName, id)
.script(script);
updateRequestList.add(updateRequest);
return updateRequestList;
}