基于阿里云实时计算Flink开发实战

目录

业务背景

技术选型

技术可行性研究

代码实现

踩过的坑


业务背景

        需要针对商品属性做非常复杂的查询,商品属性分散在5,6张表中,需要将数据抽取到es中,方便筛选查询,又因为业务对实时性要求较高,故选用flink做实时在线同步.

技术选型

        初期使用flink sql开发,join全部的表数据写入elasticSearch.

CREATE TEMPORARY TABLE es_sink_beesforce_poc_list (
  id STRING,
  pocMiddleId STRING,
  ...,
  validationType INT,
  PRIMARY KEY (id) NOT ENFORCED -- 主键可选,如果定义了主键,则作为文档ID,否则文档ID将为随机值。
) WITH (
  'connector' = 'elasticsearch-7',
  'index' = '*****',
  'hosts' = '*****',
  'username' ='*****',
  'password' ='*****'
);

INSERT INTO es_sink_beesforce_poc_list WITH -- 渠道 层级 1
channel_info_level_1 AS (
	SELECT
		channel.channel_code,
		channel.channel_name,
		channel.channel_code AS parent_channel_code,
		channel.channel_name AS parent_channel_name 
	FROM
		`****`.`****`.poc_channel_info /*+ OPTIONS('server-id'='5400-5409') */
		AS channel 
	WHERE
		channel.channel_level = 1 
	),-- 渠道 层级 2
	channel_info_level_2 AS (
	SELECT
		channel.channel_code,
		channel.channel_name,
		concat( parent.parent_channel_code, ',', channel.channel_code ) AS parent_channel_code,
		concat( parent.parent_channel_name, '-', channel.channel_name ) AS parent_channel_name 
	FROM
		`****`.`****`.poc_channel_info /*+ OPTIONS('server-id'='5410-5419') */
		AS channel
		LEFT JOIN channel_info_level_1 AS parent ON channel.parent_channel_code = parent.channel_code 
	WHERE
		channel.channel_level = 2 
	),-- 渠道 层级 3
	channel_info_level_3 AS (
	SELECT
		channel.channel_code,
		channel.channel_name,
		concat( parent.parent_channel_code, ',', channel.channel_code ) AS parent_channel_code,
		concat( parent.parent_channel_name, '-', channel.channel_name ) AS parent_channel_name 
	FROM
		`****`.`****`.poc_channel_info /*+ OPTIONS('server-id'='5420-5429') */
		AS channel
		LEFT JOIN channel_info_level_2 AS parent ON channel.parent_channel_code = parent.channel_code 
	WHERE
		channel.channel_level = 3 
	),-- 渠道 层级 4
	channel_info_level_4 AS (
	SELECT
		channel.channel_code,
		channel.channel_name,
		concat( parent.parent_channel_code, ',', channel.channel_code ) AS parent_channel_code,
		concat( parent.parent_channel_name, '-', channel.channel_name ) AS parent_channel_name 
	FROM
		`****`.`****`.poc_channel_info /*+ OPTIONS('server-id'='5430-5439') */
		AS channel
		LEFT JOIN channel_info_level_3 AS parent ON channel.parent_channel_code = parent.channel_code 
	WHERE
		channel.channel_level = 4 
	),-- 查询 渠道层级
	channel_info_level AS ( SELECT * FROM channel_info_level_1 UNION ALL SELECT * FROM channel_info_level_2 UNION ALL SELECT * FROM channel_info_level_3 UNION ALL SELECT * FROM channel_info_level_4 )
	SELECT
	concat(
		poc_info.poc_middle_id,
		'_',
	IF
	( salesman_ref.id IS NOT NULL, salesman_ref.id, '' )) AS id,
	poc_info.poc_middle_id AS pocMiddleId,
	...,
	poc_info.validation_type AS validationType 
FROM
	`****`.`****`.poc_base_info /*+ OPTIONS('server-id'='5440-5449') */
	AS poc_info
	LEFT JOIN (
	SELECT
		label_ref.poc_middle_id,
		LISTAGG ( label_info.label_code ) label_code,
		LISTAGG ( label_info.label_name ) label_name 
	FROM
		`****`.`****`.poc_label_ref /*+ OPTIONS('server-id'='5450-5459') */
		AS label_ref 
		INNER JOIN `****`.`****`.poc_label_info /*+ OPTIONS('server-id'='5460-5469') */
		AS label_info ON label_ref.label_code = label_info.label_code 
		AND label_ref.deleted = 0 
	GROUP BY
		label_ref.poc_middle_id 
	) label_info ON poc_info.poc_middle_id = label_info.poc_middle_id 
	LEFT JOIN channel_info_level AS channel_info ON poc_info.channel_format_code = channel_info.channel_code
	LEFT JOIN `****`.`****`.poc_salesman_ref /*+ OPTIONS('server-id'='5470-5479') */
	AS salesman_ref ON poc_info.poc_middle_id = salesman_ref.poc_middle_id
	LEFT JOIN `****`.`****`.poc_extend_info /*+ OPTIONS('server-id'='5480-5489') */
	AS extend_info ON poc_info.poc_middle_id = extend_info.poc_middle_id
	LEFT JOIN `****`.`****`.wccs_dict_info /*+ OPTIONS('server-id'='5490-5499') */
	AS wccs_chain ON extend_info.wccs_chain_code_2022_version = wccs_chain.dict_code 
	AND wccs_chain.dict_type = 4
	LEFT JOIN `****`.`****`.wccs_dict_info /*+ OPTIONS('server-id'='5500-5509') */
	AS wccs_grade ON extend_info.wccs_grade_code_2022_version = wccs_grade.dict_code 
	AND wccs_grade.dict_type = 6
	LEFT JOIN `****`.`****`.poc_bees_project_info /*+ OPTIONS('server-id'='6300-6309') */
	AS bees_project_info ON poc_info.poc_middle_id = bees_project_info.poc_middle_id 
	AND bees_project_info.deleted = 0 
WHERE
	poc_info.deleted = 0 
	AND poc_info.poc_middle_id IS NOT NULL;

        逐步开发测试中发现flink sql有很多缺陷,flink社区对部分缺陷也没有成熟的解决方案,具体如下

1.flink sql中双流join的历史数据存在state中,且这些数据是有时效性的,默认36小时,flink启动时,所有涉及到的表数据都被flink加载到state中,36小时之后,state数据会失效(期中如果有更新,会刷新36小时的有效期),此时,flink监听mysql binlog之后join不到数据,部分数据会丢失.

2.如果有多个flink作业同时启动,且监听的mysql地址在一起,会经常报错,导致flink重启(可使用/*+ OPTIONS('server-id'='5430-5439') */语法指定监听binlog的server-id解决该问题),具体原因是mysql cdc source会伪装成mysql集群中的一个slave,每个slave对于主库来说拥有唯一的id,也就是serverId,同时每个serverid下也会记录不用的binlog点位,如果多个slave享有同样的serverid,会导致数据拉取错乱,如果不指定,会随机分配一个,很容易发生重复,最好是各个团队约定好,指定一个唯一的serverid.

基于阿里云实时计算Flink开发实战_第1张图片

3.flink sql自身不支持直接同步elastic search中的nested类型字段(后面调研发现,可以使用UDF上传jar包的方式实现自定义函数)

因为问题1无法解决,项目组决定使用flink datastream作业.

技术可行性研究

        为了多张表分布同步,我们取消了之前关联所有数据之后在同步至es的方案,改成每次读到binlog都去更新es,为此必须做到所有表之间对于es单条数据的解耦.同时为了避免从表数据比主表数据先被flink监听到的场景,我们新引入了esLastOptTime字段,不管主表数据有没有被监听到,都在es中新增一个document文档,同时文档中没有id字段,仅有从表业务系统的字段,仅读取到主表数据的时候才给id字段赋值(此id非es的document的id,为业务系统id)

代码实现

maven依赖(必须把provided的scope注释掉,项目才能本地启动)



    4.0.0

    
        abi-cloud-beesforce-data-board
        com.abi
        1.0.0-SNAPSHOT
    

    abi-cloud-beesforce-data-board-flink
    abi-cloud-beesforce-data-board-flink
    abi api project

    
        UTF-8
        1.13.1
        1.8
        2.11
        ${target.java.version}
        ${target.java.version}
        2.12.4
    

    
        
            apache.snapshots
            Apache Development Snapshot Repository
            https://repository.apache.org/content/repositories/snapshots/
            
                false
            
            
                true
            
        
    

    
        
        
        
            org.apache.flink
            flink-java
            ${flink.version}
            provided
        
        
            org.apache.flink
            flink-streaming-java_${scala.binary.version}
            ${flink.version}
            provided
        
        
            org.apache.flink
            flink-clients_${scala.binary.version}
            ${flink.version}
            provided
        

        

        
        
            org.apache.flink
            flink-connector-elasticsearch7_${scala.binary.version}
            ${flink.version}
        
        
            com.ververica
            flink-connector-mysql-cdc
            2.2.0
        
        
            org.apache.flink
            flink-connector-base
            ${flink.version}
        
        
            org.apache.flink
            flink-table-common
            ${flink.version}
        

        
        
        
            org.apache.logging.log4j
            log4j-slf4j-impl
            ${log4j.version}
            runtime
        
        
            org.apache.logging.log4j
            log4j-api
            ${log4j.version}
            runtime
        
        
            org.apache.logging.log4j
            log4j-core
            ${log4j.version}
            runtime
        

        
            org.projectlombok
            lombok
            1.18.24
        
        
            com.alibaba
            druid
            1.1.10
        
        
            cn.hutool
            hutool-json
            5.7.16
        
        
            org.apache.commons
            commons-text
            1.9
        
    

    
        

            
            
                org.apache.maven.plugins
                maven-compiler-plugin
                3.1
                
                    ${target.java.version}
                    ${target.java.version}
                
            

            
            
            
                org.apache.maven.plugins
                maven-shade-plugin
                3.1.1
                
                    
                    
                        package
                        
                            shade
                        
                        
                            
                                
                                    org.apache.flink:force-shading
                                    com.google.code.findbugs:jsr305
                                    org.slf4j:*
                                    org.apache.logging.log4j:*
                                
                            
                            
                                
                                    
                                    *:*
                                    
                                        META-INF/*.SF
                                        META-INF/*.DSA
                                        META-INF/*.RSA
                                    
                                
                            
                            
                        
                    
                
            
        

        
            

                
                
                    org.eclipse.m2e
                    lifecycle-mapping
                    1.0.0
                    
                        
                            
                                
                                    
                                        org.apache.maven.plugins
                                        maven-shade-plugin
                                        [3.1.1,)
                                        
                                            shade
                                        
                                    
                                    
                                        
                                    
                                
                                
                                    
                                        org.apache.maven.plugins
                                        maven-compiler-plugin
                                        [3.1,)
                                        
                                            testCompile
                                            compile
                                        
                                    
                                    
                                        
                                    
                                
                            
                        
                    
                
            
        
    

日志打印级别控制,log4j2.properties

################################################################################
#  Licensed to the Apache Software Foundation (ASF) under one
#  or more contributor license agreements.  See the NOTICE file
#  distributed with this work for additional information
#  regarding copyright ownership.  The ASF licenses this file
#  to you under the Apache License, Version 2.0 (the
#  "License"); you may not use this file except in compliance
#  with the License.  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
# limitations under the License.
################################################################################

rootLogger.level = INFO
rootLogger.appenderRef.console.ref = ConsoleAppender

appender.console.name = ConsoleAppender
appender.console.type = CONSOLE
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{HH:mm:ss,SSS} %-5p %-60c %x - %m%n

binlog数据流实体类

@NoArgsConstructor
@Data
public class BinlogStreamBean implements Serializable {
    @JsonProperty("before")
    private Map before;
    @JsonProperty("after")
    private Map after;
    @JsonProperty("source")
    private SourceDTO source;
    @JsonProperty("op")
    private String op;
    @JsonProperty("ts_ms")
    private Long tsMs;
    @JsonProperty("transaction")
    private Object transaction;

    @NoArgsConstructor
    @Data
    public static class SourceDTO {
        @JsonProperty("version")
        private String version;
        @JsonProperty("connector")
        private String connector;
        @JsonProperty("name")
        private String name;
        @JsonProperty("ts_ms")
        private Integer tsMs;
        @JsonProperty("snapshot")
        private String snapshot;
        @JsonProperty("db")
        private String db;
        @JsonProperty("sequence")
        private Object sequence;
        @JsonProperty("table")
        private String table;
        @JsonProperty("server_id")
        private Integer serverId;
        @JsonProperty("gtid")
        private Object gtid;
        @JsonProperty("file")
        private String file;
        @JsonProperty("pos")
        private Integer pos;
        @JsonProperty("row")
        private Integer row;
        @JsonProperty("thread")
        private Object thread;
        @JsonProperty("query")
        private Object query;
    }
}

部分同步代码

@Slf4j
public class SinkToEsSteaming {

    public static void main(String[] args) {

        //1.处理main方法入参
        ParameterTool parameterTool = ParameterTool.fromArgs(args);

        //2.数据处理
        ElasticsearchSinkFunction elasticsearchSinkFunction = (row, ctx, indexer) -> {
            //1-数据准备
            //获取全局参数
            ParameterTool parameterTool1 = (ParameterTool) ctx.getExecutionConfig().getGlobalJobParameters();

            //索引名
            String indexName = "es_sink_jar_beesforce_poc_list" + EnvUtil.getUnderscoreEnv(parameterTool1.get("env"));

            //binlog数据
            BinlogStreamBean binlogStreamBean = JSONUtil.toBean(row, BinlogStreamBean.class);

            //2-获取实际操作类型
            String optType = binlogStreamBean.getOp();
            if("u".equals(optType)){
                String deleted = String.valueOf(binlogStreamBean.getAfter().get("deleted"));
                //特殊处理逻辑删除
                if("1".equals(deleted)){
                    optType =  "d";
                }else{
                    optType =  "u";
                }
            }

            //3-获取操作的request
            List updateRequestList = Lists.newArrayList();
            Map data = Maps.newHashMap();
            if("c".equals(optType)){
                data = binlogStreamBean.getAfter();
            }else if("u".equals(optType)){
                data = binlogStreamBean.getAfter();
            }else if("d".equals(optType)){
                data = binlogStreamBean.getBefore();
            }else if ("r".equals(optType)) {
                //读取期初数据只需读取deleted等于0的数据
                String deleted = String.valueOf(binlogStreamBean.getAfter().get("deleted"));
                if (!"1".equals(deleted)) {
                    data = binlogStreamBean.getAfter();
                }
            }
            //es最后一次操作的事件
            data.put("esLastOptTime",new Date().getTime());
            //id
            String id = IStringUtil.valueOf(data.get("poc_middle_id"));
            IndexRequest indexRequest = Requests.indexRequest(indexName)
                    .id(id)
                    .source(data);
            UpdateRequest updateRequest = new UpdateRequest(indexName, id)
                    .doc(indexRequest)
                    .upsert(indexRequest);
            updateRequestList.add(updateRequest);

            //4-放入indexer
            for (UpdateRequest esUpdateRequest : updateRequestList) {
                indexer.add(esUpdateRequest);
            }
        };

        //3.设置Elasticsearch连接与漏斗处理方法
        ElasticsearchSink esSink = getEsSink(elasticsearchSinkFunction, parameterTool);

        //4-拿到mysql binlog的输入流,并绑定漏斗
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.enableCheckpointing(3000);
        DataStreamSource dataStreamSource = env
                .fromSource(getMysqlSource(parameterTool), WatermarkStrategy.noWatermarks(), "MySQL Source");
        //设置全局参数
        env.getConfig().setGlobalJobParameters(parameterTool);
        String parallelism = parameterTool.get("parallelism");
        if (StringUtils.isNotBlank(parallelism)) {
            env.setParallelism(Integer.parseInt(parallelism));
        }

        dataStreamSource.addSink(esSink);
        try {
            dataStreamSource.getExecutionEnvironment().execute();
        } catch (Exception e) {
            log.info("启动失败");
        }
    }

    private static MySqlSource getMysqlSource(ParameterTool parameterTool) {
        String mysqlHost = parameterTool.get("mysqlHost");
        String mysqlPort = parameterTool.get("mysqlPort");
        String mysqlUsername = parameterTool.get("mysqlUsername");
        String mysqlPassword = parameterTool.get("mysqlPassword");
        String serverId = parameterTool.get("serverId");

        String pocBaseInfo = "poc_base_info";
        String pocLabelRef = "poc_label_ref";
        String pocSalesmanRef = "poc_salesman_ref";
        String pocExtendInfo = "poc_extend_info";
        String pocBeesProjectInfo = "poc_bees_project_info";
        String pocWholesalerRef = "poc_wholesaler_ref";
        String databaseMiddle = "abi-cloud-middle-platform-poc" + EnvUtil.getHorizontalEnv(parameterTool.get("env"));
        //创建解析器
        Map config = Maps.newHashMap();
        config.put(JsonConverterConfig.DECIMAL_FORMAT_CONFIG, DecimalFormat.NUMERIC.name());
        JsonDebeziumDeserializationSchema jdd = new JsonDebeziumDeserializationSchema(false, config);

        MySqlSource mySqlSource = MySqlSource.builder()
                .hostname(mysqlHost)
                .port(Integer.parseInt(mysqlPort))
                .databaseList(databaseMiddle)
                .tableList(
                        MessageFormat.format("{0}.{1}", databaseMiddle, pocBaseInfo),
                        MessageFormat.format("{0}.{1}", databaseMiddle, pocLabelRef),
                        MessageFormat.format("{0}.{1}", databaseMiddle, pocSalesmanRef),
                        MessageFormat.format("{0}.{1}", databaseMiddle, pocExtendInfo),
                        MessageFormat.format("{0}.{1}", databaseMiddle, pocBeesProjectInfo),
                        MessageFormat.format("{0}.{1}", databaseMiddle, pocWholesalerRef)
                )
                .username(mysqlUsername)
                .password(mysqlPassword)
                .startupOptions(StartupOptions.initial())
                .deserializer(jdd)
                .serverId(serverId)
                .build();
        return mySqlSource;
    }

    private static ElasticsearchSink getEsSink(ElasticsearchSinkFunction elasticsearchSinkFunction, ParameterTool parameterTool) {
        String esAddress = parameterTool.get("esAddress");
        String esPort = parameterTool.get("esPort");
        String esUsername = parameterTool.get("esUsername");
        String esPassword = parameterTool.get("esPassword");

        //1-httpHosts
        List httpHosts = new ArrayList<>();
        httpHosts.add(new HttpHost(esAddress, Integer.parseInt(esPort), "http"));

        //2-restClientFactory
        RestClientFactory restClientFactory = restClientBuilder -> {
            Node node = new Node(new HttpHost(esAddress, Integer.parseInt(esPort), "https"));
            List nodes = new ArrayList<>();
            nodes.add(node);
            Header[] header = new Header[1];
            BasicHeader authHeader = new BasicHeader("Authorization", "Basic " + Base64.encode((esUsername + ":" + esPassword).getBytes()));
            header[0] = authHeader;
            restClientBuilder.setDefaultHeaders(header);
            restClientBuilder.build().setNodes(
                    nodes
            );
        };

        //3-创建数据源对象 ElasticsearchSink
        ElasticsearchSink.Builder esSinkBuilder = new ElasticsearchSink.Builder(
                httpHosts, elasticsearchSinkFunction
        );
        esSinkBuilder.setRestClientFactory(restClientFactory);
        //配置批量提交
        esSinkBuilder.setBulkFlushBackoff(true);
        //设置批量提交的最大条数
        esSinkBuilder.setBulkFlushMaxActions(3000);
        //设置批量提交最大数据量(以MB为单位)
        esSinkBuilder.setBulkFlushMaxSizeMb(50);
        //设置批量提交间隔
        esSinkBuilder.setBulkFlushInterval(100);
        //设置重试次数
        esSinkBuilder.setBulkFlushBackoffRetries(1);
        //设置重试间隔
        esSinkBuilder.setBulkFlushBackoffDelay(2000L);
        //设置重试策略CONSTANT: 常数 eg: 重试间隔为2s 重试3次 会在2s->4s->6s进行; EXPONENTIAL:指数 eg:  重试间隔为2s 重试3次 会在2s->4s->8s进行
        esSinkBuilder.setBulkFlushBackoffType(ElasticsearchSinkBase.FlushBackoffType.CONSTANT);
        //设置重试机制
        esSinkBuilder.setFailureHandler(new RetryRejectedExecutionFailureHandler());

        ElasticsearchSink esSink = esSinkBuilder.build();

        return esSink;
    }
}

main方法入参样例

--esAddress
********
--esPort
********
--esUsername
********
--esPassword
********
--mysqlHost
********
--mysqlJdbcUrl
********
--mysqlPort
********
--mysqlDriver
********
--mysqlUsername
********
--mysqlPassword
********
--env
dev
--serverId
5500-5529

踩过的坑

1.flink task manager和job manager之间的环境是隔离的,main方法体内的变量, 在process方法中读不到值(可使用flink自带的ParameterTool类解决,把某些变量变成全局变量)

2.必须自定义解析器,解析binlog日志,不然有些如decimal类型的字段会被encode

Map config = Maps.newHashMap();
        config.put(JsonConverterConfig.DECIMAL_FORMAT_CONFIG, DecimalFormat.NUMERIC.name());
        JsonDebeziumDeserializationSchema jdd = new JsonDebeziumDeserializationSchema(false, config);

3.必须单独处理nested类型的字段,更新nested字段需使用script脚本处理各种特殊情况

public static List buildNestedUpdateRequest(String indexName, String id,
                                                               String fieldName, String optType,
                                                               String nestedIdField, String nestedId, Map nestedObj){

        List updateRequestList = new ArrayList<>();

        if (StringUtils.isBlank(id) || null == nestedObj || nestedObj.isEmpty()) {
            return updateRequestList;
        }

        //1-确保有这条数据,没有则新增,有则修改一个没卵用的字段
        List createDocIfNotExistReqList = buildUpdateRequest(indexName,id,new HashMap<>());
        updateRequestList.addAll(createDocIfNotExistReqList);

        //2-构造真正的update请求
        Script script = null;
        //根据操作类型搞出不同的脚本
        if("u".equals(optType) || "c".equals(optType)){
            //语句
            String code = String.format("if (ctx._source.%s == null) {ctx._source.%s = [];}ctx._source.%s.removeIf(item -> item.%s == params.detail.%s);ctx._source.%s.add(params.detail);",
                    fieldName,
                    fieldName,
                    fieldName,
                    nestedIdField,
                    nestedIdField,
                    fieldName);
            //参数
            Map paramMap = new HashMap<>();
            paramMap.put("detail", nestedObj);
            //构造脚本
            script = new Script(ScriptType.INLINE, "painless", code, paramMap);

        }else if("d".equals(optType)){
            //语句
            String code = String.format("if(ctx._source.%s==null){return;}ctx._source.%s.removeIf(item -> item.%s == params.nestedId);if(ctx._source.%s.length==0){ctx._source.remove('%s')}",
                    fieldName,
                    fieldName,
                    nestedIdField,
                    fieldName,
                    fieldName);
            //参数
            Map paramMap = new HashMap<>();
            paramMap.put("nestedId", nestedId);
            //构造脚本
            script = new Script(ScriptType.INLINE, "painless", code, paramMap);
        }
        //创建request
        UpdateRequest updateRequest = new UpdateRequest(indexName, id)
                .script(script);
        updateRequestList.add(updateRequest);

        return updateRequestList;
    }

你可能感兴趣的:(flink,大数据,java,elasticsearch)