如何保证 flink-connector-elasticsearch 的幂等性

好的,下面是您所需的内容。

官方文档连接

  1. Flink 官方文档:https://ci.apache.org/projects/flink/flink-docs-release-1.11/
  2. Flink Elasticsearch Connector 文档:https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/connectors/elasticsearch.html
  3. Elasticsearch Script Upsert API 文档:https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html#upserts

问题描述

在 sink 端存在着老数据更新新数据的现象。例如,A 数据有两个版本: A1 和 A2,1 和 2 可以理解为时间戳,在落 elasticsearch 的时候,我们不能保证 A1 和 A2 的先后顺序,如果,写入 es 的顺序是 A1 A2 ,则 elasticsearch 中的数据是最新的,如果顺序是 A2 A1 ,则 elasticsearch 中的最终保存的数据不是最新的。那如何解决这个问题呢?就需要使用 elasticsearch 的 upsert + script 功能了。

演示代码

以下是使用 Flink 1.11.1 版本实现根据数据中的时间戳更新 Elasticsearch 数据的代码示例。假设我们有一个简单的数据流,包含用户名称、事件时间戳和一些其他信息。我们要将这些数据写入 Elasticsearch 中,并使用时间戳更新已存在的文档。

首先,我们需要引入 Flink Elasticsearch Connector 和 Elasticsearch 的 Java 客户端库。可以使用以下 Maven 依赖项:


<dependency>
    <groupId>org.apache.flinkgroupId>
    <artifactId>flink-connector-elasticsearch7_2.11artifactId>
    <version>1.14.0version>
dependency>

然后,我们需要创建一个 Elasticsearch 连接器,并将其添加到 Flink 程序中:

import org.apache.flink.api.common.functions.RuntimeContext;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkFunction;
import org.apache.flink.streaming.connectors.elasticsearch.RequestIndexer;
import org.apache.flink.streaming.connectors.elasticsearch7.ElasticsearchSink;
import org.apache.http.HttpHost;
import org.elasticsearch.action.update.UpdateRequest;
import org.elasticsearch.script.Script;
import org.elasticsearch.script.ScriptType;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.*;
/**
 * @className: ESSinkTest
 * @Description:
 * @Author: wangyifei
 * @Date: 2023/7/1 12:00
 */
public class ESSinkTest {
    private static Logger logger = LoggerFactory.getLogger(ESSinkTest.class);
    private static String script = "String pattern = \"yyyy-MM-dd HH:mm:ss\" ;\n" +
            "        DateTimeFormatter formatter = DateTimeFormatter.ofPattern(pattern);\n" +
            "        LocalDateTime parse1 = LocalDateTime.parse(\"__date__ 00:00:00\" , formatter);\n" +
            "        long l = parse1.toInstant(ZoneOffset.ofHours(8)).toEpochMilli();\n" +
            "        LocalDateTime parse2 = LocalDateTime.parse(ctx._source.date + \" 00:00:00\" , formatter);\n" +
            "        long ll = parse2.toInstant(ZoneOffset.ofHours(8)).toEpochMilli();\n" +
            "        \n" +
            "       if(ll < l){\n" +
            "        ctx._source.name=\"__name__\";\n" +
            "        ctx._source.desc=\"__desc__\";\n" +
            "        ctx._source.date=\"__date__\";\n" +
            "        ctx._source.price=__price__;\n" +
            "       }\n";
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
        DataStreamSource<String> source = env.socketTextStream("127.0.0.1", 999);
        List<HttpHost> https = new ArrayList<>();
        https.add(new HttpHost("127.0.0.1", 9200, "http"));
        ElasticsearchSink.Builder<String> builder
                = new ElasticsearchSink.Builder<String>(
                https
                , new ElasticsearchSinkFunction<String>() {
            @Override
            public void process(String cnt, RuntimeContext runtimeContext, RequestIndexer requestIndexer) {

                String[] split = cnt.split(",");
                String id = split[0];
                String name = split[1];
                String desc = split[2];
                String date = split[3];
                String price = split[4];
                UpdateRequest request = new UpdateRequest();
                String localScript = null;
                localScript = script.replace("__name__" , name);
                localScript = localScript.replace("__desc__" , desc);
                localScript = localScript.replace("__date__" , date);
                localScript = localScript.replace("__price__" , price);
                request.index("product")
                        .scriptedUpsert(true)
                        .id(id)
                        .script(new Script(ScriptType.INLINE, "painless", localScript, Collections.emptyMap()));
                Map<String,Object> bean = new HashMap<>();
                bean.put("name",name);
                bean.put("desc",desc);
                bean.put("date",date);
                bean.put("price",price);
                System.out.println(bean);
                request.upsert(bean);
                requestIndexer.add(request);
            }
        }
        );
        builder.setBulkFlushMaxActions(1);
        ElasticsearchSink<String> http = builder.build();
        source.addSink(http);
        env.execute();
    }
}

在上面的代码中,我们从一个简单的 Socket 数据源中读取数据,然后,我们使用 ElasticsearchSink 来将数据写入 Elasticsearch。在代码中使用 script + upsert 的方式实现了 sink 端的幂等性,而且是老的版本不同更新新的版本。

你可能感兴趣的:(Flink,flink,elasticsearch,jenkins)