KafkaSink 可将数据流写入一个或多个 Kafka topic
实战源码地址,一键下载可用:https://gitee.com/shawsongyue/aurora.git
模块:aurora_flink_connector_kafka
主类:KafkaSinkStreamingJob
org.apache.flink
flink-connector-kafka
3.0.2-1.18
若是不引入该依赖,项目启动直接报错:Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/flink/connector/base/source/reader/RecordEmitter
org.apache.flink
flink-connector-base
1.18.0
Kafka sink 提供了构建类来创建 KafkaSink
的实例
DataStream<String> stream = ...;
KafkaSink<String> sink = KafkaSink.<String>builder()
.setBootstrapServers(brokers)
.setRecordSerializer(KafkaRecordSerializationSchema.builder()
.setTopic("topic-name")
.setValueSerializationSchema(new SimpleStringSchema())
.build()
)
.setDeliveryGuarantee(DeliveryGuarantee.AT_LEAST_ONCE)
.build();
stream.sinkTo(sink);
以下属性在构建 KafkaSink 时是必须指定的:
Bootstrap servers, setBootstrapServers(String)
消息序列化器(Serializer), setRecordSerializer(KafkaRecordSerializationSchema)
如果使用DeliveryGuarantee.EXACTLY_ONCE 的语义保证,则需要使用 setTransactionalIdPrefix(String)
构建时需要提供 KafkaRecordSerializationSchema
来将输入数据转换为 Kafka 的 ProducerRecord
。Flink 提供了 schema 构建器 以提供一些通用的组件,例如消息键(key)/消息体(value)序列化、topic 选择、消息分区,同样也可以通过实现对应的接口来进行更丰富的控制。
其中消息体(value)序列化方法和 topic 的选择方法是必须指定的,此外也可以通过 setKafkaKeySerializer(Serializer)
或 setKafkaValueSerializer(Serializer)
来使用 Kafka 提供而非 Flink 提供的序列化器
KafkaRecordSerializationSchema.builder()
.setTopicSelector((element) -> {})
.setValueSerializationSchema(new SimpleStringSchema())
.setKeySerializationSchema(new SimpleStringSchema())
.setPartitioner(new FlinkFixedPartitioner())
.build();
05 容错恢复
`KafkaSink` 总共支持三种不同的语义保证(`DeliveryGuarantee`)。对于 `DeliveryGuarantee.AT_LEAST_ONCE` 和 `DeliveryGuarantee.EXACTLY_ONCE`,Flink checkpoint 必须启用。默认情况下 `KafkaSink` 使用 `DeliveryGuarantee.NONE`。 以下是对不同语义保证的解释:
DeliveryGuarantee.NONE
不提供任何保证:消息有可能会因 Kafka broker 的原因发生丢失或因 Flink 的故障发生重复。DeliveryGuarantee.AT_LEAST_ONCE
: sink 在 checkpoint 时会等待 Kafka 缓冲区中的数据全部被 Kafka producer 确认。消息不会因 Kafka broker 端发生的事件而丢失,但可能会在 Flink 重启时重复,因为 Flink 会重新处理旧数据。DeliveryGuarantee.EXACTLY_ONCE
: 该模式下,Kafka sink 会将所有数据通过在 checkpoint 时提交的事务写入。因此,如果 consumer 只读取已提交的数据(参见 Kafka consumer 配置 isolation.level
),在 Flink 发生重启时不会发生数据重复。然而这会使数据在 checkpoint 完成时才会可见,因此请按需调整 checkpoint 的间隔。请确认事务 ID 的前缀(transactionIdPrefix)对不同的应用是唯一的,以保证不同作业的事务 不会互相影响!此外,强烈建议将 Kafka 的事务超时时间调整至远大于 checkpoint 最大间隔 + 最大重启时间,否则 Kafka 对未提交事务的过期处理会导致数据丢失。Kafka sink 会在不同的范围(Scope)中汇报下列指标。
范围 | 指标 | 用户变量 | 描述 | 类型 |
---|---|---|---|---|
算子 | currentSendTime | n/a | 发送最近一条数据的耗时。该指标反映最后一条数据的瞬时值。 | Gauge |
4.0.0
com.xsy
aurora_flink_connector_kafka
1.0-SNAPSHOT
11
3.8.1
UTF-8
UTF-8
1.2.75
2.17.1
1.18.0
2.11
com.alibaba
fastjson
${fastjson.version}
org.apache.flink
flink-java
${flink.version}
org.apache.flink
flink-streaming-scala_2.12
${flink.version}
org.apache.flink
flink-clients
${flink.version}
org.apache.logging.log4j
log4j-slf4j-impl
${log4j.version}
org.apache.logging.log4j
log4j-api
${log4j.version}
org.apache.logging.log4j
log4j-core
${log4j.version}
org.apache.flink
flink-connector-kafka
3.0.2-1.18
org.apache.flink
flink-connector-base
1.18.0
${project.name}
src/main/resources
src/main/java
**/*.xml
org.apache.maven.plugins
maven-shade-plugin
3.1.1
package
shade
org.apache.flink:force-shading
org.google.code.flindbugs:jar305
org.slf4j:*
org.apache.logging.log4j:*
*:*
META-INF/*.SF
META-INF/*.DSA
META-INF/*.RSA
org.aurora.KafkaStreamingJob
org.springframework.boot
spring-boot-maven-plugin
${spring.boot.version}
true
${project.build.finalName}
repackage
maven-compiler-plugin
${maven.plugin.version}
${java.version}
UTF-8
-parameters
aliyun-repos
https://maven.aliyun.com/nexus/content/groups/public/
false
aliyun-plugin
https://maven.aliyun.com/nexus/content/groups/public/
false
(1)application.properties
#kafka集群地址
kafka.bootstrapServers=localhost:9092
#kafka主题
kafka.topic=topic_a
#kafka消费者组
kafka.group=aurora_group
(2)log4j2.properties
rootLogger.level=INFO
rootLogger.appenderRef.console.ref=ConsoleAppender
appender.console.name=ConsoleAppender
appender.console.type=CONSOLE
appender.console.layout.type=PatternLayout
appender.console.layout.pattern=%d{HH:mm:ss,SSS} %-5p %-60c %x - %m%n
log.file=D:\\tmprootLogger.level=INFO
rootLogger.appenderRef.console.ref=ConsoleAppender
appender.console.name=ConsoleAppender
appender.console.type=CONSOLE
appender.console.layout.type=PatternLayout
appender.console.layout.pattern=%d{HH:mm:ss,SSS} %-5p %-60c %x - %m%n
log.file=D:\\tmp
package com.aurora;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.connector.base.DeliveryGuarantee;
import org.apache.flink.connector.kafka.sink.KafkaRecordSerializationSchema;
import org.apache.flink.connector.kafka.sink.KafkaSink;
import org.apache.flink.connector.kafka.source.KafkaSource;
import org.apache.flink.connector.kafka.source.KafkaSourceBuilder;
import org.apache.flink.connector.kafka.source.enumerator.initializer.OffsetsInitializer;
import org.apache.flink.runtime.state.StateBackend;
import org.apache.flink.runtime.state.filesystem.FsStateBackend;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.CheckpointConfig;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.ArrayList;
/**
* @author 浅夏的猫
* @description kafka 连接器使用demo作业
* @datetime 22:21 2024/2/1
*/
public class KafkaSinkStreamingJob {
private static final Logger logger = LoggerFactory.getLogger(KafkaSinkStreamingJob.class);
public static void main(String[] args) throws Exception {
//===============1.获取参数==============================
//定义文件路径
String propertiesFilePath = "E:\\project\\aurora_dev\\aurora_flink_connector_kafka\\src\\main\\resources\\application.properties";
//方式一:直接使用内置工具类
ParameterTool paramsMap = ParameterTool.fromPropertiesFile(propertiesFilePath);
//================2.初始化kafka参数==============================
String bootstrapServers = paramsMap.get("kafka.bootstrapServers");
String topic = paramsMap.get("kafka.topic");
KafkaSink sink = KafkaSink.builder()
//设置kafka地址
.setBootstrapServers(bootstrapServers)
//设置消息序列号方式
.setRecordSerializer(KafkaRecordSerializationSchema.builder()
.setTopic(topic)
.setValueSerializationSchema(new SimpleStringSchema())
.build()
)
//至少一次
.setDeliveryGuarantee(DeliveryGuarantee.AT_LEAST_ONCE)
.build();
//=================4.创建Flink运行环境=================
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
ArrayList listData = new ArrayList<>();
listData.add("test");
listData.add("java");
listData.add("c++");
DataStreamSource dataStreamSource = env.fromCollection(listData);
//=================5.数据简单处理======================
SingleOutputStreamOperator flatMap = dataStreamSource.flatMap(new FlatMapFunction() {
@Override
public void flatMap(String record, Collector collector) throws Exception {
logger.info("正在处理kafka数据:{}", record);
collector.collect(record);
}
});
//数据输出算子
flatMap.sinkTo(sink);
//=================6.启动服务=========================================
//开启flink的checkpoint功能:每隔1000ms启动一个检查点(设置checkpoint的声明周期)
env.enableCheckpointing(1000);
//checkpoint高级选项设置
//设置checkpoint的模式为exactly-once(这也是默认值)
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
//确保检查点之间至少有500ms间隔(即checkpoint的最小间隔)
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(500);
//确保检查必须在1min之内完成,否则就会被丢弃掉(即checkpoint的超时时间)
env.getCheckpointConfig().setCheckpointTimeout(60000);
//同一时间只允许操作一个检查点
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
//程序即使被cancel后,也会保留checkpoint数据,以便根据实际需要恢复到指定的checkpoint
env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
//设置statebackend,指定state和checkpoint的数据存储位置(checkpoint的数据必须得有一个可以持久化存储的地方)
env.getCheckpointConfig().setCheckpointStorage("file:///E:/flink/checkPoint");
env.execute();
}
}