Flink Kafka Producer报错:The producer has been rejected from the broker because it tried to use an old

背景

在开发一个APM项目的过程中,需要使用flink从阿里云的sls消费数据并写入kafka,这里使用的Sink是flink官方支持库提供的 FlinkKafkaProducer ,对接后在运行过程中较频繁的出现以下异常

2021-07-07 11:25:56.080 [ERROR] [APM_ANRProcess -> Sink: APM_ANRSink (1/1)] [org.apache.flink.streaming.runtime.tasks.StreamTask][732] - Error during disposal of stream operator.
org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: Failed to send data to Kafka: The producer has been rejected from the broker because it tried to use an old epoch with the transactionalId
	at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1282)
	at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:920)
	at org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43)
	at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:729)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:645)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:549)
	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.common.errors.ProducerFencedException: The producer has been rejected from the broker because it tried to use an old epoch with the transactionalId
2021-07-07 11:25:56.082 [WARN ] [APM_ANRProcess -> Sink: APM_ANRSink (1/1)] [org.apache.flink.runtime.taskmanager.Task][970] - APM_ANRProcess -> Sink: APM_ANRSink (1/1) (2b3de5183e98607918b2976d46795740) switched from RUNNING to FAILED.
org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: Failed to send data to Kafka: Producer attempted an operation with an old epoch. Either there is a newer producer with the same transactionalId, or the producer's transaction has been expired by the broker.
	at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1282)
	at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.invoke(FlinkKafkaProducer.java:816)
	at com.shizhuang.apm.flinkwork.sink.ApmKafkaIngestEventsProducer.invoke(ApmKafkaIngestEventsProducer.java:58)
	at com.shizhuang.apm.flinkwork.sink.ApmKafkaIngestEventsProducer.invoke(ApmKafkaIngestEventsProducer.java:18)
	at org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.invoke(TwoPhaseCommitSinkFunction.java:235)
	at org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:56)
	at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:717)
	at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:692)
	at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:672)
	at org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:52)
	at org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:30)
	at org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:53)

问题分析

从日志上看

The producer has been rejected from the broker because it tried to use an old epoch with the transactionalId

从自己面上的意思是说 producer 使用了 过时的 epoch 信息。
kafka会给每个producer 生成、分配一个transaction.id,这个唯一标识符可以用来解决僵尸示例的问题,epoch是 transaction.id 信息里的一个元信息,关于 epoch 的过期,从网上搜索的资料有这样一段解释

Once the epoch is bumped, any producers with same transactional.id and an older epoch are considered zombies and are fenced off, ie. future transactional writes from those producers are rejected. [emphasis added]

因此首先猜测是 producer 事务的提交超时了,因此broker认为 producer已经断开了,当producer 尝试使用过期的transaction.id提交信息时被拒绝了

问题解决

通过code review,我们的flink是设置了 checkpoint的,在flink 的官方文档上有说明,当 flink 设置了 checkpoint, kafka的事务提交时间将使用flink的checkpoint时间间隔。
我们项目当前 checkpointing的事务提交时间为5分钟,尝试需改 checkpoint时间为 1分钟后,发现问题解决了。
当然如果不想修改 checkpoint的间隔,也可以适当提高 transaction.timeout.ms 的时间,比如设置成 和broker一样的默认 15分钟

结论

  • 通过控制 flink的 checkpoint时间间隔,解决了 kafka producer 事务超时的问题
  • 如果不想修改checkpoint ,也可以适当提高 transaction.timeout.ms 的时间

你可能感兴趣的:(大数据,flink,kafka)