Kafka 发送端error：This server is not the leader for that topic-partition

kafka版本：0.10.2

问题描述

线上kafka的生产者程序大概每一周都会抛出以下异常后停止，重启后恢复。观察监控发现，异常前网卡流量平稳，不存在抖动。（是不是很神奇）
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.

追踪日志

server.log

org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition

stat-change.log

[2019-02-25 15:01:03,023] TRACE Broker 2 received LeaderAndIsr request PartitionState(controllerEpoch=8, leader=3, leaderEpoch=10, isr=[1, 2, 3], zkVersion=19, replicas=[1, 2, 3]) correlation id 202 from controller 1 epoch 8 for partition [__consumer_offsets,36] (state.change.logger)
[2019-02-25 15:01:03,023] TRACE Broker 2 handling LeaderAndIsr request correlationId 202 from controller 1 epoch 8 starting the become-follower transition for partition __consumer_offsets-36 (state.change.logger)
[2019-02-25 15:01:03,023] TRACE Broker 2 stopped fetchers as part of become-follower request from controller 1 epoch 8 with correlation id 202 for partition __consumer_offsets-36 (state.change.logger)
[2019-02-25 15:01:03,044] TRACE Broker 2 truncated logs and checkpointed recovery boundaries for partition __consumer_offsets-36 as part of become-follower request with correlation id 202 from controller 1 epoch 8 (state.change.logger)
[2019-02-25 15:01:03,044] TRACE Broker 2 started fetcher to new leader as part of become-follower request from controller 1 epoch 8 with correlation id 202 for partition __consumer_offsets-36 (state.change.logger)
[2019-02-25 15:01:03,044] TRACE Broker 2 completed LeaderAndIsr request correlationId 202 from controller 1 epoch 8 for the become-follower transition for partition __consumer_offsets-36 (state.change.logger)
[2019-02-25 15:01:03,045] TRACE Broker 2 cached leader info (LeaderAndIsrInfo:(Leader:3,ISR:1,2,3,LeaderEpoch:10,ControllerEpoch:8),ReplicationFactor:3),AllReplicas:1,2,3) for partition __consumer_offsets-36 in response to UpdateMetadata request sent by controller 1 epoch 8 with correlation id 203 (state.change.logger)

通过以下关键字received-->handling-->stopped-->truncated-->started-->completed LeaderAndIsr request-->cached leader info ，可见 partition 的leader选举还是非常快的，毫秒级

错误分析

我们的producer端的代码里没加 reties 参数，默认就发送一次，遇到leader选举时，找不到leader就会发送失败，造成程序停止

解决办法

producer端加上参数 reties=3，重试发送三次（默认100ms重试一次由 retry.backoff.ms控制）；
如果还需要保证消息发送的有序性，记得加上参数 max.in.flight.requests.per.connection = 1 限制客户端在单个连接上能够发送的未响应请求的个数，设置此值是1表示kafka broker在响应请求之前client不能再向同一个broker发送请求。（注意：设置此参数是为了满足必须顺序消费的场景，比如binlog数据）

Kafka 发送端error：This server is not the leader for that topic-partition

问题描述

追踪日志

server.log

stat-change.log

通过以下关键字received-->handling-->stopped-->truncated-->started-->completed LeaderAndIsr request-->cached leader info ，可见 partition 的leader选举还是非常快的，毫秒级

错误分析

解决办法

你可能感兴趣的:(Kafka 发送端error：This server is not the leader for that topic-partition)