kafka消费无限重试问题排查

背景

kafka: 0.9
spring-kafka: 2.2.4.RELEASE
kafka-client: 2.0.1
测试环境:topic的partition数量为2,consumer数量为1

问题

项目中kafka消费配置了失败重试规则,最大重试次数maxRetryTimes的值为15。测试时故意在消费时报错,在消费15次之后就会写入死信队列。

kafkaListenerContainerFactory.setErrorHandler(new SeekToCurrentErrorHandler(new DeadLetterPublishingRecoverer(kafkaTemplate), maxRetryTimes));

但是最近测试环境却出现“消费端无限重试异常消息的情况”

解决

本地连上测试环境的kafka调试消费,发现:

  1. topic的partition数量为2
  2. 当前只有一个consumer,且consumer的concurrency配置为1
  3. 如果同时有两批异常消息在重试。

在以上情况下,消费端就会出现无限重试

启动debug了解spring-kafka重试原理和问题原因。
从配置的SeekToCurrentErrorHandler开始,SeekToCurrentErrorHandler会将consumer消费进度重新定位到消费失败的消息的offset,这样comsumer就会重新拉取到失败的消息:

public class SeekToCurrentErrorHandler implements ContainerAwareErrorHandler {
    @Override
    public void handle(Exception thrownException, List> records,
            Consumer consumer, MessageListenerContainer container) {
        if (!SeekUtils.doSeeks(records, consumer, thrownException, true, this.failureTracker::skip, LOGGER)) {
            throw new KafkaException("Seek to current after exception", thrownException);
        }
        else if (this.commitRecovered) {
            if (container.getContainerProperties().getAckMode().equals(AckMode.MANUAL_IMMEDIATE)) {
                ConsumerRecord record = records.get(0);
                Map offsetToCommit = Collections.singletonMap(
                        new TopicPartition(record.topic(), record.partition()),
                        new OffsetAndMetadata(record.offset() + 1));
                if (container.getContainerProperties().isSyncCommits()) {
                    consumer.commitSync(offsetToCommit);
                }
                else {
                    OffsetCommitCallback commitCallback = container.getContainerProperties().getCommitCallback();
                    if (commitCallback == null) {
                        commitCallback = LOGGING_COMMIT_CALLBACK;
                    }
                    consumer.commitAsync(offsetToCommit, commitCallback);
                }
            }
            else {
                LOGGER.warn("'commitRecovered' ignored, container AckMode must be MANUAL_IMMEDIATE");
            }
        }
    }
}

这里由if (!SeekUtils.doSeeks(records, consumer, thrownException, true, this.failureTracker::skip, LOGGER))来判断是否需要重试,并重新定位offset

public final class SeekUtils {
    public static boolean doSeeks(List> records, Consumer consumer, Exception exception,
            boolean recoverable, BiPredicate, Exception> skipper, Log logger) {
        Map partitions = new LinkedHashMap<>();
        AtomicBoolean first = new AtomicBoolean(true);
        AtomicBoolean skipped = new AtomicBoolean();
        records.forEach(record ->  {
            if (recoverable && first.get()) {
                skipped.set(skipper.test(record, exception));
                if (skipped.get() && logger.isDebugEnabled()) {
                    logger.debug("Skipping seek of: " + record);
                }
            }
            if (!recoverable || !first.get() || !skipped.get()) {
                partitions.computeIfAbsent(new TopicPartition(record.topic(), record.partition()),
                        offset -> record.offset());
            }
            first.set(false);
        });
        boolean tracing = logger.isTraceEnabled();
        partitions.forEach((topicPartition, offset) -> {
            try {
                if (tracing) {
                    logger.trace("Seeking: " + topicPartition + " to: " + offset);
                }
                consumer.seek(topicPartition, offset);
            }
            catch (Exception e) {
                logger.error("Failed to seek " + topicPartition + " to " + offset, e);
            }
        });
        return skipped.get();
    }
}

可以看到SeekUtils主要做两件事情:

  1. 通过skipper.test(record, exception)判断是否需要重试,并将结果作为方法的返回值。
  2. 通过consumer.seek(topicPartition, offset)重新定位消费offset
class FailedRecordTracker {
    private final ThreadLocal failures = new ThreadLocal<>(); // intentionally not static
    boolean skip(ConsumerRecord record, Exception exception) {
        FailedRecord failedRecord = this.failures.get();
        if (failedRecord == null || !failedRecord.getTopic().equals(record.topic())
                || failedRecord.getPartition() != record.partition() || failedRecord.getOffset() != record.offset()) {
            this.failures.set(new FailedRecord(record.topic(), record.partition(), record.offset()));
            return false;
        }
        else {
            if (this.maxFailures >= 0 && failedRecord.incrementAndGet() >= this.maxFailures) {
                this.recoverer.accept(record, exception);
                return true;
            }
            return false;
        }
    }
    
    private static final class FailedRecord {
        private final String topic;
        private final int partition;
        private final long offset;
        private int count;
        FailedRecord(String topic, int partition, long offset) {
            this.topic = topic;
            this.partition = partition;
            this.offset = offset;
            this.count = 1;
        }
        private String getTopic() {
            return this.topic;
        }
        private int getPartition() {
            return this.partition;
        }
        private long getOffset() {
            return this.offset;
        }
        private int incrementAndGet() {
            return ++this.count;
        }
    }    
}    

FailedRecordTracker.skip判断:

  1. 当前线程消费的消息和上一次消费的消息是一致的(topic、partition、offset相等)。
  2. 在符合条件1的情况下,当前消息重试次数是否达到配置的最大值(每次重试时,ThreadLocal里存的FailedRecord.count值都是加1)。

在符合以上条件的情况下,spring-kafka就会报错不再继续重试。
但是调试发现,if (failedRecord == null || !failedRecord.getTopic().equals(record.topic()) || failedRecord.getPartition() != record.partition() || failedRecord.getOffset() != record.offset())每次判断结果都是false,也就是被判定为这是一个新的错误消息,所以每次都创建一个新的FailedRecord存入ThreadLocal覆盖掉旧值,新FailedRecord的已重试次数count值为1。
为什么这里每次都是返回false

回到前面说的出异常的条件:

  1. topic的partition数量为2
  2. 当前只有一个consumer,且consumer的concurrency配置为1
  3. 如果同时有两批异常消息在重试。

测试发现如果把concurrency调成跟partition数量一致时,就不会出现无限重试的情况。那么concurrency到底是什么作用呢?
继续阅读源码:

public class ConcurrentMessageListenerContainer extends AbstractMessageListenerContainer {
    protected void doStart() {
        if (!isRunning()) {
            checkTopics();
            ContainerProperties containerProperties = getContainerProperties();
            TopicPartitionInitialOffset[] topicPartitions = containerProperties.getTopicPartitions();
            if (topicPartitions != null && this.concurrency > topicPartitions.length) {
                this.logger.warn("When specific partitions are provided, the concurrency must be less than or "
                        + "equal to the number of partitions; reduced from " + this.concurrency + " to "
                        + topicPartitions.length);
                this.concurrency = topicPartitions.length;
            }
            setRunning(true);
            for (int i = 0; i < this.concurrency; i++) {
                KafkaMessageListenerContainer container;
                if (topicPartitions == null) {
                    container = new KafkaMessageListenerContainer<>(this, this.consumerFactory, containerProperties);
                }
                else {
                    container = new KafkaMessageListenerContainer<>(this, this.consumerFactory,
                            containerProperties, partitionSubset(containerProperties, i));
                }
                String beanName = getBeanName();
                container.setBeanName((beanName != null ? beanName : "consumer") + "-" + i);
                if (getApplicationEventPublisher() != null) {
                    container.setApplicationEventPublisher(getApplicationEventPublisher());
                }
                container.setClientIdSuffix("-" + i);
                container.setGenericErrorHandler(getGenericErrorHandler());
                container.setAfterRollbackProcessor(getAfterRollbackProcessor());
                container.setEmergencyStop(() -> {
                    stop(() -> {
                        // NOSONAR
                    });
                    publishContainerStoppedEvent();
                });
                container.start();
                this.containers.add(container);
            }
        }
    }
    private TopicPartitionInitialOffset[] partitionSubset(ContainerProperties containerProperties, int i) {
        TopicPartitionInitialOffset[] topicPartitions = containerProperties.getTopicPartitions();
        if (this.concurrency == 1) {
            return topicPartitions;
        }
        else {
            int numPartitions = topicPartitions.length;
            if (numPartitions == this.concurrency) {
                return new TopicPartitionInitialOffset[] { topicPartitions[i] };
            }
            else {
                int perContainer = numPartitions / this.concurrency;
                TopicPartitionInitialOffset[] subset;
                if (i == this.concurrency - 1) {
                    subset = Arrays.copyOfRange(topicPartitions, i * perContainer, topicPartitions.length);
                }
                else {
                    subset = Arrays.copyOfRange(topicPartitions, i * perContainer, (i + 1) * perContainer);
                }
                return subset;
            }
        }
    }    
}    

可以看到配置的concurrency值是多少,就会创建多少个KafkaMessageListenerContainer,每个KafkaMessageListenerContainer平均分配partition,并且有不同的消费线程。

现在再回到if (failedRecord == null || !failedRecord.getTopic().equals(record.topic()) || failedRecord.getPartition() != record.partition() || failedRecord.getOffset() != record.offset())每次判断结果都是false的问题。
因为之前concurrency配置为1,所以spring-kafka创建一个KafkaMessageListenerContainer来消费两个partition,也就是两个partition是由同一个线程来处理,而FailedRecord又是存储在Threadlocal里。
所以当出现以下情况时:

  1. 同时有两批异常消息要重试。
  2. 两批异常消息分别属于不同的partition.

就会出现以下情况:

  1. 消费partition-0消息-0异常。
  2. 发现当前消息跟ThreadLocal里消息的“topic、partition、offset”不一致,构建消息-0FailedRecord,放入ThreadLocal,FailedRecord内的重试次数count为1。
  3. 消费partition-1消息-1异常。
  4. 发现当前消息跟ThreadLocal里消息的“topic、partition、offset”不一致,构建消息-1FailedRecord,放入ThreadLocal(覆盖掉原有的值),FailedRecord内的重试次数count为1。
  5. 消费partition-0消息-0异常。
  6. 发现当前消息跟ThreadLocal里消息的“topic、partition、offset”不一致,构建消息-0FailedRecord,放入ThreadLocal(覆盖掉原有的值),FailedRecord内的重试次数count为1。
  7. 无限循环以上步骤。

而当concurrency配置为跟partition跟数量一样的2partition-0partition-1分别处理消息-0消息-1FailedRecord被存储在线程各自的ThreadLocal内,没有被覆盖,所以不再无限重试。
如果同时有大量的异常消息在重试,异常消息量超过partition的数量时,会不会出现再同一个KafkaMessageListenerContainer内重试多条异常消息而又出现无限重试的问题吗?
答案是不会,KafkaMessageListenerContainer在消费消息失败后,就会重新通过seek定位当前patition消费进度,并重新拉取消息,这时候会拉取到刚刚消费失败的那一条,直到这一条消息重试完之后,才会处理下一条消息。
注:当有多个consumer且每个consumer的concurrency值一样时,配置的规则是“concurrency值 * consumer数量 = partition数量”,kafka消费分配策略不在本文展开。

优化

concurrency配置成partition数量一样时,是解决问题了,但是这样子就限定了一个KafkaMessageListenerContainer只能消费一个partition,感觉很不合理。
抱着这个想法,我搜了下资料。在Spring Kafka SeekToCurrentErrorHandler maxFailures doesn't work when concurrency level is less than partition numberSeekToCurrentErrorHandler with exponential backoff - potential infinite loop里确定这是spring-kafka-2.2.4.RELEASEbug,并且官方回答在2.2.8.RELEASE已经修复了这个bug,然而我通过测试和观察源码,发现2.2.8.RELEASE并没有修复,升级到2.2.14.RELEASE后这个bug不再出现。
2.2.14.RELEASE是怎么处理这个问题的?通过前面分析我们知道主要是在FailedRecordTracker.skip里判断有问题,ThreadLocal只存上一次消费的FailedRecord,导致不同的partition的消息互相循环覆盖。
所以2.2.14.RELEASE改成存储每个topic-parition的上一次消费的FailedRecord

private final ThreadLocal failures = new ThreadLocal<>(); 

改成

private final ThreadLocal> failures = new ThreadLocal<>();

skip判断时,就可以根据topic-parition取到不同parition上一次的的消息,避免了互相循环覆盖的问题。

class FailedRecordTracker {
    private final ThreadLocal> failures = new ThreadLocal<>(); // intentionally not static
    boolean skip(ConsumerRecord record, Exception exception) {
        if (this.noRetries) {
            this.recoverer.accept(record, exception);
            return true;
        }
        Map map = this.failures.get();
        if (map == null) {
            this.failures.set(new HashMap<>());
            map = this.failures.get();
        }
        TopicPartition topicPartition = new TopicPartition(record.topic(), record.partition());
        FailedRecord failedRecord = map.get(topicPartition);
        if (failedRecord == null || failedRecord.getOffset() != record.offset()) {
            failedRecord = new FailedRecord(record.offset());
            map.put(topicPartition, failedRecord);
            return false;
        }
        else if (this.maxFailures > 0 && failedRecord.incrementAndGet() >= this.maxFailures) {
            this.recoverer.accept(record, exception);
            map.remove(topicPartition);
            if (map.isEmpty()) {
                this.failures.remove();
            }
            return true;
        }
        else {
            return false;
        }
    }
}    

你可能感兴趣的:(java,spring,kafka)