kafka消费无限重试问题排查

背景

kafka: 0.9
spring-kafka: 2.2.4.RELEASE
kafka-client: 2.0.1
测试环境：topic的partition数量为2，consumer数量为1

问题

项目中kafka消费配置了失败重试规则，最大重试次数maxRetryTimes的值为15。测试时故意在消费时报错，在消费15次之后就会写入死信队列。

kafkaListenerContainerFactory.setErrorHandler(new SeekToCurrentErrorHandler(new DeadLetterPublishingRecoverer(kafkaTemplate), maxRetryTimes));

但是最近测试环境却出现“消费端无限重试异常消息的情况”。

解决

本地连上测试环境的kafka调试消费，发现：

topic的partition数量为2。
当前只有一个consumer，且consumer的concurrency配置为1。
如果同时有两批异常消息在重试。

在以上情况下，消费端就会出现无限重试。

启动debug了解spring-kafka重试原理和问题原因。
从配置的SeekToCurrentErrorHandler开始，SeekToCurrentErrorHandler会将consumer消费进度重新定位到消费失败的消息的offset，这样comsumer就会重新拉取到失败的消息：

public class SeekToCurrentErrorHandler implements ContainerAwareErrorHandler {
    @Override
    public void handle(Exception thrownException, List> records,
            Consumer consumer, MessageListenerContainer container) {
        if (!SeekUtils.doSeeks(records, consumer, thrownException, true, this.failureTracker::skip, LOGGER)) {
            throw new KafkaException("Seek to current after exception", thrownException);
        }
        else if (this.commitRecovered) {
            if (container.getContainerProperties().getAckMode().equals(AckMode.MANUAL_IMMEDIATE)) {
                ConsumerRecord record = records.get(0);
                Map offsetToCommit = Collections.singletonMap(
                        new TopicPartition(record.topic(), record.partition()),
                        new OffsetAndMetadata(record.offset() + 1));
                if (container.getContainerProperties().isSyncCommits()) {
                    consumer.commitSync(offsetToCommit);
                }
                else {
                    OffsetCommitCallback commitCallback = container.getContainerProperties().getCommitCallback();
                    if (commitCallback == null) {
                        commitCallback = LOGGING_COMMIT_CALLBACK;
                    }
                    consumer.commitAsync(offsetToCommit, commitCallback);
                }
            }
            else {
                LOGGER.warn("'commitRecovered' ignored, container AckMode must be MANUAL_IMMEDIATE");
            }
        }
    }
}

这里由if (!SeekUtils.doSeeks(records, consumer, thrownException, true, this.failureTracker::skip, LOGGER))来判断是否需要重试，并重新定位offset。

public final class SeekUtils {
    public static boolean doSeeks(List> records, Consumer consumer, Exception exception,
            boolean recoverable, BiPredicate, Exception> skipper, Log logger) {
        Map partitions = new LinkedHashMap<>();
        AtomicBoolean first = new AtomicBoolean(true);
        AtomicBoolean skipped = new AtomicBoolean();
        records.forEach(record ->  {
            if (recoverable && first.get()) {
                skipped.set(skipper.test(record, exception));
                if (skipped.get() && logger.isDebugEnabled()) {
                    logger.debug("Skipping seek of: " + record);
                }
            }
            if (!recoverable || !first.get() || !skipped.get()) {
                partitions.computeIfAbsent(new TopicPartition(record.topic(), record.partition()),
                        offset -> record.offset());
            }
            first.set(false);
        });
        boolean tracing = logger.isTraceEnabled();
        partitions.forEach((topicPartition, offset) -> {
            try {
                if (tracing) {
                    logger.trace("Seeking: " + topicPartition + " to: " + offset);
                }
                consumer.seek(topicPartition, offset);
            }
            catch (Exception e) {
                logger.error("Failed to seek " + topicPartition + " to " + offset, e);
            }
        });
        return skipped.get();
    }
}

可以看到SeekUtils主要做两件事情：

通过skipper.test(record, exception)判断是否需要重试，并将结果作为方法的返回值。
通过consumer.seek(topicPartition, offset)重新定位消费offset。

class FailedRecordTracker {
    private final ThreadLocal failures = new ThreadLocal<>(); // intentionally not static
    boolean skip(ConsumerRecord record, Exception exception) {
        FailedRecord failedRecord = this.failures.get();
        if (failedRecord == null || !failedRecord.getTopic().equals(record.topic())
                || failedRecord.getPartition() != record.partition() || failedRecord.getOffset() != record.offset()) {
            this.failures.set(new FailedRecord(record.topic(), record.partition(), record.offset()));
            return false;
        }
        else {
            if (this.maxFailures >= 0 && failedRecord.incrementAndGet() >= this.maxFailures) {
                this.recoverer.accept(record, exception);
                return true;
            }
            return false;
        }
    }
    
    private static final class FailedRecord {
        private final String topic;
        private final int partition;
        private final long offset;
        private int count;
        FailedRecord(String topic, int partition, long offset) {
            this.topic = topic;
            this.partition = partition;
            this.offset = offset;
            this.count = 1;
        }
        private String getTopic() {
            return this.topic;
        }
        private int getPartition() {
            return this.partition;
        }
        private long getOffset() {
            return this.offset;
        }
        private int incrementAndGet() {
            return ++this.count;
        }
    }    
}

FailedRecordTracker.skip判断：

当前线程消费的消息和上一次消费的消息是一致的（topic、partition、offset相等）。
在符合条件1的情况下，当前消息重试次数是否达到配置的最大值（每次重试时，ThreadLocal里存的FailedRecord.count值都是加1）。

在符合以上条件的情况下，spring-kafka就会报错不再继续重试。
但是调试发现，if (failedRecord == null || !failedRecord.getTopic().equals(record.topic()) || failedRecord.getPartition() != record.partition() || failedRecord.getOffset() != record.offset())每次判断结果都是false，也就是被判定为这是一个新的错误消息，所以每次都创建一个新的FailedRecord存入ThreadLocal覆盖掉旧值，新FailedRecord的已重试次数count值为1。
为什么这里每次都是返回false？

回到前面说的出异常的条件：

topic的partition数量为2。
当前只有一个consumer，且consumer的concurrency配置为1。
如果同时有两批异常消息在重试。

测试发现如果把concurrency调成跟partition数量一致时，就不会出现无限重试的情况。那么concurrency到底是什么作用呢？
继续阅读源码：

public class ConcurrentMessageListenerContainer extends AbstractMessageListenerContainer {
    protected void doStart() {
        if (!isRunning()) {
            checkTopics();
            ContainerProperties containerProperties = getContainerProperties();
            TopicPartitionInitialOffset[] topicPartitions = containerProperties.getTopicPartitions();
            if (topicPartitions != null && this.concurrency > topicPartitions.length) {
                this.logger.warn("When specific partitions are provided, the concurrency must be less than or "
                        + "equal to the number of partitions; reduced from " + this.concurrency + " to "
                        + topicPartitions.length);
                this.concurrency = topicPartitions.length;
            }
            setRunning(true);
            for (int i = 0; i < this.concurrency; i++) {
                KafkaMessageListenerContainer container;
                if (topicPartitions == null) {
                    container = new KafkaMessageListenerContainer<>(this, this.consumerFactory, containerProperties);
                }
                else {
                    container = new KafkaMessageListenerContainer<>(this, this.consumerFactory,
                            containerProperties, partitionSubset(containerProperties, i));
                }
                String beanName = getBeanName();
                container.setBeanName((beanName != null ? beanName : "consumer") + "-" + i);
                if (getApplicationEventPublisher() != null) {
                    container.setApplicationEventPublisher(getApplicationEventPublisher());
                }
                container.setClientIdSuffix("-" + i);
                container.setGenericErrorHandler(getGenericErrorHandler());
                container.setAfterRollbackProcessor(getAfterRollbackProcessor());
                container.setEmergencyStop(() -> {
                    stop(() -> {
                        // NOSONAR
                    });
                    publishContainerStoppedEvent();
                });
                container.start();
                this.containers.add(container);
            }
        }
    }
    private TopicPartitionInitialOffset[] partitionSubset(ContainerProperties containerProperties, int i) {
        TopicPartitionInitialOffset[] topicPartitions = containerProperties.getTopicPartitions();
        if (this.concurrency == 1) {
            return topicPartitions;
        }
        else {
            int numPartitions = topicPartitions.length;
            if (numPartitions == this.concurrency) {
                return new TopicPartitionInitialOffset[] { topicPartitions[i] };
            }
            else {
                int perContainer = numPartitions / this.concurrency;
                TopicPartitionInitialOffset[] subset;
                if (i == this.concurrency - 1) {
                    subset = Arrays.copyOfRange(topicPartitions, i * perContainer, topicPartitions.length);
                }
                else {
                    subset = Arrays.copyOfRange(topicPartitions, i * perContainer, (i + 1) * perContainer);
                }
                return subset;
            }
        }
    }    
}

可以看到配置的concurrency值是多少，就会创建多少个KafkaMessageListenerContainer，每个KafkaMessageListenerContainer平均分配partition，并且有不同的消费线程。

现在再回到if (failedRecord == null || !failedRecord.getTopic().equals(record.topic()) || failedRecord.getPartition() != record.partition() || failedRecord.getOffset() != record.offset())每次判断结果都是false的问题。
因为之前concurrency配置为1，所以spring-kafka创建一个KafkaMessageListenerContainer来消费两个partition，也就是两个partition是由同一个线程来处理，而FailedRecord又是存储在Threadlocal里。
所以当出现以下情况时：

同时有两批异常消息要重试。
两批异常消息分别属于不同的partition.

就会出现以下情况：

消费partition-0的消息-0异常。
发现当前消息跟ThreadLocal里消息的“topic、partition、offset”不一致，构建消息-0的FailedRecord，放入ThreadLocal，FailedRecord内的重试次数count为1。
消费partition-1的消息-1异常。
发现当前消息跟ThreadLocal里消息的“topic、partition、offset”不一致，构建消息-1的FailedRecord，放入ThreadLocal（覆盖掉原有的值），FailedRecord内的重试次数count为1。
消费partition-0的消息-0异常。
发现当前消息跟ThreadLocal里消息的“topic、partition、offset”不一致，构建消息-0的FailedRecord，放入ThreadLocal（覆盖掉原有的值），FailedRecord内的重试次数count为1。
无限循环以上步骤。

而当concurrency配置为跟partition跟数量一样的2，partition-0和partition-1分别处理消息-0和消息-1，FailedRecord被存储在线程各自的ThreadLocal内，没有被覆盖，所以不再无限重试。
如果同时有大量的异常消息在重试，异常消息量超过partition的数量时，会不会出现再同一个KafkaMessageListenerContainer内重试多条异常消息而又出现无限重试的问题吗？
答案是不会，KafkaMessageListenerContainer在消费消息失败后，就会重新通过seek定位当前patition消费进度，并重新拉取消息，这时候会拉取到刚刚消费失败的那一条，直到这一条消息重试完之后，才会处理下一条消息。
注：当有多个consumer且每个consumer的concurrency值一样时，配置的规则是“concurrency值 * consumer数量 = partition数量”，kafka消费分配策略不在本文展开。

优化

把concurrency配置成partition数量一样时，是解决问题了，但是这样子就限定了一个KafkaMessageListenerContainer只能消费一个partition，感觉很不合理。
抱着这个想法，我搜了下资料。在Spring Kafka SeekToCurrentErrorHandler maxFailures doesn't work when concurrency level is less than partition number和SeekToCurrentErrorHandler with exponential backoff - potential infinite loop里确定这是spring-kafka-2.2.4.RELEASE的bug，并且官方回答在2.2.8.RELEASE已经修复了这个bug，然而我通过测试和观察源码，发现2.2.8.RELEASE并没有修复，升级到2.2.14.RELEASE后这个bug不再出现。
2.2.14.RELEASE是怎么处理这个问题的？通过前面分析我们知道主要是在FailedRecordTracker.skip里判断有问题，ThreadLocal只存上一次消费的FailedRecord，导致不同的partition的消息互相循环覆盖。
所以2.2.14.RELEASE改成存储每个topic-parition的上一次消费的FailedRecord。
从

private final ThreadLocal failures = new ThreadLocal<>();

改成

private final ThreadLocal> failures = new ThreadLocal<>();

skip判断时，就可以根据topic-parition取到不同parition上一次的的消息，避免了互相循环覆盖的问题。

class FailedRecordTracker {
    private final ThreadLocal> failures = new ThreadLocal<>(); // intentionally not static
    boolean skip(ConsumerRecord record, Exception exception) {
        if (this.noRetries) {
            this.recoverer.accept(record, exception);
            return true;
        }
        Map map = this.failures.get();
        if (map == null) {
            this.failures.set(new HashMap<>());
            map = this.failures.get();
        }
        TopicPartition topicPartition = new TopicPartition(record.topic(), record.partition());
        FailedRecord failedRecord = map.get(topicPartition);
        if (failedRecord == null || failedRecord.getOffset() != record.offset()) {
            failedRecord = new FailedRecord(record.offset());
            map.put(topicPartition, failedRecord);
            return false;
        }
        else if (this.maxFailures > 0 && failedRecord.incrementAndGet() >= this.maxFailures) {
            this.recoverer.accept(record, exception);
            map.remove(topicPartition);
            if (map.isEmpty()) {
                this.failures.remove();
            }
            return true;
        }
        else {
            return false;
        }
    }
}

kafka消费无限重试问题排查

背景

问题

解决

优化

你可能感兴趣的:(java,spring,kafka)