https://kafka.apache.org/23/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html
<dependency>
<groupId>org.apache.kafkagroupId>
<artifactId>kafka-clientsartifactId>
<version>2.3.0version>
dependency>
KafkaConsumer用来从kafka集群消费消息。KafkaConsumer会自动处理所链接节点的失效,自动适应主题(topic)分区在kafka集群内的转移,自动利用 消费者分组(consumer groups) 来实现消息消费的负载均衡。
KafkaConsumer与kafka集群的节点之间维持一个TCP链接,不再使用的KafkaConsumer需要关闭,否则这个TCP链接就泄露了。需要注意的是 KafkaConsumer不是线程安全 的。
这个版本的KafkaConsumer支持kafka 0.10.0 版本,低与这个版本的kafka不兼容。
kafka集群将消息存储在集群中的一个个分区上,这一个个分区中的消息在分区中都有一个唯一的用数字表示的位置,也就是偏移量(Offsets)。消费者就是顺着分区中消息的这个偏移量,一个接着一个的消费消息的。那么如何得知当前消费者已经消费到哪个消息了呢?其实就是根据消息的偏移量来得知的,是消息的偏移量+1,比如当前消费者消费到了偏移量为4的消息,那么消费者位置就是5,就像是java遍历数组的操作,遍历到第4个数组元素的时候,位置是5,这就是消费者位置。
对于消费者来说,有两个位置概念需要理解;
这些不同的概念就给了消费者一些手段来自主决定哪些消息是已经消费了的,下文会有更详细的介绍。
group.id:这是一个KafkaConsumer的配置,拥有同样 group.id 的消费者在kafka集群看来就是属于一个组的。
消费者组中的每个消费者都能通过调用 subscribe() 方法来自主决定订阅哪些主题。kafka集群将把被订阅主题中的某个消息发送到订阅这个主题的某个消费者组中的 一个 消费者,kafka集群通过将主题所涉及的分区均匀的分给订阅这个主题的消费者组中的消费者来实现这个功能。比如:如果一个主题有 4 个分区,一个消费者组有 2 个消费者并且这个消费者组订阅了这个主题,那么这个消费者组中的每个消费者都将被分配到 2 个分区。
消费者组中消费者的关系是动态的,也就是说,如果消费者组中的某个消费者失效了,那么分配给它的那些分区就会被自动重新分配给这个消费者组中的其他消费者,也就是实现了容灾。同样的,如果某个消费者组中新加入了一个消费者,那么就会从这个消费者组中的其他消费者身上取一些分区给这个新加入的消费者,也就是实现了负载均衡。下文有这方面的详细讨论。类似的负载均衡也会在主题增加了新的分区或者一个新主题被创建并且被订阅的时候发生。
另外,当消费者组内的负载均衡发生时,消费者可以通过 ConsumerRebalanceListener 来获取通知。消费者也可以通过调用 assign() 方法手动指定主题分区,若手动指定分区,则消费者组内分区的动态分配和负载均衡将会被禁用。
Properties props = new Properties();
props.setProperty("bootstrap.servers", "localhost:9092");
props.setProperty("group.id", "test");
props.setProperty("enable.auto.commit", "true");
props.setProperty("auto.commit.interval.ms", "1000");
props.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("foo", "bar"));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records)
System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
}
上述代码中的一些配置前文已经解释过,有些前文没讲到的配置解释如下:
Properties props = new Properties();
props.setProperty("bootstrap.servers", "localhost:9092");
props.setProperty("group.id", "test");
props.setProperty("enable.auto.commit", "false");
props.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("foo", "bar"));
final int minBatchSize = 200;
List<ConsumerRecord<String, String>> buffer = new ArrayList<>();
while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
buffer.add(record);
}
if (buffer.size() >= minBatchSize) {
insertIntoDb(buffer);
consumer.commitSync();
buffer.clear();
}
}
自动 已提交位置 的意思是只要消息从 poll() 返回了,则KafkaConsumer就认为这些消息消费者已消费。如上示例代码,如果在实际消费过程中,比如向数据库插入数据时失败了,在自动 已提交位置 的情况下,这些消息就丢了。但是在手动 已提交位置 的情况下,消费者可以自己决定哪些消息是已消费的,哪些是消费失败的,如上代码,这些消费失败的消息就能重新被消费者消费,从而避免消息丢失。手动 已提交位置 也有弊端,在消费者消费完消息后,调用 commitSync() 的时候如果失败了,就会导致下一次调用 poll() 方法取到的是旧的消息,在上述代码的情况下,发生的情况就是会向数据库中插入重复数据。具体怎么取舍,需要消费者自己决定。
上述代码示例调用 commitSync 方法手动 已提交位置 ,某些情况下消费者可能希望对这一步有更进一步的控制,如下官方Doc代码示例,可以表明手动 已提交位置 具体到哪个位置:
try {
while(running) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(Long.MAX_VALUE));
for (TopicPartition partition : records.partitions()) {
List<ConsumerRecord<String, String>> partitionRecords = records.records(partition);
for (ConsumerRecord<String, String> record : partitionRecords) {
System.out.println(record.offset() + ": " + record.value());
}
long lastOffset = partitionRecords.get(partitionRecords.size() - 1).offset();
consumer.commitSync(Collections.singletonMap(partition, new OffsetAndMetadata(lastOffset + 1)));
}
}
} finally {
consumer.close();
}
从上述代码可以看出, 已提交位置 对应的应该是下一次调用 poll() 方法应该取得消息的位置,一般就是最新处理成功的那个消息的偏移量+1。
String topic = "foo";
TopicPartition partition0 = new TopicPartition(topic, 0);
TopicPartition partition1 = new TopicPartition(topic, 1);
consumer.assign(Arrays.asList(partition0, partition1));
一旦手动分配主题分区成功,消费者就可以调用 poll() 方法消费消息了。手动主题分区会禁用掉消费者组内消费者的自动负载均衡功能,这时如果某个消费者失效了,这个消费者所分配的主题分区不会自动重新分配给消费者组内的其他消费者,也就是说这个分区的消息就会没有消费者来消费,除非那个失效的消费者又重新恢复了消费功能。手动分区也可能导致 已提交位置 冲突,从而导致消息的丢失或者重复消费,所以 建议给手动分配主题分区的每个消费者都配置不同的消费者组id(group.id) 。
Note that it isn’t possible to mix manual partition assignment (i.e. using assign) with dynamic partition assignment through topic subscription (i.e. using subscribe).
解释:注意,不能混合使用自动主题分区和手动主题分区,即对一个消费者实例,不能即调用 assign 方法,又调用 subscribe 方法。
注意: 在kafka集群自动分配主题分区的情况下,上述消费者 精确只消费一次 语义的实现是不保险的,因为你不知道 onPartitionsRevoked 回调和 onPartitionsAssigned 哪个会先执行,如果 onPartitionsAssigned 先执行的话,就出错了,所以还是推荐在手动分配主题分区的情况下实现消费者 精确只消费一次 的语义,简单又好用,还不怕出错。当然如果有办法解决上述两个回调调用顺序的问题,肯定是kafka集群自动分配主题分区的情况下容灾更好。
public class KafkaConsumerRunner implements Runnable {
private final AtomicBoolean closed = new AtomicBoolean(false);
private final KafkaConsumer consumer;
public KafkaConsumerRunner(KafkaConsumer consumer) {
this.consumer = consumer;
}
public void run() {
try {
consumer.subscribe(Arrays.asList("topic"));
while (!closed.get()) {
ConsumerRecords records = consumer.poll(Duration.ofMillis(10000));
// Handle new records
}
} catch (WakeupException e) {
// Ignore exception if closing
if (!closed.get()) throw e;
} finally {
consumer.close();
}
}
// Shutdown hook which can be called from a separate thread
public void shutdown() {
closed.set(true);
consumer.wakeup();
}
}
在另一个线程里,可以这样关闭KafkaConsumer实例:
closed.set(true);
consumer.wakeup();
注意: 官方文档推荐使用 wakeup() 方法来打断KafkaConsumer的执行,而不是通过 interrupt() 方法。原文如下:Note that while it is possible to use thread interrupts instead of wakeup() to abort a blocking operation (in which case, InterruptException will be raised), we discourage their use since they may cause a clean shutdown of the consumer to be aborted. Interrupts are mainly supported for those cases where using wakeup() is impossible, e.g. when a consumer thread is managed by code that is unaware of the Kafka client.
官方文档还说:
We have intentionally avoided implementing a particular threading model for processing. This leaves several options for implementing multi-threaded processing of records.
- One Consumer Per Thread
A simple option is to give each thread its own consumer instance. Here are the pros and cons of this approach:
PRO: It is the easiest to implement
PRO: It is often the fastest as no inter-thread co-ordination is needed
PRO: It makes in-order processing on a per-partition basis very easy to implement (each thread just processes messages in the order it receives them).
CON: More consumers means more TCP connections to the cluster (one per thread). In general Kafka handles connections very efficiently so this is generally a small cost.
CON: Multiple consumers means more requests being sent to the server and slightly less batching of data which can cause some drop in I/O throughput.
CON: The number of total threads across all processes will be limited by the total number of partitions.- Decouple Consumption and Processing
Another alternative is to have one or more consumer threads that do all data consumption and hands off ConsumerRecords instances to a blocking queue consumed by a pool of processor threads that actually handle the record processing. This option likewise has pros and cons:
PRO: This option allows independently scaling the number of consumers and processors. This makes it possible to have a single consumer that feeds many processor threads, avoiding any limitation on partitions.
CON: Guaranteeing order across the processors requires particular care as the threads will execute independently an earlier chunk of data may actually be processed after a later chunk of data just due to the luck of thread execution timing. For processing that has no ordering requirements this is not a problem.
CON: Manually committing the position becomes harder as it requires that all threads co-ordinate to ensure that processing is complete for that partition.
There are many possible variations on this approach. For example each processor thread can have its own queue, and the consumer threads can hash into these queues using the TopicPartition to ensure in-order consumption and simplify commit.