目录:
《Kafka Producer设计分析》
《KafkaProducer类代码分析》
《RecordAccumulator类代码分析》
《Sender类代码分析》
《NetworkClient类代码分析》
-------------------------------------------------------------------
上一节《RecordAccumulator类代码分析》
上文我们讲到待发送的消息已经缓存在RecordAccumulator中。KafkaProducer得知有Batch已经满了,就会通知Sender开始干活。本篇我们就来看一看Sender的设计和实现。
Sender在KafkaProducer中以一个单独的线程运行,把RecordAccumulator中封箱后的ProducerBatch发送出去。网络IO是由Sender触发的。
Sender把ProducerBatch按照要发往的broker node进行分类,发往同一个node的ProducerBatch会被打包在一起,只触发一次网络IO。例如主题A分区1和主题B分区2存放在同一个node上,那么Send在IO前会把这两个主题分区的消息打包在一起,生成一个请求对象,一并发过去,这样大大减少了网络IO的开销。
我还是以发送快递举例类比,这次是把货箱装车的场景:
绿色衣服的小人是装箱工作人员,负责把货箱装到对应的车上。可以看到车是按照快递站点区分的,那么朝阳北京和河北唐山的货箱都要装到华北一区的车上。这样一趟车就可以把北京和唐山的包裹一起发送出去,而不需要两辆车分别发送。
其实这个绿色小人就是sender,它负责把不同主题分区的ProducerBatch(货箱)封装进同一个ClientRequest(货车集装箱),前提是这些主题分区存储在同一个broker上(都属华北一区),然后通过NetWorkClient(货车)真正网络IO发往Broker(华北一区站点)。
下面进入Sender的代码讲解。
从Sender设计可以看到,Sender内部有两个重要的实例引用,一个就是仓库:RecordAccumulator,另外一个就是那一辆辆车:负责网络IO的KafkaClient对象。KafkaClient负责让不同的车找到通往各自目的地的路,把货物运送过去。KafkaClient后面会单独讲解,本篇我们专注于分析sender怎么把消息重新分组装车。
Sender实现Runnable接口,重写run方法,这也是线程的启动入口,代码如下:
public void run() {
log.debug("Starting Kafka producer I/O thread.");
// main loop, runs until close is called
while (running) {
try {
run(time.milliseconds());
} catch (Exception e) {
log.error("Uncaught error in kafka producer I/O thread: ", e);
}
}
log.debug("Beginning shutdown of Kafka producer I/O thread, sending remaining records.");
// okay we stopped accepting requests but there may still be
// requests in the accumulator or waiting for acknowledgment,
// wait until these are completed.
while (!forceClose && (this.accumulator.hasUndrained() || this.client.inFlightRequestCount() > 0)) {
try {
run(time.milliseconds());
} catch (Exception e) {
log.error("Uncaught error in kafka producer I/O thread: ", e);
}
}
if (forceClose) {
// We need to fail all the incomplete batches and wake up the threads waiting on
// the futures.
log.debug("Aborting incomplete batches due to forced shutdown");
this.accumulator.abortIncompleteBatches();
}
try {
this.client.close();
} catch (Exception e) {
log.error("Failed to close network client", e);
}
log.debug("Shutdown of Kafka producer I/O thread has completed.");
}
run 方法主循环中调用 run(time.milliseconds());方法,sender的主要逻辑实际在这个方法中。一旦running为false跳出主循环,根据状态判断是继续发送完成,还是强制关闭。强制关闭的话,通过accumulator.abortIncompleteBatches()把RecourdAccumulator中incomplete集合中保存的未完成ProducerBatch做相应的处理,对他们进行封箱,防止继续有新的消息被追加进来,然后从所属Deque中删除掉,释放掉BufferPool中的空间。这部分核心的代码如下:
void abortBatches(final RuntimeException reason) {
for (ProducerBatch batch : incomplete.copyAll()) {
Deque dq = getDeque(batch.topicPartition);
synchronized (dq) {
batch.abortRecordAppends();
dq.remove(batch);
}
batch.abort(reason);
deallocate(batch);
}
}
run 方法最后关闭负责网络IO 的NetworkClient。
sender的主要逻辑实际在这个方法中。这个方法前面一大段逻辑都是处理开启事物的消息发送,我们为了简化代码理解,不做这部分分析,直接进入非事务的消息发送逻辑中。
非事务的消息发送只有两行关键代码:
long pollTimeout = sendProducerData(now);
client.poll(pollTimeout, now);
1、准备发送的数据请求
2、把准备好的消息请求真正发送出去
下面我们先看准备数据请求的方法sendProducerData()。
这个方法是sender的主流程,需要认真理解。我先贴代码,代码下紧跟逻辑讲解。
private long sendProducerData(long now) {
Cluster cluster = metadata.fetch();
// get the list of partitions with data ready to send
RecordAccumulator.ReadyCheckResult result = this.accumulator.ready(cluster, now);
// if there are any partitions whose leaders are not known yet, force metadata update
if (!result.unknownLeaderTopics.isEmpty()) {
// The set of topics with unknown leader contains topics with leader election pending as well as
// topics which may have expired. Add the topic again to metadata to ensure it is included
// and request metadata update, since there are messages to send to the topic.
for (String topic : result.unknownLeaderTopics)
this.metadata.add(topic);
log.debug("Requesting metadata update due to unknown leader topics from the batched records: {}",
result.unknownLeaderTopics);
this.metadata.requestUpdate();
}
// remove any nodes we aren't ready to send to
Iterator iter = result.readyNodes.iterator();
long notReadyTimeout = Long.MAX_VALUE;
while (iter.hasNext()) {
Node node = iter.next();
if (!this.client.ready(node, now)) {
iter.remove();
notReadyTimeout = Math.min(notReadyTimeout, this.client.pollDelayMs(node, now));
}
}
// create produce requests
Map> batches = this.accumulator.drain(cluster, result.readyNodes, this.maxRequestSize, now);
addToInflightBatches(batches);
if (guaranteeMessageOrder) {
// Mute all the partitions drained
for (List batchList : batches.values()) {
for (ProducerBatch batch : batchList)
this.accumulator.mutePartition(batch.topicPartition);
}
}
accumulator.resetNextBatchExpiryTime();
List expiredInflightBatches = getExpiredInflightBatches(now);
List expiredBatches = this.accumulator.expiredBatches(now);
expiredBatches.addAll(expiredInflightBatches);
// Reset the producer id if an expired batch has previously been sent to the broker. Also update the metrics
// for expired batches. see the documentation of @TransactionState.resetProducerId to understand why
// we need to reset the producer id here.
if (!expiredBatches.isEmpty())
log.trace("Expired {} batches in accumulator", expiredBatches.size());
for (ProducerBatch expiredBatch : expiredBatches) {
String errorMessage = "Expiring " + expiredBatch.recordCount + " record(s) for " + expiredBatch.topicPartition
+ ":" + (now - expiredBatch.createdMs) + " ms has passed since batch creation";
failBatch(expiredBatch, -1, NO_TIMESTAMP, new TimeoutException(errorMessage), false);
if (transactionManager != null && expiredBatch.inRetry()) {
// This ensures that no new batches are drained until the current in flight batches are fully resolved.
transactionManager.markSequenceUnresolved(expiredBatch.topicPartition);
}
}
sensors.updateProduceRequestMetrics(batches);
// If we have any nodes that are ready to send + have sendable data, poll with 0 timeout so this can immediately
// loop and try sending more data. Otherwise, the timeout will be the smaller value between next batch expiry
// time, and the delay time for checking data availability. Note that the nodes may have data that isn't yet
// sendable due to lingering, backing off, etc. This specifically does not include nodes with sendable data
// that aren't ready to send since they would cause busy looping.
long pollTimeout = Math.min(result.nextReadyCheckDelayMs, notReadyTimeout);
pollTimeout = Math.min(pollTimeout, this.accumulator.nextExpiryTimeMs() - now);
pollTimeout = Math.max(pollTimeout, 0);
if (!result.readyNodes.isEmpty()) {
log.trace("Nodes with data ready to send: {}", result.readyNodes);
// if some partitions are already ready to be sent, the select time would be 0;
// otherwise if some partition already has some data accumulated but not ready yet,
// the select time will be the time difference between now and its linger expiry time;
// otherwise the select time will be the time difference between now and the metadata expiry time;
pollTimeout = 0;
}
sendProduceRequests(batches, now);
return pollTimeout;
}
主要逻辑如下:
下面我们重点来来第8步的sendProduceRequests()方法,这个方法中实际是按照node编号循环第4步得到的Map
代码如下,不想看代码可以直接看逻辑分析。
private void sendProduceRequest(long now, int destination, short acks, int timeout, List batches) {
if (batches.isEmpty())
return;
Map produceRecordsByPartition = new HashMap<>(batches.size());
final Map recordsByPartition = new HashMap<>(batches.size());
// find the minimum magic version used when creating the record sets
byte minUsedMagic = apiVersions.maxUsableProduceMagic();
for (ProducerBatch batch : batches) {
if (batch.magic() < minUsedMagic)
minUsedMagic = batch.magic();
}
for (ProducerBatch batch : batches) {
TopicPartition tp = batch.topicPartition;
MemoryRecords records = batch.records();
// down convert if necessary to the minimum magic used. In general, there can be a delay between the time
// that the producer starts building the batch and the time that we send the request, and we may have
// chosen the message format based on out-dated metadata. In the worst case, we optimistically chose to use
// the new message format, but found that the broker didn't support it, so we need to down-convert on the
// client before sending. This is intended to handle edge cases around cluster upgrades where brokers may
// not all support the same message format version. For example, if a partition migrates from a broker
// which is supporting the new magic version to one which doesn't, then we will need to convert.
if (!records.hasMatchingMagic(minUsedMagic))
records = batch.records().downConvert(minUsedMagic, 0, time).records();
produceRecordsByPartition.put(tp, records);
recordsByPartition.put(tp, batch);
}
String transactionalId = null;
if (transactionManager != null && transactionManager.isTransactional()) {
transactionalId = transactionManager.transactionalId();
}
ProduceRequest.Builder requestBuilder = ProduceRequest.Builder.forMagic(minUsedMagic, acks, timeout,
produceRecordsByPartition, transactionalId);
RequestCompletionHandler callback = new RequestCompletionHandler() {
public void onComplete(ClientResponse response) {
handleProduceResponse(response, recordsByPartition, time.milliseconds());
}
};
String nodeId = Integer.toString(destination);
ClientRequest clientRequest = client.newClientRequest(nodeId, requestBuilder, now, acks != 0,
requestTimeoutMs, callback);
client.send(clientRequest, now);
log.trace("Sent produce request to {}: {}", nodeId, requestBuilder);
}
这个方法的输入的List
1、循环batches,MemoryRecords records = batch.records(); 获取ProducerBatch中组织好的消息内容也。按照主题分区分堆ProducerBatch和MemoryRecords得到两个以TopicPartition为Key的Map,produceRecordsByPartition和recordsByPartition
2、声明ProduceRequest.Builder对象,他内部有引用指向produceRecordsByPartition
3、生成ClientRequest对象
4、调用NetWorkClient的send方法。
在NetWorkClient的send方法中,通过Builder对象的build方法生成ProduceRequest对象。再通过request.toSend(destination, header),得到NetworkSend对象send。生成InFlightRequest对象,保存起来。最后调用selector.send(send);这个方法会把send请求加入队列等待随后的poll方法把它发送出去。
讲解到这里,因为涉及到NIO的概念,而Kafka又对NIO做了封装,所以会难以理解。本篇大家只要能明白我们在sendProduceRequest()方法中,生成按kafka node划分的请求对象ClientRequest,然后交给NetWorkClient处理。关于NetWorkClient如何通过NIO方式把ClientRequest发送出去,我们以后的章节再讲。
Send线程中主要做了如下几件事:
下一节《NetworkClient类代码分析》