整体流程图
说明:
- KafkaProducer类的实例对象可被多个线程使用。
- KafkaProducer实例对象追加message到对应topic和partition中最后一个RecordBatch(t1代表topic1,p1代表partition1,t2代表topic2,p2代表partition2)。
- Sender线程会不断从RecordAccumulator中获取数据并封装成ClientRequest。注意一个ClientRequest对应一个broker,所以ClientRequest可能会包含多个topic或多个partition的数据。
- Sender线程把ClientRequest交由NetworkClient管理。NetworkClient主要负责与请求相关的网络逻辑包括成功、失败、超时处理等。
- NetworkClient将发送的ClientRequest给到Selector中对应的KafkaChanne。这里的Selector不是jdk中的那个selector,是kafka自定义的。Selector负责具体的I/O处理。
具体流程
1、示例代码
package kafka.examples;
import org.apache.kafka.clients.producer.Callback;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.clients.producer.RecordMetadata;
import java.util.Date;
import java.util.Properties;
public class ProducerDemo {
public static void main(String[] args) {
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("client.id", "DemoProducer");
props.put("key.serializer", "org.apache.kafka.common.serialization.IntegerSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
KafkaProducer producer = new KafkaProducer<>(props);
String topic = "topic1";
int key = 1;
String messageStr = "Message_" + key;
producer.send(new ProducerRecord<>(topic,
key,
messageStr), new DemoCallBack(new Date().getTime(), key, messageStr));
}
private static class DemoCallBack implements Callback {
private final long startTime;
private final int key;
private final String message;
public DemoCallBack(long startTime, int key, String message) {
this.startTime = startTime;
this.key = key;
this.message = message;
}
/**
* A callback method the user can implement to provide asynchronous handling of request completion. This method will
* be called when the record sent to the server has been acknowledged. Exactly one of the arguments will be
* non-null.
*
* @param metadata The metadata for the record that was sent (i.e. the partition and offset). Null if an error
* occurred.
* @param exception The exception thrown during processing of this record. Null if no error occurred.
*/
public void onCompletion(RecordMetadata metadata, Exception exception) {
long elapsedTime = System.currentTimeMillis() - startTime;
if (metadata != null) {
System.out.println(
"message(" + key + ", " + message + ") sent to partition(" + metadata.partition() +
"), " +
"offset(" + metadata.offset() + ") in " + elapsedTime + " ms");
} else {
exception.printStackTrace();
}
}
}
}
2、发送消息到RecordAccumulator
示例代码调用producer.send后,producer.doSend将会被调用,doSend具有发送消息的核心代码流程。如下代码:
private Future doSend(ProducerRecord record, Callback callback) {
TopicPartition tp = null;
try {
// first make sure the metadata for the topic is available
long waitedOnMetadataMs = waitOnMetadata(record.topic(), this.maxBlockTimeMs);
long remainingWaitMs = Math.max(0, this.maxBlockTimeMs - waitedOnMetadataMs);
byte[] serializedKey;
try {
serializedKey = keySerializer.serialize(record.topic(), record.key());
} catch (ClassCastException cce) {
throw new SerializationException("Can't convert key of class " + record.key().getClass().getName() +
" to class " + producerConfig.getClass(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG).getName() +
" specified in key.serializer");
}
byte[] serializedValue;
try {
serializedValue = valueSerializer.serialize(record.topic(), record.value());
} catch (ClassCastException cce) {
throw new SerializationException("Can't convert value of class " + record.value().getClass().getName() +
" to class " + producerConfig.getClass(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG).getName() +
" specified in value.serializer");
}
int partition = partition(record, serializedKey, serializedValue, metadata.fetch());
int serializedSize = Records.LOG_OVERHEAD + Record.recordSize(serializedKey, serializedValue);
ensureValidRecordSize(serializedSize);
tp = new TopicPartition(record.topic(), partition);
long timestamp = record.timestamp() == null ? time.milliseconds() : record.timestamp();
log.trace("Sending record {} with callback {} to topic {} partition {}", record, callback, record.topic(), partition);
// producer callback will make sure to call both 'callback' and interceptor callback
Callback interceptCallback = this.interceptors == null ? callback : new InterceptorCallback<>(callback, this.interceptors, tp);
RecordAccumulator.RecordAppendResult result = accumulator.append(tp, timestamp, serializedKey, serializedValue, interceptCallback, remainingWaitMs);
if (result.batchIsFull || result.newBatchCreated) {
log.trace("Waking up the sender since topic {} partition {} is either full or getting a new batch", record.topic(), partition);
this.sender.wakeup();
}
return result.future;
// handling exceptions and record the errors;
// for API exceptions return them in the future,
// for other exceptions throw directly
} catch (ApiException e) {
说明:
- waitOnMetadata方法确保metadata中包含存在的topic
- keySerializer.serialize序列化topic和key(初始化KafkaProducer可指定keySerializer)
- valueSerializer.serialize序列化value(初始化KafkaProducer可指定valueSerializer)
- partition(record, serializedKey, serializedValue, metadata.fetch())确定出消息要发送到哪个partition。若发送消息指定了partition,使用指定的partition;反之使用DefaultPartitioner或配置指定的partitioner决定出最终的partition。
- accumulator.append为将消息追加到对应topic和partition的deque中最后一个RecordBatch。
3、accumulator.append方法解析
public RecordAppendResult append(TopicPartition tp,
long timestamp,
byte[] key,
byte[] value,
Callback callback,
long maxTimeToBlock) throws InterruptedException {
// We keep track of the number of appending thread to make sure we do not miss batches in
// abortIncompleteBatches().
appendsInProgress.incrementAndGet();
try {
// check if we have an in-progress batch
Deque dq = getOrCreateDeque(tp);
synchronized (dq) {
if (closed)
throw new IllegalStateException("Cannot send after the producer is closed.");
RecordAppendResult appendResult = tryAppend(timestamp, key, value, callback, dq);
if (appendResult != null)
return appendResult;
}
// we don't have an in-progress record batch try to allocate a new batch
int size = Math.max(this.batchSize, Records.LOG_OVERHEAD + Record.recordSize(key, value));
log.trace("Allocating a new {} byte message buffer for topic {} partition {}", size, tp.topic(), tp.partition());
ByteBuffer buffer = free.allocate(size, maxTimeToBlock);
synchronized (dq) {
// Need to check if producer is closed again after grabbing the dequeue lock.
if (closed)
throw new IllegalStateException("Cannot send after the producer is closed.");
RecordAppendResult appendResult = tryAppend(timestamp, key, value, callback, dq);
if (appendResult != null) {
// Somebody else found us a batch, return the one we waited for! Hopefully this doesn't happen often...
free.deallocate(buffer);
return appendResult;
}
MemoryRecords records = MemoryRecords.emptyRecords(buffer, compression, this.batchSize);
RecordBatch batch = new RecordBatch(tp, records, time.milliseconds());
FutureRecordMetadata future = Utils.notNull(batch.tryAppend(timestamp, key, value, callback, time.milliseconds()));
dq.addLast(batch);
incomplete.add(batch);
return new RecordAppendResult(future, dq.size() > 1 || batch.records.isFull(), true);
}
} finally {
appendsInProgress.decrementAndGet();
}
}
说明:
- Deque是非线程安全的,故需加锁,这里使用了synchronized
- 第一个synchronized尝试tryAppend,若成功则立即返回,反之申请空间再append
- 申请空间使用free.allocate(size, maxTimeToBlock)。由于free.allocate可能会发生阻塞,所以不放在synchronized里。假设在一个synchronized中完成上面的所有追加操作,会出现这种情况:线程1发送的消息比较大,调用free.allocate时又发生了阻塞,若阻塞很久,就会影响其他发送小体量消息的线程。关于free.allocate的内部原理后面会详细分析。
- 第二个synchronized在使用分配的buffer之前,重新尝试了tryAppend,因为可能其他线程已经帮分配空间了。就像单例模式中的双重检查一样。
4、sender线程执行流程
sender线程会循环执行下面的run方法。
void run(long now) {
Cluster cluster = metadata.fetch();
// get the list of partitions with data ready to send
RecordAccumulator.ReadyCheckResult result = this.accumulator.ready(cluster, now);
// if there are any partitions whose leaders are not known yet, force metadata update
if (result.unknownLeadersExist)
this.metadata.requestUpdate();
// remove any nodes we aren't ready to send to
Iterator iter = result.readyNodes.iterator();
long notReadyTimeout = Long.MAX_VALUE;
while (iter.hasNext()) {
Node node = iter.next();
if (!this.client.ready(node, now)) {
iter.remove();
notReadyTimeout = Math.min(notReadyTimeout, this.client.connectionDelay(node, now));
}
}
// create produce requests
Map> batches = this.accumulator.drain(cluster,
result.readyNodes,
this.maxRequestSize,
now);
if (guaranteeMessageOrder) {
// Mute all the partitions drained
for (List batchList : batches.values()) {
for (RecordBatch batch : batchList)
this.accumulator.mutePartition(batch.topicPartition);
}
}
List expiredBatches = this.accumulator.abortExpiredBatches(this.requestTimeout, now);
// update sensors
for (RecordBatch expiredBatch : expiredBatches)
this.sensors.recordErrors(expiredBatch.topicPartition.topic(), expiredBatch.recordCount);
sensors.updateProduceRequestMetrics(batches);
List requests = createProduceRequests(batches, now);
// If we have any nodes that are ready to send + have sendable data, poll with 0 timeout so this can immediately
// loop and try sending more data. Otherwise, the timeout is determined by nodes that have partitions with data
// that isn't yet sendable (e.g. lingering, backing off). Note that this specifically does not include nodes
// with sendable data that aren't ready to send since they would cause busy looping.
long pollTimeout = Math.min(result.nextReadyCheckDelayMs, notReadyTimeout);
if (result.readyNodes.size() > 0) {
log.trace("Nodes with data ready to send: {}", result.readyNodes);
log.trace("Created {} produce requests: {}", requests.size(), requests);
pollTimeout = 0;
}
for (ClientRequest request : requests)
client.send(request, now);
// if some partitions are already ready to be sent, the select time would be 0;
// otherwise if some partition already has some data accumulated but not ready yet,
// the select time will be the time difference between now and its linger expiry time;
// otherwise the select time will be the time difference between now and the metadata expiry time;
this.client.poll(pollTimeout, now);
}
说明:
- 调用this.accumulator.ready方法,选出可以向哪些broker节点发送消息。ReadyCheckResult为broke的列表。
- client.ready根据网络状况判断指定broker的消息是否可以被发送,若broker不是处于ready状态,则移除。
- 调用accumulator.drain方法返回broker到消息list的映射。
- createProduceRequests创建ClientRequest请求列表
- client.send(request, now)将请求发送到InFlightingRequest和KafkaChannel.send中。
- this.client.poll(pollTimeout, now);为处理所有网络逻辑的方法,包括I/O。
5、细看client.poll
方法代码如下:
public List poll(long timeout, long now) {
long metadataTimeout = metadataUpdater.maybeUpdate(now);
try {
this.selector.poll(Utils.min(timeout, metadataTimeout, requestTimeoutMs));
} catch (IOException e) {
log.error("Unexpected error during I/O", e);
}
// process completed actions
long updatedNow = this.time.milliseconds();
List responses = new ArrayList<>();
handleCompletedSends(responses, updatedNow);
handleCompletedReceives(responses, updatedNow);
handleDisconnections(responses, updatedNow);
handleConnections();
handleTimedOutRequests(responses, updatedNow);
// invoke callbacks
for (ClientResponse response : responses) {
if (response.request().hasCallback()) {
try {
response.request().callback().onComplete(response);
} catch (Exception e) {
log.error("Uncaught error in request completion:", e);
}
}
}
return responses;
}
说明:
- selector.poll(Utils.min(timeout, metadataTimeout, requestTimeoutMs))为最终处理网络I/O的地方,包括OP_CONNECT、OP_READ和OP_WRITE。
- handleCompletedSends、handleCompletedReceives、handleDisconnections、handleConnections、handleTimedOutRequests都是为生成List
做准备,这些方法都是一些具体的与网络相关的逻辑,与网络I/O分离。handleCompletedSends处理已完成发送的请求;handleCompletedReceives对从服务端接收的请求进行处理;handleDisconnections对不在连接状态的broker的消息进行处理;handleConnections更新对应broker的连接状态;handleTimedOutRequests处理已超时的请求,可重发。 - response.request().callback().onComplete(response)进行回调Sender.handleProduceResponse。Sender.handleProduceResponse中会调用completeBatch,若处理成功,则会回调示例中的DemoCallBack.onCompletion并释放相应RecordBatch空间;若失败,一定情况可重新加入RecordAccumulator中进行又一次发送。
所用环境信息
- Java JDK:1.8.0_92
- Scala:2.10.6
- gradle:3.1
- zookeeper: 3.4.9
- kafka: 0.10.0.1
- Intellij idea Scala 插件:2017.2.13