【Kafka零基础学习】Kafka Producer整体流程

整体流程图

绘图1.png

说明:

  • KafkaProducer类的实例对象可被多个线程使用。
  • KafkaProducer实例对象追加message到对应topic和partition中最后一个RecordBatch(t1代表topic1,p1代表partition1,t2代表topic2,p2代表partition2)。
  • Sender线程会不断从RecordAccumulator中获取数据并封装成ClientRequest。注意一个ClientRequest对应一个broker,所以ClientRequest可能会包含多个topic或多个partition的数据。
  • Sender线程把ClientRequest交由NetworkClient管理。NetworkClient主要负责与请求相关的网络逻辑包括成功、失败、超时处理等。
  • NetworkClient将发送的ClientRequest给到Selector中对应的KafkaChanne。这里的Selector不是jdk中的那个selector,是kafka自定义的。Selector负责具体的I/O处理。

具体流程

1、示例代码

package kafka.examples;

import org.apache.kafka.clients.producer.Callback;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.clients.producer.RecordMetadata;

import java.util.Date;
import java.util.Properties;

public class ProducerDemo {

   public static void main(String[] args) {
       Properties props = new Properties();
       props.put("bootstrap.servers", "localhost:9092");
       props.put("client.id", "DemoProducer");
       props.put("key.serializer", "org.apache.kafka.common.serialization.IntegerSerializer");
       props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
       KafkaProducer producer = new KafkaProducer<>(props);

       String topic = "topic1";
       int key = 1;
       String messageStr = "Message_" + key;
       producer.send(new ProducerRecord<>(topic,
               key,
               messageStr), new DemoCallBack(new Date().getTime(), key, messageStr));
   }

   private static class DemoCallBack implements Callback {

       private final long startTime;
       private final int key;
       private final String message;

       public DemoCallBack(long startTime, int key, String message) {
           this.startTime = startTime;
           this.key = key;
           this.message = message;
       }

       /**
        * A callback method the user can implement to provide asynchronous handling of request completion. This method will
        * be called when the record sent to the server has been acknowledged. Exactly one of the arguments will be
        * non-null.
        *
        * @param metadata  The metadata for the record that was sent (i.e. the partition and offset). Null if an error
        *                  occurred.
        * @param exception The exception thrown during processing of this record. Null if no error occurred.
        */
       public void onCompletion(RecordMetadata metadata, Exception exception) {
           long elapsedTime = System.currentTimeMillis() - startTime;
           if (metadata != null) {
               System.out.println(
                       "message(" + key + ", " + message + ") sent to partition(" + metadata.partition() +
                               "), " +
                               "offset(" + metadata.offset() + ") in " + elapsedTime + " ms");
           } else {
               exception.printStackTrace();
           }
       }
   }
}


2、发送消息到RecordAccumulator

示例代码调用producer.send后,producer.doSend将会被调用,doSend具有发送消息的核心代码流程。如下代码:

private Future doSend(ProducerRecord record, Callback callback) {
        TopicPartition tp = null;
        try {
            // first make sure the metadata for the topic is available
            long waitedOnMetadataMs = waitOnMetadata(record.topic(), this.maxBlockTimeMs);
            long remainingWaitMs = Math.max(0, this.maxBlockTimeMs - waitedOnMetadataMs);
            byte[] serializedKey;
            try {
                serializedKey = keySerializer.serialize(record.topic(), record.key());
            } catch (ClassCastException cce) {
                throw new SerializationException("Can't convert key of class " + record.key().getClass().getName() +
                        " to class " + producerConfig.getClass(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG).getName() +
                        " specified in key.serializer");
            }
            byte[] serializedValue;
            try {
                serializedValue = valueSerializer.serialize(record.topic(), record.value());
            } catch (ClassCastException cce) {
                throw new SerializationException("Can't convert value of class " + record.value().getClass().getName() +
                        " to class " + producerConfig.getClass(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG).getName() +
                        " specified in value.serializer");
            }
            int partition = partition(record, serializedKey, serializedValue, metadata.fetch());
            int serializedSize = Records.LOG_OVERHEAD + Record.recordSize(serializedKey, serializedValue);
            ensureValidRecordSize(serializedSize);
            tp = new TopicPartition(record.topic(), partition);
            long timestamp = record.timestamp() == null ? time.milliseconds() : record.timestamp();
            log.trace("Sending record {} with callback {} to topic {} partition {}", record, callback, record.topic(), partition);
            // producer callback will make sure to call both 'callback' and interceptor callback
            Callback interceptCallback = this.interceptors == null ? callback : new InterceptorCallback<>(callback, this.interceptors, tp);
            RecordAccumulator.RecordAppendResult result = accumulator.append(tp, timestamp, serializedKey, serializedValue, interceptCallback, remainingWaitMs);
            if (result.batchIsFull || result.newBatchCreated) {
                log.trace("Waking up the sender since topic {} partition {} is either full or getting a new batch", record.topic(), partition);
                this.sender.wakeup();
            }
            return result.future;
            // handling exceptions and record the errors;
            // for API exceptions return them in the future,
            // for other exceptions throw directly
        } catch (ApiException e) {

说明:

  • waitOnMetadata方法确保metadata中包含存在的topic
  • keySerializer.serialize序列化topic和key(初始化KafkaProducer可指定keySerializer)
  • valueSerializer.serialize序列化value(初始化KafkaProducer可指定valueSerializer)
  • partition(record, serializedKey, serializedValue, metadata.fetch())确定出消息要发送到哪个partition。若发送消息指定了partition,使用指定的partition;反之使用DefaultPartitioner或配置指定的partitioner决定出最终的partition。
  • accumulator.append为将消息追加到对应topic和partition的deque中最后一个RecordBatch。

3、accumulator.append方法解析

public RecordAppendResult append(TopicPartition tp,
                                     long timestamp,
                                     byte[] key,
                                     byte[] value,
                                     Callback callback,
                                     long maxTimeToBlock) throws InterruptedException {
        // We keep track of the number of appending thread to make sure we do not miss batches in
        // abortIncompleteBatches().
        appendsInProgress.incrementAndGet();
        try {
            // check if we have an in-progress batch
            Deque dq = getOrCreateDeque(tp);
            synchronized (dq) {
                if (closed)
                    throw new IllegalStateException("Cannot send after the producer is closed.");
                RecordAppendResult appendResult = tryAppend(timestamp, key, value, callback, dq);
                if (appendResult != null)
                    return appendResult;
            }

            // we don't have an in-progress record batch try to allocate a new batch
            int size = Math.max(this.batchSize, Records.LOG_OVERHEAD + Record.recordSize(key, value));
            log.trace("Allocating a new {} byte message buffer for topic {} partition {}", size, tp.topic(), tp.partition());
            ByteBuffer buffer = free.allocate(size, maxTimeToBlock);
            synchronized (dq) {
                // Need to check if producer is closed again after grabbing the dequeue lock.
                if (closed)
                    throw new IllegalStateException("Cannot send after the producer is closed.");

                RecordAppendResult appendResult = tryAppend(timestamp, key, value, callback, dq);
                if (appendResult != null) {
                    // Somebody else found us a batch, return the one we waited for! Hopefully this doesn't happen often...
                    free.deallocate(buffer);
                    return appendResult;
                }
                MemoryRecords records = MemoryRecords.emptyRecords(buffer, compression, this.batchSize);
                RecordBatch batch = new RecordBatch(tp, records, time.milliseconds());
                FutureRecordMetadata future = Utils.notNull(batch.tryAppend(timestamp, key, value, callback, time.milliseconds()));

                dq.addLast(batch);
                incomplete.add(batch);
                return new RecordAppendResult(future, dq.size() > 1 || batch.records.isFull(), true);
            }
        } finally {
            appendsInProgress.decrementAndGet();
        }
    }

说明:

  • Deque是非线程安全的,故需加锁,这里使用了synchronized
  • 第一个synchronized尝试tryAppend,若成功则立即返回,反之申请空间再append
  • 申请空间使用free.allocate(size, maxTimeToBlock)。由于free.allocate可能会发生阻塞,所以不放在synchronized里。假设在一个synchronized中完成上面的所有追加操作,会出现这种情况:线程1发送的消息比较大,调用free.allocate时又发生了阻塞,若阻塞很久,就会影响其他发送小体量消息的线程。关于free.allocate的内部原理后面会详细分析。
  • 第二个synchronized在使用分配的buffer之前,重新尝试了tryAppend,因为可能其他线程已经帮分配空间了。就像单例模式中的双重检查一样。

4、sender线程执行流程

sender线程会循环执行下面的run方法。

void run(long now) {
        Cluster cluster = metadata.fetch();
        // get the list of partitions with data ready to send
        RecordAccumulator.ReadyCheckResult result = this.accumulator.ready(cluster, now);

        // if there are any partitions whose leaders are not known yet, force metadata update
        if (result.unknownLeadersExist)
            this.metadata.requestUpdate();

        // remove any nodes we aren't ready to send to
        Iterator iter = result.readyNodes.iterator();
        long notReadyTimeout = Long.MAX_VALUE;
        while (iter.hasNext()) {
            Node node = iter.next();
            if (!this.client.ready(node, now)) {
                iter.remove();
                notReadyTimeout = Math.min(notReadyTimeout, this.client.connectionDelay(node, now));
            }
        }

        // create produce requests
        Map> batches = this.accumulator.drain(cluster,
                                                                         result.readyNodes,
                                                                         this.maxRequestSize,
                                                                         now);
        if (guaranteeMessageOrder) {
            // Mute all the partitions drained
            for (List batchList : batches.values()) {
                for (RecordBatch batch : batchList)
                    this.accumulator.mutePartition(batch.topicPartition);
            }
        }

        List expiredBatches = this.accumulator.abortExpiredBatches(this.requestTimeout, now);
        // update sensors
        for (RecordBatch expiredBatch : expiredBatches)
            this.sensors.recordErrors(expiredBatch.topicPartition.topic(), expiredBatch.recordCount);

        sensors.updateProduceRequestMetrics(batches);
        List requests = createProduceRequests(batches, now);
        // If we have any nodes that are ready to send + have sendable data, poll with 0 timeout so this can immediately
        // loop and try sending more data. Otherwise, the timeout is determined by nodes that have partitions with data
        // that isn't yet sendable (e.g. lingering, backing off). Note that this specifically does not include nodes
        // with sendable data that aren't ready to send since they would cause busy looping.
        long pollTimeout = Math.min(result.nextReadyCheckDelayMs, notReadyTimeout);
        if (result.readyNodes.size() > 0) {
            log.trace("Nodes with data ready to send: {}", result.readyNodes);
            log.trace("Created {} produce requests: {}", requests.size(), requests);
            pollTimeout = 0;
        }
        for (ClientRequest request : requests)
            client.send(request, now);

        // if some partitions are already ready to be sent, the select time would be 0;
        // otherwise if some partition already has some data accumulated but not ready yet,
        // the select time will be the time difference between now and its linger expiry time;
        // otherwise the select time will be the time difference between now and the metadata expiry time;
        this.client.poll(pollTimeout, now);
    }

说明:

  • 调用this.accumulator.ready方法,选出可以向哪些broker节点发送消息。ReadyCheckResult为broke的列表。
  • client.ready根据网络状况判断指定broker的消息是否可以被发送,若broker不是处于ready状态,则移除。
  • 调用accumulator.drain方法返回broker到消息list的映射。
  • createProduceRequests创建ClientRequest请求列表
  • client.send(request, now)将请求发送到InFlightingRequest和KafkaChannel.send中。
  • this.client.poll(pollTimeout, now);为处理所有网络逻辑的方法,包括I/O。

5、细看client.poll

方法代码如下:

public List poll(long timeout, long now) {
        long metadataTimeout = metadataUpdater.maybeUpdate(now);
        try {
            this.selector.poll(Utils.min(timeout, metadataTimeout, requestTimeoutMs));
        } catch (IOException e) {
            log.error("Unexpected error during I/O", e);
        }

        // process completed actions
        long updatedNow = this.time.milliseconds();
        List responses = new ArrayList<>();
        handleCompletedSends(responses, updatedNow);
        handleCompletedReceives(responses, updatedNow);
        handleDisconnections(responses, updatedNow);
        handleConnections();
        handleTimedOutRequests(responses, updatedNow);

        // invoke callbacks
        for (ClientResponse response : responses) {
            if (response.request().hasCallback()) {
                try {
                    response.request().callback().onComplete(response);
                } catch (Exception e) {
                    log.error("Uncaught error in request completion:", e);
                }
            }
        }

        return responses;
    }

说明:

  • selector.poll(Utils.min(timeout, metadataTimeout, requestTimeoutMs))为最终处理网络I/O的地方,包括OP_CONNECT、OP_READ和OP_WRITE。
  • handleCompletedSends、handleCompletedReceives、handleDisconnections、handleConnections、handleTimedOutRequests都是为生成List做准备,这些方法都是一些具体的与网络相关的逻辑,与网络I/O分离。handleCompletedSends处理已完成发送的请求;handleCompletedReceives对从服务端接收的请求进行处理;handleDisconnections对不在连接状态的broker的消息进行处理;handleConnections更新对应broker的连接状态;handleTimedOutRequests处理已超时的请求,可重发。
  • response.request().callback().onComplete(response)进行回调Sender.handleProduceResponse。Sender.handleProduceResponse中会调用completeBatch,若处理成功,则会回调示例中的DemoCallBack.onCompletion并释放相应RecordBatch空间;若失败,一定情况可重新加入RecordAccumulator中进行又一次发送。

所用环境信息

  • Java JDK:1.8.0_92
  • Scala:2.10.6
  • gradle:3.1
  • zookeeper: 3.4.9
  • kafka: 0.10.0.1
  • Intellij idea Scala 插件:2017.2.13

你可能感兴趣的:(【Kafka零基础学习】Kafka Producer整体流程)