SnaiI

kafka-生产者源码解析

kafka分享

生产者配置属性介绍

配置参数	配置参数释义	默认值
bootstrap.servers	指定Kafka集群所需的broker地址清单	“”
metadata.max.age.ms	强制刷新元数据时间，毫秒	默认300000，5分钟
batch.size	指定ProducerBatch内存区域的大小	默认16kb
acks	指定分区中必须有多少个副本收到这条消息，才算消息发送成功	默认值1，字符串类型
linger.ms	指定ProducerBatch在延迟多少毫秒后再发送，但如果在延迟的这段时间内batch的大小已经到了batch.size设置的大小，那么消息会被立即发送，不会再等待	默认值0
client.id	用户设定，用于跟踪记录消息	“”
send.buffer.bytes	Socket发送缓冲区大小	默认128kb,-1将使用操作系统的设置
receive.buffer.bytes	Socket接收缓冲区大小	默认32kb,-1将使用操作系统的设置
max.request.size	限制生产者客户端发送消息的最大值	默认1MB
reconnect.backoff.ms	连接失败后，尝试连接Kafka的时间间隔	默认50ms
reconnect.backoff.max.ms	尝试连接到Kafka，生产者客户端等待的最大时间	默认1000ms
max.block.ms	控制生产者客户端send()方法和partitionsFor()方法的阻塞时间。当生产者的发送缓存区已满，或者没有可用元数据时，这些方法就会阻塞	默认60s
buffer.memory	生产者客户端中用于缓存消息的缓存区大小	默认32MB
retry.backoff.ms	消息发送失败重试时间间隔	默认100ms
compression.type	指定消息的压缩方式	默认不压缩
metrics.sample.window.ms	样本计算时间窗口	默认30000ms
metrics.num.samples	用于维护metrics的样本数量	默认2
metrics.log.level	metrics日志记录级别	默认info
metric.reporters	类的列表，用于衡量指标	默认空list
max.in.flight.requests.per.connection	可以在一个connection中发送多个请求，叫作一个flight,这样可以减少开销，但是如果产生错误，可能会造成数据的发送顺序改变	默认5
retries	消息发送失败重试次数	默认0
key.serializer	key的序列化方式
value.serializer	value序列化类方式
connections.max.idle.ms	设置多久之后关闭空闲连接	默认540000ms
partitioner.class	分区类,实现Partitioner接口，可以自定义分区规则
request.timeout.ms	客户端将等待请求的响应的最大时间,如果在这个时间内没有收到响应，客户端将重发请求，超过重试次数将抛异常	默认30000ms
interceptor.classes	拦截器类，实现ProducerInterceptor接口，自定义拦截器
enable.idempotence	true为开启幂等性
transaction.timeout.ms	事务超时时间	默认60000ms
ransactional.id	设置事务id，必须唯一

消费者配置属性介绍

配置参数	配置参数释义	默认值
group.id	消费者所属消费组的唯一标识
max.poll.records	一次拉取请求的最大消息数	默认500条
max.poll.interval.ms	指定拉取消息线程最长空闲时间	默认300000ms
session.timeout.ms	检测消费者是否失效的超时时间	默认10000ms
heartbeat.interval.ms	消费者心跳时间	默认3000ms
bootstrap.servers	连接集群broker地址
enable.auto.commit	是否开启自动提交消费位移的功能	默认true
auto.commit.interval.ms	自动提交消费位移的时间间隔	默认5000ms
partition.assignment.strategy	消费者的分区配置策略	默认 RangeAssignor
auto.offset.reset	如果分区没有初始偏移量，或者当前偏移量服务器上不存在时，将使用的偏移量设置，earliest从头开始消费，latest从最近的开始消费，none抛出异常
fetch.min.bytes	消费者客户端一次请求从Kafka拉取消息的最小数据量，如果Kafka返回的数据量小于该值，会一直等待，直到满足这个配置大小	默认1b
fetch.max.bytes	消费者客户端一次请求从Kafka拉取消息的最大数据量	默认50MB
fetch.max.wait.ms	从Kafka拉取消息时，在不满足fetch.min.bytes条件时，等待的最大时间	默认500ms
metadata.max.age.ms	强制刷新元数据时间，毫秒	默认300000，5分钟
max.partition.fetch.bytes	设置从每个分区里返回给消费者的最大数据量，区别于fetch.max.bytes	默认1MB
send.buffer.bytes	Socket发送缓冲区大小	默认128kb,-1将使用操作系统的设置
receive.buffer.bytes	Socket发送缓冲区大小	默认64kb,-1将使用操作系统的设置
client.id	消费者客户端的id
.reconnect.backoff.ms	连接失败后，尝试连接Kafka的时间间隔	默认50ms
reconnect.backoff.max.ms	尝试连接到Kafka，生产者客户端等待的最大时间	默认1000ms
retry.backoff.ms	消息发送失败重试时间间隔	默认100ms
metrics.sample.window.ms	样本计算时间窗口	默认30000ms
metrics.num.samples	用于维护metrics的样本数量	默认2
metrics.log.level	metrics日志记录级别	默认info
metric.reporters	类的列表，用于衡量指标	默认空list
check.crcs	自动检查CRC32记录的消耗
key.deserializer	key反序列化方式
value.deserializer	value反序列化方式
connections.max.idle.ms	设置多久之后关闭空闲连接	默认540000ms
request.timeout.ms	客户端将等待请求的响应的最大时间,如果在这个时间内没有收到响应，客户端将重发请求，超过重试次数将抛异常	默认30000ms
default.api.timeout.ms	设置消费者api超时时间	默认60000ms
interceptor.classes	自定义拦截器
exclude.internal.topics	内部的主题:一consumer_offsets 和一transaction_state。该参数用来指定 Kafka 中的内部主题是否可以向消费者公开，默认值为 true。如果设置为 true，那么只能使用 subscribe(Collection)的方式而不能使用 subscribe(Pattern)的方式来订阅内部主题，设置为 false 则没有这个限制
isolation.level	用来配置消费者的事务隔离级别。如果设置为“read committed”，那么消费者就会忽略事务未提交的消息，即只能消费到 LSO (LastStableOffset)的位置，默认情况下为 “read_uncommitted”，即可以消费到 HW (High Watermark)处的位置

server.properties配置属性

配置参数	配置参数释义	默认值
num.partitions	参数指定了新创建的主题需要包含多少个分区。如果启用了主题自动创建功能（该功能是默认启用的），主题分区的个数就是该参数指定的值	默认值是 1
broker.id	每个 broker 都需要有一个标识符，使用 broker.id 来表示，它可以被设置成其他任意整数，在集群中需要保证每个节点的 broker.id 都是唯一的	默认值是 0
port	如果使用配置样本来启动 kafka ，它会监听 9092 端口，修改 port 配置参数可以把它设置成其他任意可用的端口
zookeeper.connect	用于保存 broker 元数据的地址是通过 zookeeper.connect 来指定。 localhost:2181 表示运行在本地 2181 端口。该配置参数是用逗号分隔的一组 hostname:port/path 列表
log.dirs	Kafka 把消息都保存在磁盘上，存放这些日志片段的目录都是通过 log.dirs 来指定的。它是一组用逗号分隔的本地文件系统路径。如果指定了多个路径，那么 broker 会根据 “最少使用” 原则，把同一分区的日志片段保存到同一路径下。要注意，broker 会向拥有最少数目分区的路径新增分区，而不是向拥有最小磁盘空间的路径新增分区
num.recovery.threads.per.data.dir	设置此参数时需要注意，所配置的数字对应的是 log.dirs 指定的单个日志目录，也就是说，如果num.recovery.threads.per.data.dir 被设为 8，并且 log.dir 指定了 3 个路径，那么总共需要 24 个线程
host.name	broker的主机地址，若是设置了，那么会绑定到这个地址上，若是没有，会绑定到所有的接口上，并将其中之一发送到ZK，一般不设置

生产者的原理概要

1.生产者包括:拦截器，序列化器，RecordAccumulator存储消息，主线程，sender线程发送数据
2.原理说明

用户发送消息时，将待发送的消息和主题及分区信息等封装成 ProducerRecord对象，并作为参数KafkaProducer传入send方法
kafkaProducer发送消息的时候会创建sender、RecordAccumulator 和Metadata ，sender线程（sender是守护线程）
KafkaProducer通过buildCallback()方法是实现Callable接口实现回调
sender线程将消息先存入RecordAccumulator类中的双端队列，
.sender线程通过调用底层java NIO连接服务器的broker，并获取服务器相关信息存入Metadata
sender线程通过获取的服务器信息将RecordAccumulator类中的消息按批次发送至broker
拓展：sender是守护线程
Java线程分为用户线程和守护线程
守护线程–也称“服务线程，守护线程的优先级比较低，用于为系统中的其它对象和线程提供服务
类似于 GC机制的线程
producer架构图
架构图

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-LcJf99y3-1656568820860)(D:\ideaproject\technology\分享会\2022\后端\2022-06-01_kafka生产者源码分享\images\生产者流程.png)]

ProducerRecord类详解

ProducerRecord 的核心属性

String topic 消息所属的主题。
Integer partition 消息所在主题的队列数，可以人为指定，如果指定了 key 的话，会使用 key 的 - - hashCode 与队列总数进行取模来选择分区，如果前面两者都未指定，则会轮询主题下的所有分区。
Headers headers 该消息的额外属性对，与消息体分开存储.
K key 消息键，如果指定该值，则会使用该值的 hashcode 与队列数进行取模来选择分区。
V value 消息体。
Long timestamp 消息时间戳，根据 topic 的配置信息 message.timestamp.type 的值来赋予不同的值。
- CreateTime 发送客户端发送消息时的时间戳。
- LogAppendTime 消息在 broker 追加时的时间戳。

kafkaproducer 的核心属性

public class KafkaProducer<K, V> implements Producer<K, V> {
    private final Logger log;
    private static final String JMX_PREFIX = "kafka.producer";
    public static final String NETWORK_THREAD_PREFIX = "kafka-producer-network-thread";
    public static final String PRODUCER_METRIC_GROUP_NAME = "producer-metrics";

    private final String clientId;
    // Visible for testing
    final Metrics metrics;
    private final Partitioner partitioner;  //自定义分区策略
    private final int maxRequestSize;
    private final long totalMemorySize;
    private final ProducerMetadata metadata;  //元数据
    private final RecordAccumulator accumulator; //消息暂存储
    private final Sender sender; //消息拉取和发送的线程
    private final Thread ioThread;  
    private final CompressionType compressionType;
    private final Sensor errors;
    private final Time time;
    private final Serializer<K> keySerializer; //key序列化
    private final Serializer<V> valueSerializer; //value序列化
    private final ProducerConfig producerConfig;  //配置项
    private final long maxBlockTimeMs;  //拉取集群信息最大阻塞时间
    private final ProducerInterceptors<K, V> interceptors;  //拦截器
    private final ApiVersions apiVersions;
    private final TransactionManager transactionManager; //事务相关
    
     KafkaProducer(Map<String, Object> configs,
                  Serializer<K> keySerializer,
                  Serializer<V> valueSerializer,
                  ProducerMetadata metadata,
                  KafkaClient kafkaClient,
                  ProducerInterceptors<K, V> interceptors,
                  Time time) {
        ProducerConfig config = new ProducerConfig(ProducerConfig.appendSerializerToConfig(configs, keySerializer,
                valueSerializer));
        try {
            Map<String, Object> userProvidedConfigs = config.originals();
            this.producerConfig = config;
            this.time = time;

            String transactionalId = (String) userProvidedConfigs.get(ProducerConfig.TRANSACTIONAL_ID_CONFIG);

            this.clientId = config.getString(ProducerConfig.CLIENT_ID_CONFIG);

            LogContext logContext;
            if (transactionalId == null)
                logContext = new LogContext(String.format("[Producer clientId=%s] ", clientId));
            else
                logContext = new LogContext(String.format("[Producer clientId=%s, transactionalId=%s] ", clientId, transactionalId));
            log = logContext.logger(KafkaProducer.class);
            log.trace("Starting the Kafka producer");

            Map<String, String> metricTags = Collections.singletonMap("client-id", clientId);
            MetricConfig metricConfig = new MetricConfig().samples(config.getInt(ProducerConfig.METRICS_NUM_SAMPLES_CONFIG))
                    .timeWindow(config.getLong(ProducerConfig.METRICS_SAMPLE_WINDOW_MS_CONFIG), TimeUnit.MILLISECONDS)
                    .recordLevel(Sensor.RecordingLevel.forName(config.getString(ProducerConfig.METRICS_RECORDING_LEVEL_CONFIG)))
                    .tags(metricTags);
            List<MetricsReporter> reporters = config.getConfiguredInstances(ProducerConfig.METRIC_REPORTER_CLASSES_CONFIG,
                    MetricsReporter.class,
                    Collections.singletonMap(ProducerConfig.CLIENT_ID_CONFIG, clientId));
            JmxReporter jmxReporter = new JmxReporter();
            jmxReporter.configure(userProvidedConfigs);
            reporters.add(jmxReporter);
            MetricsContext metricsContext = new KafkaMetricsContext(JMX_PREFIX,
                    config.originalsWithPrefix(CommonClientConfigs.METRICS_CONTEXT_PREFIX));
            this.metrics = new Metrics(metricConfig, reporters, time, metricsContext);
            this.partitioner = config.getConfiguredInstance(ProducerConfig.PARTITIONER_CLASS_CONFIG, Partitioner.class);
            long retryBackoffMs = config.getLong(ProducerConfig.RETRY_BACKOFF_MS_CONFIG);
            if (keySerializer == null) {
                this.keySerializer = config.getConfiguredInstance(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
                                                                                         Serializer.class);
                this.keySerializer.configure(config.originals(Collections.singletonMap(ProducerConfig.CLIENT_ID_CONFIG, clientId)), true);
            } else {
                config.ignore(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG);
                this.keySerializer = keySerializer;
            }
            if (valueSerializer == null) {
                this.valueSerializer = config.getConfiguredInstance(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
                                                                                           Serializer.class);
                this.valueSerializer.configure(config.originals(Collections.singletonMap(ProducerConfig.CLIENT_ID_CONFIG, clientId)), false);
            } else {
                config.ignore(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG);
                this.valueSerializer = valueSerializer;
            }

            // load interceptors and make sure they get clientId
            userProvidedConfigs.put(ProducerConfig.CLIENT_ID_CONFIG, clientId);
            ProducerConfig configWithClientId = new ProducerConfig(userProvidedConfigs, false);
            List<ProducerInterceptor<K, V>> interceptorList = (List) configWithClientId.getConfiguredInstances(
                    ProducerConfig.INTERCEPTOR_CLASSES_CONFIG, ProducerInterceptor.class);
            if (interceptors != null)
                this.interceptors = interceptors;
            else
                this.interceptors = new ProducerInterceptors<>(interceptorList);
            ClusterResourceListeners clusterResourceListeners = configureClusterResourceListeners(keySerializer,
                    valueSerializer, interceptorList, reporters);
            this.maxRequestSize = config.getInt(ProducerConfig.MAX_REQUEST_SIZE_CONFIG);
            this.totalMemorySize = config.getLong(ProducerConfig.BUFFER_MEMORY_CONFIG);
            this.compressionType = CompressionType.forName(config.getString(ProducerConfig.COMPRESSION_TYPE_CONFIG));

            this.maxBlockTimeMs = config.getLong(ProducerConfig.MAX_BLOCK_MS_CONFIG);
            int deliveryTimeoutMs = configureDeliveryTimeout(config, log);

            this.apiVersions = new ApiVersions();
            this.transactionManager = configureTransactionState(config, logContext);
            //初始化消息暂存存储
            this.accumulator = new RecordAccumulator(logContext,
                    config.getInt(ProducerConfig.BATCH_SIZE_CONFIG),
                    this.compressionType,
                    lingerMs(config),
                    retryBackoffMs,
                    deliveryTimeoutMs,
                    metrics,
                    PRODUCER_METRIC_GROUP_NAME,
                    time,
                    apiVersions,
                    transactionManager,
                    new BufferPool(this.totalMemorySize, config.getInt(ProducerConfig.BATCH_SIZE_CONFIG), metrics, time, PRODUCER_METRIC_GROUP_NAME));

            List<InetSocketAddress> addresses = ClientUtils.parseAndValidateAddresses(
                    config.getList(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG),
                    config.getString(ProducerConfig.CLIENT_DNS_LOOKUP_CONFIG));
            if (metadata != null) {
                this.metadata = metadata;
            } else {
                this.metadata = new ProducerMetadata(retryBackoffMs,
                        config.getLong(ProducerConfig.METADATA_MAX_AGE_CONFIG),
                        config.getLong(ProducerConfig.METADATA_MAX_IDLE_CONFIG),
                        logContext,
                        clusterResourceListeners,
                        Time.SYSTEM);
                this.metadata.bootstrap(addresses);
            }
            this.errors = this.metrics.sensor("errors");
            //生成sender线程
            this.sender = newSender(logContext, kafkaClient, this.metadata);
            String ioThreadName = NETWORK_THREAD_PREFIX + " | " + clientId;
            this.ioThread = new KafkaThread(ioThreadName, this.sender, true);
            //与服务端建立连接，看是否能连上
            this.ioThread.start();
            config.logUnused();
            AppInfoParser.registerAppInfo(JMX_PREFIX, clientId, metrics, time.milliseconds());
            log.debug("Kafka producer started");
        } catch (Throwable t) {
            // call close methods if internal objects are already constructed this is to prevent resource leak. see KAFKA-2121
            close(Duration.ofMillis(0), true);
            // now propagate the exception
            throw new KafkaException("Failed to construct kafka producer", t);
        }
    }
}

Kafka 生产者消息发送流程

1. KafkaTemplate.doSend()

代码@1 初始化kafkaProducer，sender线程运行
代码@2 将消息追加到内存中

buildCallback()方法是实现回调
KafkaProducer.send()方法将消息追加到内存中(分区的缓存队列中)

//@1
final Producer<K, V> producer = getTheProducer();
//@2
producer.send(producerRecord, buildCallback(producerRecord, producer, future));

2. kafkaProducer.Send()

代码@1 首先执行消息发送拦截器，拦截器通过 interceptor.classes 配置指定，类型为 List< String >每一个元素为拦截器的全类路径限定名

代码@2 doSend()方法追加消息到内存

//@1
ProducerRecord<K, V> interceptedRecord = this.interceptors.onSend(record);
//@2
return doSend(interceptedRecord, callback);

3. kafkaProducer.doSend()追加消息到内存

1. 循环获取集群配置信息

代码@1 获取 topic 的分区列表，如果本地没有该topic的分区信息，则需要向远端 broker 获取，该方法会返回拉取元数据所耗费的时间。在消息发送时的最大等待时间时会扣除该部分损耗的时间

// first make sure the metadata for the topic is available
ClusterAndWaitTime clusterAndWaitTime;
try {
        //@1
       clusterAndWaitTime = waitOnMetadata(record.topic(), record.partition(), maxBlockTimeMs);
} catch (KafkaException e) {
		if (metadata.isClosed())
    throw new KafkaException("Producer closed while send in progress", e);
    throw e;
}
long remainingWaitMs = Math.max(0, maxBlockTimeMs - clusterAndWaitTime.waitedOnMetadataMs);

1.KafkaProducer.waitOnMetadata()

生产者获取metadata中的broker信息
代码@1 从当前阻塞中获取服务器的相关信息，有就直接返回ClusterAndWaitTime配置类

代码@2 当前阻塞中没有就用循环等待获取

代码@3 获取metadata当前版本号

代码@4 从阻塞中唤醒sender线程获取服务器相关信息

代码@5 sender线程更新metadata，就会更新版本，从而退出循环，否则抛出超时异常

  private ClusterAndWaitTime waitOnMetadata(String topic, Integer partition, long maxWaitMs) throws InterruptedException {
        // add topic to metadata topic list if it is not there already and reset expiry
        metadata.add(topic);
        //@1
        //取到topic的配置信息，有就直接返回ClusterAndWaitTime配置类
        Cluster cluster = metadata.fetch();
      	//根据topic获取分区数量
        Integer partitionsCount = cluster.partitionCountForTopic(topic);
        // Return cached metadata if we have it, and if the record's partition is either undefined
        // or within the known partition range
      	//返回默认分区的集群
        if (partitionsCount != null && (partition == null || partition < partitionsCount))
            return new ClusterAndWaitTime(cluster, 0);

        long begin = time.milliseconds();
        long remainingWaitMs = maxWaitMs;
        long elapsed;
             
        //@2 取不到topic的配置信息，一直死循环wait，直到超时，抛TimeoutException
        do {
            log.trace("Requesting metadata update for topic {}.", topic);
            metadata.add(topic);
             //@3
            int version = metadata.requestUpdate();
            //@4
            sender.wakeup();
            try {
                //@5
                //metadata更新方法
                metadata.awaitUpdate(version, remainingWaitMs);
            } catch (TimeoutException ex) {
               throw new TimeoutException("Failed to update metadata after " + maxWaitMs + " ms.");
            }
            cluster = metadata.fetch();
            elapsed = time.milliseconds() - begin;
            if (elapsed >= maxWaitMs)
                throw new TimeoutException("Failed to update metadata after " + maxWaitMs + " ms.");
            if (cluster.unauthorizedTopics().contains(topic))
                throw new TopicAuthorizationException(topic);
            remainingWaitMs = maxWaitMs - elapsed;
            partitionsCount = cluster.partitionCountForTopic(topic);
            //取不到topic的配置信息，一直死循环wait，直到超时，当partitionsCount！=null，表示拿到集群信息
        } while (partitionsCount == null);

        if (partition != null && partition >= partitionsCount) {
            throw new KafkaException(
                    String.format("Invalid partition given with record: %d is not in the range [0...%d).", partition, partitionsCount));
        }

        return new ClusterAndWaitTime(cluster, elapsed);
    }

2.Metadata.awaitUpdate

代码@1 sender更新版本，结束循环，拿到一次的集群信息

public synchronized void awaitUpdate(final int lastVersion, final long maxWaitMs) throws InterruptedException {
        if (maxWaitMs < 0) {
            throw new IllegalArgumentException("Max time to wait for metadata updates should not be < 0 milli seconds");
        }
        long begin = System.currentTimeMillis();
        long remainingWaitMs = maxWaitMs;
        //@1
        //当Sender成功更新meatadata之后，version加1。否则会循环，一直wait
        while (this.version <= lastVersion) {  
            if (remainingWaitMs != 0
                wait(remainingWaitMs); 
            long elapsed = System.currentTimeMillis() - begin;
             //wait时间超出了最长等待时间
            if (elapsed >= maxWaitMs) 
                throw new TimeoutException("Failed to update metadata after " + maxWaitMs + " ms.");
            remainingWaitMs = maxWaitMs - elapsed;
        }
    }

2. 序列化 key

注意：序列化方法虽然有传入 topic、Headers 这两个属性，但参与序列化的只是 key
代码@1 对传入的key值进行序列化处理

byte[] serializedKey;
try {
    //@1
    serializedKey = keySerializer.serialize(record.topic(), record.headers(), record.key());
} catch (ClassCastException cce) {
    throw new SerializationException("Can't convert key of class " + record.key().getClass().getName() +
                        " to class " + producerConfig.getClass(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG).getName() +
                        " specified in key.serializer", cce);
}

3.对消息体内容进行序列化

代码@1 对传入的value值进行序列化处理

byte[] serializedValue;
try {
    //@1
    serializedValue = valueSerializer.serialize(record.topic(), record.headers(), record.value());
} catch (ClassCastException cce) {
    throw new SerializationException("Can't convert value of class " + record.value().getClass().getName() +
                        " to class " + producerConfig.getClass(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG).getName() +
                        " specified in value.serializer", cce);
}

4.分发消息

根据分区负载算法计算本次消息发送该发往的分区。其默认实现类为 DefaultPartitioner，路由算法如下：
如果指定了 key ，则使用 key 的 hashcode 与分区数取模。
如果未指定 key，则轮询所有的分区。

int partition = partition(record, serializedKey, serializedValue, cluster);
tp = new TopicPartition(record.topic(), partition);

1.分发算法源码

代码@1 该主题下所有的分区信息
代码@2 如果key字节数组为空，也就是producer没有指定key的情况下。采用轮询策略
代码@3 根据topic获取nextValue值
代码@4 获取可用的分区信息
代码@5 如果存在可用的分区，我们就用生成的随机值取余可用分区数来确定该条消息发送的分区。
代码@6 如果不存在可用分区，就就用生成的随机值取余分区数来确定该条消息发送的分区
代码@7 如果指定了key就使用一致性hash算法来指定发送的分区
应用：kafka怎么实现顺序性执行
同一个key和同一个topic

/*KafkaProducer*/
private int partition(ProducerRecord<K, V> record, byte[] serializedKey, byte[] serializedValue, Cluster cluster) {
	Integer partition = record.partition();
    //如果partition不为空就返回配置的分区数
	return partition != null ?
			partition :
			partitioner.partition(
					record.topic(), record.key(), serializedKey, record.value(), serializedValue, cluster);
}

/*DefaultPartitioner*/
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
    //@1 该主题下所有的分区信息
    List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
    //该主题下分区数量
	int numPartitions = partitions.size();
	//@2 如果key字节数组为空，也就是producer没有指定key的情况下。采用轮询策略
	if (keyBytes == null) {
		//@3 根据topic获取nextValue值
		int nextValue = nextValue(topic);
		//@4 获取可用的分区信息
		List<PartitionInfo> availablePartitions = cluster.availablePartitionsForTopic(topic);
		if (availablePartitions.size() > 0) {
			//@5 如果存在可用的分区，我们就用生成的随机值取余可用分区数来确定该条消息发送的分区
          //通过余数和AtomicInteger增1控制轮询
			int part = Utils.toPositive(nextValue) % availablePartitions.size();
			return availablePartitions.get(part).partition();
		} else {
			 //@6 如果不存在可用分区，就用生成的随机值取余分区数来确定该条消息发送的分区
			// no partitions are available, give a non-available partition
			return Utils.toPositive(nextValue) % numPartitions;
		}
	} else {
		//@7 如果指定了key就使用一致性hash算法来指定发送的分区
		// hash the keyBytes to choose a partition，同一个key会分到同一个分区
		return Utils.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
	}
}

/*Utils*/
public static int toPositive(int number) {
     // 为啥要把hash值和0x7FFFFFFF做一次按位与操作呢，
     // 主要是为了保证得到的index的第一位为0，也就是为了得到一个正数。
     // 因为有符号数第一位0代表正数，1代表负数
	return number & 0x7fffffff;
}


private int nextValue(String topic) {
	//这里获取该主题对应的随机数，为了保证多线程下的数据原子性使用juc下的原子整型类
	AtomicInteger counter = topicCounterMap.get(topic);
	if (null == counter) {
         //使用ThreadLocal机制重新实现的Random
		//获取随机数
		counter = new AtomicInteger(ThreadLocalRandom.current().nextInt());
		//.putIfAbsent：如果所指定的 key 已经在 HashMap 中存在，返回和这个 key 值对应的 value, 
       //                  如果所指定的 key 不在 HashMap 中存在，则返回 null
       //将当前主题和随机数放进map中
		AtomicInteger currentCounter = topicCounterMap.putIfAbsent(topic, counter);
		//如果不为空，把从map中获取到的值赋值给随机数。用map中随机值把我们生成的随机值覆盖掉。
		if (currentCounter != null) {
			counter = currentCounter;
		}
	}
	//返回并且将随机值+1：counter本身+1，下次取出来也会是增1的值
	return counter.getAndIncrement();
}

partitionsForTopic和availablePartitionsForTopic的区别

Map<String, List<PartitionInfo>> tmpAvailablePartitionsByTopic = new HashMap<>(tmpPartitionsByTopic.size());
//遍历主题及主题对应的分区数的map
for (Map.Entry<String, List<PartitionInfo>> entry : tmpPartitionsByTopic.entrySet()) {
    String topic = entry.getKey();
    //将分区list设置为只读list
    List<PartitionInfo> partitionsForTopic = Collections.unmodifiableList(entry.getValue());
    //重新set进tmpPartitionsByTopic这个map
    tmpPartitionsByTopic.put(topic, partitionsForTopic);
    // Optimise for the common case where all partitions are available
    // 是否有leader
    // stream().anyMatch：是有一个或一个以上的元素满足函数参数计算结果为true那整个方法返回值为true
    boolean foundUnavailablePartition = partitionsForTopic.stream().anyMatch(p -> p.leader() == null);
    List<PartitionInfo> availablePartitionsForTopic;
    if (foundUnavailablePartition) {
        availablePartitionsForTopic = new ArrayList<>(partitionsForTopic.size());
        for (PartitionInfo p : partitionsForTopic) {
            if (p.leader() != null)
                availablePartitionsForTopic.add(p);
        }
        availablePartitionsForTopic = Collections.unmodifiableList(availablePartitionsForTopic);
    } else {
        availablePartitionsForTopic = partitionsForTopic;
    }
    tmpAvailablePartitionsByTopic.put(topic, availablePartitionsForTopic);
}
this.partitionsByTopic = Collections.unmodifiableMap(tmpPartitionsByTopic);
this.availablePartitionsByTopic = Collections.unmodifiableMap(tmpAvailablePartitionsByTopic);

5.设置消息头

如果是消息头信息(RecordHeaders)，则设置为只读

setReadOnly(record.headers());
Header[] headers = record.headers().toArray();

6.计算消息长度

根据使用的版本号，按照消息协议来计算消息的长度，并是否超过指定长度，如果超过则抛出异常

int serializedSize = AbstractRecords.estimateSizeInBytesUpperBound(apiVersions.maxUsableProduceMagic(),
                    compressionType, serializedKey, serializedValue, headers);
ensureValidRecordSize(serializedSize);

7.设置回调

先初始化消息时间戳，并对传入的 Callable(回调函数) 加入到拦截器链中

long timestamp = record.timestamp() == null ? time.milliseconds() : record.timestamp();
log.trace("Sending record {} with callback {} to topic {} partition {}", record, callback, record.topic(), partition);
Callback interceptCallback = new InterceptorCallback<>(callback, this.interceptors, tp);

8.追加消息到内存

代码@1 将消息追加到缓存区

//@1
RecordAccumulator.RecordAppendResult result = accumulator.append(tp, timestamp, serializedKey, serializedValue, headers, interceptCallback, remainingWaitMs);
//@2
if (result.batchIsFull || result.newBatchCreated) {
    log.trace("Waking up the sender since topic {} partition {} 
              is either full or getting a new batch", record.topic(), partition);
    //@3
    this.sender.wakeup();
}
//@4
return result.future;

9.唤醒sender线程，发送消息

代码@1 如果当前缓存区已写满或创建了一个新的缓存区
代码@2 唤醒 Sender(消息发送线程)，将缓存区中的消息发送到 broker 服务器
代码@3 最终返回 future
这里是经典的 Future 设计模式，所以消息发送采用的是异步发送形式

//@1
if (result.batchIsFull || result.newBatchCreated) {
    log.trace("Waking up the sender since topic {} partition {} 
              is either full or getting a new batch", record.topic(), partition);
    //@2
    this.sender.wakeup();
}
//@3
return result.future;

10.异常抛出代码

	省略

生产者怎么实现监听回调的

1. 实现原理概要

KafkaTemplate类声明producerListener属性并初始化LoggingProducerListener类

LoggingProducerListener类实现了ProducerListener生产者监听接口，重写了onError()和onSucess()方法

buildCallback()方法里面实现Callback接口

KafkaProducer.doSend方法将实现的Callback方法加入拦截器链路中

拦截器实现了Callback接口，当一次请求执行完会调用其内的onCompletion()方法
Callback接口的源码

/**
 * A callback interface that the user can implement to allow code to execute when the request is complete. This callback
 * will generally execute in the background I/O thread so it should be fast.
 */
public interface Callback {

    /**
     * A callback method the user can implement to provide asynchronous handling of request completion. This method will
     * be called when the record sent to the server has been acknowledged. Exactly one of the arguments will be
     * non-null.
     * @param metadata The metadata for the record that was sent (i.e. the partition and offset). Null if an error
     *        occurred.
     * @param exception The exception thrown during processing of this record. Null if no error occurred.
     *                  Possible thrown exceptions include:
     *
     *                  Non-Retriable exceptions (fatal, the message will never be sent):
     *
     *                  InvalidTopicException
     *                  OffsetMetadataTooLargeException
     *                  RecordBatchTooLargeException
     *                  RecordTooLargeException
     *                  UnknownServerException
     *
     *                  Retriable exceptions (transient, may be covered by increasing #.retries):
     *
     *                  CorruptRecordException
     *                  InvalidMetadataException
     *                  NotEnoughReplicasAfterAppendException
     *                  NotEnoughReplicasException
     *                  OffsetOutOfRangeException
     *                  TimeoutException
     *                  UnknownTopicOrPartitionException
     */
    public void onCompletion(RecordMetadata metadata, Exception exception);
}

2. 源码解析

ProducerListener接口源码

ProducerListener接口主要是定义了onSuccess和onError方法

public interface ProducerListener<K, V> {

	default void onSuccess(ProducerRecord<K, V> producerRecord, RecordMetadata recordMetadata) {
		onSuccess(producerRecord.topic(), producerRecord.partition(),
				producerRecord.key(), producerRecord.value(), recordMetadata);
	}

	default void onSuccess(String topic, Integer partition, K key, V value, RecordMetadata recordMetadata) {
	}


	default void onError(ProducerRecord<K, V> producerRecord, Exception exception) {
		onError(producerRecord.topic(), producerRecord.partition(),
				producerRecord.key(), producerRecord.value(), exception);
	}

	default void onError(String topic, Integer partition, K key, V value, Exception exception) {
	}

	@Deprecated
	default boolean isInterestedInSuccess() {
		return false;
	}
}

LoggingProducerListener源码

LoggingProducerListener类实现了ProducerListener接口，重写了onError方法
拓展：为什么可以不用实现onSuccess方法

JDK1.8的新特性：default关键字：非强制重写方法
如果实现的两个类中有两个一模一样的方法，要强制实现

public class LoggingProducerListener<K, V> implements ProducerListener<K, V> {

	public static final int DEFAULT_MAX_CONTENT_LOGGED = 100;

	private static final Log logger = LogFactory.getLog(LoggingProducerListener.class); // NOSONAR

	private boolean includeContents = true;

	private int maxContentLogged = DEFAULT_MAX_CONTENT_LOGGED;


	public void setIncludeContents(boolean includeContents) {
		this.includeContents = includeContents;
	}

	public void setMaxContentLogged(int maxContentLogged) {
		this.maxContentLogged = maxContentLogged;
	}

	@Override
	public void onError(String topic, Integer partition, K key, V value, Exception exception) {
		if (logger.isErrorEnabled()) {
			StringBuffer logOutput = new StringBuffer();
			logOutput.append("Exception thrown when sending a message");
			if (this.includeContents) {
				logOutput.append(" with key='")
					.append(toDisplayString(ObjectUtils.nullSafeToString(key), this.maxContentLogged))
					.append("'")
					.append(" and payload='")
					.append(toDisplayString(ObjectUtils.nullSafeToString(value), this.maxContentLogged))
					.append("'");
			}
			logOutput.append(" to topic ").append(topic);
			if (partition != null) {
				logOutput.append(" and partition ").append(partition);
			}
			logOutput.append(":");
			logger.error(logOutput, exception);
		}
	}

	private String toDisplayString(String original, int maxCharacters) {
		if (original.length() <= maxCharacters) {
			return original;
		}
		return original.substring(0, maxCharacters) + "...";
	}

}

设置回调相关的源码

代码@0 KafkaTemplate的属性producerListener，初始化LoggingProducerListener类，LoggingProducerListener类实现ProducerListener接口

代码@1 kafka工具类调用发送信息方法时，一同创建了回调方法

代码@2 buildCallback方法创建Callback接口的实现类，实现onCompletion方法

代码@3 判断是否有异常，没有异常就调用producerListener的成功方法onSuccess

代码@4 判断是否有异常，有异常就调用producerListener的失败方法onError

@0 
private volatile ProducerListener<K, V> producerListener = new LoggingProducerListener<K, V>();


/**KafkaTemplate#send()方法**/
protected ListenableFuture<SendResult<K, V>> doSend(final ProducerRecord<K, V> producerRecord) {
        /*省略*/
        final SettableListenableFuture<SendResult<K, V>> future = new SettableListenableFuture<>();
        // @1
        producer.send(producerRecord, buildCallback(producerRecord, producer, future));
        /*省略*/
}


/**KafkaTemplate#buildCallback方法**/
private Callback buildCallback(final ProducerRecord<K, V> producerRecord, final Producer<K, V> producer,
			final SettableListenableFuture<SendResult<K, V>> future) {
			//java 8 写法 等同于 
			// new Callback(){
			//		@Override
            //		public void onCompletion(metadata,exception){
            //			...
            //		}
			//	}
                  // @2
		return (metadata, exception) -> {
			try {
                                    // @3
				//没有异常信息，就调用producerListener接口默认的onSuccess()方法
				if (exception == null) {
					future.set(new SendResult<>(producerRecord, metadata));
					if (KafkaTemplate.this.producerListener != null) {
						KafkaTemplate.this.producerListener.onSuccess(producerRecord, metadata);
					}
					if (KafkaTemplate.this.logger.isTraceEnabled()) {
						KafkaTemplate.this.logger.trace("Sent ok: " + producerRecord + ", metadata: " + metadata);
					}
				}
                                    // @4
				//有异常信息，就调用producerListener接口重写的onError()方法
				else {
					future.setException(new KafkaProducerException(producerRecord, "Failed to send", exception));
					if (KafkaTemplate.this.producerListener != null) {
						KafkaTemplate.this.producerListener.onError(producerRecord, exception);
					}
					if (KafkaTemplate.this.logger.isDebugEnabled()) {
						KafkaTemplate.this.logger.debug("Failed to send: " + producerRecord, exception);
					}
				}
			}
			finally {
				//如果不是事物审查会拿着，就关闭生产者
				if (!KafkaTemplate.this.transactional) {
					closeProducer(producer, false);
				}
			}
		};
	}

拦截类ProducerInterceptors和ProducerInterceptor的源码

代码@1 生产者拦截器集合类

代码@2 生产者拦截器类,继承Configurable配置，可在配置类里面加上自定义拦截类

//@1
public class ProducerInterceptors<K, V> implements Closeable {
    private static final Logger log = LoggerFactory.getLogger(ProducerInterceptors.class);
    private final List<ProducerInterceptor<K, V>> interceptors;

    public ProducerInterceptors(List<ProducerInterceptor<K, V>> interceptors) {
        this.interceptors = interceptors;
    }

    public ProducerRecord<K, V> onSend(ProducerRecord<K, V> record) {
        ProducerRecord<K, V> interceptRecord = record;
        for (ProducerInterceptor<K, V> interceptor : this.interceptors) {
            try {
                interceptRecord = interceptor.onSend(interceptRecord);
            } catch (Exception e) {
                // do not propagate interceptor exception, log and continue calling other interceptors
                // be careful not to throw exception from here
                if (record != null)
                    log.warn("Error executing interceptor onSend callback for topic: {}, partition: {}", record.topic(), record.partition(), e);
                else
                    log.warn("Error executing interceptor onSend callback", e);
            }
        }
        return interceptRecord;
    }


    public void onAcknowledgement(RecordMetadata metadata, Exception exception) {
        for (ProducerInterceptor<K, V> interceptor : this.interceptors) {
            try {
                interceptor.onAcknowledgement(metadata, exception);
            } catch (Exception e) {
                // do not propagate interceptor exceptions, just log
                log.warn("Error executing interceptor onAcknowledgement callback", e);
            }
        }
    }

        
    public void onSendError(ProducerRecord<K, V> record, TopicPartition interceptTopicPartition, Exception exception) {
        for (ProducerInterceptor<K, V> interceptor : this.interceptors) {
            try {
                if (record == null && interceptTopicPartition == null) {
                    interceptor.onAcknowledgement(null, exception);
                } else {
                    if (interceptTopicPartition == null) {
                        interceptTopicPartition = new TopicPartition(record.topic(),
                                record.partition() == null ? RecordMetadata.UNKNOWN_PARTITION : record.partition());
                    }
                    interceptor.onAcknowledgement(new RecordMetadata(interceptTopicPartition, -1, -1,
                                    RecordBatch.NO_TIMESTAMP, Long.valueOf(-1L), -1, -1), exception);
                }
            } catch (Exception e) {
                // do not propagate interceptor exceptions, just log
                log.warn("Error executing interceptor onAcknowledgement callback", e);
            }
        }
    }

    @Override
    public void close() {
        for (ProducerInterceptor<K, V> interceptor : this.interceptors) {
            try {
                interceptor.close();
            } catch (Exception e) {
                log.error("Failed to close producer interceptor ", e);
            }
        }
    }
}

    //@2
public interface ProducerInterceptor<K, V> extends Configurable {

    public ProducerRecord<K, V> onSend(ProducerRecord<K, V> record);


    public void onAcknowledgement(RecordMetadata metadata, Exception exception);

 
    public void close();
}

KafkaProducer的源码

代码@1 回调方法加入到拦截器链路中

代码@2 kafka拦截器链集合

代码@3 KafkaProducer#发送玩一次请求，异步执行回调Callback接口的onCompletion方法

代码@4 调用拦截器的onAcknowledgement方法，执行生产者的各个拦截器的onAcknowledgement方法

代码@5 立即调用KafkaTemplate创建的onCompletion方法，发送成功或者失败日志信息

/**KafkaProducer#doSend方法**/
private Future<RecordMetadata> doSend(ProducerRecord<K, V> 
record, Callback callback) {
    /*省略*/
     // @1
    Callback interceptCallback = new InterceptorCallback<>(callback, this.interceptors, tp);
     /*省略*/
}

/**KafkaProducer#InterceptorCallback方法**/
private static class InterceptorCallback<K, V> implements Callback {
	private final Callback userCallback;
            // @2
	private final ProducerInterceptors<K, V> interceptors;
	private final TopicPartition tp;

	private InterceptorCallback(Callback userCallback, ProducerInterceptors<K, V> interceptors, TopicPartition tp) {
		this.userCallback = userCallback;
		this.interceptors = interceptors;
		this.tp = tp;
	}

        // @3
	public void onCompletion(RecordMetadata metadata, Exception exception) {
		metadata = metadata != null ? metadata : new RecordMetadata(tp, -1, -1, RecordBatch.NO_TIMESTAMP, Long.valueOf(-1L), -1, -1);
                // @4
		this.interceptors.onAcknowledgement(metadata, exception);
		if (this.userCallback != null)
                         // @5
			this.userCallback.onCompletion(metadata, exception);
	}
}

public class ProducerInterceptors<K, V> implements Closeable {    
        private static final Logger log = LoggerFactory.getLogger(ProducerInterceptors.class);    
        private final List<ProducerInterceptor<K, V>> interceptors;

}

3.类流程图

监听回调过程流程图

4.future 接口

Future表示异步计算的结果。提供了检查计算是否完成、等待计算完成以及检索计算结果的方法。

结果只能在计算完成后使用方法get取回，必要时阻塞，直到它准备好。

取消由cancel方法执行。

提供了其他方法来确定任务是正常完成还是被取消。计算一旦完成，就不能取消计算。

如果您想使用Future来实现可取消性，但不提供可用结果，您可以声明Future形式的类型并返回null作为基础任务的结果

cancel方法

参数：

mayInterruptIfRunning 执行此任务的线程是否应该被中断；否则，允许进行中的任务完成

返回：
true 任务取消执行
false 如果任务无法取消，通常是因为它已经正常完成

boolean cancel(boolean mayInterruptIfRunning);

isCancelled方法

如果此任务在正常完成之前被取消，则返回true。

返回：
true 如果此任务在正常完成之前被取消

boolean isCancelled();

isDone方法

如果此任务完成，则返回true。完成可能是由于正常终止、异常或取消——在所有这些情况下，此方法将返回true。

返回：
true 如果此任务完成

boolean isDone();

get方法

如有必要，等待计算完成，然后取回其结果。
返回：
V 计算结果
异常：
throws CancellationException 如果计算被取消
throws ExecutionException 如果计算抛出异常
throws InterruptedException 如果当前线程在等待时被中断

V get() throws InterruptedException, ExecutionException;

get方法

如有必要，最多等待给定的计算时间完成，然后取回其结果（如果可用）。

参数：
timeout 最长等待时间
unit 超时参数timeout的时间单位
返回：
V 计算结果
异常：
throws CancellationException 如果计算被取消
throws ExecutionException 如果计算抛出异常
throws InterruptedException 如果当前线程在等待时被中断
throws TimeoutException 如果等待超时

V get(long timeout, TimeUnit unit) throws InterruptedException, ExecutionException, TimeoutException;

如何将消息追加到生产者的发送缓存区

生产者流程图

RecordAccumulator类，本地消息缓存类

private final ConcurrentMap batches;

1.RecordAccumulator.append方法详解

属性介绍

TopicPartition tp topic 与分区信息，即发送到哪个 topic 的那个分区。
long timestamp 客户端发送时的时间戳。
byte[] key 消息的 key。
byte[] value 消息体。
Header[] headers 消息头，可以理解为额外消息属性。
Callback callback 回调方法。
long maxTimeToBlock 消息追加超时时间。

public RecordAppendResult append(TopicPartition tp,
                                     long timestamp,
                                     byte[] key,
                                     byte[] value,
                                     Header[] headers,
                                     Callback callback,
                                     long maxTimeToBlock) throws InterruptedException {}

2.RecordAccumulator.append()方法

1.消息增加到双端队列

尝试根据 topic与分区在 kafka 中获取一个双端队列，如果不存在，则创建一个，然后调用 tryAppend 方法将消息追加到缓存中。Kafka 会为每一个 topic 的每一个分区创建一个消息缓存区，消息先追加到缓存中，然后消息发送 API 立即返回，然后由单独的线程 Sender 将缓存区中的消息定时发送到 broker 。这里的缓存区的实现使用的是 ArrayQeque。然后调用 tryAppend 方法尝试将消息追加到其缓存区，如果追加成功，则返回结果。

//尝试获取队列batches
Deque<ProducerBatch> dq = getOrCreateDeque(tp);
synchronized (dq) {
    if (closed)
        throw new KafkaException("Producer closed while send in progress");
        //将消息加入到缓存区
    RecordAppendResult appendResult = tryAppend(timestamp, key, value, headers, callback, dq);
    if (appendResult != null)
        return appendResult;
}

1.ProducerBatch tryAppend方法详解

代码@1：首先判断 ProducerBatch 是否还能容纳当前消息，如果剩余内存不足，将直接返回 null。如果返回 null ，会尝试再创建一个新的ProducerBatch。

代码@2：通过 MemoryRecordsBuilder 将消息写入按照 Kafka 消息格式写入到内存中，即写入到在创建 ProducerBatch 时申请的 ByteBuffer 中。本文先不详细介绍 Kafka 各个版本的消息格式，后续会专门写一篇文章介绍 Kafka 各个版本的消息格式。

代码@3：更新 ProducerBatch 的 maxRecordSize、lastAppendTime 属性，分别表示该批次中最大的消息长度与最后一次追加消息的时间。

代码@4：构建 FutureRecordMetadata 对象，这里是典型的 Future模式，里面主要包含了该条消息对应的批次的 produceFuture、消息在该批消息的下标，key 的长度、消息体的长度以及当前的系统时间。

代码@5：将 callback 、本条消息的凭证(Future) 加入到该批次的 thunks 中，该集合存储了一个批次中所有消息的发送回执。

流程执行到这里，KafkaProducer 的 send 方法就执行完毕了，返回给调用方的就是一个 FutureRecordMetadata 对象。

public FutureRecordMetadata tryAppend(long timestamp, byte[] key, byte[] value, Header[] headers, Callback callback, long now) {
    // @1
    if (!recordsBuilder.hasRoomFor(timestamp, key, value, headers)) {  
        return null;
    } else {
    	 // @2
        Long checksum = this.recordsBuilder.append(timestamp, key, value, headers);                    
        // @3
        this.maxRecordSize = Math.max(this.maxRecordSize, AbstractRecords.estimateSizeInBytesUpperBound(magic(),
                    recordsBuilder.compressionType(), key, value, headers)); 
        this.lastAppendTime = now;                                                                                       
        // @4
        FutureRecordMetadata future = new FutureRecordMetadata(this.produceFuture,
                                                               this.recordCount,
                                                               timestamp, 
                                                               checksum,
                                                               key == null ? -1 : key.length,
                                                               value == null ? -1 : value.length,
                                                               Time.SYSTEM); 
        // @5
        thunks.add(new Thunk(callback, future));                                                           				
        this.recordCount++;
        return future;                                                                            
    }
}

2.批次ProducerBatch与双端队列的关系

内存存储结构如下ConcurrentMap< TopicPartition, Deque< ProducerBatch>> batches
ConcurrentHashMap是一个线程安全，并且是一个高效的HashMap
Kafka的生产者为会每一个topic的每一个分区单独维护一个队列，即ArrayDeque，内部存放的元素为ProducerBatch，即代表一个批次，即Kafka消息发送是按批发送的

双端队列与批次的关系

3.如何实现同步发送和异步发送

如果项目方需要使用同步发送的方式，只需要拿到 send 方法的返回结果后，调用其 get() 方法，此时如果消息还未发送到 Broker 上，该方法会被阻塞，等到 broker 返回消息发送结果后该方法会被唤醒并得到消息发送结果。如果需要异步发送，则建议使用 send(ProducerRecord< K, V > record, Callback callback),但不能调用 get 方法即可。Callback 会在收到 broker 的响应结果后被调用，并且支持拦截器

    public void sendMessage(String topic, String dataBuyPoint) throws ExecutionException, InterruptedException {
        ListenableFuture<SendResult<String, String>> future = kafkaTemplate
                .send(topic, dataBuyPoint);
         //调用get方法
        future.get();
        future.addCallback(new ListenableFutureCallback<SendResult<String, String>>() {
            public void onSuccess(SendResult<String, String> result) {
            }
            @Override
            public void onFailure(Throwable ex) {
            }
        });
    }

2 申请内存空间

如果第一步未追加成功，说明当前没有可用的 ProducerBatch，则需要创建一个 ProducerBatch，故先从 BufferPool 中申请 batch.size 的内存空间，为创建 ProducerBatch 做准备，如果由于 BufferPool 中未有剩余内存，则最多等待 maxTimeToBlock ，如果在指定时间内未申请到内存，则抛出异常

int size = Math.max(this.batchSize, AbstractRecords.estimateSizeInBytesUpperBound(maxUsableMagic, compression, key, value, headers));
log.trace("Allocating a new {} byte message buffer for topic {} partition {}", size, tp.topic(), tp.partition());
buffer = free.allocate(size, maxTimeToBlock);

3 创建新批次

创建一个新的批次 ProducerBatch，并将消息写入到该批次中，并返回追加结果，这里有如下几个关键点：

创建 ProducerBatch ，其内部持有一个 MemoryRecordsBuilder对象，该对象负责将消息写入到内存中，即写入到 ProducerBatch 内部持有的内存，大小等于 batch.size。
将消息追加到 ProducerBatch 中。
将新创建的 ProducerBatch 添加到双端队列的末尾。
将该批次加入到 incomplete 容器中，该容器存放未完成发送到 broker 服务器中的消息批次，当 Sender 线程将消息发送到 broker 服务端后，会将其移除并释放所占内存。
返回追加结果。
纵观 RecordAccumulator append 的流程，基本上就是从双端队列获取一个未填充完毕的 ProducerBatch（消息批次），然后尝试将其写入到该批次中（缓存、内存中），如果追加失败，则尝试创建一个新的 ProducerBatch 然后继续追加

synchronized (dq) {
    // Need to check if producer is closed again after grabbing the dequeue lock.
    if (closed)
        throw new KafkaException("Producer closed while send in progress");
    // 省略部分代码
    MemoryRecordsBuilder recordsBuilder = recordsBuilder(buffer, maxUsableMagic);
    ProducerBatch batch = new ProducerBatch(tp, recordsBuilder, time.milliseconds());
    FutureRecordMetadata future = Utils.notNull(batch.tryAppend(timestamp, key, value, headers, callback, time.milliseconds()));
    dq.addLast(batch);
    incomplete.add(batch);
    // Don't deallocate this buffer in the finally block as it's being used in the record batch
    buffer = null;
    return new RecordAppendResult(future, dq.size() > 1 || batch.isFull(), true);
}

RecordAccumulator 核心方法详解

属性解析

long waitedTimeMs 该 ProducerBatch 已等待的时长，等于当前时间戳与 ProducerBatch 的 lastAttemptMs 之差，在 ProducerBatch 创建时或需要重试时会将当前的时间赋值给lastAttemptMs。
retryBackoffMs 当发生异常时发起重试之前的等待时间，默认为 100ms，可通过属性 retry.backoff.ms 配置。
batch.attempts() 该批次当前已重试的次数。
backingOff 后台发送是否关闭，即如果需要重试并且等待时间小于 retryBackoffMs ，则 backingOff = true，也意味着该批次未准备好。
timeToWaitMs send 线程发送消息需要的等待时间，如果 backingOff 为 true，表示该批次是在重试，并且等待时间小于系统设置的需要等待时间，这种情况下 timeToWaitMs = retryBackoffMs 。否则需要等待的时间为 lingerMs。
boolean full 该批次是否已满，如果两个条件中的任意一个满足即为 true。
- Deque< ProducerBatch> 该队列的个数大于1，表示肯定有一个 ProducerBatch 已写满。
- ProducerBatch 已写满。
boolean expired 是否过期，等于已经等待的时间是否大于需要等待的时间，如果把发送看成定时发送的话，expired 为 true 表示定时器已到达触发点，即需要执行。
boolean exhausted 当前生产者缓存已不够，创建新的 ProducerBatch 时阻塞在申请缓存空间的线程大于0，此时应立即将缓存区中的消息立即发送到服务器。
boolean sendable 是否可发送。其满足下面的任意一个条件即可：
- 该批次已写满。(full = true)。
- 已等待系统规定的时长。（expired = true）
- 发送者内部缓存区已耗尽并且有新的线程需要申请(exhausted = true)。
- 该发送者的 close 方法被调用(close = true)。
- 该发送者的 flush 方法被调用。

RecordAccumulator.ready()方法

代码@1：对生产者缓存区 ConcurrentHashMap> batches 遍历，从中挑选已准备好的消息批次。
代码@2：从生产者元数据缓存中尝试查找分区(TopicPartition) 的 leader 信息，如果不存在，当将该 topic 添加到 unknownLeaderTopics
代码@3：稍后会发送元数据更新请求去 broker 端查找分区的路由信息。
代码@4：如果不在 readyNodes 中就需要判断是否满足条件，isMuted 与顺序消息有关，本文暂时不关注，在后面的顺序消息部分会重点探讨。
代码@5：这里就是判断是否准备好的条件

public ReadyCheckResult ready(Cluster cluster, long nowMs) {
    Set<Node> readyNodes = new HashSet<>();
    long nextReadyCheckDelayMs = Long.MAX_VALUE;
    Set<String> unknownLeaderTopics = new HashSet<>();
 
    boolean exhausted = this.free.queued() > 0;
    for (Map.Entry<TopicPartition, Deque<ProducerBatch>> entry : this.batches.entrySet()) {   // @1
        TopicPartition part = entry.getKey();
        Deque<ProducerBatch> deque = entry.getValue();
 
        Node leader = cluster.leaderFor(part);   // @2
        synchronized (deque) {
            if (leader == null && !deque.isEmpty()) {   // @3
                // This is a partition for which leader is not known, but messages are available to send.
                // Note that entries are currently not removed from batches when deque is empty.
                unknownLeaderTopics.add(part.topic());
            } else if (!readyNodes.contains(leader) && !isMuted(part, nowMs)) {    // @4
                ProducerBatch batch = deque.peekFirst();
                if (batch != null) {
                    long waitedTimeMs = batch.waitedTimeMs(nowMs);
                    boolean backingOff = batch.attempts() > 0 && waitedTimeMs < retryBackoffMs;
                    long timeToWaitMs = backingOff ? retryBackoffMs : lingerMs;
                    boolean full = deque.size() > 1 || batch.isFull();
                    boolean expired = waitedTimeMs >= timeToWaitMs;
                    boolean sendable = full || expired || exhausted || closed || flushInProgress();
                    if (sendable && !backingOff) {   // @5
                        readyNodes.add(leader);
                    } else {
                        long timeLeftMs = Math.max(timeToWaitMs - waitedTimeMs, 0);
                        // Note that this results in a conservative estimate since an un-sendable partition may have
                        // a leader that will later be found to have sendable data. However, this is good enough
                        // since we'll just wake up and then sleep again for the remaining time.
                        nextReadyCheckDelayMs = Math.min(timeLeftMs, nextReadyCheckDelayMs);   
                    }
                }
            }
        }
    }
    return new ReadyCheckResult(readyNodes, nextReadyCheckDelayMs, unknownLeaderTopics);
}

RecordAccumulator.drain()方法

代码@1：我们首先来介绍该方法的参数：

Cluster cluster 集群信息。
Set< Node> nodes 已准备好的节点集合。
int maxSize 一次请求最大的字节数。
long now 当前时间。

代码@2：遍历所有节点，调用 drainBatchesForOneNode 方法抽取数据，组装成 Map> batches。

//@1
public Map<Integer, List<ProducerBatch>> drain(Cluster cluster, Set<Node> nodes, int maxSize, long now) { // @1
    if (nodes.isEmpty())
        return Collections.emptyMap();
 
    Map<Integer, List<ProducerBatch>> batches = new HashMap<>();
    //@2
    for (Node node : nodes) {                                                                                                    
        List<ProducerBatch> ready = drainBatchesForOneNode(cluster, node, maxSize, now);                      // @2
        batches.put(node.id(), ready);
    }
    return batches;
}

RecordAccumulator.drainBatchesForOneNode()方法

代码@1：根据 brokerId 获取该 broker 上的所有主分区。
代码@2：初始化 start。这里首先来阐述一下 start 与 drainIndex 。

start 当前开始遍历的分区序号。
drainIndex 上次抽取的队列索引后，这里主要是为了每个队列都是从零号分区开始抽取。

代码@3：循环从缓存区抽取对应分区中累积的数据。
代码@4：根据 topic + 分区号从生产者发送缓存区中获取已累积的双端Queue。
代码@5：从双端队列的头部获取一个元素。（消息追加时是追加到队列尾部）。
代码@6：如果当前批次是重试，并且还未到阻塞时间，则跳过该分区。
代码@7：如果当前已抽取的消息总大小加上新的消息已超过 maxRequestSize，则结束抽取。
代码@8：将当前批次加入到已准备集合中，并关闭该批次，即不在允许向该批次中追加消息

private List<ProducerBatch> drainBatchesForOneNode(Cluster cluster, Node node, int maxSize, long now) {
    int size = 0;
    List<PartitionInfo> parts = cluster.partitionsForNode(node.id());   // @1
    List<ProducerBatch> ready = new ArrayList<>();
    int start = drainIndex = drainIndex % parts.size();                        // @2
    do {                                                                                                // @3 
        PartitionInfo part = parts.get(drainIndex);
        TopicPartition tp = new TopicPartition(part.topic(), part.partition()); 
        this.drainIndex = (this.drainIndex + 1) % parts.size();                     
            
        if (isMuted(tp, now))
            continue;
 
        Deque<ProducerBatch> deque = getDeque(tp);                              // @4
        if (deque == null)
            continue;
 
        synchronized (deque) {
            // invariant: !isMuted(tp,now) && deque != null
            ProducerBatch first = deque.peekFirst();                                         // @5
            if (first == null)
                continue;
 
            // first != null
            boolean backoff = first.attempts() > 0 && first.waitedTimeMs(now) < retryBackoffMs;   // @6
            // Only drain the batch if it is not during backoff period.
            if (backoff)                                                                                     
                continue;
 
            if (size + first.estimatedSizeInBytes() > maxSize && !ready.isEmpty()) {     // @7
                break;
            } else {
                if (shouldStopDrainBatchesForPartition(first, tp))                                  
                    break;
 
                // 这里省略与事务消息相关的代码，后续会重点学习。
                batch.close();                                                                                            // @8
                size += batch.records().sizeInBytes();
                ready.add(batch);                                                                            
 
                batch.drained(now);                                                                             
            }
        }
    } while (start != drainIndex);
    return ready;
}

双端队列ArrayDeque

1.结构

ArrayDeque 是非线程安全
在ArrayDeque底层使用了数组来存储数据，同时用两个int值head和tail来表示头部和尾部。
不过需要注意的是tail并不是尾部元素的索引，而是尾部元素的下一位，即下一个将要被加入的元素的索引。

//用数组存储元素
transient Object[] elements; // non-private to simplify nested class access
//头部元素的索引
transient int head;
//尾部下一个将要被加入的元素的索引
transient int tail;
//最小容量，必须为2的幂次方
private static final int MIN_INITIAL_CAPACITY = 8;

2.特性

无容量大小限制，容量按需增长；
非线程安全队列，无同步策略，不支持多线程安全访问；
当用作栈时，性能优于Stack，当用于队列时，性能优于LinkedList
两端都可以操作
具有fail-fast特征
不能存储null
支持双向迭代器遍历

3.常用方法

1.添加元素
        addFirst(E e)在数组前面添加元素
        addLast(E e)在数组后面添加元素
        offerFirst(E e) 在数组前面添加元素，并返回是否添加成功
        offerLast(E e) 在数组后天添加元素，并返回是否添加成功

  2.删除元素
        removeFirst()删除第一个元素，并返回删除元素的值,如果元素为null，将抛出异常
        pollFirst()删除第一个元素，并返回删除元素的值，如果元素为null，将返回null
        removeLast()删除最后一个元素，并返回删除元素的值，如果为null，将抛出异常
        pollLast()删除最后一个元素，并返回删除元素的值，如果为null，将返回null
        removeFirstOccurrence(Object o) 删除第一次出现的指定元素
        removeLastOccurrence(Object o) 删除最后一次出现的指定元素
   

   3.获取元素
        getFirst() 获取第一个元素,如果没有将抛出异常
        getLast() 获取最后一个元素，如果没有将抛出异常
   

    4.队列操作
        add(E e) 在队列尾部添加一个元素
        offer(E e) 在队列尾部添加一个元素，并返回是否成功
        remove() 删除队列中第一个元素，并返回该元素的值，如果元素为null，将抛出异常(其实底层调用的是removeFirst())
        poll()  删除队列中第一个元素，并返回该元素的值,如果元素为null，将返回null(其实调用的是pollFirst())
        element() 获取第一个元素，如果没有将抛出异常
        peek() 获取第一个元素，如果返回null
      

    5.栈操作
        push(E e) 栈顶添加一个元素
        pop(E e) 移除栈顶元素,如果栈顶没有元素将抛出异常
        

    6.其他
        size() 获取队列中元素个数
        isEmpty() 判断队列是否为空
        iterator() 迭代器，从前向后迭代
        descendingIterator() 迭代器，从后向前迭代
        contain(Object o) 判断队列中是否存在该元素
        toArray() 转成数组
        clear() 清空队列
        clone() 克隆(复制)一个新的队列

4.Deque双队列的好处

双队列头部和尾部都可以进去或者出去
假设有一个消息要消费很久
当一个消息消费不成功，可以从头部插进入，不用在排到尾部等重新等待消费

5.Queue队列

（Queue）是一种新进先出的数据结构，我们就假设是从尾部进去，然后头部出去

Sender线程

Sender属性

KafkaClient client kafka 网络通信客户端，主要封装与 broker 的网络通信。
RecordAccumulator accumulator 消息记录累积器，消息追加的入口(RecordAccumulator 的 append 方法)。
Metadata metadata 元数据管理器，即 topic 的路由分区信息。
boolean guaranteeMessageOrder 是否需要保证消息的顺序性。
int maxRequestSize 调用 send 方法发送的最大请求大小，包括 key、消息体序列化后的消息总大小不能超过该值。通过参数
max.request.size 来设置。
short acks 用来定义消息“已提交”的条件(标准)，就是 Broker 端向客户端承偌已提交的条件，可选值如下0、-1、1
int retries 重试次数。
Time time 时间工具类。
boolean running 该线程状态，为 true 表示运行中。
boolean forceClose 是否强制关闭，此时会忽略正在发送中的消息。
SenderMetrics sensors 消息发送相关的统计指标收集器。
int requestTimeoutMs 请求的超时时间。
long retryBackoffMs 请求失败之在重试之前等待的时间。
ApiVersions apiVersions API版本信息。
TransactionManager transactionManager 事务处理器。
Map< TopicPartition, List< ProducerBatch>> inFlightBatches 正在执行发送相关的消息批次。

Sender唤醒后的执行过程

流程图

网络层流程图

执行代码顺序

KafkaProducer.dosend()；
Sender.run()；
NetworkClient.poll()（NetworkClient.dosend()）；
Selector.poll()；

KafkaProducer.dosend()；

dosend() 主要做了两个事情：

waitOnMetadata()：请求更新 tp（topic-partition） meta，中间会调用 sender.wakeup()；
accumulator.append()：将 msg 写入到其 tp 对应的 deque 中，如果该 tp 对应的 deque 新建了一个 Batch，最后也会调用 sender.wakeup()。

sender.wakeup() 方法

调用过程 Sender -> NetworkClient -> Selector(Kafka 封装的) -> Selector(Java NIO)

sender.wakeup() 方法的作用就是将 Sender 线程从 select()方法的阻塞中唤醒，select() 方法的作用是轮询注册在多路复用器上的 Channel，它会一直阻塞在这个方法上，除非满足下面条件中的一个：

at least one channel is selected;
this selector’s {@link #wakeup wakeup} method is invoked;
the current thread is interrupted;
the given timeout period expires.

否则 select() 将会一直轮询，阻塞在这个地方，直到条件满足。

KafkaProducer 中 dosend() 方法调用 sender.wakeup() 方法作用就是：当有新的 RecordBatch 创建后，旧的 RecordBatch 就可以发送了（或者此时有 Metadata 请求需要发送），如果线程阻塞在 select() 方法中，就将其唤醒，Sender 重新开始运行 run() 方法，在这个方法中，旧的 RecordBatch （或相应的 Metadata 请求）将会被选中，进而可以及时将这些请求发送出去

// org.apache.kafka.clients.producer.internals.Sender
/**
* Wake up the selector associated with this send thread
*/
public void wakeup() {
    this.client.wakeup();
}
 
// org.apache.kafka.clients.NetworkClient
/**
* Interrupt the client if it is blocked waiting on I/O.
*/
@Override
public void wakeup() {
    this.selector.wakeup();
}
 
// org.apache.kafka.common.network.Selector
/**
* Interrupt the nioSelector if it is blocked waiting to do I/O.
*/
//note: 如果 selector 是阻塞的话,就唤醒
@Override
public void wakeup() {
    this.nioSelector.wakeup();
}

Sender.run()方法

代码@1：Sender 线程在运行状态下主要的业务处理方法，将消息缓存区中的消息向 broker 发送。

代码@2：如果主动关闭 Sender 线程，如果不是强制关闭，则如果缓存区还有消息待发送，再次调用 runOnce 方法将剩余的消息发送完毕后再退出。

代码@3：如果强制关闭 Sender 线程，则拒绝未完成提交的消息。

代码@4：关闭 Kafka Client 即网络通信对象。

public void run() {
	log.debug("Starting Kafka producer I/O thread.");

	// main loop, runs until close is called
	while (running) {
		try {       
			 // @1
			run(time.milliseconds());
		} catch (Exception e) {
			log.error("Uncaught error in kafka producer I/O thread: ", e);
		}
	}

	log.debug("Beginning shutdown of Kafka producer I/O thread, sending remaining records.");

	// okay we stopped accepting requests but there may still be
	// requests in the accumulator or waiting for acknowledgment,
	// wait until these are completed.
	 // @2
	while (!forceClose && (this.accumulator.hasUndrained() || this.client.inFlightRequestCount() > 0)) {
		try {
			run(time.milliseconds());
		} catch (Exception e) {
			log.error("Uncaught error in kafka producer I/O thread: ", e);
		}
	}
	 // @3
	if (forceClose) {
		// We need to fail all the incomplete batches and wake up the threads waiting on
		// the futures.
		log.debug("Aborting incomplete batches due to forced shutdown");
		this.accumulator.abortIncompleteBatches();
	}
	try {
		// @4
		this.client.close();
	} catch (Exception e) {
		log.error("Failed to close network client", e);
	}

	log.debug("Shutdown of Kafka producer I/O thread has completed.");
}

Sender.run(long now)方法

代码@1 accumulator.ready()：遍历所有的 tp（topic-partition），如果其对应的 RecordBatch 可以发送（大小达到 batch.size 大小或时间达到 linger.ms），就将其对应的 leader 选出来，最后会返回一个可以发送 Produce request 的 Set（实际返回的是 ReadyCheckResult实例，不过 Set 是最主要的成员变量）；

代码@2 如果发现有 tp 没有 leader，那么这里就调用 requestUpdate() 方法更新 metadata，实际上还是在第一步对 tp 的遍历中，遇到没有 leader 的 tp 就将其加入到一个叫做unknownLeaderTopics 的 set 中，然后会请求这个 tp 的 meta；

代码@3 accumulator.drain()：遍历每个 leader （第一步中选出）上的所有 tp，如果该 tp 对应的 RecordBatch 不在 backoff 期间（没有重试过，或者重试了但是间隔已经达到了 retryBackoffMs ），并且加上这个 RecordBatch 其大小不超过 maxSize（一个 request 的最大限制，默认为 1MB），那么就把这个 RecordBatch 添加 list 中，最终返回的类型为 Map，key 为 leader.id，value 为要发送的 RecordBatch 的列表；

代码@4 sendProduceRequests()：发送 Produce 请求，从图中，可以看出，这个方法会调用 NetworkClient.send() 来发送 clientRequest；

代码@5 NetworkClient.poll()：关于 socket 的 IO 操作都是在这个方法进行的，它还是调用 Selector 进行的相应操作，而 Selector 底层则是封装的 Java NIO 的相关接口

注意：代码@3如果要向一个 leader 发送 Produce 请求，那么这 leader 对应 tp，如果其 RecordBatch 没有达到要求（batch.size 或 linger.ms 都没达到）还是可能会发送，这样做的好处是：可以减少 request 的频率，有利于提供发送效率

//note: Sender 线程每次循环具体执行的地方
    void run(long now) {
        Cluster cluster = metadata.fetch();
        //@1 获取那些已经可以发送的 RecordBatch 对应的 nodes
        RecordAccumulator.ReadyCheckResult result = this.accumulator.ready(cluster, now);
 
        //@2  如果有 topic-partition 的 leader 是未知的,就强制 metadata 更新
        if (!result.unknownLeaderTopics.isEmpty()) {
            for (String topic : result.unknownLeaderTopics)
                this.metadata.add(topic);
            this.metadata.requestUpdate();
        }
 
        //note: 如果与node 没有连接（如果可以连接,会初始化该连接）,暂时先移除该 node
        Iterator<Node> iter = result.readyNodes.iterator();
        long notReadyTimeout = Long.MAX_VALUE;
        while (iter.hasNext()) {
            Node node = iter.next();
            if (!this.client.ready(node, now)) {//note: 没有建立连接的 broker,这里会与其建立连接
                iter.remove();
                notReadyTimeout = Math.min(notReadyTimeout, this.client.connectionDelay(node, now));
            }
        }
 
        //@3 返回该 node 对应的所有可以发送的 RecordBatch 组成的 batches（key 是 node.id,这些 batches 将会在一个 request 中发送）
        Map<Integer, List<RecordBatch>> batches = this.accumulator.drain(cluster,
                                                                         result.readyNodes,
                                                                         this.maxRequestSize,
                                                                         now);
        //note: 保证一个 tp 只有一个 RecordBatch 在发送,保证有序性
        //note: max.in.flight.requests.per.connection 设置为1时会保证
        if (guaranteeMessageOrder) {
            // Mute all the partitions draine
            for (List<RecordBatch> batchList : batches.values()) {
                for (RecordBatch batch : batchList)
                    this.accumulator.mutePartition(batch.topicPartition);
            }
        }
 
        //note: 将由于元数据不可用而导致发送超时的 RecordBatch 移除
        List<RecordBatch> expiredBatches = this.accumulator.abortExpiredBatches(this.requestTimeout, now);
        for (RecordBatch expiredBatch : expiredBatches)
            this.sensors.recordErrors(expiredBatch.topicPartition.topic(), expiredBatch.recordCount);
 
        sensors.updateProduceRequestMetrics(batches);
 
        long pollTimeout = Math.min(result.nextReadyCheckDelayMs, notReadyTimeout);
        if (!result.readyNodes.isEmpty()) {
            log.trace("Nodes with data ready to send: {}", result.readyNodes);
            pollTimeout = 0;
        }
        //@4 发送 RecordBatch
        sendProduceRequests(batches, now);
 
        //note: 如果有 partition 可以立马发送数据,那么 pollTimeout 为0.
        //@5 关于 socket 的一些实际的读写操作
        this.client.poll(pollTimeout, now);
    }

Sender.sendProducerData方法

Step1：首先根据当前时间，根据缓存队列中的数据判断哪些 topic 的哪些分区已经达到发送条件
代码@1：从metadata中获取服务器信息，服务器信息，已经提前被sender线程拿到

代码@2：从双端队列中获取达到发送条件的信息

 // @1
Cluster cluster = metadata.fetch();
 // @2
RecordAccumulator.ReadyCheckResult result = this.accumulator.ready(cluster, now);

Step2：如果在待发送的消息未找到其路由信息，则需要首先去 broker 服务器拉取对应的路由信息(分区的 leader 节点信息)
代码@1：判断待发送的消息中有不知道分区leader的批量信息

代码@2：将没有分区leader的主题重新放回metadata

代码@3：更新metadata，更新版本，让生产者拿到最新的服务器信息，重新去broker服务器拉取路由信息分配

// @1
if (!result.unknownLeaderTopics.isEmpty()) {
      // @2
	for (String topic : result.unknownLeaderTopics)
		this.metadata.add(topic);

	log.debug("Requesting metadata update due to unknown leader topics from the batched records: {}", result.unknownLeaderTopics);
    // @3
	this.metadata.requestUpdate();
}

Step3：移除在网络层面没有准备好的分区，并且计算在接下来多久的时间间隔内，该分区都将处于未准备状态
1、在网络环节没有准备好的标准如下：

分区没有未完成的更新元素数据请求(metadata)。
当前生产者与对端 broker 已建立连接并完成了 TCP 的三次握手。
如果启用 SSL、ACL 等机制，相关状态都已就绪。
该分区对应的连接正在处理中的请求数时是否超过设定值，默认为 5，可通过属性 max.in.flight.requests.per.connection 来设置。

2、client pollDelayMs 预估分区在接下来多久的时间间隔内都将处于未转变好状态(not ready)，其标准如下：

如果已与对端的 TCP 连接已创建好，并处于已连接状态，此时如果没有触发限流，则返回0，如果有触发限流，则返回限流等待时间。
如果还位于对端建立 TCP 连接，则返回 Long.MAX_VALUE，因为连接建立好后，会唤醒发送线程的。

long notReadyTimeout = Long.MAX_VALUE;
while (iter.hasNext()) {
    Node node = iter.next();
    if (!this.client.ready(node, now)) {
        iter.remove();
        notReadyTimeout = Math.min(notReadyTimeout, this.client.pollDelayMs(node, now));
    }
}

Step4：根据已准备的分区，从缓存区中抽取待发送的消息批次(ProducerBatch)，并且按照 nodeId:List 组织，注意，抽取后的 ProducerBatch 将不能再追加消息了，就算还有剩余空间可用

Map<Integer, List<ProducerBatch>> batches = this.accumulator.drain(cluster, result.readyNodes, this.maxRequestSize, now);

Step5：将抽取的 ProducerBatch 加入到 inFlightBatches 数据结构，该属性的声明如下：Map> inFlightBatches，即按照 topic-分区为键，存放已抽取的 ProducerBatch，这个属性的含义就是存储待发送的消息批次。可以根据该数据结构得知在消息发送时以分区为维度反馈 Sender 线程的“积压情况”，max.in.flight.requests.per.connection 就是来控制积压的最大数量，如果积压达到这个数值，针对该队列的消息发送会限流

if (guaranteeMessageOrder) {
	// Mute all the partitions drained
	for (List<ProducerBatch> batchList : batches.values()) {
		for (ProducerBatch batch : batchList)
			this.accumulator.mutePartition(batch.topicPartition);
	}
}

Step6：从 inflightBatches 与 batches 中查找已过期的消息批次(ProducerBatch)，判断是否过期的标准是系统当前时间与 ProducerBatch 创建时间之差是否超过120s，过期时间可以通过参数 delivery.timeout.ms 设置

List expiredBatches = this.accumulator.expiredBatches(this.requestTimeoutMs, now);

Step7：处理已超时的消息批次，通知该批消息发送失败，即通过设置 KafkaProducer#send 方法返回的凭证中的 FutureRecordMetadata 中的 ProduceRequestResult result，使之调用其 get 方法不会阻塞

if (!expiredBatches.isEmpty())
	log.trace("Expired {} batches in accumulator", expiredBatches.size());
for (ProducerBatch expiredBatch : expiredBatches) {
	failBatch(expiredBatch, -1, NO_TIMESTAMP, expiredBatch.timeoutException(), false);
	if (transactionManager != null && expiredBatch.inRetry()) {
		// This ensures that no new batches are drained until the current in flight batches are fully resolved.
		transactionManager.markSequenceUnresolved(expiredBatch.topicPartition);
	}
}

Step8：收集统计指标

sensors.updateProduceRequestMetrics(batches);

Step9：设置下一次的发送延时

long pollTimeout = Math.min(result.nextReadyCheckDelayMs, notReadyTimeout);
if (!result.readyNodes.isEmpty()) {
	log.trace("Nodes with data ready to send: {}", result.readyNodes);
	// if some partitions are already ready to be sent, the select time would be 0;
	// otherwise if some partition already has some data accumulated but not ready yet,
	// the select time will be the time difference between now and its linger expiry time;
	// otherwise the select time will be the time difference between now and the metadata expiry time;
	pollTimeout = 0;
}

Step10：该步骤按照 brokerId 分别构建发送请求，即每一个 broker 会将多个 ProducerBatch 一起封装成一个请求进行发送，同一时间，每一个与 broker 连接只会只能发送一个请求，注意，这里只是构建请求，并最终会通过 NetworkClient#send 方法，将该批数据设置到 NetworkClient 的待发送数据中，此时并没有触发真正的网络调用

sendProduceRequests(batches, now);

private void sendProduceRequests(Map<Integer, List<ProducerBatch>> collated, long now) {
	for (Map.Entry<Integer, List<ProducerBatch>> entry : collated.entrySet())
		sendProduceRequest(now, entry.getKey(), acks, requestTimeoutMs, entry.getValue());
}

private void sendProduceRequest(long now, int destination, short acks, int timeout, List<ProducerBatch> batches) {
	if (batches.isEmpty())
		return;

	Map<TopicPartition, MemoryRecords> produceRecordsByPartition = new HashMap<>(batches.size());
	final Map<TopicPartition, ProducerBatch> recordsByPartition = new HashMap<>(batches.size());

	// find the minimum magic version used when creating the record sets
	byte minUsedMagic = apiVersions.maxUsableProduceMagic();
	for (ProducerBatch batch : batches) {
		if (batch.magic() < minUsedMagic)
			minUsedMagic = batch.magic();
	}

	for (ProducerBatch batch : batches) {
		TopicPartition tp = batch.topicPartition;
		MemoryRecords records = batch.records();

		// down convert if necessary to the minimum magic used. In general, there can be a delay between the time
		// that the producer starts building the batch and the time that we send the request, and we may have
		// chosen the message format based on out-dated metadata. In the worst case, we optimistically chose to use
		// the new message format, but found that the broker didn't support it, so we need to down-convert on the
		// client before sending. This is intended to handle edge cases around cluster upgrades where brokers may
		// not all support the same message format version. For example, if a partition migrates from a broker
		// which is supporting the new magic version to one which doesn't, then we will need to convert.
		if (!records.hasMatchingMagic(minUsedMagic))
			records = batch.records().downConvert(minUsedMagic, 0, time).records();
		produceRecordsByPartition.put(tp, records);
		recordsByPartition.put(tp, batch);
	}

	String transactionalId = null;
	if (transactionManager != null && transactionManager.isTransactional()) {
		transactionalId = transactionManager.transactionalId();
	}
	ProduceRequest.Builder requestBuilder = ProduceRequest.Builder.forMagic(minUsedMagic, acks, timeout,
			produceRecordsByPartition, transactionalId);
	RequestCompletionHandler callback = new RequestCompletionHandler() {
		public void onComplete(ClientResponse response) {
			handleProduceResponse(response, recordsByPartition, time.milliseconds());
		}
	};

	String nodeId = Integer.toString(destination);
	ClientRequest clientRequest = client.newClientRequest(nodeId, requestBuilder, now, acks != 0,
			requestTimeoutMs, callback);
	client.send(clientRequest, now);
	log.trace("Sent produce request to {}: {}", nodeId, requestBuilder);
}

NetworkClient.poll() 方法

代码@1：判断是否需要更新 meta,如果需要就更新（请求更新 metadata 的地方）。
代码@2：调用 Selector.poll() 进行 socket 相关的 IO 操作，触发真正的网络通讯，该方法中会通过收到调用 NIO 中的 Selector#select() 方法，对通道的读写就绪事件进行处理，当写事件就绪后，就会将通道中的消息发送到远端的 broker。
代码@3：然后会消息发送，消息接收、断开连接、API版本，超时等结果进行收集。
代码@4：并依次对结果进行唤醒，此时会将响应结果设置到 KafkaProducer#send 方法返回的凭证中，从而唤醒发送客户端，完成一次完整的消息发送流程
handleCompletedSends方法, 用于处理不需要响应的请求, 即发送成功就可执行该方法
handleCompletedReceives方法,用于处理需要响应的请求,即发送请求后得到response后执行该方法,

两个方法的相同点: 都会从inflightreques队列中删除对应的reques, 将request加入请求响应(不需要回复的request使用默认响应,非真值)
两个方法的不同点:

ack=0时执行前者, 否则执行后者
request的响应加入请求响应中,时, 不需要响应的请求加入的响应是个默认值, 非真实值
前者是成功注册到nio中事件后即可执行, 后者是得到正确的响应值再执行

completeResponse方法处理clientrequest中定义的回调方法,该回调方法中又分别调用该请求中的records定义的回调
clientrequest中的回调方法在handleresponse方法中定义

 public List<ClientResponse> poll(long timeout, long now) {
    ensureActive();
 
    if (!abortedSends.isEmpty()) {
        // If there are aborted sends because of unsupported version exceptions or disconnects,
        // handle them immediately without waiting for Selector#poll.
        List<ClientResponse> responses = new ArrayList<>();
        handleAbortedSends(responses);
        completeResponses(responses);
        return responses;
    }
 
    long metadataTimeout = metadataUpdater.maybeUpdate(now);   // @1
    try {
        this.selector.poll(Utils.min(timeout, metadataTimeout, defaultRequestTimeoutMs));    // @2
    } catch (IOException e) {
        log.error("Unexpected error during I/O", e);
    }
 
    // process completed actions
    long updatedNow = this.time.milliseconds();
    List<ClientResponse> responses = new ArrayList<>();            // @3
    //note: 处理已经完成的 send（不需要 response 的 request,如 send）
    handleCompletedSends(responses, updatedNow);
    //note: 处理从 server 端接收到 Receive（如 Metadata 请求）
    handleCompletedReceives(responses, updatedNow);
    //note: 处理连接失败那些连接,重新请求 meta
    handleDisconnections(responses, updatedNow);
    //note: 处理新建立的那些连接（还不能发送请求,比如:还未认证）
    handleConnections();
    //对那些新建立的连接，发送 apiVersionRequest（默认情况：第一次建立连接时，需要向 Broker 发送 ApiVersionRequest 请求）；
    handleInitiateApiVersionRequests(updatedNow);
    //处理 timeout 的连接，关闭该连接，并刷新 Metadata。
    handleTimedOutRequests(responses, updatedNow);
    completeResponses(responses);                                               // @4
    return responses;
}

Selector.poll()

public void poll(long timeout) throws IOException {
	if (timeout < 0)
		throw new IllegalArgumentException("timeout should be >= 0");

	boolean madeReadProgressLastCall = madeReadProgressLastPoll;
	 //note: Step1 清除相关记录
	clear();

	boolean dataInBuffers = !keysWithBufferedRead.isEmpty();

	if (hasStagedReceives() || !immediatelyConnectedKeys.isEmpty() || (madeReadProgressLastCall && dataInBuffers))
		timeout = 0;

	if (!memoryPool.isOutOfMemory() && outOfMemory) {
		//we have recovered from memory pressure. unmute any channel not explicitly muted for other reasons
		log.trace("Broker no longer low on memory - unmuting incoming sockets");
		for (KafkaChannel channel : channels.values()) {
			if (channel.isInMutableState() && !explicitlyMutedChannels.contains(channel)) {
				channel.maybeUnmute();
			}
		}
		outOfMemory = false;
	}

	/* check ready keys */
    
	//note: Step2 获取就绪事件的数
	long startSelect = time.nanoseconds();
	int numReadyKeys = select(timeout);//轮询
	long endSelect = time.nanoseconds();
	this.sensors.selectTime.record(endSelect - startSelect, time.milliseconds());

 	//note: Step3 处理 io 操作
	if (numReadyKeys > 0 || !immediatelyConnectedKeys.isEmpty() || dataInBuffers) {
		Set<SelectionKey> readyKeys = this.nioSelector.selectedKeys();

		// Poll from channels that have buffered data (but nothing more from the underlying socket)
		if (dataInBuffers) {
			keysWithBufferedRead.removeAll(readyKeys); //so no channel gets polled twice
			Set<SelectionKey> toPoll = keysWithBufferedRead;
			keysWithBufferedRead = new HashSet<>(); //poll() calls will repopulate if needed
			pollSelectionKeys(toPoll, false, endSelect);
		}

		// Poll from channels where the underlying socket has more data
		pollSelectionKeys(readyKeys, false, endSelect);
		// Clear all selected keys so that they are included in the ready count for the next select
		readyKeys.clear();

		pollSelectionKeys(immediatelyConnectedKeys, true, endSelect);
		immediatelyConnectedKeys.clear();
	} else {
		madeReadProgressLastPoll = true; //no work is also "progress"
	}

	long endIo = time.nanoseconds();
	this.sensors.ioTime.record(endIo - endSelect, time.milliseconds());

	// we use the time at the end of select to ensure that we don't close any connections that
	// have just been processed in pollSelectionKeys
    
	//note: 每次 poll 之后会调用一次
	//TODO: 连接虽然关闭了,但是 Client 端的缓存依然存在
	maybeCloseOldestConnection(endSelect);

	// Add to completedReceives after closing expired connections to avoid removing
	// channels with completed receives until all staged receives are completed.
    
	//note: Step4 将处理得到的 stagedReceives 添加到 completedReceives 中
	addToCompletedReceives();
}

Selector.clear()

clear() 方法是在每次 poll() 执行的第一步，它作用的就是清理上一次 poll 过程产生的部分缓存。

//note: 每次 poll 调用前都会清除以下缓存
private void clear() {
    this.completedSends.clear();
    this.completedReceives.clear();
    this.connected.clear();
    this.disconnected.clear();
    // Remove closed channels after all their staged receives have been processed or if a send was requested
    for (Iterator<Map.Entry<String, KafkaChannel>> it = closingChannels.entrySet().iterator(); it.hasNext(); ) {
        KafkaChannel channel = it.next().getValue();
        Deque<NetworkReceive> deque = this.stagedReceives.get(channel);
        boolean sendFailed = failedSends.remove(channel.id());
        if (deque == null || deque.isEmpty() || sendFailed) {
            doClose(channel, true);
            it.remove();
        }
    }
    this.disconnected.addAll(this.failedSends);
    this.failedSends.clear();
}

Selector.select()

Selector 的 select() 方法在实现上底层还是调用 Java NIO 原生的接口，这里的 nioSelector 其实就是 java.nio.channels.Selector 的实例对象，这个方法最坏情况下，会阻塞 ms 的时间，如果在一次轮询，只要有一个 Channel 的事件就绪，它就会立马返回。

private int select(long ms) throws IOException {
    if (ms < 0L)
        throw new IllegalArgumentException("timeout should be >= 0");
 
    if (ms == 0L)
        return this.nioSelector.selectNow();
    else
        return this.nioSelector.select(ms);
}

Selector.pollSelectionKeys()

这部分是 socket IO 的主要部分，发送 Send 及接收 Receive 都是在这里完成的，在 poll() 方法中，这个方法会调用两次：

第一次调用的目的是：处理已经就绪的事件，进行相应的 IO 操作；
第二次调用的目的是：处理新建立的那些连接，添加缓存及传输层（Kafka 又封装了一次）的握手与认证。

private void pollSelectionKeys(Iterable<SelectionKey> selectionKeys,
                                   boolean isImmediatelyConnected,
                                   long currentTimeNanos) {
        Iterator<SelectionKey> iterator = selectionKeys.iterator();
        while (iterator.hasNext()) {
            SelectionKey key = iterator.next();
            iterator.remove();
            KafkaChannel channel = channel(key);
 
            // register all per-connection metrics at once
            sensors.maybeRegisterConnectionMetrics(channel.id());
            if (idleExpiryManager != null)
                idleExpiryManager.update(channel.id(), currentTimeNanos);
 
            try {
 
                /* complete any connections that have finished their handshake (either normally or immediately) */
                //note: 处理一些刚建立 tcp 连接的 channel
                if (isImmediatelyConnected || key.isConnectable()) {
                    if (channel.finishConnect()) {//note: 连接已经建立
                        this.connected.add(channel.id());
                        this.sensors.connectionCreated.record();
                        SocketChannel socketChannel = (SocketChannel) key.channel();
                        log.debug("Created socket with SO_RCVBUF = {}, SO_SNDBUF = {}, SO_TIMEOUT = {} to node {}",
                                socketChannel.socket().getReceiveBufferSize(),
                                socketChannel.socket().getSendBufferSize(),
                                socketChannel.socket().getSoTimeout(),
                                channel.id());
                    } else
                        continue;
                }
 
                /* if channel is not ready finish prepare */
                //note: 处理 tcp 连接还未完成的连接,进行传输层的握手及认证
                if (channel.isConnected() && !channel.ready())
                    channel.prepare();
 
                /* if channel is ready read from any connections that have readable data */
                if (channel.ready() && key.isReadable() && !hasStagedReceive(channel)) {
                    NetworkReceive networkReceive;
                    while ((networkReceive = channel.read()) != null)//note: 知道读取一个完整的 Receive,才添加到集合中
                        addToStagedReceives(channel, networkReceive);//note: 读取数据
                }
 
                /* if channel is ready write to any sockets that have space in their buffer and for which we have data */
                if (channel.ready() && key.isWritable()) {
                    Send send = channel.write();
                    if (send != null) {
                        this.completedSends.add(send);//note: 将完成的 send 添加到 list 中
                        this.sensors.recordBytesSent(channel.id(), send.size());
                    }
                }
 
                /* cancel any defunct sockets */
                //note: 关闭断开的连接
                if (!key.isValid())
                    close(channel, true);
 
            } catch (Exception e) {
                String desc = channel.socketDescription();
                if (e instanceof IOException)
                    log.debug("Connection with {} disconnected", desc, e);
                else
                    log.warn("Unexpected error from {}; closing connection", desc, e);
                close(channel, true);
            }
        }
    }

Selector.addToCompletedReceives()

这个方法的目的是处理接收到的 Receive，由于 Selector 这个类在 Client 和 Server 端都会调用，这里分两种情况讲述一下：

应用在 Server 端时，后续文章会详细介绍，这里简单说一下，Server 为了保证消息的时序性，在 Selector 中提供了两个方法：mute(String id) 和 unmute(String id)，对该 KafkaChannel 做标记来保证同时只能处理这个 Channel 的一个 request（可以理解为排它锁）。当 Server 端接收到 request 后，先将其放入 stagedReceives 集合中，此时该 Channel 还未 mute，这个 Receive 会被放入 completedReceives 集合中。Server 在对 completedReceives 集合中的 request 进行处理时，会先对该 Channel mute，处理后的 response 发送完成后再对该 Channel unmute，然后才能处理该 Channel 其他的请求；
应用在 Client 端时，Client 并不会调用 Selector 的 mute() 和 unmute() 方法，client 的时序性而是通过 InFlightRequests 和 RecordAccumulator 的 mutePartition 来保证的，因此对于 Client 端而言，这里接收到的所有 Receive 都会被放入到 completedReceives 的集合中等待后续处理。

这个方法只有配合 Server 端的调用才能看明白其作用，它统一 Client 和 Server 调用的 api，使得都可以使用 Selector 这个类。

/**
 * checks if there are any staged receives and adds to completedReceives
 */
private void addToCompletedReceives() {
    if (!this.stagedReceives.isEmpty()) {//note: 处理 stagedReceives
        Iterator<Map.Entry<KafkaChannel, Deque<NetworkReceive>>> iter = this.stagedReceives.entrySet().iterator();
        while (iter.hasNext()) {
            Map.Entry<KafkaChannel, Deque<NetworkReceive>> entry = iter.next();
            KafkaChannel channel = entry.getKey();
            if (!channel.isMute()) {
                Deque<NetworkReceive> deque = entry.getValue();
                addToCompletedReceives(channel, deque);
                if (deque.isEmpty())
                    iter.remove();
            }
        }
    }
}
 
private void addToCompletedReceives(KafkaChannel channel, Deque<NetworkReceive> stagedDeque) {
    NetworkReceive networkReceive = stagedDeque.poll();
    this.completedReceives.add(networkReceive); //note: 添加到 completedReceives 中
    this.sensors.recordBytesReceived(channel.id(), networkReceive.payload().limit());
}

run 方法流程图

sender如何更新Metadata

更新metadata

(1)周期性的更新: 每隔一段时间更新一次，这个通过 Metadata的lastRefreshMs, lastSuccessfulRefreshMs 这2个字段来实现

对应的ProducerConfig配置项为：
metadata.max.age.ms //缺省300000，即10分钟1次

(2) 失效检测，强制更新：检查到metadata失效以后，调用metadata.requestUpdate()强制更新

public void run() {
  // main loop, runs until close is called
  while (running) {
    try {
      run(time.milliseconds());
    } catch (Exception e) {
      log.error("Uncaught error in kafka producer I/O thread: ", e);
    }
  }
  。。。
}

public void run(long now) {
  Cluster cluster = metadata.fetch();
  。。。
    RecordAccumulator.ReadyCheckResult result = this.accumulator.ready(cluster, now);   //遍历消息队列中所有的消息，找出对应的，已经ready的Node

  if (result.unknownLeadersExist)  //如果一个ready的node都没有，请求更新metadata
    this.metadata.requestUpdate();

  。。。

    //client的2个关键函数，一个发送ClientRequest，一个接收ClientResponse。底层调用的是NIO的poll。
    for (ClientRequest request : requests)
      client.send(request, now);

  this.client.poll(pollTimeout, now);
}

//NetworkClient
public List<ClientResponse> poll(long timeout, long now) {
  long metadataTimeout = metadataUpdater.maybeUpdate(now); //关键点：每次poll的时候判断是否要更新metadata

  try {
    //调用kafka-network-Selectable连接broker
    this.selector.poll(Utils.min(timeout, metadataTimeout, requestTimeoutMs));
  } catch (IOException e) {
    log.error("Unexpected error during I/O", e);
  }

  // process completed actions
  long updatedNow = this.time.milliseconds();
  List<ClientResponse> responses = new ArrayList<>();
  handleCompletedSends(responses, updatedNow);
  handleCompletedReceives(responses, updatedNow);   //在返回的handler中，会处理metadata的更新
  handleDisconnections(responses, updatedNow);
  handleConnections();
  handleTimedOutRequests(responses, updatedNow);

  // invoke callbacks
  for (ClientResponse response : responses) {
    if (response.request().hasCallback()) {
      try {
        response.request().callback().onComplete(response);
      } catch (Exception e) {
        log.error("Uncaught error in request completion:", e);
      }
    }
  }

  return responses;
}
 
 //DefaultMetadataUpdater
@Override
public long maybeUpdate(long now) {
  // should we update our metadata?
  long timeToNextMetadataUpdate = metadata.timeToNextUpdate(now);
  long timeToNextReconnectAttempt = Math.max(this.lastNoNodeAvailableMs + metadata.refreshBackoff() - now, 0);
  long waitForMetadataFetch = this.metadataFetchInProgress ? Integer.MAX_VALUE : 0;
  // if there is no node available to connect, back off refreshing metadata
  long metadataTimeout = Math.max(Math.max(timeToNextMetadataUpdate, timeToNextReconnectAttempt),
                                  waitForMetadataFetch);

  if (metadataTimeout == 0) {
    // highly dependent on the behavior of leastLoadedNode.
    Node node = leastLoadedNode(now);  //找到负载最小的Node，加载最少的节点node
    maybeUpdate(now, node); //把更新Metadata的请求，发给这个Node
  }

  return metadataTimeout;
}

private void maybeUpdate(long now, Node node) {
  if (node == null) {
    log.debug("Give up sending metadata request since no node is available");
    // mark the timestamp for no node available to connect
    this.lastNoNodeAvailableMs = now;
    return;
  }
  String nodeConnectionId = node.idString();

  if (canSendRequest(nodeConnectionId)) {
    Set<String> topics = metadata.needMetadataForAllTopics() ? new HashSet<String>() : metadata.topics();
    this.metadataFetchInProgress = true;
    ClientRequest metadataRequest = request(now, nodeConnectionId, topics);  //关键点：发送更新Metadata的Request
    log.debug("Sending metadata request {} to node {}", metadataRequest, node.id());
    //将消息暂存NIO的channel中
    doSend(metadataRequest, now); //这里只是异步发送，返回的response在上面的handleCompletedReceives里面处理
  } else if (connectionStates.canConnect(nodeConnectionId, now)) {
    log.debug("Initialize connection to node {} for sending metadata request", node.id());
    initiateConnect(node, now);

  } else { // connected, but can't send more OR connecting
    this.lastNoNodeAvailableMs = now;
  }
}
 
private void handleCompletedReceives(List<ClientResponse> responses, long now) {
  for (NetworkReceive receive : this.selector.completedReceives()) {
    String source = receive.source();
    ClientRequest req = inFlightRequests.completeNext(source);
    ResponseHeader header = ResponseHeader.parse(receive.payload());
    // Always expect the response version id to be the same as the request version id
    short apiKey = req.request().header().apiKey();
    short apiVer = req.request().header().apiVersion();
    Struct body = (Struct) ProtoUtils.responseSchema(apiKey, apiVer).read(receive.payload());
    correlate(req.request().header(), header);
    if (!metadataUpdater.maybeHandleCompletedReceive(req, now, body))
      responses.add(new ClientResponse(req, now, false, body));
  }
}
 
 
@Override
public boolean maybeHandleCompletedReceive(ClientRequest req, long now, Struct body) {
  short apiKey = req.request().header().apiKey();
  if (apiKey == ApiKeys.METADATA.id && req.isInitiatedByNetworkClient()) {
    handleResponse(req.request().header(), body, now);
    return true;
  }
  return false;
}

//关键函数
private void handleResponse(RequestHeader header, Struct body, long now) {
  this.metadataFetchInProgress = false;
  MetadataResponse response = new MetadataResponse(body);
  Cluster cluster = response.cluster();   //从response中，拿到一个新的cluster对象
  if (response.errors().size() > 0) {
    log.warn("Error while fetching metadata with correlation id {} : {}", header.correlationId(), response.errors());
  }

  if (cluster.nodes().size() > 0) {
    this.metadata.update(cluster, now);   //更新metadata，用新的cluster覆盖旧的cluster
  } else {
    log.trace("Ignoring empty metadata response with correlation id {}.", header.correlationId());
    this.metadata.failedUpdate(now);  //更新metadata失败，做失败处理逻辑
  }
}


//更新成功，version+1, 同时更新其它字段
public synchronized void update(Cluster cluster, long now) {
this.needUpdate = false;
this.lastRefreshMs = now;
this.lastSuccessfulRefreshMs = now;
this.version += 1;

for (Listener listener: listeners)
//如果有人监听了metadata的更新，通知他们
listener.onMetadataUpdate(cluster); 

this.cluster = this.needMetadataForAllTopics ? getClusterForCurrentTopics(cluster) : cluster;    //新的cluster覆盖旧的cluster
//通知所有的阻塞的producer线程
notifyAll();  

log.debug("Updated cluster metadata version {} to {}", this.version, this.cluster);
}

//更新失败，只更新lastRefreshMs
public synchronized void failedUpdate(long now) {
this.lastRefreshMs = now;
}

sender如何找到加载最少(负载最小)的节点node

1.连接首选是出于read状态的，连接好，准备发送数据状态
2.其次是状态出于CONNECTING和CHECKING_API_VERSIONS，正在连接
3.尝试连接次数大于重新连接失败次数

@Override
public Node leastLoadedNode(long now) {
    List<Node> nodes = this.metadataUpdater.fetchNodes();
    if (nodes.isEmpty())
        throw new IllegalStateException("There are no nodes in the Kafka cluster");
    int inflight = Integer.MAX_VALUE;

    Node foundConnecting = null;
    Node foundCanConnect = null;
    Node foundReady = null;
    
	//根据种子nodes.size()获取随机数
    int offset = this.randOffset.nextInt(nodes.size());
    for (int i = 0; i < nodes.size(); i++) {
        //轮询获取node的index
        int idx = (offset + i) % nodes.size();
        Node node = nodes.get(idx);
        //首选处于read状态，已连接准备发送数据状态
        if (canSendRequest(node.idString(), now)) {
            int currInflight = this.inFlightRequests.count(node.idString());
            if (currInflight == 0) {
                // if we find an established connection with no in-flight requests we can stop right away
                log.trace("Found least loaded node {} connected with no in-flight requests", node);
                return node;
            } else if (currInflight < inflight) {
                // otherwise if this is the best we have found so far, record that
                inflight = currInflight;
                foundReady = node;
            }
         //状态出于CONNECTING和CHECKING_API_VERSIONS，正在连接
        } else if (connectionStates.isPreparingConnection(node.idString())) {
            foundConnecting = node;
         //尝试连接次数大于重新连接失败次数
        } else if (canConnect(node, now)) {
            if (foundCanConnect == null ||
                    this.connectionStates.lastConnectAttemptMs(foundCanConnect.idString()) >
                            this.connectionStates.lastConnectAttemptMs(node.idString())) {
                foundCanConnect = node;
            }
        } else {
            log.trace("Removing node {} from least loaded node selection since it is neither ready " +
                    "for sending or connecting", node);
        }
    }

    // We prefer established connections if possible. Otherwise, we will wait for connections
    // which are being established before connecting to new nodes.
    if (foundReady != null) {
        log.trace("Found least loaded node {} with {} inflight requests", foundReady, inflight);
        return foundReady;
    } else if (foundConnecting != null) {
        log.trace("Found least loaded connecting node {}", foundConnecting);
        return foundConnecting;
    } else if (foundCanConnect != null) {
        log.trace("Found least loaded node {} with no active connection", foundCanConnect);
        return foundCanConnect;
    } else {
        log.trace("Least loaded node selection failed to find an available node");
        return null;
    }
}

获取node判断详解

  
//处于read状态，已连接准备发送数据状态
private boolean canSendRequest(String node, long now) {
    return connectionStates.isReady(node, now) && selector.isChannelReady(node) &&
        inFlightRequests.canSendMore(node);
}
private boolean isReady(NodeConnectionState state, long now) {
    return state != null && state.state == ConnectionState.READY && state.throttleUntilTimeMs <= now;
}

//状态出于CONNECTING和CHECKING_API_VERSIONS，正在连接
public boolean isPreparingConnection(String id) {
    NodeConnectionState state = nodeState.get(id);
    return state != null &&
        (state.state == ConnectionState.CONNECTING || state.state == ConnectionState.CHECKING_API_VERSIONS);
}

//尝试连接次数大于重新连接失败次数
public boolean canConnect(String id, long now) {
    NodeConnectionState state = nodeState.get(id);
    if (state == null)
        return true;
    else
        return state.state.isDisconnected() &&
        now - state.lastConnectAttemptMs >= state.reconnectBackoffMs;
}

DISCONNECTED: 尚未成功建立连接

CONNECTING: 正在进行连接

CHECKING_API_VERSIONS: 已建立连接，正在检查API版本。此检查失败将导致连接关闭就绪：连接已准备好发送请求

AUTHENTICATION_FAILED: 由于身份验证错误，连接失败

public enum ConnectionState {
    DISCONNECTED, CONNECTING, CHECKING_API_VERSIONS, READY, AUTHENTICATION_FAILED;

    public boolean isDisconnected() {
        return this == AUTHENTICATION_FAILED || this == DISCONNECTED;
    }

    public boolean isConnected() {
        return this == CHECKING_API_VERSIONS || this == READY;
    }
}

消息如何暂存到send中和发送到broker

Network类

Network.send() 方法

public void send(ClientRequest request, long now) {
    doSend(request, false, now);
}

NetworkClient.dosend()方法

Producer 端的请求都是通过 NetworkClient.dosend() 来发送的，其作用就是：

检查版本信息，并根据 apiKey() 构建 Request；
创建 NetworkSend 实例；
调用 Selector.send 发送该 Send。

//note: 发送请求
private void doSend(ClientRequest clientRequest, boolean isInternalRequest, long now) {
	String nodeId = clientRequest.destination();
	if (!isInternalRequest) {
		// If this request came from outside the NetworkClient, validate
		// that we can send data.  If the request is internal, we trust
		// that that internal code has done this validation.  Validation
		// will be slightly different for some internal requests (for
		// example, ApiVersionsRequests can be sent prior to being in
		// READY state.)
		if (!canSendRequest(nodeId))
			throw new IllegalStateException("Attempt to send a request to node " + nodeId + " which is not ready.");
	}
	AbstractRequest request = null;
	AbstractRequest.Builder<?> builder = clientRequest.requestBuilder();
	//note: 构建 AbstractRequest, 检查其版本信息
	try {
		NodeApiVersions versionInfo = nodeApiVersions.get(nodeId);
		// Note: if versionInfo is null, we have no server version information. This would be
		// the case when sending the initial ApiVersionRequest which fetches the version
		// information itself.  It is also the case when discoverBrokerVersions is set to false.
		if (versionInfo == null) {
			if (discoverBrokerVersions && log.isTraceEnabled())
				log.trace("No version information found when sending message of type {} to node {}. " +
						"Assuming version {}.", clientRequest.apiKey(), nodeId, builder.version());
		} else {
			short version = versionInfo.usableVersion(clientRequest.apiKey());
			builder.setVersion(version);
		}
		// The call to build may also throw UnsupportedVersionException, if there are essential
		// fields that cannot be represented in the chosen version.
		request = builder.build();//note: 当为 Produce 请求时,转化为 ProduceRequest,Metadata 请求时,转化为 Metadata 请求
	} catch (UnsupportedVersionException e) {
		// If the version is not supported, skip sending the request over the wire.
		// Instead, simply add it to the local queue of aborted requests.
		log.debug("Version mismatch when attempting to send {} to {}",
				clientRequest.toString(), clientRequest.destination(), e);
		ClientResponse clientResponse = new ClientResponse(clientRequest.makeHeader(),
				clientRequest.callback(), clientRequest.destination(), now, now,
				false, e, null);
		abortedSends.add(clientResponse);
		return;
	}
	RequestHeader header = clientRequest.makeHeader();
	if (log.isDebugEnabled()) {
		int latestClientVersion = ProtoUtils.latestVersion(clientRequest.apiKey().id);
		if (header.apiVersion() == latestClientVersion) {
			log.trace("Sending {} to node {}.", request, nodeId);
		} else {
			log.debug("Using older server API v{} to send {} to node {}.",
				header.apiVersion(), request, nodeId);
		}
	}
	//note: Send是一个接口，这里返回的是 NetworkSend，而 NetworkSend 继承 ByteBufferSend
	Send send = request.toSend(nodeId, header);
	InFlightRequest inFlightRequest = new InFlightRequest(
			header,
			clientRequest.createdTimeMs(),
			clientRequest.destination(),
			clientRequest.callback(),
			clientRequest.expectResponse(),
			isInternalRequest,
			send,
			now);
	this.inFlightRequests.add(inFlightRequest);
	//note: 将 send 和对应 kafkaChannel 绑定起来，并开启该 kafkaChannel 底层 socket 的写事件
	selector.send(inFlightRequest.send);
}

Selector类

Selector.send()方法

作用就是获取该 Send 对应的 KafkaChannel，调用 setSend() 向 KafkaChannel 注册一个 Write 事件。

//note: 发送请求
public void send(Send send) {
    String connectionId = send.destination();
    if (closingChannels.containsKey(connectionId))
        this.failedSends.add(connectionId);
    else {
        KafkaChannel channel = channelOrFail(connectionId, false);
        try {
            channel.setSend(send);
        } catch (CancelledKeyException e) {
            this.failedSends.add(connectionId);
            close(channel, false);
        }
    }
}

KafkaChannel

KafkaChannel.setSend()

setSend() 方法需要配合 write()（该方法是在 Selector.poll() 中调用的）方法一起来看

setSend()：将当前 KafkaChannel 的 Send 赋值为要发送的 Send，并注册一个 OP_WRITE 事件；
write()：发送当前的 Send，发送完后删除注册的 OP_WRITE 事件。

//note: 每次调用时都会注册一个 OP_WRITE 事件
public void setSend(Send send) {
    if (this.send != null)
        throw new IllegalStateException("Attempt to begin a send operation with prior send operation still in progress.");
    this.send = send;
    this.transportLayer.addInterestOps(SelectionKey.OP_WRITE);
}
 
//note: 调用 send() 发送 Send
public Send write() throws IOException {
    Send result = null;
    if (send != null && send(send)) {
        result = send;
        send = null;
    }
    return result;
}
 
//note: 发送完成后,就删除这个 WRITE 事件
private boolean send(Send send) throws IOException {
    send.writeTo(transportLayer);
    if (send.completed())
        transportLayer.removeInterestOps(SelectionKey.OP_WRITE);
 
    return send.completed();
}