Kafka RecordAccumulator源码

RecordAccumulator

其作用相当于一个缓冲队列,会根据主题和分区(TopicPartition对象)对消息进行分组,每一个TopicPartition对象会对应 一个双端队列Deque,ProducerBatch表示一批消息,在KafkaProducer发送消息时,总是从队列队尾 (Tail)取出ProducerBatch(如果队列不为空),而Sender是从队列头(Head)取ProducerBatch进行处

Kafka RecordAccumulator源码_第1张图片

RecordAccumulator里面包含 ConcurrentMap> batches;

ConcurrentMap-->实现类 CopyOnWriteMap

TopicPartition-->对应的分区

Deque  RecordBatch包含多个MemoryRecords,才是真正放消息的地方

大致结构

ConcurrentMap> batches; 

所以Kafka这个核心数据结构在这里之所以采用CopyOnWriteMap思想来实现,就是因为这个Map的key-value对,其实没那么频繁更新。也就是TopicPartition-Deque这个key-value对,更新频率很低。

但是他的get操作却是高频的读取请求,因为会高频的读取出来一个TopicPartition对应的Deque数据结构,来对这个队列进行入队出队等操作,所以对于这个map而言,高频的是其get操作。

这个时候,Kafka就采用了CopyOnWrite思想来实现这个Map,避免更新key-value的时候阻塞住高频的读操作,实现无锁的效果,优化线程并发的性能。

 

Kafka RecordAccumulator源码_第2张图片

 

MemeoryRecords里面有下面四个字段比较重要:
buffer:用于保存消息数据的java NIO ByteBuffer
writeLimit:记录buffer字段最多可以写入多少个字节的数据
compressor:压缩器,对消息数据进行压缩,然后将压缩的数据输出到buffer。
writable:标记是只读模式还是可写模式
 

public Compressor(ByteBuffer buffer, CompressionType type) {
		//保存压缩的类型,和buffer开始的位置
        this.type = type;
        this.initPos = buffer.position();

        this.numRecords = 0;
        this.writtenUncompressed = 0;
        this.compressionRate = 1;
        this.maxTimestamp = Record.NO_TIMESTAMP;

        if (type != CompressionType.NONE) {
            // for compressed records, leave space for the header and the shallow message metadata
            // and move the starting position to the value payload offset
            buffer.position(initPos + Records.LOG_OVERHEAD + Record.RECORD_OVERHEAD);
        }

        // 创建合适的输入流
        bufferStream = new ByteBufferOutputStream(buffer);
        //根据压缩类型创建压缩流
        appendStream = wrapForOutput(bufferStream, type, COMPRESSION_DEFAULT_BUFFER_SIZE);
    }

 

再Compressor的构造方法里面最终会调用wrapForOutput方法为当前的buffer创建指定类型的压缩流

public static DataOutputStream wrapForOutput(ByteBufferOutputStream buffer, CompressionType type, int bufferSize) {
        try {
            switch (type) {
                case NONE:
                    return new DataOutputStream(buffer);
                case GZIP:
                    return new DataOutputStream(new GZIPOutputStream(buffer, bufferSize));
                case SNAPPY:
                    try {
                        OutputStream stream = (OutputStream) snappyOutputStreamSupplier.get().newInstance(buffer, bufferSize);
                        return new DataOutputStream(stream);
                    } catch (Exception e) {
                        throw new KafkaException(e);
                    }
                case LZ4:
                    try {
                        OutputStream stream = (OutputStream) lz4OutputStreamSupplier.get().newInstance(buffer);
                        return new DataOutputStream(stream);
                    } catch (Exception e) {
                        throw new KafkaException(e);
                    }
                default:
                    throw new IllegalArgumentException("Unknown compression type: " + type);
            }
        } catch (IOException e) {
            throw new KafkaException(e);
        }
    }

在wrapForOutput方法里面主要也是根据指定的压缩类型创建输出流,所以说,在Compressor中通过利用装饰者模式使buffer有了自动扩容和压缩的功能。
下面我们机选看MemoryRecords里面几个比较重要的方法:
emptyRecords:我们只能通过它来返回MemoryRecords对象
append:首先会判断是否是可写模式,然后调用Compressor的put方法
hasRoomFor:根据Compressor估算已写字节数
close:当有扩容的情况时,MemoryRecords.buffer字段ByteBufferOutputStream.buffer字段所指向的不再是同一个ByteBuffer对象,所以close方法会将MemoryRecords.buffer指向扩容后的对象,同时,设置为只读模式。
sizeInBytes:对于可写返回的是ByteBufferOutputStream.buffer大小,对于只读返回的是MemoryRecords.buffer大小
下面来看一下RecordBatch的核心方法:

public FutureRecordMetadata tryAppend(long timestamp, byte[] key, byte[] value, Callback callback, long now) {
		//判断是否还有空间
        if (!this.records.hasRoomFor(key, value)) {
            return null;
        } else {
        	//向MemoryRecords里面添加内容
            long checksum = this.records.append(offsetCounter++, timestamp, key, value);
           //更新统计信息
            this.maxRecordSize = Math.max(this.maxRecordSize, Record.recordSize(key, value));
            this.lastAppendTime = now;
            //创建FutureRecordMetadata对象
            FutureRecordMetadata future = new FutureRecordMetadata(this.produceFuture, this.recordCount,
                                                                   timestamp, checksum,
                                                                   key == null ? -1 : key.length,
                                                                   value == null ? -1 : value.length);
            //将用户自定义的callback和FutureRecordMetadata封装成Thunk封装到thunks里面
            if (callback != null)
                thunks.add(new Thunk(callback, future));
            this.recordCount++;
            return future;
        }
    }

当RecordBatch成功收到正常响应或超时或关闭生产者的时候,都会调用RecordBatch的done方法:

public void done(long baseOffset, long timestamp, RuntimeException exception) {
        log.trace("Produced messages to topic-partition {} with base offset offset {} and error: {}.",
                  topicPartition,
                  baseOffset,
                  exception);
        // 执行保存的所有回调方法
        for (int i = 0; i < this.thunks.size(); i++) {
            try {
                Thunk thunk = this.thunks.get(i);
                if (exception == null) {
                    //将服务端返回的消息返回封装成RecordMetadata 
                    RecordMetadata metadata = new RecordMetadata(this.topicPartition,  baseOffset, thunk.future.relativeOffset(),
                                                                 timestamp == Record.NO_TIMESTAMP ? thunk.future.timestamp() : timestamp,
                                                                 thunk.future.checksum(),
                                                                 thunk.future.serializedKeySize(),
                                                                 thunk.future.serializedValueSize());
                    thunk.callback.onCompletion(metadata, null);
                } else {
                    thunk.callback.onCompletion(null, exception);
                }
            } catch (Exception e) {
                log.error("Error executing user-provided callback on message for topic-partition {}:", topicPartition, e);
            }
        }
        //标识整个RecordBatch都处理完成
        this.produceFuture.done(topicPartition, baseOffset, exception);

 在done方法中,会调用全部消息的callback方法并且最后会标识整个RecordBatch都处理完成。
ByteBuffer 的创建和释放时比较消耗资源的,我们之前介绍Netty的时候,Netty自己有一个内存池,需要内存的时候基本都会在里面分配,Kafka客户端也有自己的内存管理机制,它使用BufferPool来实现ByteBuffer的复用,每个BufferPool只针对特定大小的ByteBuffer进行管理,对于其他大小的ByteBuffer并不会缓存金BufferPool。下面是几个重要的字段:
free:是一个ArrayDeque队列,其中缓存了指定大小的ByteBuffer对象。
ReentrantLock:因为多线程并发分配和回收ByteBuffer,所以使用锁控制并发
waiters:记录因申请不到足够空间而阻塞的线程,此队列实际记录的是阻塞线程对应的Condition对象
totalMemory:记录整个Pool的大小
availbleMemory:记录了可用空间大小

 

public ByteBuffer allocate(int size, long maxTimeToBlockMs) throws InterruptedException {
        if (size > this.totalMemory)
            throw new IllegalArgumentException("Attempt to allocate " + size + " bytes, but there is a hard limit of "   + this.totalMemory   + " on memory allocations.");
		//获取内存加锁
        this.lock.lock();
        try {
            // 检查当前申请的大小是否符合规定,并且有空闲的Bytebuffer,如果符合直接从free中返回一个
            if (size == poolableSize && !this.free.isEmpty())
                return this.free.pollFirst();

            // 记录整个free队列的大小
            int freeListSize = this.free.size() * this.poolableSize;
            //可用的内容空间大小加空闲队列的大小和我们要申请的空间大小比较
            if (this.availableMemory + freeListSize >= size) {
                // 如果有足够的内存,那么会不断释放free队列里面的ByteBuffer使availableMemory 来满足这次申请
                freeUp(size);
                //成功后重新计算可用内存的大小
                this.availableMemory -= size;
                //解锁
                lock.unlock();
                //直接分配指定大小的内存
                return ByteBuffer.allocate(size);
            } else {
                // 没有足够的内存可以被分配,需要被阻塞
                int accumulated = 0;
                ByteBuffer buffer = null;
                Condition moreMemory = this.lock.newCondition();
                long remainingTimeToBlockNs = TimeUnit.MILLISECONDS.toNanos(maxTimeToBlockMs);
                //添加到waiters
                this.waiters.addLast(moreMemory);
                //循环等待
                while (accumulated < size) {
                    long startWaitNs = time.nanoseconds();
                    long timeNs;
                    boolean waitingTimeElapsed;
                    try {
                        waitingTimeElapsed = !moreMemory.await(remainingTimeToBlockNs, TimeUnit.NANOSECONDS);
                    } catch (InterruptedException e) {
                        this.waiters.remove(moreMemory);
                        throw e;
                    } finally {
                    	//统计等待的时间
                        long endWaitNs = time.nanoseconds();
                        timeNs = Math.max(0L, endWaitNs - startWaitNs);
                        this.waitTime.record(timeNs, time.milliseconds());
                    }
					//如果超时要抛出异常
                    if (waitingTimeElapsed) {
                        this.waiters.remove(moreMemory);
                        throw new TimeoutException("Failed to allocate memory within the configured max blocking time " + maxTimeToBlockMs + " ms.");
                    }

                    remainingTimeToBlockNs -= timeNs;
                    //重新检查
                    if (accumulated == 0 && size == this.poolableSize && !this.free.isEmpty()) {
                        // just grab a buffer from the free list
                        buffer = this.free.pollFirst();
                        accumulated = size;
                    } else {
                        // 先分配一部分内存在继续等待
                        freeUp(size - accumulated);
                        int got = (int) Math.min(size - accumulated, this.availableMemory);
                        this.availableMemory -= got;
                        accumulated += got;
                    }
                }

                // 分配成功后,动waiters里面移除
                Condition removed = this.waiters.removeFirst();
                if (removed != moreMemory)
                    throw new IllegalStateException("Wrong condition: this shouldn't happen.");

                // 要是还有空闲的空间唤醒下一个线程
                if (this.availableMemory > 0 || !this.free.isEmpty()) {
                    if (!this.waiters.isEmpty())
                        this.waiters.peekFirst().signal();
                }

                // 解锁返回buffer
                lock.unlock();
                if (buffer == null)
                    return ByteBuffer.allocate(size);
                else
                    return buffer;
            }
        } finally {
            if (lock.isHeldByCurrentThread())
                lock.unlock();
        }
    }

接下来看一下deallocate方法

public void deallocate(ByteBuffer buffer, int size) {
		//加锁
        lock.lock();
        try {
        	//判断是否是符合规定的内存大小,如果符合清空里面的数据后直接放到free里面
            if (size == this.poolableSize && size == buffer.capacity()) {
                buffer.clear();
                this.free.add(buffer);
            } else {
            	//否则等待GC回收
                this.availableMemory += size;
            }
            //唤醒等待队列里面的线程
            Condition moreMem = this.waiters.peekFirst();
            if (moreMem != null)
                moreMem.signal();
        } finally {
        	//解锁
            lock.unlock();
        }
}

在释放内存的时候对内存大小进行了区分,如果释放的内存大小是可以管理的,那么就放到free里面,否则等待GC来回收,释放结束后会唤醒等待对了里面的一个线程。
RecordAccumulator里面用到的几个类基本就介绍完了,下面我们来看RecordAccumulator的关键字段和方法
batches:Topic域RecordBatch集合的映射关系,类型是CopyOnWriteMap,
batchSize:指定每个RecordBatch底层ByteBuffer的大小
Compression:压缩类型
incomplete:未发送完成的RecordBatch集合,底层通过Set集合实现。
free:BufferPool对象
drainIndex:使用drain方法批量导出RecordBatch时,为了防止饥饿,使用drainIndex记录上次发送停止时的位置,下次继续从此位置开始发送。
我们回到上一节中没有仔细看的那行代码,就是将消息添加到RecordAccumulator中:

public RecordAppendResult append(TopicPartition tp,
                                     long timestamp,
                                     byte[] key,
                                     byte[] value,
                                     Callback callback,
                                     long maxTimeToBlock) throws InterruptedException {
        // 记录当前添加任务的线程的数量
        appendsInProgress.incrementAndGet();
        try {
            // 找到对应的Deque
            Deque dq = getOrCreateDeque(tp);
            //加锁
            synchronized (dq) {
                if (closed)
                    throw new IllegalStateException("Cannot send after the producer is closed.");
                //尝试添加当前消息
                RecordAppendResult appendResult = tryAppend(timestamp, key, value, callback, dq);
                //如果添加成功直接返回
                if (appendResult != null)
                    return appendResult;
            }

            int size = Math.max(this.batchSize, Records.LOG_OVERHEAD + Record.recordSize(key, value));
            log.trace("Allocating a new {} byte message buffer for topic {} partition {}", size, tp.topic(), tp.partition());
            //添加失败重新申请内存空间
            ByteBuffer buffer = free.allocate(size, maxTimeToBlock);
            //加锁
            synchronized (dq) {
                // Need to check if producer is closed again after grabbing the dequeue lock.
                if (closed)
                    throw new IllegalStateException("Cannot send after the producer is closed.");
				//继续追加
                RecordAppendResult appendResult = tryAppend(timestamp, key, value, callback, dq);
                if (appendResult != null) {
                    // 追加成功释放空间
                    free.deallocate(buffer);
                    return appendResult;
                }
                //生成一个新的RecordBatch,添加到batches集合中
                MemoryRecords records = MemoryRecords.emptyRecords(buffer, compression, this.batchSize);
                RecordBatch batch = new RecordBatch(tp, records, time.milliseconds());
                
                FutureRecordMetadata future = Utils.notNull(batch.tryAppend(timestamp, key, value, callback, time.milliseconds()));

                dq.addLast(batch);
                //将新建的batch添加到incomplete集合中
                incomplete.add(batch);
                return new RecordAppendResult(future, dq.size() > 1 || batch.records.isFull(), true);
            }
        } finally {
            appendsInProgress.decrementAndGet();
        }
    }

我们总结一下添加方法里主要的工作:
1、首先在batches集合里面找到TopicPartition对应的Deque,查找不到则创建一个新的
2、对Deque加锁
3、调用tryAppend方法,尝试向Deque中最后一个RecordBatch追加Record
4、解锁
5、追加成功,返回
6、追加失败,重新申请内存
7、再次尝试2、3
8、追加成功,返回,同时释放6申请的内存。追加失败那么新建一个RecordBatch
9、Record追加到新建的RecordBatch中,将RecordBatch添加到Deque的尾部
10、将新建的RecordBatch添加到incomplete集合中
11、解锁
12、返回RecordAppenResult,它里面的字段会作为唤醒Sender线程的条件。
doSend方法最后一步就是根据RecordAppenResult判断是否需要唤醒Sender线程,唤醒条件是:
1、消息所在队列的最后一个RecordBatch满了
2、此队列中不止一个RecordBatch
在客户端将消息发送给服务端之前,会调用RecordAccumulator.ready()方法获取集群中符合发送消息条件的节点集合:

public ReadyCheckResult ready(Cluster cluster, long nowMs) {
		//用来记录可以向哪些Node节点发送消息
        Set readyNodes = new HashSet<>();
        //记录下次需要调用ready方法的时间间隔
        long nextReadyCheckDelayMs = Long.MAX_VALUE;
        //根据Metadata中是否有找不到Leader副本的分区
        boolean unknownLeadersExist = false;
		//是否有线程在阻塞等待BufferPool释放空间
        boolean exhausted = this.free.queued() > 0;
        //遍历batchs
        for (Map.Entry> entry : this.batches.entrySet()) {
            TopicPartition part = entry.getKey();
            //获取一个TopicPartition的队列
            Deque deque = entry.getValue();
			//查找分区的Leader副本所在的Node
            Node leader = cluster.leaderFor(part);
            //如果找不到,那么就不能发送信息,之后会触发MetaData更新
            if (leader == null) {
                unknownLeadersExist = true;
            } else if (!readyNodes.contains(leader) && !muted.contains(part)) {
            	//对队列上锁
                synchronized (deque) {
                	//获取第一个RecoardBatch
                    RecordBatch batch = deque.peekFirst();
                    //如果不为null
                    if (batch != null) {
                    	//计算发送的条件
                        boolean backingOff = batch.attempts > 0 && batch.lastAttemptMs + retryBackoffMs > nowMs;
                        long waitedTimeMs = nowMs - batch.lastAttemptMs;
                        long timeToWaitMs = backingOff ? retryBackoffMs : lingerMs;
                        long timeLeftMs = Math.max(timeToWaitMs - waitedTimeMs, 0);
                        boolean full = deque.size() > 1 || batch.records.isFull();
                        boolean expired = waitedTimeMs >= timeToWaitMs;
                        boolean sendable = full || expired || exhausted || closed || flushInProgress();
                        if (sendable && !backingOff) {
                            readyNodes.add(leader);
                        } else {
                            // 记录下次需要调用ready方法检查的时间间隔
                            nextReadyCheckDelayMs = Math.min(timeLeftMs, nextReadyCheckDelayMs);
                        }
                    }
                }
            }
        }

        return new ReadyCheckResult(readyNodes, nextReadyCheckDelayMs, unknownLeadersExist);
    }

 

 对Node进行筛选有下面这几个条件:
1、Deque中有多个RecordBatch或是第一个RecordBatch是否满了
2、是否超时了
3、是否有其他线程在等待BufferPool释放空间
4、是否有线程正在等待flush操作完成
5、Sender线程准备关闭
调用完ready方法获得readyNodes集合后,此集合还要经过Network Client的过滤之后才能得到发送消息的Node集合
获取的Node集合后,RecordAccumulator会带用drain方法,将TopicPartition->RecordBatch集合映射成NodeId->RecordBatch集合,因为在网络层,生产者只关心它向哪个Node节点发送消息数据,并不关系这些数据属于哪个TopicPartition。

public Map> drain(Cluster cluster,
                                                 Set nodes,
                                                 int maxSize,
                                                 long now) {
      	//如果nodes为空,直接返回一个空集合
        if (nodes.isEmpty())
            return Collections.emptyMap();

        Map> batches = new HashMap<>();
        //遍历所有的筛选出来的node
        for (Node node : nodes) {
            int size = 0;
            //获取当前node的所有分区信息
            List parts = cluster.partitionsForNode(node.id());
            List ready = new ArrayList<>();
           //获取到这次开始发送的下标
            int start = drainIndex = drainIndex % parts.size();
            do {
            	//得到当前分区的信息
                PartitionInfo part = parts.get(drainIndex);
                //封装成TopicPartition对象
                TopicPartition tp = new TopicPartition(part.topic(), part.partition());
                // Only proceed if the partition has no in-flight batches.
                if (!muted.contains(tp)) {
                    Deque deque = getDeque(new TopicPartition(part.topic(), part.partition()));
                    if (deque != null) {
                    	//锁住当前队列
                        synchronized (deque) {
                        	//从队列中获取到一个RecordBatch
                            RecordBatch first = deque.peekFirst();
                            //如果不为null
                            if (first != null) {
                                boolean backoff = first.attempts > 0 && first.lastAttemptMs + retryBackoffMs > now;
                                // Only drain the batch if it is not during backoff period.
                                if (!backoff) {
                                    if (size + first.records.sizeInBytes() > maxSize && !ready.isEmpty()) {
                                        // 数据量已满,之后会以一个单独的请求来发送它
                                        break;
                                    } else {
                                    	//获取一个RecordBatch 添加到ready集合中
                                        RecordBatch batch = deque.pollFirst();
                                        batch.records.close();
                                        size += batch.records.sizeInBytes();
                                        ready.add(batch);
                                        batch.drainedMs = now;
                                    }
                                }
                            }
                        }
                    }
                }
                this.drainIndex = (this.drainIndex + 1) % parts.size();
            } while (start != drainIndex);
            batches.put(node.id(), ready);
        }
        return batches;
    }

 遍历所有的node,获取每个node的所有分区,以每个分区的分区号和topic获取到对应的队列,每次之后从一个队列中获得一个RecordBatch,如果太大了,那么会把这个RecordBatch在下次发送,遍历完之后,回到这一个node.id和要发送整个node的消息的集合。

 

 

主流程图 

Kafka RecordAccumulator源码_第3张图片

1、ProducerInterceptors对消息进行拦截。
2、Serializer对消息的key和value进行序列化
3、Partitioner为消息选择合适的Partition
4、RecordAccumulator收集消息,实现批量发送
5、Sender从RecordAccumulator获取消息
6、构造ClientRequest
7、将ClientRequest交给NetworkClient,准备发送
8、NetworkClient将请求放入KafkaChannel的缓存
9、执行网络I/O,发送请求
10、收到响应,调用Client Request的回调函数
11、调用RecordBatch的回调函数,最终调用每个消息注册的回调函数
消息发送的过程中,设计两个线程协同工作。主线程首先将业务数据封装成ProducerRecord对象,之后调用send方法将消息放入RecordAccumulator中暂存,Sender线程负责将消息信息构成请求,并最终执行网络I/O的线程,他从Record Accumulator中取出消息并批量发送出去。

你可能感兴趣的:(kafka)