
How to choose the number oftopics/partitions in a Kafka cluster?



This is a common question asked by many Kafka users.The goal of this post is to explain a few important determining factors andprovide a few simple formulas.


More Partitions Lead to HigherThroughput


The first thing to understand is that a topic partition is the unit ofparallelism in Kafka. On both the producer and the broker side, writes todifferent partitions can be done fully in parallel. So expensive operationssuch as compression can utilize more hardware resources. On the consumer side,Kafka always gives a single partition's data to one consumer thread. Thus, thedegree of parallelism in the consumer (within a consumer group) is bounded bythe number of partitions being consumed. Therefore, in general, the morepartitions there are in a Kafka cluster, the higher the throughput one canachieve.



在consumer段,kafka只允许单个partition的数据被一个consumer线程消费。因此,在consumer端,每一个Consumer Group内部的consumer并行度完全依赖于被消费的分区数量。


A rough formula for picking the number of partitionsis based on throughput. You measure the throughout that you can achieve on asingle partition for production (call itp) and consumption (call it c).Let's say your target throughput ist. Then you need to have at least max(t/p,t/c) partitions. The per-partition throughput that one can achieve on theproducer depends on configurations such as the batching size, compressioncodec, type of acknowledgement, replication factor, etc. However, in general,one can produce at 10s of MB/sec on just a single partition as shown in thisbenchmark. The consumer throughput is oftenapplication dependent since it corresponds to how fast the consumer logic canprocess each message. So, you really need to measure it.

我们可以粗略地通过吞吐量来计算kafka集群的分区数量。假设对于单个partition,producer端的可达吞吐量为p,Consumer端的可达吞吐量为c,期望的目标吞吐量为t,那么集群所需要的partition数量至少为max(t/p,t/c)。在producer端,单个分区的吞吐量大小会受到批量大小、数据压缩方法、 确认类型(同步/异步)、复制因子等配置参数的影响。经过测试,在producer端,单个partition的吞吐量通常是在10MB/s左右。在consumer端,单个partition的吞吐量依赖于consumer端每个消息的应用逻辑处理速度。因此,我们需要对consumer端的吞吐量进行测量。

Although it's possible to increase the number ofpartitions over time, one has to be careful if messages are produced with keys.When publishing a keyed message, Kafka deterministically maps the message to apartition based on the hash of the key. This provides a guarantee that messages with the same key are alwaysrouted to the same partition. This guarantee can be important for certainapplications since messages within a partition are always delivered in order tothe consumer. If the number of partitions changes, such a guarantee may nolonger hold. To avoid this situation, a common practice is to over-partition abit. Basically, you determine the number of partitions based on a future targetthroughput, say for one or two years later. Initially, you can just have asmall Kafka cluster based on your current throughput. Over time, you can addmore brokers to the cluster and proportionally move a subset of the existingpartitions to the new brokers (which can be done online). This way, you cankeep up with the throughput growth without breaking the semantics in theapplication when keys are used.



In addition to throughput, there are a few otherfactors that are worth considering when choosing the number of partitions. Asyou will see, in some cases, having too many partitions may also have negativeimpact.


More Partitions Requires More OpenFile Handles


Each partition maps to a directory in the file systemin the broker. Within that log directory, there will be two files (one for theindex and another for the actual data) per log segment. Currently, in Kafka,each broker opens a file handle of both the index and the data file of everylog segment. So, the more partitions, the higher that one needs to configurethe open file handle limit in the underlying operating system. This is mostlyjust a configuration issue. We have seen production Kafka clusters running withmore than 30 thousand open file handles per broker.


More Partitions May IncreaseUnavailability


Kafka supports intra-cluster replication, which provides higher availabilityand durability. A partition can have multiple replicas, each stored on adifferent broker. One of the replicas is designated as the leader and the restof the replicas are followers. Internally, Kafka manages all those replicasautomatically and makes sure that they are kept in sync. Both the producer andthe consumer requests to a partition are served on the leader replica. When abroker fails, partitions with a leader on that broker become temporarilyunavailable. Kafka will automatically move the leader of those unavailablepartitions to some other replicas to continue serving the client requests. Thisprocess is done by one of the Kafka brokers designated as the controller. Itinvolves reading and writing some metadata for each affected partition inZooKeeper. Currently, operations to ZooKeeper are done serially in thecontroller.

Kafka通过多副本复制技术,实现kafka集群的高可用和稳定性。每个partition都会有多个数据副本,每个副本分别存在于不同的broker。所有的数据副本中,有一个数据副本为Leader,其他的数据副本为follower。在kafka集群内部,所有的数据副本皆采用自动化的方式进行管理,并且确保所有的数据副本的数据皆保持同步状态。不论是producer端还是consumer端发往partition的请求,皆通过leader数据副本所在的broker进行处理。当broker发生故障时,对于leader数据副本在该broker的所有partition将会变得暂时不可用。Kafka将会自动在其他数据副本中选择出一个leader,用于接收客户端的请求。这个过程由kafka controller节点broker自动完成,主要是从Zookeeper读取和修改受影响partition的一些元数据信息。在当前的kafka版本实现中,对于zookeeper的所有操作都是由kafka controller来完成的(serially的方式)。

In the common case when a broker is shut downcleanly, the controller will proactively move the leaders off the shutting downbroker one at a time. The moving of a single leader takes only a fewmilliseconds. So, from the clients perspective, there is only a small window ofunavailability during a clean broker shutdown.


However, when a broker is shut down uncleanly (e.g.,kill -9), the observed unavailability could be proportional to the number ofpartitions. Suppose that a broker has a total of 2000 partitions, each with 2replicas. Roughly, this broker will be the leader for about 1000 partitions.When this broker fails uncleanly, all those 1000 partitions become unavailableat exactly the same time. Suppose that it takes 5 ms to elect a new leader fora single partition. It will take up to 5 seconds to elect the new leader forall 1000 partitions. So, for some partitions, their observed unavailability canbe 5 seconds plus the time taken to detect the failure.

然而,当broker非计划地停止服务时(例如,kill -9方式),系统的不可用时间窗口将会与受影响的partition数量有关。假如,一个2节点的kafka集群中存在2000个partition,每个partition拥有2个数据副本。当其中一个broker非计划地宕机,所有1000个partition同时变得不可用。假设每一个partition恢复时间是5ms,那么1000个partition的恢复时间将会花费5秒钟。因此,在这种情况下,用户将会观察到系统存在5秒钟的不可用时间窗口。

If one is unlucky, the failed broker may be thecontroller. In this case, the process of electing the new leaders won't startuntil the controller fails over to a new broker. The controller failoverhappens automatically, but requires the new controller to read some metadatafor every partition from ZooKeeper during initialization. For example, if thereare 10,000 partitions in the Kafka cluster and initializing the metadata fromZooKeeper takes 2 ms per partition, this can add 20 more seconds to the unavailabilitywindow.


In general, unclean failures are rare. However, ifone cares about availability in those rare cases, it's probably better to limitthe number of partitions per broker to two to four thousand and the totalnumber of partitions in the cluster to low tens of thousand.


More Partitions May Increase End-to-endLatency


The end-to-end latency in Kafka is defined by thetime from when a message is published by the producer to when the message isread by the consumer. Kafka only exposes a message to a consumer after it hasbeen committed, i.e., when the message is replicated to all the in-syncreplicas. So, the time to commit a message can be a significant portion of theend-to-end latency. By default, a Kafka broker only uses a single thread toreplicate data from another broker, for all partitions that share replicas betweenthe two brokers. Our experiments show that replicating 1000 partitions from onebroker to another can add about 20 ms latency, which implies that theend-to-end latency is at least 20 ms. This can be too high for some real-timeapplications.


Note that this issue is alleviated on a largercluster. For example, suppose that there are 1000 partition leaders on a brokerand there are 10 other brokers in the same Kafka cluster. Each of the remaining10 brokers only needs to fetch 100 partitions from the first broker on average.Therefore, the added latency due to committing a message will be just a few ms,instead of tens of ms.


As a rule of thumb, if you care about latency, it'sprobably a good idea to limit the number of partitions per broker to100 x bx r, where b is the number of brokers in a Kafka cluster andris the replication factor.


More Partitions May Require MoreMemory In the Client


In the most recent 0.8.2 release which we ship withthe Confluent Platform 1.0, we have developed a more efficientJava producer. One of the nice features of the new producer is that it allowsusers to set an upper bound on the amount of memory used for buffering incomingmessages. Internally, the producer buffers messages per partition. After enoughdata has been accumulated or enough time has passed, the accumulated messagesare removed from the buffer and sent to the broker.

在最新发布的0.8.2版本的kafka中,我们开发了一个更加高效的Java producer。新版producer拥有一个比较好的特征,他允许用户为待接入消息存储空间设置内存大小上限。在内部实现层面,producer按照每一个partition来缓存消息。在数据积累到一定大小或者足够的时间时,积累的消息将会从缓存中移除并发往broker节点。

If one increases the number of partitions, messagewill be accumulated in more partitions in the producer. The aggregate amount ofmemory used may now exceed the configured memory limit. When this happens, theproducer has to either block or drop any new message, neither of which isideal. To prevent this from happening, one will need to reconfigure theproducer with a larger memory size.


As a rule of thumb, to achieve good throughput, oneshould allocate at least a few tens of KB per partition being produced in theproducer and adjust the total amount of memory if the number of partitionsincreases significantly.


A similar issue exists in the consumer as well. Theconsumer fetches a batch of messages per partition. The more partitions that aconsumer consumes, the more memory it needs. However, this is typically only anissue for consumers that are not real time.




In general, more partitions in a Kafka cluster leadsto higher throughput. However, one does have to be aware of the potentialimpact of having too many partitions in total or per broker on things likeavailability and latency. In the future, we do plan to improve some of thoselimitations to make Kafka more scalable in terms of the number of partitions.

