PUBLISH & SUBSCRIBE
像一个消息系统一个读写流式数据
Read and write streams of data like a messaging system.
Learn more »](https://kafka.apache.org/documentation/#producerapi)
PROCESS
写一个伸缩的流式处理应用,用于及时地对事件进行响应
Write scalable stream processing applications that react to events in real-time.
Learn more »](https://kafka.apache.org/documentation/streams) ## STORE
Store streams of data safely in a distributed, replicated, fault-tolerant cluster.
Learn more »
STORE
用于分布式、可复制、容错的集群中存储流式数据
Store streams of data safely in a distributed, replicated, fault-tolerant cluster.
Learn more »](https://kafka.apache.org/intro#kafka_storage)
面向协议的编程及面向MQ的编程模型:
上述的方式是A和B应用之间是一种强耦合(strong couple)的关系
通过使用MQ交互可以实现A和B系统的解耦(decouple)
Introduction
是一个分布式的流式平台,准确的含义是什么呢?
Apache Kafka® is a distributed streaming platform. What exactly does that mean?
一个流式平台有三个关键的能力
A streaming platform has three key capabilities:
- 对于流式记录的发布和订阅,类似于消息队列或者企业级的消息系统
Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system. - 以一种失败容错并且持久化的方式进行存储
Store streams of records in a fault-tolerant durable way. - 当记录流出现的时候会进行处理
Process streams of records as they occur.
Kafka 通常用于两大类应用程序中
Kafka is generally used for two broad classes of applications:
构建一个实时的流式处理管道,在系统或者应用之前获取数据
Building real-time streaming data pipelines that reliably get data between systems or applications
构建一个实时的流式应用 用于传输或者对流式数据做出响应,为了更好的了解Kafka是如何处理这些事情的,
Building real-time streaming applications that transform or react to the streams of data
To understand how Kafka does these things, let's dive in and explore Kafka's capabilities from the bottom up.
首先介绍几个关键的概念:
First a few concepts:
Kafka是在一个或多个服务器上作为集群运行的,可以横跨多个数据中心,Kafka集群用于存储多种(Topic)流式数据,每一条记录都包含一个key 、一个value及时间戳。- Kafka可以以Topic的方式对数据进行分门别类的处理。
Kafka is run as a cluster on one or more servers that can span multiple datacenters.
The Kafka cluster stores streams of records in categories called topics.
Each record consists of a key, a value, and a timestamp.
Kafka拥有4个核心的API:
Kafka has four core APIs:
生产者API 允许一个应用向一个或者多个Topic发送记录流。
- The Producer API allows an application to publish a stream of records to one or more Kafka topics.
Consumer API 允许一个应用订阅一个或者多个Topic及处理向它们发送的记录流。
- The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
Stream API 允许一个应用充当 “流处理器”的角色,会从一个或者多个Topic中消费输入流,及向一个或者多个Topic中生成输出流,可以有效地将输入流转换成输出流。它是1.0后新增的特性。类似于Java8中新增的Stream API, StreamA to StreamB。
- The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
连接器API 针对特定的场景进行处理
Connector API 允许构建一个或者多个可重用的生产及消费者,将数据系统中的Topic连接到已经存在的应用,例如一个可靠数据库的连接器可能捕获针对表的每一次更改。
- The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.
客户端及服务端以一种简单、高性能、语言无关进行交互,该协议经过版本控制,并与旧版本保持向后兼容性。我们为Kafka提供了一个Java客户端,但是客户端可以使用多语言。
In Kafka the communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with older version. We provide a Java client for Kafka, but clients are available in many languages.
主题及日志
Topics and Logs
首先深入挖掘一下Kafka提供的核心的抽象用于记录流--Topic(主题)
Let's first dive into the core abstraction Kafka provides for a stream of records—the topic.
一个Topic是发布记录的种类,Kafka中Topic总是多订阅的,一个Topic可以拥有一个或者多个消费者来订阅写入其中的数据
Tips: 一般情况下会被多个消费者消费,因为会涉及到消费者组(Consumer Group)的相关差异,主要用于单播及多播的操作。
A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it.
针对一个Topic,Kafka集群维护了像下面的日志分区
For each topic, the Kafka cluster maintains a partitioned log that looks like this:
Anatomy:解析、剖析
Tips: 上述的Topic分成3个Partion,消息是3个分区的总和,每个分区中的消息都是有序的,但是分区之前的消息是不保证顺序的,比如: Partion0的Id为1的消息可能是Partion3 Id为3的后面发送的。
每一个分区都是有序的,不可变的记录的一种序列会被连续添加到-结构化提交日志中,分区中的每一条记录都被分配一个id编号称为偏移量offset,用于唯一标识分区中每一条记录。
Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.
Kafka集群持久化所有发布的记录,不管它们是否已经被消费,使用一个可配置的保留周期(默认是168h(7天)),例如,如果保留策略被设置成2天,一条记录被发布2天内,它是可以被获取并消费的,过后数据会被丢弃并释放存储空间,Kafka的性能在数据大小方面是稳定的(有效的常量或者说与数据量无关的),所以长时间存储数据不是问题。
The Kafka cluster durably persists all published records—whether or not they have been consumed—using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so storing data for a long time is not a problem.
实际上,从每一个消费者基础上看,唯一元数据是该消费者在日志中的偏移量或位置
这个偏移量被消费者控制,通常情况下,消费者在读取记录时将线性地推进其偏移量,但实际上,由于位置由消费者控制,因此它可以按照自己喜欢的任何顺序消费记录,例如:一个消费者可以重置偏移量去重新处理过去(已经消费)的数据或者跳到最近的记录(忽略中间的数据),从“现在”(现在、获取、跳跃后的位置)开始消费。这是取决于当前的应用需求。
In fact, the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes. For example a consumer can reset to an older offset to reprocess data from the past or skip ahead to the most recent record and start consuming from "now".
这种特征的组合意味着Kafka消费者是非常廉价的(成本很低),它们可以来来去去,而不会对集群或其他消费者造成太大影响,例如,您可以使用我们的命令行工具“跟踪”任何主题的内容,而不需要更改任何现有消费者所消费的内容。
This combination of features means that Kafka consumers are very cheap—they can come and go without much impact on the cluster or on other consumers. For example, you can use our command line tools to "tail" the contents of any topic without changing what is consumed by any existing consumers.
日志中的分区有以下几种用途: 首先允许日志扩展到超出一个服务器可以容纳的大小,每个单独的分区必须适合承载它的服务器,但是一个主题可能有多个分区(分区可以在不同的服务器上),因此它可以处理任意数量的数据。其次它们可以作为并行的单位,分区可以对其进行水平扩容。
The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism—more on that in a bit.
分布式
Distribution
Log的分区是被分配到Kafka集群的服务器上,每个服务器都会去处理数据及对于共享分区的请求。为了保证容错,每一个分区可以通过一个配置的Server数字标识进行复制。
Tips: 每一个分区都可以存储多个副本,副本的数据是一致的 是为了失败容错来实现高可用。
The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.
每一个分区都有一个充当Leader的服务器,0个或者多个Follwers的服务器(类似于Zookeeper),Leader会处理针对这个分区的读写请求,Follwers会复制Leader的信息。如果一个Leader宕机的话,其中一个Follwer会自动成为一个新的Leader(涉及到选举算法),针对一些分区来说每个Server都会充当Leader角色及一个Follwer针对其他分区,所以负载在一个集群中是一个很好的平衡。
Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.
地理复制
Geo-Replication
Kafka MirrorMaker为集群提供地理复制支持,使用MirrorMaker,消息可以跨多个数据中心或云区域复制。你可以将它用在主动/被动的情况下实现备份和恢复;或在活动/活动场景中,将数据放置到离用户更近的位置,或支持数据本地化需求。
Kafka MirrorMaker provides geo-replication support for your clusters. With MirrorMaker, messages are replicated across multiple datacenters or cloud regions. You can use this in active/passive scenarios for backup and recovery; or in active/active scenarios to place data closer to your users, or support data locality requirements.
Producers
生产者将数据发布到他们选择的主题,生产者负责决定将哪一条记录分配到Topic的哪一个分区,可以通过round-robin(随机算法)的机制来实现简单的负载均衡,它也也可以通过语义分区的功能(基于记录中的key),更多的是使用第二种方式。
Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record). More on the use of partitioning in a second!
消费者
Consumers
消费者用“消费者组”名称标记自己,发布到一个Topic的记录被发送给每个注册了这个Topic的消费者组,消费者实例可以在一个独立的进程或者是一台独立的机器上。
Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.
如果所有的消费者实例拥有相同的消费者组,记录将会在所有的消费者实例中实现有效的负载均衡。
If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.
如果所有的消费者实例拥有不同的消费者组,每一条记录都将发布给所有消费者进程。
If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.
一个包含4个分区(P0-P3)的2个Kafka服务器集群,及2个消费者组,GA包含2个消费者实例 GB包含4个消费者实例。
A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups. Consumer group A has two consumer instances and group B has four.
然而,通常情况下,我们会发现Topic有较少的消费者组,针对于一个“逻辑订阅者”分配一个消费者组,每个组中包含许多消费者用于实现可扩展及容错。这只不过是发布-订阅语义,其中订阅者是消费者集群,而不是单个流程。
More commonly, however, we have found that topics have a small number of consumer groups, one for each "logical subscriber". Each group is composed of many consumer instances for scalability and fault tolerance. This is nothing more than publish-subscribe semantics where the subscriber is a cluster of consumers instead of a single process.
Kafka通过将日志上的分区划分到消费者实例来实现消费,以至于每个实例在任何时候都是分区的“公平份额”的独占者。维护组内成员的关系是通过Kafka协议动态处理的,如果新的实例加入到组中,将会从组中其他成员那里接管一部分分区。如果消费者实例消亡了,它的分区将会分发给其他剩余的分区。
The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time. This process of maintaining membership in the group is handled by the Kafka protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.
Tips: Kafka 通过分区绑定实例的方式而不是具体每一个流式记录的方式,即通过为流式记录(Stream of Records)分配消费者,可以通过测试来观察这一点,假设存在1个Topic只有唯一的一个Partition,有1个消费者组(Consumer Group)里面包含2个消费者,通过生产者(Producer)发送消息,只有唯一的一个Consumer可以接收到消息,这个唯一的消费者,在这个消费者没有挂掉的情况下,这个分区中的消息就始终被这个消费者消费,不会被另外一个消费者消费。
Kafka只保证在一个分区中记录是有序的,不保证在不同分区之间是有序的。 针对于大多数应用来说,按分区排序和按键分区数据的能力已经足够。如果想保证所有的记录是有序的可以通过为一个Topic分配一个分区来达到,这就意味着每一个消费者组中仅仅只有一个消费者。
Kafka only provides a total order over records within a partition, not between different partitions in a topic. Per-partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if you require a total order over records this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.
多租
Multi-tenancy
你可以部署Kafka作为一个多租的解决方案,多租是通过配置哪个主题可以生产或者消费数据来实现的。管理员可以定义及限制请求的配额来控制代理中被客户端使用的资源,参看这个链接来获取更多的信息。
You can deploy Kafka as a multi-tenant solution. Multi-tenancy is enabled by configuring which topics can produce or consume data. There is also operations support for quotas. Administrators can define and enforce quotas on requests to control the broker resources that are used by clients. For more information, see the security documentation.
担保
Guarantees
在高层上 Kafka提供了以下的担保:
At a high-level Kafka gives the following guarantees:
生产者发布的特定主题分区的记录将会按照顺序进行追加,也就是,M1记录同M2记录是相同的生产者发送的,M1是先发送的 M1就会拥有一个比较小的offset偏移量 在日志中位于比较靠前的位置。在同一个分区上消息是有序的。
- Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
消费者消费的顺序同记录被存储在日志上的顺序是一致的
- A consumer instance sees records in the order they are stored in the log.
针对有N个副本因子的Topic 我们将容忍最多N-1服务器故障,而不会丢失提交到日志的任何记录。副本的作用来确保高可用来。
- For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.
更多的担保细节在设计的章节有所体现
More details on these guarantees are given in the design section of the documentation.
Kafka作为一个消息系统
Kafka as a Messaging System
Kafka的流式概念同传统企业级的消息系统相比是怎样的?
How does Kafka's notion of streams compare to a traditional enterprise messaging system?
传统的消息有两种模型:队列及发布-订阅模式 在一个队列中,消费者池可以从一个服务器读取数据及每一条记录发送给它们其中的一个。在一个发布-订阅的的模型中一条记录可以被广播给所有的消费者。这两种模式各有优缺点,队列的优势允许你将数据的处理分发给多个消费者实例,这让你可以扩展性的处理数据。不幸的是,队列不支持多订阅者--对数据的处理只有一次机会。发布订阅的模式允许你向多处理单元广播数据,但是没办法实现处理的可伸缩。因为每个消息都会进入到每个订阅者中。
Messaging traditionally has two models: queuing and publish-subscribe. In a queue, a pool of consumers may read from a server and each record goes to one of them; in publish-subscribe the record is broadcast to all consumers. Each of these two models has a strength and a weakness. The strength of queuing is that it allows you to divide up the processing of data over multiple consumer instances, which lets you scale your processing. Unfortunately, queues aren't multi-subscriber—once one process reads the data it's gone. Publish-subscribe allows you broadcast data to multiple processes, but has no way of scaling processing since every message goes to every subscriber.
Kafka消费者组就对上述的两种消息模型进行了泛化,与队列一样,消费者组允许你将处理划分到一组进程上(消费者组中成员),与发布-订阅一样,Kafka允许你向多个消费者组广播消息
The consumer group concept in Kafka generalizes these two concepts. As with a queue the consumer group allows you to divide up processing over a collection of processes (the members of the consumer group). As with publish-subscribe, Kafka allows you to broadcast messages to multiple consumer groups.
Kafka模型的优势是每个Topic都有这样的属性-- 它可以扩展处理也可以实现多个订阅者——不需要选择其中之一
The advantage of Kafka's model is that every topic has both these properties—it can scale processing and is also multi-subscriber—there is no need to choose one or the other.
Kafka也比传统的消息传递系统有更强的顺序保证。
Kafka has stronger ordering guarantees than a traditional messaging system, too.
在传统的队列中在服务端可以保证记录的有序性,如果多个消费者从队列中消费记录那么服务器将会按照它们被存储的顺序处理它们,但是,尽管服务器按照顺序处理记录,这些记录被异步的分发给消费者,所以在不同针对不同的消费者顺序可能是无法保证的,这也就意味着在并行的消费情况下,记录的顺序可能会被丢失,消息系统经常通过一个“排它消费者”来解决这个问题,只允许一个进程对队列的数据进行消费,但是这也就意味着失去了并行的处理能力。
A traditional queue retains records in-order on the server, and if multiple consumers consume from the queue then the server hands out records in the order they are stored. However, although the server hands out records in order, the records are delivered asynchronously to consumers, so they may arrive out of order on different consumers. This effectively means the ordering of the records is lost in the presence of parallel consumption. Messaging systems often work around this by having a notion of "exclusive consumer" that allows only one process to consume from a queue, but of course this means that there is no parallelism in processing.
Kafka的做法就比较好,通过一个并行的概念--在主题中设置分区,Kafka可以保证记录的顺序性及针对消费者处理进程组提供负载均衡的能力。这是通过将Topic的分区分配给消费者组中的消费者来实现的,从而保证每个分区只会被消费者组中的一个消费者消费。通过这样处理,我们可以确保消费者只是一个分区的读者及顺序性地消费消息。由于有很多分区存在,可以平衡很多消费者实例的负载均衡,注意消费中消费者实例不能大于Topic中分区的数量。
Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances in a consumer group than partitions.
Kafka作为一个存储系统
Kafka as a Storage System
任何的消息队列都是将消息发送与消息消费进行解耦 并且可以有效地充当正在运行的消息的存储系统 与其他消息系统不同的是Kafka是一个很好的存储系统。
Any message queue that allows publishing messages decoupled from consuming them is effectively acting as a storage system for the in-flight messages. What is different about Kafka is that it is a very good storage system.
写入Kafka的数据被存储到磁盘并且为了容错做了备份,Kafka让生产者等待确认直到数据被全部备份从而保证持久化即使这个服务器写失败。
Data written to Kafka is written to disk and replicated for fault-tolerance. Kafka allows producers to wait on acknowledgement so that a write isn't considered complete until it is fully replicated and guaranteed to persist even if the server written to fails.
Kafka的磁盘结构支持良好的可伸缩性 针对50KB或者50TB的持久化的数据Kafka都拥有良好的能力,也就是数据量无关的。
The disk structures Kafka uses scale well—Kafka will perform the same whether you have 50 KB or 50 TB of persistent data on the server.
由于严谨地对待存储并允许客户控制其读取位置,你可以将Kafka作为一种特定目的分布式文件系统(用于致力于高可用、低延迟提交日志存储、可复制及传播)。
As a result of taking storage seriously and allowing the clients to control their read position, you can think of Kafka as a kind of special purpose distributed filesystem dedicated to high-performance, low-latency commit log storage, replication, and propagation.
如果想要了解更多Kafka提交日志存储的细节及可复制的设计,请参考以下链接。
For details about the Kafka's commit log storage and replication design, please read this page.
Kafka 流式处理
Kafka for Stream Processing
只读取、写入及存储流式数据,这个目的是为了使得流式数据的处理成为可能。
It isn't enough to just read, write, and store streams of data, the purpose is to enable real-time processing of streams.
在Kafka中,流处理器是指从输入主题获取连续的数据流,并对这个输入流执行一些处理,并产生连续的数据流到输出主题
In Kafka a stream processor is anything that takes continual streams of data from input topics, performs some processing on this input, and produces continual streams of data to output topics.
例如
For example, a retail application might take in input streams of sales and shipments, and output a stream of reorders and price adjustments computed off this data.
直接使用生产者及消费者API可以进行简单的处理,但是对于更为复杂的传输系统Kafka提供了集成的Stream API 这可以允许构建这样的应用 进行较重大的处理,计算流之外的聚合或将流连接在一起
It is possible to do simple processing directly using the producer and consumer APIs. However for more complex transformations Kafka provides a fully integrated Streams API. This allows building applications that do non-trivial processing that compute aggregations off of streams or join streams together.
这样可以有助于解决这类应用程序所面临的难题:处理输出的数据,将输入重新处理为代码更改、执行有状态计算等
This facility helps solve the hard problems this type of application faces: handling out-of-order data, reprocessing input as code changes, performing stateful computations, etc.
Stream API是构建在Kafka提供的核心原语上的:可以针对输入使用生产者及消费者,使用Kafka进行有状态存储及针对在流处理器实例间使用相同的组机制。
The streams API builds on the core primitives Kafka provides: it uses the producer and consumer APIs for input, uses Kafka for stateful storage, and uses the same group mechanism for fault tolerance among the stream processor instances.
针对上面提到的点进行综述
Putting the Pieces Together
这种消息传递、存储和流处理的组合可能看起来不寻常,但对于Kafka作为流平台来说,这是必要的。
This combination of messaging, storage, and stream processing may seem unusual but it is essential to Kafka's role as a streaming platform.
像HDFS这样的分布式文件系统允许存储静态文件进行批处理。像这样高效的系统允许存储和处理来自过去的“历史”数据。
A distributed file system like HDFS allows storing static files for batch processing. Effectively a system like this allows storing and processing historical data from the past.
传统的企业级消息系统允许处理未来的信息 所谓未来的信息是你注册之后才会到来的,以这种方式构建的应用程序在未来数据到达时处理它。
A traditional enterprise messaging system allows processing future messages that will arrive after you subscribe. Applications built in this way process future data as it arrives.
Kafka组合了这些能力及对于将Kafka用作流应用程序的平台以及流数据管道来说,这种组合都是至关重要的。
Kafka combines both of these capabilities, and the combination is critical both for Kafka usage as a platform for streaming applications as well as for streaming data pipelines.
通过组合存储、低延迟注册及流式应用可以同等地对待过去及未来的数据,单节点应用可以处理历史数据 但是当它到达最后一条记录时,它可以在未来数据到达时继续处理,而不是结束。这是针对流式处理的广义概念 包含批处理和消息驱动应用程序
By combining storage and low-latency subscriptions, streaming applications can treat both past and future data the same way. That is a single application can process historical, stored data but rather than ending when it reaches the last record it can keep processing as future data arrives. This is a generalized notion of stream processing that subsumes batch processing as well as message-driven applications.
类似地,对于流数据管道,订阅实时事件的组合,使使用Kafka处理非常低延迟的管道成为可能。但是,可靠地存储数据的能力使其能够用于必须保证数据交付的关键数据,或者用于与离线系统集成,离线系统只定期加载数据,或者可能在很长一段时间内停机以进行维护。流处理设施使得在数据到达时转换数据成为可能。
Likewise for streaming data pipelines the combination of subscription to real-time events make it possible to use Kafka for very low-latency pipelines; but the ability to store data reliably make it possible to use it for critical data where the delivery of data must be guaranteed or for integration with offline systems that load data only periodically or may go down for extended periods of time for maintenance. The stream processing facilities make it possible to transform data as it arrives.
有关Kafka提供的保证、api和功能的更多信息,请参阅以下链接
For more information on the guarantees, APIs, and capabilities Kafka provides see the rest of the documentation.