Kafka Stream的大部分API还是比较容易理解和使用的,但是,其中的时间窗口聚合即windowBy方法还是需要仔细研究下,否则很容易使用错误。
本文先引入Kafka Stream,然后主要针对时间窗口聚合API即windowBy()做详细分析。
Kafka Streams是一个用于构建应用程序和微服务的客户端库,其中的输入和输出数据存储在Kafka集群中。它结合了在客户端编写和部署Java/Scala应用程序的简单性,以及Kafka服务器集群的优点。
Kafka Stream为我们屏蔽了直接使用Kafka Consumer的复杂性,不用手动进行轮询poll(),不必关心commit()。而且,使用Kafka Stream,可以方便的进行实时计算、实时分析。
public class WordCountApplication { public static void main(final String[] args) throws InterruptedException { Properties props = new Properties(); props.put(StreamsConfig.APPLICATION_ID_CONFIG, "wordcount-application"); props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092"); props.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, "500");// 默认30s commit一次 props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass()); props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass()); StreamsBuilder builder = new StreamsBuilder(); // 从名为“TextLinesTopic”的topic创建流。 KStream |
启动kafka-console-producer, 创建主题TextLinesTopic0,并发送消息。
.\bin\windows\kafka-console-producer.bat --broker-list localhost:9092 --topic TextLinesTopic .\bin\windows\kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic TextLinesTopic0 |
借助KafkaStream的API,我们可以方便的编写实时计算应用。比如上面的groupBy、count方法,再比如接下来的windowBy方法,如果不使用KafakStream,直接使用Kafka Consumer自行实现,则比较麻烦。
Window name | Behavior | Short description |
Tumbling time window | Time-based | Fixed-size, non-overlapping, gap-less windows |
Hopping time window | Time-based | Fixed-size, overlapping windows |
Sliding time window | Time-based | Fixed-size, overlapping windows that work on differences between record timestamps |
Session window | Session-based | Dynamically-sized, non-overlapping, data-driven windows |
翻滚时间窗口Tumbling time windows
是跳跃时间窗口hopping time windows
Tumbling time windows are aligned to the epoch, with the lower interval bound being inclusive and the upper bound being exclusive. “Aligned to the epoch” means that the first window starts at timestamp zero. For example, tumbling windows with a size of 5000ms have predictable window boundaries
— and not[1000;6000),[6000;11000),...
or even something “random” like[1452;6452),[6452;11452),...
private static final String BOOT_STRAP_SERVERS = "localhost:9092"; private static final String TEST_TOPIC = "test_topic"; private static final long TIME_WINDOW_SECONDS = 5L; //时间窗口大小 @Test public void testTumblingTimeWindows() throws InterruptedException { Properties props = configStreamProperties(); StreamsBuilder builder = new StreamsBuilder(); KStream |
@BeforeClass public static void generateValue() { Properties props = new Properties(); props.put("bootstrap.servers", BOOT_STRAP_SERVERS); props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer"); props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer"); props.put("request.required.acks", "0"); new Thread(() -> { Producer |
下面是些公共代码,之后的例子也有会用到 :
private Properties configStreamProperties() { Properties props = new Properties(); props.put(StreamsConfig.APPLICATION_ID_CONFIG, "streams-ljf-test"); props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, BOOT_STRAP_SERVERS); props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass()); props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass()); props.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, "500");//todo 默认值为30s,会导致30s才提交一次数据。 return props; } private boolean isOldWindow(Windowed |
@BeforeClass public static void generateValue() { Properties props = new Properties(); // ...配置不变,此处省略 new Thread(() -> { Producer |
PS:类似groupByKey,还有groupBy,前者可以看做后者的特化,后者可以根据消Message的key、value自定义分组逻辑。关于此,可以参考API官方文档Stateless transformations
Sliding windows are actually quite different from hopping and tumbling windows. In Kafka Streams, sliding windows are used only for join operations, and can be specified through the JoinWindows
A sliding window models a fixed-size window that slides continuously over the time axis; here, two data records are said to be included in the same window if (in the case of symmetric windows) the difference of their timestamps is within the window size. Thus, sliding windows are not aligned to the epoch, but to the data record timestamps. In contrast to hopping and tumbling windows, the lower and upper window time interval bounds of sliding windows are both inclusive.
Session windows are used to aggregate key-based events into so-called sessions, the process of which is referred to as sessionization. Sessions represent a period of activity separated by a defined gap of inactivity (or “idleness”). Any events processed that fall within the inactivity gap of any existing sessions are merged into the existing sessions. If an event falls outside of the session gap, then a new session will be created.
Note Hopping windows vs. sliding windows: Hopping windows are sometimes called “sliding windows” in other stream processing tools. Kafka Streams follows the terminology in academic literature, where the semantics of sliding windows are different to those of hopping windows.
Hopping time windows are aligned to the epoch, with the lower interval bound being inclusive and the upper bound being exclusive. “Aligned to the epoch” means that the first window starts at timestamp zero. For example, hopping windows with a size of 5000ms and an advance interval (“hop”) of 3000ms have predictable window boundaries
— and not[1000;6000),[4000;9000),...
or even something “random” like[1452;6452),[4452;9452),...
跳跃时间窗口Hopping time windows
及其前进间隔advance interval
private static final long TIME_WINDOW_SECONDS = 5L; //窗口大小设为5秒 private static final long ADVANCED_BY_SECONDS = 1L; //前进间隔1秒 @Test public void testHoppingTimeWindowWithSuppress() throws InterruptedException { Properties props = configStreamProperties(); StreamsBuilder builder = new StreamsBuilder(); KStream |
后者的意思是:抑制住上游流的输出,直到当前时间窗口关闭后,才向下游发送数据。前面我们说过,每当统计值产生变化时,统计的结果会立即发送给下游。但是有些情况下,比如我们从kafka中的消息记录了应用程序的每次gc时间,我们的流任务需要统计每个时间窗口内的平均gc时间,然后发送给下游(下游可能是直接输出到控制台,也可能是另一个kafka topic或者一段报警逻辑)。那么,只要当这个时间窗口关闭时,向下游发送一个最终结果就够了。而且有的情况下,如果窗口还没关闭就发送到下游,可能导致错误的逻辑(比如数据抖动产生误报警)。
@Test public void testHoppingTimeWindow() throws InterruptedException { Properties props = configStreamProperties(); StreamsBuilder builder = new StreamsBuilder(); KStream |
上面我特意强调了两点,一是所在的窗口都进行聚合计算,二是聚合计算的结果立即发往下游。第二点我们已经验证了。我们将最开始Tumbling time window的程序加上suppres进一步验证一下。
@Test public void testTumblingTimeWindowWithSuppress() throws InterruptedException { Properties props = configStreamProperties(); StreamsBuilder builder = new StreamsBuilder(); KStream |
@Test public void testTumblingTimeWindowWithSuppress() throws InterruptedException { Properties props = configStreamProperties(); StreamsBuilder builder = new StreamsBuilder(); KStream |
进一步地,我们在使用hopping time windows 进行验证:到达的数据落到的每个窗口上,都会立即、分别调用该窗口的聚合函数。
@Test public void testHoppingTimeWindowWithSuppress() throws InterruptedException { Properties props = configStreamProperties(); StreamsBuilder builder = new StreamsBuilder(); KStream |
最后我们研究下Kafka Stream中的时间概念。
这个问题其实不光是Kafka Stream的问题,也牵扯到Kafka基本生产者消费者模型。但是由于实时计算的特点,在Kafka Stream中需要格外关注。
Kafka有这样几个时间概念: http://kafka.apache.org/23/documentation/streams/core-concepts#streams_time
摄入时间与事件时间的区别:前者是消息存入到topic的时间,后者是事件发生的事件。 摄入时间与处理时间的去表:后者是被KafkaStream应用消费到的时间点。如果一个记录从未被消费,则它拥有摄入时间而没有处理时间。
The choice between event-time and ingestion-time is actually done through the configuration of Kafka (not Kafka Streams): From Kafka 0.10.x onwards, timestamps are automatically embedded into Kafka messages. Depending on Kafka’s configuration these timestamps represent event-time or ingestion-time. The respective Kafka configuration setting can be specified on the broker level or per topic. The default timestamp extractor in Kafka Streams will retrieve these embedded timestamps as-is. Hence, the effective time semantics of your application depend on the effective Kafka configuration for these embedded timestamps.
public class MyTimestampExtractor implements TimestampExtractor { @Override public long extract(ConsumerRecord |
@BeforeClass public static void generateValue() { Properties props = new Properties(); props.put("bootstrap.servers", BOOT_STRAP_SERVERS); props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer"); props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer"); props.put("request.required.acks", "0"); new Thread(() -> { Producer |
private static final long TIME_WINDOW_SECONDS = 5L; @Test public void testEventTime() throws InterruptedException { Properties props = configStreamProperties(); // 指定使用自定义的时间提取器 props.put(StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG, MyTimestampExtractor.class); StreamsBuilder builder = new StreamsBuilder(); KStream |
还记的在Tumbling time windows小节里的例子吗,当时的输出是123451234512345…。但是我们现在使用自定义时间提取器,从消息内容里提取时间信息,而在发送时做了点小把戏,所以在同一分钟内接收到的消息,提出来的时间都是0秒的,也就是都会落到第一个时间窗口内(0秒-5秒窗口)。
如果不制定自定义的时间提取器,时间又是哪里来的呢? kafka每条消息中其实自带了时间戳,作为CreateTime
producer.send(new ProducerRecord<>(TOPIC, key, value) |
public ProducerRecord(String topic, K key, V value) { this(topic, null, null, key, value, null); } /** * Creates a record with a specified timestamp to be sent to a specified topic and partition * * @param topic The topic the record will be appended to * @param partition The partition to which the record should be sent * @param timestamp The timestamp of the record, in milliseconds since epoch. If null, the producer will assign * the timestamp using System.currentTimeMillis(). * @param key The key that will be included in the record * @param value The record contents * @param headers the headers that will be included in the record */ public ProducerRecord(String topic, Integer partition, Long timestamp, K key, V value, Iterable |
name | desc | type | default | VALID VALUES |
message.timestamp.type | Define whether the timestamp in the message is message create time or log append time | string | CreateTime | [CreateTime, LogAppendTime] |
@Test public void testEventTime() throws InterruptedException { Properties props = configStreamProperties(); // 指定使用自定义的时间提取器 // props.put(StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG, MyTimestampExtractor.class); StreamsBuilder builder = new StreamsBuilder(); KStream |
Whenever a Kafka Streams application writes records to Kafka, then it will also assign timestamps to these new records. The way the timestamps are assigned depends on the context:
- When new output records are generated via processing some input record, for example,
triggered in theprocess()
function call, output record timestamps are inherited from input record timestamps directly.- When new output records are generated via periodic functions such as
, the output record timestamp is defined as the current internal time (obtained throughcontext.timestamp()
) of the stream task.- For aggregations, the timestamp of a resulting aggregate update record will be that of the latest arrived input record that triggered the update.
Note, that the describe default behavior can be changed in the Processor API by assigning timestamps to output records explicitly when calling
Tumbling time window
、Hopping time window
、sliding time window
、session time window
配置。Kafka Stream 官方文档