Spark Structured Streaming java example

场景

实时数据存储在kafka，时间顺序不一定，计算需使用到其他静态资源（rest API或数据库中）
要求按天计算，计算时有时间顺序要求，每小时计算一次，结果输出到kafka

关键点

window

参考：spark window on event time

checkpointLocation

主要用于记录一些metadata，offset和算子计算的中间结果，用于故障恢复和重启
参考：spark-checkpointing

startingOffsets

初始读取kafka的偏移量，当checkpointLocation不存在时使用，或者当算子更新checkpointLocation失效时
参考：http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html

方案1:窗口计算

使用structured streaming 窗口计算，窗口长度24小时，步长24小时，设置watermark为48小时。

Dataset lines = sparkSession.readStream()
    .format("kafka")
    .option("kafka.bootstrap.servers", "_")
    .option("subscribe", "topic")
    .option("startingOffsets", "{\"topic\":{\"0\":_offset_}}")
    .load();

Dataset

Spark Structured Streaming java example

场景

关键点

window

checkpointLocation

startingOffsets

方案1:窗口计算

方案2 mapGroupsWithState状态流

你可能感兴趣的:(Spark Structured Streaming java example)