



Structured Streaming Programming Guide

  • Overview
  • Quick Example
  • Programming Model
    • Basic Concepts
    • Handling Event-time and Late Data
    • Fault Tolerance Semantics
  • API using Datasets and DataFrames
    • Creating streaming DataFrames and streaming Datasets
      • Input Sources
      • Schema inference and partition of streaming DataFrames/Datasets
    • Operations on streaming DataFrames/Datasets
      • Basic Operations - Selection, Projection, Aggregation
      • Window Operations on Event Time
      • Handling Late Data and Watermarking
      • Join Operations
      • Streaming Deduplication
      • Arbitrary Stateful Operations
      • Unsupported Operations
    • Starting Streaming Queries
      • Output Modes
      • Output Sinks
      • Using Foreach
    • Managing Streaming Queries
    • Monitoring Streaming Queries
      • Interactive APIs
      • Asynchronous API
    • Recovering from Failures with Checkpointing
  • Where to go from here

Quick Example

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

spark = SparkSession \
    .builder \
    .appName("StructuredNetworkWordCount") \

# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines = spark \
    .readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \

# Split the lines into words
words =
       split(lines.value, " ")

# Generate running word count
wordCounts = words.groupBy("word").count()

# Start running the query that prints the running counts to the console
query = wordCounts \
    .writeStream \
    .outputMode("complete") \
    .format("console") \


Creating streaming DataFrames and streaming Datasets

spark = SparkSession. ...

# Read text from socket
socketDF = spark \
    .readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \

socketDF.isStreaming()    # Returns True for DataFrames that have streaming sources


# Read all the csv files written atomically in a directory
userSchema = StructType().add("name", "string").add("age", "integer")
csvDF = spark \
    .readStream \
    .option("sep", ";") \
    .schema(userSchema) \
    .csv("/path/to/directory")  # Equivalent to format("csv").load("/path/to/directory")

Basic Operations - Selection, Projection, Aggregation

df = ...  # streaming DataFrame with IOT device data with schema { device: string, deviceType: string, signal: double, time: DateType }

# Select the devices which have signal more than 10"device").where("signal > 10")

# Running count of the number of updates for each device type

Window Operations on Event Time

words = ...  # streaming DataFrame of schema { timestamp: Timestamp, word: String }

# Group the data by window and word and compute the count of each group
windowedCounts = words.groupBy(
    window(words.timestamp, "10 minutes", "5 minutes"),

Handling Late Data and Watermarking

words = ...  # streaming DataFrame of schema { timestamp: Timestamp, word: String }

# Group the data by window and word and compute the count of each group
windowedCounts = words \
    .withWatermark("timestamp", "10 minutes") \
        window(words.timestamp, "10 minutes", "5 minutes"),
        words.word) \

Join Operations

staticDf = ...
streamingDf = spark.readStream. ...
streamingDf.join(staticDf, "type")  # inner equi-join with a static DF
streamingDf.join(staticDf, "type", "right_join")  # right outer join with a static DF

Streaming Deduplication

streamingDf = spark.readStream. ...

// Without watermark using guid column

// With watermark using guid and eventTime columns
streamingDf \
  .withWatermark("eventTime", "10 seconds") \
  .dropDuplicates("guid", "eventTime")
