Structured Streaming Programming Guide(spark3.3.0)

参考官方文档:https://wiki.huawei.com/domains/1185/wiki/8/WIKI2021121601648

概述

Structured Streaming queries默认使用的是mini-batch处理引擎,每次处理一小块数据来达到100ms的延迟效果,如果想要降低延迟,可以使用Continuous Processing,延迟小于1ms

栗子

  1. 创建SparkSession
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

spark = SparkSession \
    .builder \
    .appName("StructuredNetworkWordCount") \
    .getOrCreate()
  1. 创建一个流式的datafram来从本地的9999端口接收数据,然后计算词频
# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines = spark \
    .readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()

# Split the lines into words
words = lines.select(
   explode(
       split(lines.value, " ")
   ).alias("word")
)

# Generate running word count
wordCounts = words.groupBy("word").count()

注意这里只是设置,并没有真的接收数据
3. 接收数据并计算

# Start running the query that prints the running counts to the console
query = wordCounts \
    .writeStream \
    .outputMode("complete") \
    .format("console") \
    .start()
# complete设置决定了每次更新之后都会在consle输出结果
query.awaitTermination()
  1. 开始测试
$ nc -lk 9999
$ ./bin/spark-submit examples/src/main/python/sql/streaming/structured_network_wordcount.py localhost 9999

Structured Streaming Programming Guide(spark3.3.0)_第1张图片

你可能感兴趣的:(学习笔记,大数据,spark)