Structured Streaming 词频统计模型

环境

本次使用全部以单机环境运行,下面附上spark和kafka的主要配置。

spark
版本:spark-2.4.4-bin-hadoop2.7.tgz (https://www.apache.org/dyn/closer.lua/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz)

spark-env.sh

SPARK_LOCAL_IP=192.168.33.50
SPARK_MASTER_HOST=192.168.33.50
SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=1G

启动服务

$ ./sbin/start-all.sh

Socket开启

$ nc -lk 9999  

# 输入(不同时间输入)
apache
apache hadd
apache spa
wow
apache

流计算1 聚合展示

对输入的数据进行空格切割,然后watermark设置数据为1分钟过期,并设置window为1分钟即把1分钟之内里面的数据进行聚合,每10秒执行一次并输出所有的结果数据表(这样的效果可以实时得到所有的聚合数据并在前端进行展示)。

sp_test5_4.py

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split, window

spark = SparkSession.builder.master(
    "spark://192.168.33.50:7077"
).getOrCreate()

stream_data = spark \
    .readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .option('includeTimestamp', 'true') \
    .load()
stream_data.printSchema()

# 数据分割转成列
words = stream_data.select(
    explode(
        split(stream_data.value, " ")
    ).alias("zhexiao"),
    stream_data.timestamp
)
words.printSchema()

window_words = words.withWatermark(
    "timestamp", "1 minutes"
).groupBy(
    window(
        words.timestamp,
        '1 minutes',
        '1 minutes'
    ),
    words.zhexiao
).count().orderBy('window')

# Start running the query that prints the running counts to the console
query = window_words \
    .writeStream \
    .trigger(processingTime='10 seconds') \
    .outputMode("complete") \
    .format("console") \
    .option('truncate', 'false') \
    .start() \
    .awaitTermination()

执行

$ ./bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4 /test/sp_test5_4.py

输出
Structured Streaming 词频统计模型_第1张图片

流计算2 聚合入库

对输入的数据进行空格切割,然后watermark设置数据为1分钟过期,并设置window为1分钟即把1分钟之内里面的数据进行聚合,每1分钟执行一次并输出当前这一分钟的聚合数据(这样的效果可以通过数据流直接把每分钟的聚合数据入库)。

sp_test5_5.py

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split, window

spark = SparkSession.builder.master(
    "spark://192.168.33.50:7077"
).getOrCreate()

stream_data = spark \
    .readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .option('includeTimestamp', 'true') \
    .load()
stream_data.printSchema()

# 数据分割转成列
words = stream_data.select(
    explode(
        split(stream_data.value, " ")
    ).alias("zhexiao"),
    stream_data.timestamp
)
words.printSchema()

window_words = words.withWatermark(
    "timestamp", "1 minutes"
).groupBy(
    window(
        words.timestamp,
        '1 minutes',
        '1 minutes'
    ),
    words.zhexiao
).count()

# Start running the query that prints the running counts to the console
query = window_words \
    .writeStream \
    .trigger(processingTime='1 minutes') \
    .outputMode("update") \
    .format("console") \
    .option('truncate', 'false') \
    .start() \
    .awaitTermination()

执行

$ ./bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4 /test/sp_test5_5.py

输入
a

pache
apache wow
apache spark
apace
apache
apache wow
apache spark

输出
Structured Streaming 词频统计模型_第2张图片

输入

wow
apache
apache spark

输出
Structured Streaming 词频统计模型_第3张图片

你可能感兴趣的:(Python,Bigdata)