本次使用全部以单机环境运行,下面附上spark和kafka的主要配置。
spark
版本:spark-2.4.4-bin-hadoop2.7.tgz (https://www.apache.org/dyn/closer.lua/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz)
spark-env.sh
SPARK_LOCAL_IP=192.168.33.50
SPARK_MASTER_HOST=192.168.33.50
SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=1G
启动服务
$ ./sbin/start-all.sh
Socket开启
$ nc -lk 9999
# 输入(不同时间输入)
apache
apache hadd
apache spa
wow
apache
对输入的数据进行空格切割,然后watermark设置数据为1分钟过期,并设置window为1分钟即把1分钟之内里面的数据进行聚合,每10秒执行一次并输出所有的结果数据表(这样的效果可以实时得到所有的聚合数据并在前端进行展示)。
sp_test5_4.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split, window
spark = SparkSession.builder.master(
"spark://192.168.33.50:7077"
).getOrCreate()
stream_data = spark \
.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.option('includeTimestamp', 'true') \
.load()
stream_data.printSchema()
# 数据分割转成列
words = stream_data.select(
explode(
split(stream_data.value, " ")
).alias("zhexiao"),
stream_data.timestamp
)
words.printSchema()
window_words = words.withWatermark(
"timestamp", "1 minutes"
).groupBy(
window(
words.timestamp,
'1 minutes',
'1 minutes'
),
words.zhexiao
).count().orderBy('window')
# Start running the query that prints the running counts to the console
query = window_words \
.writeStream \
.trigger(processingTime='10 seconds') \
.outputMode("complete") \
.format("console") \
.option('truncate', 'false') \
.start() \
.awaitTermination()
执行
$ ./bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4 /test/sp_test5_4.py
对输入的数据进行空格切割,然后watermark设置数据为1分钟过期,并设置window为1分钟即把1分钟之内里面的数据进行聚合,每1分钟执行一次并输出当前这一分钟的聚合数据(这样的效果可以通过数据流直接把每分钟的聚合数据入库)。
sp_test5_5.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split, window
spark = SparkSession.builder.master(
"spark://192.168.33.50:7077"
).getOrCreate()
stream_data = spark \
.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.option('includeTimestamp', 'true') \
.load()
stream_data.printSchema()
# 数据分割转成列
words = stream_data.select(
explode(
split(stream_data.value, " ")
).alias("zhexiao"),
stream_data.timestamp
)
words.printSchema()
window_words = words.withWatermark(
"timestamp", "1 minutes"
).groupBy(
window(
words.timestamp,
'1 minutes',
'1 minutes'
),
words.zhexiao
).count()
# Start running the query that prints the running counts to the console
query = window_words \
.writeStream \
.trigger(processingTime='1 minutes') \
.outputMode("update") \
.format("console") \
.option('truncate', 'false') \
.start() \
.awaitTermination()
执行
$ ./bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4 /test/sp_test5_5.py
输入
a
pache
apache wow
apache spark
apace
apache
apache wow
apache spark
输入
wow
apache
apache spark