spark streamingRDD队列流

用streamingContext.queueStream(queueOfRDD)创建基于RDD的Dstream
每隔1s创建一个RDD,加到队列里,每隔2s对Dstream进行处理
cd 。。。。
vim RDDQueueStream.py

#!/usr/bin/env python3
import time
from pyspark import SparkContext
from spark.streaming import StreamingContext
if__name__=“main”:
sc = SparkContext(appName=‘PythonStreamingQueueStream’)
ssc =StreamingContext(sc,2)
#下面创建一个RDD队列流加了5次
rddQueue = []
for i in range(5):
rddQueue += [ssc.saprkContext.parallelize([j for j in range(1,1001)],10)]#10是分区,每次生成一千个元素
time.sleep(1)#每隔1s筛一个RDD队列

Input = ssc.queueStream(rddQueue)
mappedStream = input.map(lamda x:(x%10,1))
reducedStream = mappedStream.reduceByKey(lambda a,b:a+b)
reducedStream.pprint()
ssc.start()
ssc.stop(stopSparkContext=True,stopGraceFully=True)

#然后运行这个,切换到代码目录下
ranhou
/usr/local/spark/bin/spark-submit RDDQueueStream.py

你可能感兴趣的:(spark streamingRDD队列流)