下面我们一起从源代码的角度探究一下,Spark Streaming怎么实现窗口计算的。
Spark Streaming的 窗口操作的类WindowedDStream。
从构造函数来看,需要一个父DStream,和windowDuration窗口大小, slideDuration滑动频率
private[streaming] class WindowedDStream[T: ClassManifest]( parent: DStream[T], _windowDuration: Duration, _slideDuration: Duration) extends DStream[T](parent.ssc) {
if (!_windowDuration.isMultipleOf(parent.slideDuration)) throw new Exception("The window duration of WindowedDStream (" + _slideDuration + ") " + "must be multiple of the slide duration of parent DStream (" + parent.slideDuration + ")") if (!_slideDuration.isMultipleOf(parent.slideDuration)) throw new Exception("The slide duration of WindowedDStream (" + _slideDuration + ") " + "must be multiple of the slide duration of parent DStream (" + parent.slideDuration + ")")
这里设置了存储级别,和与父RDD的依赖关系
parent.persist(StorageLevel.MEMORY_ONLY_SER) //设置存储级别为内存only def windowDuration: Duration = _windowDuration //setter override def dependencies = List(parent) //这里我们可以清晰的看到,这是一个On-to-One的依赖 override def slideDuration: Duration = _slideDuration //setter override def parentRememberDuration: Duration = rememberDuration + windowDuration
override def compute(validTime: Time): Option[RDD[T]] = { val currentWindow = new Interval(validTime - windowDuration + parent.slideDuration, validTime) Some(new UnionRDD(ssc.sc, parent.slice(currentWindow))) }
即 父RDD按照传入的2个参数分片后的数据集合,进行一个Union操作-->合并一个窗口大小的数据集。
总结:
1.检查参数,设置依赖关系。
2.按照输入的参数对父RDD进行分片,然后将一个窗口大小的数据集进行Union操作,用到了UnionRDD。
3.形成最终的WindowedDStream
原创,转载请注明出处http://blog.csdn.net/oopsoom/article/details/23777843