[spark streaming]窗口操作

WindowOperations(窗口操作)

         Spark还提供了窗口的计算,它允许你使用一个滑动窗口应用在数据变换中。下图说明了该滑动窗口。

如图所示,每个时间窗口在一个个DStream中划过,每个DSteam中的RDD进入Window中进行合并,操作时生成为

窗口化DSteam的RDD。在上图中,该操作被应用在过去的3个时间单位的数据,和划过了2个时间单位。这说明任

何窗口操作都需要指定2个参数。

 

  1. window length(窗口长度):窗口的持续时间(上图为3个时间单位)
  2. sliding interval (滑动间隔)- 窗口操作的时间间隔(上图为2个时间单位)。

 

上面的2个参数的大小,必须是接受产生一个DStream时间的倍数

让我们用一个例子来说明窗口操作。比如说,你想用以前的WordCount的例子,来计算最近30s的数据的中的单词

数,10S接受为一个DStream。为此,我们要用reduceByKey操作来计算最近30s数据中每一个DSteam中关于

(word,1)的pair操作。它可以用reduceByKeyAndWindow操作来实现。一些常见的窗口操作如下。所有这些操作

都需要两个参数--- window length(窗口长度)和sliding interval(滑动间隔)。

 

-------------------------实验数据----------------------------------------------------------------------

spark
Streaming
better
than
storm
you
need
it
yes
do
it

 

(每秒在其中随机抽取一个,作为Socket端的输入),socket端的数据模拟和实验函数等程序见附录百度云链接

-----------------------------------------------window操作-------------------------------------------------------------------------

 

 
  1. //输入:窗口长度(隐:输入的滑动窗口长度为形成Dstream的时间)

  2. //输出:返回一个DStream,這个DStream包含這个滑动窗口下的全部元素

  3. def window(windowDuration: Duration): DStream[T] = window(windowDuration, this.slideDuration)

  4.  
  5. //输入:窗口长度和滑动窗口长度

  6. //输出:返回一个DStream,這个DStream包含這个滑动窗口下的全部元素

  7. def window(windowDuration: Duration, slideDuration: Duration): DStream[T] = ssc.withScope {

  8. new WindowedDStream(this, windowDuration, slideDuration)

  9. }

 

 

 
  1. import org.apache.log4j.{Level, Logger}

  2. import org.apache.spark.streaming.{Seconds, StreamingContext}

  3. import org.apache.spark.{SparkConf, SparkContext}

  4.  
  5. object windowOnStreaming {

  6. def main(args: Array[String]) {

  7. /**

  8. * this is test of Streaming operations-----window

  9. */

  10. Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)

  11. Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)

  12.  
  13. val conf = new SparkConf().setAppName("the Window operation of SparK Streaming").setMaster("local[2]")

  14. val sc = new SparkContext(conf)

  15. val ssc = new StreamingContext(sc,Seconds(2))

  16.  
  17.  
  18. //set the Checkpoint directory

  19. ssc.checkpoint("/Res")

  20.  
  21. //get the socket Streaming data

  22. val socketStreaming = ssc.socketTextStream("master",9999)

  23.  
  24. val data = socketStreaming.map(x =>(x,1))

  25. //def window(windowDuration: Duration): DStream[T]

  26. val getedData1 = data.window(Seconds(6))

  27. println("windowDuration only : ")

  28. getedData1.print()

  29. //same as

  30. // def window(windowDuration: Duration, slideDuration: Duration): DStream[T]

  31. //val getedData2 = data.window(Seconds(9),Seconds(3))

  32. //println("Duration and SlideDuration : ")

  33. //getedData2.print()

  34.  
  35. ssc.start()

  36. ssc.awaitTermination()

  37. }

  38.  
  39. }

 

 


 

 

 

--------------------reduceByKeyAndWindow操作--------------------------------

 
  1. /**通过对每个滑动过来的窗口应用一个reduceByKey的操作,返回一个DSream,有点像

  2. * `DStream.reduceByKey(),但是只是這个函数只是应用在滑动过来的窗口,hash分区是采用spark集群

  3. * 默认的分区树

  4. * @param reduceFunc 从左到右的reduce 函数

  5. * @param windowDuration 窗口时间

  6. * 滑动窗口默认是1个batch interval

  7. * 分区数是是RDD默认(depend on spark集群core)

  8. */

  9. def reduceByKeyAndWindow(

  10. reduceFunc: (V, V) => V,

  11. windowDuration: Duration

  12. ): DStream[(K, V)] = ssc.withScope {

  13. reduceByKeyAndWindow(reduceFunc, windowDuration, self.slideDuration, defaultPartitioner())

  14. }

  15.  
  16. /**通过对每个滑动过来的窗口应用一个reduceByKey的操作,返回一个DSream,有点像

  17. * `DStream.reduceByKey(),但是只是這个函数只是应用在滑动过来的窗口,hash分区是采用spark集群

  18. * 默认的分区树

  19. * @param reduceFunc 从左到右的reduce 函数

  20. * @param windowDuration 窗口时间

  21. * @param slideDuration 滑动时间

  22. */

  23. def reduceByKeyAndWindow(

  24. reduceFunc: (V, V) => V,

  25. windowDuration: Duration,

  26. slideDuration: Duration

  27. ): DStream[(K, V)] = ssc.withScope {

  28. reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration, defaultPartitioner())

  29. }

  30.  
  31.  
  32. /**通过对每个滑动过来的窗口应用一个reduceByKey的操作,返回一个DSream,有点像

  33. * `DStream.reduceByKey(),但是只是這个函数只是应用在滑动过来的窗口,hash分区是采用spark集群

  34. * 默认的分区树

  35. * @param reduceFunc 从左到右的reduce 函数

  36. * @param windowDuration 窗口时间

  37. * @param slideDuration 滑动时间

  38.  
  39. * @param numPartitions 每个RDD的分区数.

  40. */

  41. def reduceByKeyAndWindow(

  42. reduceFunc: (V, V) => V,

  43. windowDuration: Duration,

  44. slideDuration: Duration,

  45. numPartitions: Int

  46. ): DStream[(K, V)] = ssc.withScope {

  47. reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration,

  48. defaultPartitioner(numPartitions))

  49. }

  50.  
  51. /**

  52. /**通过对每个滑动过来的窗口应用一个reduceByKey的操作,返回一个DSream,有点像

  53. * `DStream.reduceByKey(),但是只是這个函数只是应用在滑动过来的窗口,hash分区是采用spark集群

  54. * 默认的分区树

  55. * @param reduceFunc 从左到右的reduce 函数

  56. * @param windowDuration 窗口时间

  57. * @param slideDuration 滑动时间

  58.  
  59. * @param numPartitions 每个RDD的分区数.

  60. * @param partitioner 设置每个partition的分区数

  61. */

  62. def reduceByKeyAndWindow(

  63. reduceFunc: (V, V) => V,

  64. windowDuration: Duration,

  65. slideDuration: Duration,

  66. partitioner: Partitioner

  67. ): DStream[(K, V)] = ssc.withScope {

  68. self.reduceByKey(reduceFunc, partitioner)

  69. .window(windowDuration, slideDuration)

  70. .reduceByKey(reduceFunc, partitioner)

  71. }

  72.  
  73. /**

  74. *通过对每个滑动过来的窗口应用一个reduceByKey的操作.同时对old RDDs进行了invReduceFunc操作

  75. * hash分区是采用spark集群,默认的分区树

  76. * @param reduceFunc从左到右的reduce 函数

  77. * @param invReduceFunc inverse reduce function; such that for all y, invertible x:

  78. * `invReduceFunc(reduceFunc(x, y), x) = y`

  79. * @param windowDuration窗口时间

  80. * @param slideDuration 滑动时间

  81. * @param filterFunc 来赛选一定条件的 key-value 对的

  82. */

  83. def reduceByKeyAndWindow(

  84. reduceFunc: (V, V) => V,

  85. invReduceFunc: (V, V) => V,

  86. windowDuration: Duration,

  87. slideDuration: Duration = self.slideDuration,

  88. numPartitions: Int = ssc.sc.defaultParallelism,

  89. filterFunc: ((K, V)) => Boolean = null

  90. ): DStream[(K, V)] = ssc.withScope {

  91. reduceByKeyAndWindow(

  92. reduceFunc, invReduceFunc, windowDuration,

  93. slideDuration, defaultPartitioner(numPartitions), filterFunc

  94. )

  95. }

  96.  
  97. /**

  98. *通过对每个滑动过来的窗口应用一个reduceByKey的操作.同时对old RDDs进行了invReduceFunc操作

  99. * hash分区是采用spark集群,默认的分区树

  100. * @param reduceFunc从左到右的reduce 函数

  101. * @param invReduceFunc inverse reduce function; such that for all y, invertible x:

  102. * `invReduceFunc(reduceFunc(x, y), x) = y`

  103. * @param windowDuration窗口时间

  104. * @param slideDuration 滑动时间

  105. * @param partitioner 每个RDD的分区数.

  106. * @param filterFunc 来赛选一定条件的 key-value 对的

  107. */

  108. def reduceByKeyAndWindow(

  109. reduceFunc: (V, V) => V,

  110. invReduceFunc: (V, V) => V,

  111. windowDuration: Duration,

  112. slideDuration: Duration,

  113. partitioner: Partitioner,

  114. filterFunc: ((K, V)) => Boolean

  115. ): DStream[(K, V)] = ssc.withScope {

  116.  
  117. val cleanedReduceFunc = ssc.sc.clean(reduceFunc)

  118. val cleanedInvReduceFunc = ssc.sc.clean(invReduceFunc)

  119. val cleanedFilterFunc = if (filterFunc != null) Some(ssc.sc.clean(filterFunc)) else None

  120. new ReducedWindowedDStream[K, V](

  121. self, cleanedReduceFunc, cleanedInvReduceFunc, cleanedFilterFunc,

  122. windowDuration, slideDuration, partitioner

  123. )

  124. }

 

 

 
  1. import org.apache.log4j.{Level, Logger}

  2. import org.apache.spark.streaming.{Seconds, StreamingContext}

  3. import org.apache.spark.{SparkConf, SparkContext}

  4.  
  5.  
  6. object reduceByWindowOnStreaming {

  7.  
  8. def main(args: Array[String]) {

  9. /**

  10. * this is test of Streaming operations-----reduceByKeyAndWindow

  11. */

  12. Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)

  13. Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)

  14.  
  15. val conf = new SparkConf().setAppName("the reduceByWindow operation of SparK Streaming").setMaster("local[2]")

  16. val sc = new SparkContext(conf)

  17. val ssc = new StreamingContext(sc,Seconds(2))

  18.  
  19. //set the Checkpoint directory

  20. ssc.checkpoint("/Res")

  21.  
  22. //get the socket Streaming data

  23. val socketStreaming = ssc.socketTextStream("master",9999)

  24.  
  25. val data = socketStreaming.map(x =>(x,1))

  26. //def reduceByKeyAndWindow(reduceFunc: (V, V) => V, windowDuration: Duration ): DStream[(K, V)]

  27. //val getedData1 = data.reduceByKeyAndWindow(_+_,Seconds(6))

  28.  
  29. val getedData2 = data.reduceByKeyAndWindow(_+_,

  30. (a,b) => a+b*0

  31. ,Seconds(6),Seconds(2))

  32.  
  33. val getedData1 = data.reduceByKeyAndWindow(_+_,_-_,Seconds(9),Seconds(6))

  34.  
  35. println("reduceByKeyAndWindow : ")

  36. getedData1.print()

  37.  
  38. ssc.start()

  39. ssc.awaitTermination()

  40.  
  41.  
  42. }

  43. }

 

 


這里出现了invReduceFunc函数這个函数有点特别,一不注意就会出错,现在通过分析源码中的

ReducedWindowedDStream這个类内部来进行说明:

 

------------------reduceByWindow操作---------------------------

 
  1. /输入:reduceFunc、窗口长度、滑动长度

  2. //输出:(a,b)为从几个从左到右一次取得两个元素

  3. //(,a,b)进入reduceFunc,

  4. def reduceByWindow(

  5. reduceFunc: (T, T) => T,

  6. windowDuration: Duration,

  7. slideDuration: Duration

  8. ): DStream[T] = ssc.withScope {

  9. this.reduce(reduceFunc).window(windowDuration, slideDuration).reduce(reduceFunc)

  10. }

  11. /**

  12. *输入reduceFunc,invReduceFunc,窗口长度、滑动长度

  13. */

  14. def reduceByWindow(

  15. reduceFunc: (T, T) => T,

  16. invReduceFunc: (T, T) => T,

  17. windowDuration: Duration,

  18. slideDuration: Duration

  19. ): DStream[T] = ssc.withScope {

  20. this.map((1, _))

  21. .reduceByKeyAndWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration, 1)

  22. .map(_._2)

  23. }

 

 

 
  1. import org.apache.log4j.{Level, Logger}

  2. import org.apache.spark.streaming.{Seconds, StreamingContext}

  3. import org.apache.spark.{SparkConf, SparkContext}

  4.  
  5. /**

  6. * Created by root on 6/23/16.

  7. */

  8. object reduceByWindow {

  9. def main(args: Array[String]) {

  10. /**

  11. * this is test of Streaming operations-----reduceByWindow

  12. */

  13. Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)

  14. Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)

  15.  
  16. val conf = new SparkConf().setAppName("the reduceByWindow operation of SparK Streaming").setMaster("local[2]")

  17. val sc = new SparkContext(conf)

  18. val ssc = new StreamingContext(sc,Seconds(2))

  19. //set the Checkpoint directory

  20. ssc.checkpoint("/Res")

  21.  
  22. //get the socket Streaming data

  23. val socketStreaming = ssc.socketTextStream("master",9999)

  24.  
  25. //val data = socketStreaming.reduceByWindow(_+_,Seconds(6),Seconds(2))

  26. val data = socketStreaming.reduceByWindow(_+_,_+_,Seconds(6),Seconds(2))

  27.  
  28.  
  29. println("reduceByWindow: count the number of elements")

  30. data.print()

  31.  
  32.  
  33. ssc.start()

  34. ssc.awaitTermination()

  35.  
  36. }

  37. }

 

 


 

 

-----------------------------------------------countByWindow操作---------------------------------

 

 
  1. /**

  2. * 输入 窗口长度和滑动长度,返回窗口内的元素数量

  3. * @param windowDuration 窗口长度

  4. * @param slideDuration 滑动长度

  5. */

  6. def countByWindow(

  7. windowDuration: Duration,

  8. slideDuration: Duration): DStream[Long] = ssc.withScope {

  9. this.map(_ => 1L).reduceByWindow(_ + _, _ - _, windowDuration, slideDuration)

  10. //窗口下的DStream进行map操作,把每个元素变为1之后进行reduceByWindow操作

  11. }

 

 

 

 
  1. import org.apache.log4j.{Level, Logger}

  2. import org.apache.spark.streaming.{Seconds, StreamingContext}

  3. import org.apache.spark.{SparkConf, SparkContext}

  4.  
  5. /**

  6. * Created by root on 6/23/16.

  7. */

  8. object countByWindow {

  9. def main(args: Array[String]) {

  10.  
  11. /**

  12. * this is test of Streaming operations-----countByWindow

  13. */

  14. Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)

  15. Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)

  16.  
  17. val conf = new SparkConf().setAppName("the reduceByWindow operation of SparK Streaming").setMaster("local[2]")

  18. val sc = new SparkContext(conf)

  19. val ssc = new StreamingContext(sc,Seconds(2))

  20. //set the Checkpoint directory

  21. ssc.checkpoint("/Res")

  22.  
  23. //get the socket Streaming data

  24. val socketStreaming = ssc.socketTextStream("master",9999)

  25.  
  26. val data = socketStreaming.countByWindow(Seconds(6),Seconds(2))

  27.  
  28.  
  29. println("countByWindow: count the number of elements")

  30. data.print()

  31.  
  32.  
  33. ssc.start()

  34. ssc.awaitTermination()

  35.  
  36.  
  37. }

  38. }

 

 

-------------------------------- countByValueAndWindow-------------

 

 

 
  1. /**

  2. *输入 窗口长度、滑动时间、RDD分区数(默认分区是等于并行度)

  3. * @param windowDuration width of the window; must be a multiple of this DStream's

  4. * batching interval

  5. * @param slideDuration sliding interval of the window (i.e., the interval after which

  6. * the new DStream will generate RDDs); must be a multiple of this

  7. * DStream's batching interval

  8. * @param numPartitions number of partitions of each RDD in the new DStream.

  9. */

  10. def countByValueAndWindow(

  11. windowDuration: Duration,

  12. slideDuration: Duration,

  13. numPartitions: Int = ssc.sc.defaultParallelism)

  14. (implicit ord: Ordering[T] = null)

  15. : DStream[(T, Long)] = ssc.withScope {

  16. this.map((_, 1L)).reduceByKeyAndWindow(

  17. (x: Long, y: Long) => x + y,

  18. (x: Long, y: Long) => x - y,

  19. windowDuration,

  20. slideDuration,

  21. numPartitions,

  22. (x: (T, Long)) => x._2 != 0L

  23. )

  24. }

 

 
  1. import org.apache.log4j.{Level, Logger}

  2. import org.apache.spark.streaming.{Seconds, StreamingContext}

  3. import org.apache.spark.{SparkConf, SparkContext}

  4.  
  5. /**

  6. * Created by root on 6/23/16.

  7. */

  8. object countByValueAndWindow {

  9. def main(args: Array[String]) {

  10. /**

  11. * this is test of Streaming operations-----countByValueAndWindow

  12. */

  13. Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)

  14. Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)

  15.  
  16. val conf = new SparkConf().setAppName("the reduceByWindow operation of SparK Streaming").setMaster("local[2]")

  17. val sc = new SparkContext(conf)

  18. val ssc = new StreamingContext(sc,Seconds(2))

  19. //set the Checkpoint directory

  20. ssc.checkpoint("/Res")

  21.  
  22. //get the socket Streaming data

  23. val socketStreaming = ssc.socketTextStream("master",9999)

  24.  
  25. val data = socketStreaming.countByValueAndWindow(Seconds(6),Seconds(2))

  26.  
  27.  
  28. println("countByWindow: count the number of elements")

  29. data.print()

  30.  
  31.  
  32. ssc.start()

  33. ssc.awaitTermination()

  34. }

  35.  
  36. }



 

 

 


 

附录

链接:http://pan.baidu.com/s/1slkqwBb 密码:d92r

你可能感兴趣的:(spark,玩转spark)