SparkStreaming 算子updateStateByKey的应用

@羲凡——只为了更好的活着

SparkStreaming 算子updateStateByKey的应用

updateStateByKey功能是按照key进行分组,将该批次的value数据和上一个批次该key的value进行一个状态值的更新

1.源码说明:
def updateStateByKey[S: ClassTag](
      updateFunc: (Seq[V], Option[S]) => Option[S],
      numPartitions: Int
    ): DStream[(K, S)] = ssc.withScope {
    updateStateByKey(updateFunc, defaultPartitioner(numPartitions))
  }

1.Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values of each key.返回一个新的“状态”数据流,通过对键的前一个状态和每个键的新值应用给定的函数来更新每个键的状态。
2.In every batch the updateFunc will be called for each state even if there are no new values.在每个批处理中,即使没有新值,也将为每个状态调用updateFunc。
3.Hash partitioning is used to generate the RDDs with `numPartitions` partitions.参数numPartitions作为生成RDDs 的哈希分区数

2.应用代码如下:
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{Seconds, StreamingContext}

object TestUpdateStateByKey {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("TestStreamingWC")
      .master("local[*]")
      .getOrCreate()
    val sc: SparkContext = spark.sparkContext
    val ssc: StreamingContext = new StreamingContext(spark.sparkContext, Seconds(5))
    ssc.checkpoint("/aarontest/sparkstreaming/checkpoint/test1")

    val lines = ssc.socketTextStream("deptest22", 9999)
    val wordCounts = lines
      .flatMap(_.split(" "))
      .map((_, 1)).reduceByKey(_ + _)
      .updateStateByKey((seq: Seq[Int], state: Option[Long]) => {
        val sum: Int = seq.sum
        val preState = state.getOrElse(0L)
        Some(sum + preState)
      },4)
    wordCounts.print()
    ssc.start()
    ssc.awaitTermination()
  }
}
3.注意事项:

1.在使用updateStateByKey算子时,必须使用 ssc.checkpoint(),否则报错ERROR StreamingContext: Error starting the context, marking it as stopped . java.lang.IllegalArgumentException: requirement failed: The checkpoint directory has not been set. Please set it by StreamingContext.checkpoint().

2.SparkStreaming 代码没有停止时,关闭了deptest22的9999端口,报错ERROR ReceiverTracker: Deregistered receiver for stream 0: Restarting receiver with delay 2000ms: Error connecting to deptest22:9999 - java.net.ConnectException: Connection refused: connect。但是在重新启动9999端口后,流数据的处理仍能继续
3.updateStateByKey算子中的numPartitions参数不填时,调用spark的默认的分区数

====================================================================

@羲凡——只为了更好的活着

若对博客中有任何问题,欢迎留言交流

你可能感兴趣的:(Spark)