UpdateByKey

1、DStream的transformation操作

       常用的tarnsformation算子如下:

       

Transformation Meaning
map(func) Return a new DStream by passing each element of the source DStream through a function func.
flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items.
filter(func) Return a new DStream by selecting only the records of the source DStream on which func returns true.
repartition(numPartitions) Changes the level of parallelism in this DStream by creating more or fewer partitions.
union(otherStream) Return a new DStream that contains the union of the elements in the source DStream and otherDStream.
count() Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.
reduce(func) Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative and commutative so that it can be computed in parallel.
countByValue() When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.
reduceByKey(func, [numTasks]) When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config propertyspark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.
join(otherStream, [numTasks]) When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.
cogroup(otherStream, [numTasks]) When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.
transform(func) Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.
updateStateByKey(func) Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key.
2、updateStateByKey

  SparkStreaming是以时间为线索进行处理的,故而Dstream中保存的是当前时间段内的数据,但是对于一些销售统计来说,我们希望得到目前为止的销售额度等信息,这时候updateStateByKey算子就起到关键作用。从字面上理解,updateStateBykey是通过key来维护状态信息。

   2-1)关键步骤

   (1)需要定义状态的内容,即该状态需要统计何种信息,数据以何种格式存放

   (2)定义状态更新函数,即如何使用之间的状态信息以及新到的数据来更新状态信息。

   (3)使用状态更新函数必须要指定checkpoint,方式为ssc.checkPoint("hdfs://spark/checkpoint/streaming"),官方解释为:Set the context to periodically checkpoint the DStream operations for driver。故而这里需要设置到一个相对来说比具有健壮性的地方,比如HDFS中。

  在每一个处理单元内,Spark将会对所有带有key的数据集调用状态更新函数,不管在这个处理单元内是否存在数据,如果状态更新函数返回None,那么此条记录将会被移除。

 2-2)代码展示

def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
    val newCount = ...  // add the new values with the previous running count to get the new count
    Some(newCount)
}
调用方式为

val runningCounts = pairs.updateStateByKey[Int](updateFunction _)

updateFunction就如同reduceBy的匿名函数




你可能感兴趣的:(SparkStreaming)