updateStateByKey函数详解及worldcount例子

updateStateByKey操作允许您在使用新的信息持续更新时保持任意状态。
1、定义状态 - 状态可以是任意数据类型。 
2、定义状态更新功能 - 使用函数指定如何使用上一个状态更新状态,并从输入流中指定新值。 

如何使用该函数,spark文档写的很模糊,网上资料也不够详尽,自己翻阅源码总结一下,并给一个完整的例子
updateStateBykey函数有6种重载函数:
1、只传入一个更新函数,最简单的一种。
更新函数两个参数 Seq [ V ], Option [ S ],前者是每个key新增的值的集合,后者是当前保存的状态,
def updateStateByKey [ S : ClassTag ](
updateFunc: ( Seq [ V ], Option [ S ]) => Option [ S ]
): DStream [( K , S )] = ssc . withScope {
updateStateByKey (updateFunc, defaultPartitioner ())
}

例如,对于wordcount,我们可以这样定义更新函数:
(values: Seq [Int],state: Option [Int])=>{
//创建一个变量,用于记录单词出现次数
var newValue =state. getOrElse ( 0 ) //getOrElse相当于if....else.....
for ( value <- values){
newValue += value //将单词出现次数累计相加
}
Option ( newValue )
}

2、传入更新函数和分区数
def updateStateByKey [ S : ClassTag ](
updateFunc: ( Seq [ V ], Option [ S ]) => Option [ S ],
numPartitions: Int
): DStream [( K , S )] = ssc . withScope {
updateStateByKey (updateFunc, defaultPartitioner (numPartitions))
}
3、传入更新函数和自定义分区
def updateStateByKey [ S : ClassTag ](
updateFunc: ( Seq [ V ], Option [ S ]) => Option [ S ],
partitioner: Partitioner
): DStream [( K , S )] = ssc . withScope {
val cleanedUpdateF = sparkContext . clean (updateFunc)
val newUpdateFunc = (iterator: Iterator [( K , Seq [ V ], Option [ S ])]) => {
iterator. flatMap (t => cleanedUpdateF (t._2, t._3). map (s => (t._1, s)))
}
updateStateByKey ( newUpdateFunc , partitioner, true )
}
4、传入完整的状态更新函数
前面的函数传入的都是不完整的更新函数,只是针对一个key的,他们在执行的时候也会生成一个完整的状态更新函数。
Iterator [( K , Seq [ V ], Option [ S ])]) => Iterator [( K , S )] 入参是一个迭代器,参数1是key,参数2是这个key在这个batch中更新的值的集合,参数3是当前状态,最终得到key-->newvalue
def updateStateByKey [ S : ClassTag ](
updateFunc: ( Iterator [( K , Seq [ V ], Option [ S ])]) => Iterator [( K , S )],
partitioner: Partitioner ,
rememberPartitioner: Boolean
): DStream [( K , S )] = ssc . withScope {
new StateDStream (self, ssc . sc . clean (updateFunc), partitioner, rememberPartitioner, None )
}

例如,对于wordcount:

val newUpdateFunc = (iterator: Iterator [( String , Seq [Int], Option [Int])]) => {
iterator. flatMap (t => function1 (t._2, t._3). map (s => (t._1, s)))
}

5、加入初始状态
initialRDD: RDD [( K , S )] 初始状态集合
def updateStateByKey [ S : ClassTag ](
updateFunc: ( Seq [ V ], Option [ S ]) => Option [ S ],
partitioner: Partitioner ,
initialRDD: RDD [( K , S )]
): DStream [( K , S )] = ssc . withScope {
val cleanedUpdateF = sparkContext . clean (updateFunc)
val newUpdateFunc = (iterator: Iterator [( K , Seq [ V ], Option [ S ])]) => {
iterator. flatMap (t => cleanedUpdateF (t._2, t._3). map (s => (t._1, s)))
}
updateStateByKey ( newUpdateFunc , partitioner, true , initialRDD)
}
6、是否记得当前的分区
def updateStateByKey [ S : ClassTag ](
updateFunc: ( Iterator [( K , Seq [ V ], Option [ S ])]) => Iterator [( K , S )],
partitioner: Partitioner ,
rememberPartitioner: Boolean,
initialRDD: RDD [( K , S )]
): DStream [( K , S )] = ssc . withScope {
new StateDStream (self, ssc . sc . clean (updateFunc), partitioner,
rememberPartitioner, Some (initialRDD))
}

完整的例子:

def testUpdate ={
val sc = SparkUtils . getSpark ( "test" , "db01" ).sparkContext
val ssc = new StreamingContext ( sc , Seconds ( 5 ))
ssc . checkpoint ( "hdfs://ns1/config/checkpoint" )
val initialRDD = sc . parallelize ( List (( "hello" , 1 ), ( "world" , 1 )))
val lines = ssc . fileStream [ LongWritable , Text , TextInputFormat ]( "hdfs://ns1/config/data/" )
val words = lines . flatMap (x=>x._2. toString . split ( "," ))
val wordDstream : DStream [( String , Int)]= words . map (x => (x, 1 ))
val result = wordDstream . reduceByKey (_ + _)

def function1 (newValues: Seq [Int], runningCount: Option [Int]): Option [Int] = {
val newCount = newValues. sum + runningCount. getOrElse ( 0 ) // add the new values with the previous running count to get the new count
Some ( newCount )
}
val newUpdateFunc = (iterator: Iterator [( String , Seq [Int], Option [Int])]) => {
iterator. flatMap (t => function1 (t._2, t._3). map (s => (t._1, s)))
}
val stateDS = result . updateStateByKey ( newUpdateFunc , new HashPartitioner ( sc . defaultParallelism ), true , initialRDD )
stateDS . print ()
ssc . start ()
ssc . awaitTermination ()
}

你可能感兴趣的:(spark)