updateStateByKey操作允许您在使用新的信息持续更新时保持任意状态。
1、定义状态 - 状态可以是任意数据类型。
2、定义状态更新功能 - 使用函数指定如何使用上一个状态更新状态,并从输入流中指定新值。
如何使用该函数,spark文档写的很模糊,网上资料也不够详尽,自己翻阅源码总结一下,并给一个完整的例子
updateStateBykey函数有6种重载函数:
1、只传入一个更新函数,最简单的一种。
更新函数两个参数
Seq
[
V
],
Option
[
S
],前者是每个key新增的值的集合,后者是当前保存的状态,
def
updateStateByKey
[
S
:
ClassTag
](
updateFunc: (
Seq
[
V
],
Option
[
S
]) =>
Option
[
S
]
):
DStream
[(
K
,
S
)] =
ssc
.
withScope
{
updateStateByKey
(updateFunc,
defaultPartitioner
())
}
例如,对于wordcount,我们可以这样定义更新函数:
(values:
Seq
[Int],state:
Option
[Int])=>{
//创建一个变量,用于记录单词出现次数
var
newValue
=state.
getOrElse
(
0
)
//getOrElse相当于if....else.....
for
(
value
<- values){
newValue
+=
value
//将单词出现次数累计相加
}
Option
(
newValue
)
}
2、传入更新函数和分区数
def
updateStateByKey
[
S
:
ClassTag
](
updateFunc: (
Seq
[
V
],
Option
[
S
]) =>
Option
[
S
],
numPartitions: Int
):
DStream
[(
K
,
S
)] =
ssc
.
withScope
{
updateStateByKey
(updateFunc,
defaultPartitioner
(numPartitions))
}
3、传入更新函数和自定义分区
def
updateStateByKey
[
S
:
ClassTag
](
updateFunc: (
Seq
[
V
],
Option
[
S
]) =>
Option
[
S
],
partitioner:
Partitioner
):
DStream
[(
K
,
S
)] =
ssc
.
withScope
{
val
cleanedUpdateF
=
sparkContext
.
clean
(updateFunc)
val
newUpdateFunc
= (iterator:
Iterator
[(
K
,
Seq
[
V
],
Option
[
S
])]) => {
iterator.
flatMap
(t =>
cleanedUpdateF
(t._2, t._3).
map
(s => (t._1, s)))
}
updateStateByKey
(
newUpdateFunc
, partitioner,
true
)
}
4、传入完整的状态更新函数
前面的函数传入的都是不完整的更新函数,只是针对一个key的,他们在执行的时候也会生成一个完整的状态更新函数。
Iterator
[(
K
,
Seq
[
V
],
Option
[
S
])]) =>
Iterator
[(
K
,
S
)] 入参是一个迭代器,参数1是key,参数2是这个key在这个batch中更新的值的集合,参数3是当前状态,最终得到key-->newvalue
def
updateStateByKey
[
S
:
ClassTag
](
updateFunc: (
Iterator
[(
K
,
Seq
[
V
],
Option
[
S
])]) =>
Iterator
[(
K
,
S
)],
partitioner:
Partitioner
,
rememberPartitioner: Boolean
):
DStream
[(
K
,
S
)] =
ssc
.
withScope
{
new
StateDStream
(self,
ssc
.
sc
.
clean
(updateFunc), partitioner, rememberPartitioner,
None
)
}
例如,对于wordcount:
val
newUpdateFunc
= (iterator:
Iterator
[(
String
,
Seq
[Int],
Option
[Int])]) => {
iterator.
flatMap
(t =>
function1
(t._2, t._3).
map
(s => (t._1, s)))
}
5、加入初始状态
initialRDD:
RDD
[(
K
,
S
)] 初始状态集合
def
updateStateByKey
[
S
:
ClassTag
](
updateFunc: (
Seq
[
V
],
Option
[
S
]) =>
Option
[
S
],
partitioner:
Partitioner
,
initialRDD:
RDD
[(
K
,
S
)]
):
DStream
[(
K
,
S
)] =
ssc
.
withScope
{
val
cleanedUpdateF
=
sparkContext
.
clean
(updateFunc)
val
newUpdateFunc
= (iterator:
Iterator
[(
K
,
Seq
[
V
],
Option
[
S
])]) => {
iterator.
flatMap
(t =>
cleanedUpdateF
(t._2, t._3).
map
(s => (t._1, s)))
}
updateStateByKey
(
newUpdateFunc
, partitioner,
true
, initialRDD)
}
6、是否记得当前的分区
def
updateStateByKey
[
S
:
ClassTag
](
updateFunc: (
Iterator
[(
K
,
Seq
[
V
],
Option
[
S
])]) =>
Iterator
[(
K
,
S
)],
partitioner:
Partitioner
,
rememberPartitioner: Boolean,
initialRDD:
RDD
[(
K
,
S
)]
):
DStream
[(
K
,
S
)] =
ssc
.
withScope
{
new
StateDStream
(self,
ssc
.
sc
.
clean
(updateFunc), partitioner,
rememberPartitioner,
Some
(initialRDD))
}
完整的例子:
def
testUpdate
={
val
sc
=
SparkUtils
.
getSpark
(
"test"
,
"db01"
).sparkContext
val
ssc
=
new
StreamingContext
(
sc
,
Seconds
(
5
))
ssc
.
checkpoint
(
"hdfs://ns1/config/checkpoint"
)
val
initialRDD
=
sc
.
parallelize
(
List
((
"hello"
,
1
), (
"world"
,
1
)))
val
lines
=
ssc
.
fileStream
[
LongWritable
,
Text
,
TextInputFormat
](
"hdfs://ns1/config/data/"
)
val
words
=
lines
.
flatMap
(x=>x._2.
toString
.
split
(
","
))
val
wordDstream
:
DStream
[(
String
, Int)]=
words
.
map
(x => (x,
1
))
val
result
=
wordDstream
.
reduceByKey
(_ + _)
def
function1
(newValues:
Seq
[Int], runningCount:
Option
[Int]):
Option
[Int] = {
val
newCount
= newValues.
sum
+ runningCount.
getOrElse
(
0
)
// add the new values with the previous running count to get the new count
Some
(
newCount
)
}
val
newUpdateFunc
= (iterator:
Iterator
[(
String
,
Seq
[Int],
Option
[Int])]) => {
iterator.
flatMap
(t =>
function1
(t._2, t._3).
map
(s => (t._1, s)))
}
val
stateDS
=
result
.
updateStateByKey
(
newUpdateFunc
,
new
HashPartitioner
(
sc
.
defaultParallelism
),
true
,
initialRDD
)
stateDS
.
print
()
ssc
.
start
()
ssc
.
awaitTermination
()
}