一、updateStateByKey算子应用示例
object SparkStreamingApp {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[2]").setAppName("SparkStreamingApp")
val ssc = new StreamingContext(conf,Seconds(5))
val lines = ssc.socketTextStream("hadoop000",9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(x=>(x,1))
val wordcounts = pairs.reduceByKey(_+_)
wordcounts.print()
ssc.start()
ssc.awaitTermination()
}
}
上面这段简单的wordcount代码,输入一批数据就处理一批数据,输出结果都是当前批次数据的统计结果,那么怎么能够统计出所有输入数据的一个总的统计结果呢?再此介绍一下updateStateByKey方法
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object UpdateStatebyKeyApp {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[2]").setAppName("UpdateStatebyKeyApp")
val ssc = new StreamingContext(conf,Seconds(5))
ssc.checkpoint("hdfs://192.168.137.251:9000/spark/data")
val lines = ssc.socketTextStream("hadoop000",9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(x=>(x,1))
val wordcounts = pairs.updateStateByKey(updateFunction)
wordcounts.print()
ssc.start()
ssc.awaitTermination()
}
def updateFunction(newValues: Seq[Int], oldValues: Option[Int]): Option[Int] = {
val newCount = newValues.sum
val oldCount = oldValues.getOrElse(0)
Some(newCount + oldCount)
}
}
输入:
[hadoop@hadoop000 ~]$ nc -lk 9999
huluwa huluwa dawa erwa
spark计算结果:
18/11/13 14:19:00 INFO CheckpointWriter: Submitted checkpoint of time 1542089940000 ms to writer queue
18/11/13 14:19:00 INFO CheckpointWriter: Saving checkpoint for time 1542089940000 ms to file 'hdfs://192.168.137.251:9000/spark/data/checkpoint-1542089940000'
-------------------------------------------
Time: 1542089940000 ms
-------------------------------------------
(huluwa,2)
(dawa,1)
(erwa,1)
...
...
...
-------------------------------------------
Time: 1542089945000 ms
-------------------------------------------
(huluwa,2)
(dawa,1)
(erwa,1)
...
...
...
-------------------------------------------
Time: 1542089950000 ms
-------------------------------------------
(huluwa,2)
(dawa,1)
(erwa,1)
如果不输入新的数据,会一直展示之前的结果
再输入一条新的数据:
[hadoop@hadoop000 ~]$ nc -lk 9999
huluwa huluwa dawa erwa
huluwa huluwa dae erwa
输出结果:
-------------------------------------------
Time: 1542089970000 ms
-------------------------------------------
(dae,1)
(huluwa,4)
(dawa,1)
(erwa,2)
继续输入数据:
[hadoop@hadoop000 ~]$ nc -lk 9999
huluwa huluwa dawa erwa
huluwa huluwa dae erwa
huluwa huluwa dawa erwa
huluwa huluwa dawa erwa
huluwa huluwa dawa erwa
huluwa huluwa dae erwa
huluwa huluwa dae erwa
结果:
-------------------------------------------
Time: 1542090080000 ms
-------------------------------------------
(dae,3)
(huluwa,14)
(dawa,4)
(erwa,7)
查看checkpoint文件夹下,发现有很多类似于checkpoint-1542090065000的状态文件
[hadoop@hadoop000 data]$ hadoop fs -ls /spark/data
Found 15 items
-rw-r--r-- 1 ÃÎcandybear supergroup 3514 2018-11-13 22:21 /spark/data/checkpoint-1542090065000
-rw-r--r-- 1 ÃÎcandybear supergroup 3518 2018-11-13 22:21 /spark/data/checkpoint-1542090065000.bk
-rw-r--r-- 1 ÃÎcandybear supergroup 3511 2018-11-13 22:21 /spark/data/checkpoint-1542090070000
-rw-r--r-- 1 ÃÎcandybear supergroup 3518 2018-11-13 22:21 /spark/data/checkpoint-1542090070000.bk
-rw-r--r-- 1 ÃÎcandybear supergroup 3514 2018-11-13 22:21 /spark/data/checkpoint-1542090075000
-rw-r--r-- 1 ÃÎcandybear supergroup 3518 2018-11-13 22:21 /spark/data/checkpoint-1542090075000.bk
-rw-r--r-- 1 ÃÎcandybear supergroup 3511 2018-11-13 22:21 /spark/data/checkpoint-1542090080000
-rw-r--r-- 1 ÃÎcandybear supergroup 3518 2018-11-13 22:21 /spark/data/checkpoint-1542090080000.bk
-rw-r--r-- 1 ÃÎcandybear supergroup 3512 2018-11-13 22:21 /spark/data/checkpoint-1542090085000
-rw-r--r-- 1 ÃÎcandybear supergroup 3516 2018-11-13 22:21 /spark/data/checkpoint-1542090085000.bk
drwxr-xr-x - ÃÎcandybear supergroup 0 2018-11-13 22:21 /spark/data/e02daf3a-0805-4612-b612-67ca34d32ff8
drwxr-xrwx - ÃÎcandybear supergroup 0 2018-11-13 22:21 /spark/data/receivedBlockMetadata
这些checkpoint文件都是小文件,对hdfs的压力很大,怎么解决呢?下文会讲
附:updateStateByKey源码(1.6版本之前用这个)
/**
* Return a new "state" DStream where the state for each key is updated by applying
* the given function on the previous state of the key and the new values of each key.
* In every batch the updateFunc will be called for each state even if there are no new values.
* Hash partitioning is used to generate the RDDs with Spark's default number of partitions.
* @param updateFunc State update function. If `this` function returns None, then
* corresponding state key-value pair will be eliminated.
* @tparam S State type
*/
def updateStateByKey[S: ClassTag](
updateFunc: (Seq[V], Option[S]) => Option[S]
): DStream[(K, S)] = ssc.withScope {
updateStateByKey(updateFunc, defaultPartitioner())
}
1.6版本之后用mapWithState
/**
* :: Experimental ::
* Return a [[MapWithStateDStream]] by applying a function to every key-value element of
* `this` stream, while maintaining some state data for each unique key. The mapping function
* and other specification (e.g. partitioners, timeouts, initial state data, etc.) of this
* transformation can be specified using `StateSpec` class. The state data is accessible in
* as a parameter of type `State` in the mapping function.
*
* Example of using `mapWithState`:
* {{{
* // A mapping function that maintains an integer state and return a String
* def mappingFunction(key: String, value: Option[Int], state: State[Int]): Option[String] = {
* // Use state.exists(), state.get(), state.update() and state.remove()
* // to manage state, and return the necessary string
* }
*
* val spec = StateSpec.function(mappingFunction).numPartitions(10)
*
* val mapWithStateDStream = keyValueDStream.mapWithState[StateType, MappedType](spec)
* }}}
*
* @param spec Specification of this transformation
* @tparam StateType Class type of the state data
* @tparam MappedType Class type of the mapped data
*/
@Experimental
def mapWithState[StateType: ClassTag, MappedType: ClassTag](
spec: StateSpec[K, V, StateType, MappedType]
): MapWithStateDStream[K, V, StateType, MappedType] = {
new MapWithStateDStreamImpl[K, V, StateType, MappedType](
self,
spec.asInstanceOf[StateSpecImpl[K, V, StateType, MappedType]]
)
}
二、mapWithState算子应用示例
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, State, StateSpec, StreamingContext}
object MapWithStateApp {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[2]").setAppName("MapWithStateApp")
val ssc = new StreamingContext(conf,Seconds(5))
ssc.checkpoint("hdfs://192.168.137.251:9000/spark/data")
val lines = ssc.socketTextStream("hadoop000",9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(x=>(x,1)).reduceByKey(_+_)
// Update the cumulative count using mapWithState
// This will give a DStream made of state (which is the cumulative count of the words)
val mappingFunc = (word: String, one: Option[Int], state: State[Int]) => {
val sum = one.getOrElse(0) + state.getOption.getOrElse(0)
val output = (word, sum)
state.update(sum)
output
}
val wordcounts = pairs.mapWithState(StateSpec.function(mappingFunc))
wordcounts.print()
ssc.start()
ssc.awaitTermination()
}
}
在控制台输入一条数据:
[hadoop@hadoop000 ~]$ nc -lk 9999
hadoop spark spark hive
spark输出结果:
-------------------------------------------
Time: 1542098100000 ms
-------------------------------------------
(hive,1)
(spark,2)
(hadoop,1)
再输入两条数据:
[hadoop@hadoop000 ~]$ nc -lk 9999
hadoop spark spark hive
hadoop spark spark hive
hadoop spark spark hive
输出:
-------------------------------------------
Time: 1542098115000 ms
-------------------------------------------
(hive,3)
(spark,6)
(hadoop,3)
下面输入一条新的数据:
[hadoop@hadoop000 ~]$ nc -lk 9999
hadoop spark spark hive
hadoop spark spark hive
hadoop spark spark hive
huluwa huluwa dawa erwa
发现spark计算结果只展示与输入数据相匹配的结果
-------------------------------------------
Time: 1542098120000 ms
-------------------------------------------
(huluwa,2)
(dawa,1)
(erwa,1)
[hadoop@hadoop000 ~]$ nc -lk 9999
hadoop spark spark hive
hadoop spark spark hive
hadoop spark spark hive
huluwa huluwa dawa erwa
huluwa huluwa dawa erwa
-------------------------------------------
Time: 1542098125000 ms
-------------------------------------------
(huluwa,4)
(dawa,2)
(erwa,2)
那之前的计算结果是否还存在?验证一下:
[hadoop@hadoop000 ~]$ nc -lk 9999
hadoop spark spark hive
hadoop spark spark hive
hadoop spark spark hive
huluwa huluwa dawa erwa
huluwa huluwa dawa erwa
hadoop spark spark hive
从输出结果可以看到,之前的统计结果还存在,只是选择性的展示出来
-------------------------------------------
Time: 1542098135000 ms
-------------------------------------------
(hive,4)
(spark,8)
(hadoop,4)
打开checkpoint目录,和updateStateByKey一样,有很多checkpoint-时间戳的小文件存在
[hadoop@hadoop000 data]$ hadoop fs -ls /spark/data
Found 12 items
-rw-r--r-- 1 ÃÎcandybear supergroup 3974 2018-11-14 00:35 /spark/data/checkpoint-1542098120000
-rw-r--r-- 1 ÃÎcandybear supergroup 3978 2018-11-14 00:35 /spark/data/checkpoint-1542098120000.bk
-rw-r--r-- 1 ÃÎcandybear supergroup 3975 2018-11-14 00:35 /spark/data/checkpoint-1542098125000
-rw-r--r-- 1 ÃÎcandybear supergroup 3979 2018-11-14 00:35 /spark/data/checkpoint-1542098125000.bk
-rw-r--r-- 1 ÃÎcandybear supergroup 3975 2018-11-14 00:35 /spark/data/checkpoint-1542098130000
-rw-r--r-- 1 ÃÎcandybear supergroup 3979 2018-11-14 00:35 /spark/data/checkpoint-1542098130000.bk
-rw-r--r-- 1 ÃÎcandybear supergroup 4037 2018-11-14 00:35 /spark/data/checkpoint-1542098135000
-rw-r--r-- 1 ÃÎcandybear supergroup 3979 2018-11-14 00:35 /spark/data/checkpoint-1542098135000.bk
-rw-r--r-- 1 ÃÎcandybear supergroup 4043 2018-11-14 00:35 /spark/data/checkpoint-1542098140000
-rw-r--r-- 1 ÃÎcandybear supergroup 4047 2018-11-14 00:35 /spark/data/checkpoint-1542098140000.bk
drwxr-xr-x - ÃÎcandybear supergroup 0 2018-11-14 00:35 /spark/data/d55f3470-753c-4735-9b18-1b2c75f3a300
drwxr-xrwx - ÃÎcandybear supergroup 0 2018-11-14 00:34 /spark/data/receivedBlockMetadata
那么updateStateByKey和mapWithState产生这些小文件应该怎么处理或者怎么规避产生这么多的小文件呢?
其实解决办法很简单,想要统计从某个时间段内的数据,可以不使用这两个算子,每个批次的数据处理之后在后面附上一个处理时间,然后保存到数据库比如MySQL中,等需要的时候,再取出历史数据进行统计,这样就从源头上避免了小文件的产生,数据库保存格式如下:
+---------+-------+---------------+
| word | count | timestamp |
+---------+-------+---------------+
| hive | 1 | 1542098135000 |
| spark | 2 | 1542098135000 |
| hadoop | 1 | 1542098135000 |
| hive | 1 | 1542098140000 |
| spark | 2 | 1542098140000 |
| hadoop | 1 | 1542098140000 |
| hive | 1 | 1542098145000 |
| spark | 2 | 1542098145000 |
| hadoop | 1 | 1542098145000 |
+---------+-------+---------------+