Spark Streaming类似于Apache Storm,用于流式数据的处理。根据其官方文档介绍,Spark Streaming有高吞吐量和容错能力强等特点。
Spark Streaming支持的数据输入源很多,例如:Kafka、Flume、Twitter、ZeroMQ和简单的TCP套接字等等。数据输入后可以用Spark的高度抽象原语如:map、reduce、join、window等进行运算。而结果也能保存在很多地方,如HDFS,数据库等。另外Spark Streaming也能和MLlib(机器学习)以及Graphx完美融合。
Spark |
Storm |
开发语言:Scala |
开发语言:Clojure |
编程模型:DStream |
编程模型:Spout/Bolt |
Discretized Stream是Spark Streaming的基础抽象,代表持续性的数据流和经过各种Spark原语操作后的结果数据流。在内部实现上,DStream是一系列连续的RDD来表示。每个RDD含有一段时间间隔内的数据,如下图:
对数据的操作也是按照RDD为单位来进行的
计算过程由Spark engine来完成
从图中也能看出它将输入的数据分成多个batch进行处理,严格来说spark streaming 并不是一个真正的实时框架,因为他是分批次进行处理的。
Transfor mation |
Meaning |
map(func) |
Return a new DStream by passing each element of the source DStream through a function func. |
flatMap(func) |
Similar to map, but each input item can be mapped to 0 or more output items. |
filter(func) |
Return a new DStream by selecting only the records of the source DStream on which func returns true. |
repartition(numPartitions) |
Changes the level of parallelism in this DStream by creating more or fewer partitions. |
union(otherStream) |
Return a new DStream that contains the union of the elements in the source DStream and otherDStream. |
count() |
Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream. |
reduce(func) |
Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative so that it can be computed in parallel. |
countByValue() |
When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream. |
reduceByKey(func, [numTasks]) |
When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks. |
join(otherStream, [numTasks]) |
When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key. |
cogroup(otherStream, [numTasks]) |
When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples. |
transform(func) |
Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream. |
updateStateByKey(func) |
Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key. |
1.UpdateStateByKeyOperation
UpdateStateByKey原语用于记录历史记录,上文中Word Count示例中就用到了该特性。若不用UpdateStateByKey来更新状态,那么每次数据进来后分析完成后,结果输出后将不再保存
2.TransformOperation
Transform原语允许DStream上执行任意的RDD-to-RDD函数。通过该函数可以方便的扩展Spark API。此外,MLlib(机器学习)以及Graphx也是通过本函数来进行结合的。
3.WindowOperations
Window Operations有点类似于Storm中的State,可以设置窗口的大小和滑动窗口的间隔来动态的获取当前Steaming的允许状态(后面有代码案例)
Output Operations可以将DStream的数据输出到外部的数据库或文件系统,当某个Output Operations原语被调用时(与RDD的Action相同),streaming程序才会开始真正的计算过程。
Output Operation |
Meaning |
print() |
Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging. |
saveAsTextFiles(prefix, [suffix]) |
Save this DStream's contents as text files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]". |
saveAsObjectFiles(prefix, [suffix]) |
Save this DStream's contents as SequenceFiles of serialized Java objects. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]". |
saveAsHadoopFiles(prefix, [suffix]) |
Save this DStream's contents as Hadoop files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]". |
foreachRDD(func) |
The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs. |
首先我们需要再一台Linux端安装netCat工具, 可以使用yum安装
yum install -y nc
然后启动nc服务端并监听6666(自定义)端口
nc -lk 6666
架构图如下
netCat获取数据后由Client将数据发送到Server,再有Server发送至SparkStreaming,由于最终由Executor来处理数据,数据发送到Executor端,Executor端应至少有两个线程,一个Receiver负责接收数据,一个Executor用来执行和分析数据,所以在SparkConf处setMaster如果使用本地模式需要写为"local[2]"
5.1、使用SparkStreaming实现WordCount(未实现批次累加功能)
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object SparkStreamingWordCount {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setAppName("SparkStreamingWordCount")
.setMaster("local[2]")
val sc = new SparkContext(conf)
// 创建SparkStreaming的上下文对象
val ssc: StreamingContext = new StreamingContext(sc, Seconds(5))
// 获取NatCat服务的数据
val dStream: ReceiverInputDStream[String] = ssc.socketTextStream("min1",6666)
// 分析数据
val res: DStream[(String, Int)] = dStream.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
res.print()
// 提交任务
ssc.start()
// 线程等待, 等待处理下一批次的任务
ssc.awaitTermination()
}
}
由上面可以发现,模板代码如下:
val conf = new SparkConf()
.setAppName("SparkStreamingWordCount")
.setMaster("local[2]")
val sc = new SparkContext(conf)
// 创建SparkStreaming的上下文对象
val ssc: StreamingContext = new StreamingContext(sc, Seconds(5))
// 获取NatCat服务的数据
val dStream: ReceiverInputDStream[String] = ssc.socketTextStream("min1",6666)
此时通过控制台可以发现log4j会打印许多日志信息,此时我们可以使用下面代码调高输出级别,使控制台只打印结果
import org.apache.log4j.{Logger, Level}
import org.apache.spark.Logging
object LoggerLevels extends Logging {
def setStreamingLogLevels() {
val log4jInitialized = Logger.getRootLogger.getAllAppenders.hasMoreElements
if (!log4jInitialized) {
logInfo("Setting log level to [WARN] for streaming example." +
" To override add a custom log4j.properties to the classpath.")
Logger.getRootLogger.setLevel(Level.WARN)
}
}
}
此功能还可以使用Transform来实现,并调用上面的LoggerLevels类调高输出级别
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Milliseconds, StreamingContext}
object TransformDemo {
def main(args: Array[String]): Unit = {
LoggerLevels.setStreamingLogLevels()
val conf = new SparkConf()
.setAppName("TransformDemo")
.setMaster("local[2]")
val ssc = new StreamingContext(conf,Milliseconds(5000))
// 设置检查点, 因为需要用检查点记录历史批次结果处理数据
ssc.checkpoint("hdfs://min1:9000/cp-20180611-1")
// 获取数据
val dStream = ssc.socketTextStream("min1",6666)
// 调用transform进行单词统计
val res: DStream[(String, Int)] = dStream.transform(rdd => rdd.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_))
res.print()
ssc.start()
ssc.awaitTermination()
}
}
5.2、使用updateStateByKey实现WordCount并实现批次累加功能
updateStateByKey 需要传三个参数:
第一个参数: 需要一个具体操作数据的函数: 该函数的参数列表需要传进来一个迭代器
Iterator中有三个类型, 分别代表:
String: 代表元组中的key, 也就是一个个单词
seq[Int]: 代表当前批次单词出现的次数, 相当于,Seq(1,1,1)
Option[Int]: 代表上一批次累加的结果, 因为有可能有值, 也有可能没有值, 所以用Option来封装
在获取Option里的值的时候, 最好用getOrElse, 这样可以给一个初始值
第二个参数: 指定分区器
第三个参数: 是否记录上一批次的分区信息
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.{HashPartitioner, SparkConf}
import org.apache.spark.streaming.{Milliseconds, Seconds, StreamingContext}
/**
* 实现批次累加功能: updateStateByKey
*/
object SparkStreamingACCWC {
def main(args: Array[String]): Unit = {
LoggerLevels.setStreamingLogLevels()
val conf = new SparkConf()
.setAppName("SparkStreamingACCWC")
.setMaster("local[2]")
val ssc = new StreamingContext(conf,Seconds(5))
// 设置检查点, 因为需要用检查点记录历史批次结果处理数据
ssc.checkpoint("hdfs://min1:9000/cp-20180611-1")
// 获取数据
val dStream = ssc.socketTextStream("192.168.158.115",6666)
// 分析数据
val tuples = dStream.flatMap(_.split(" ")).map((_,1))
val res: DStream[(String, Int)] =
tuples.updateStateByKey(func, new HashPartitioner(ssc.sparkContext.defaultParallelism),true)//defaultParallelism 并行数(也就是分区数)
res.print()
ssc.start()
ssc.awaitTermination()
}
/**
* updateStateByKey 需要传三个参数:
* 第一个参数: 需要一个具体操作数据的函数: 该函数的参数列表需要传进来一个迭代器
* Iterator中有三个类型, 分别代表:
* String: 代表元组中的key, 也就是一个个单词
* seq[Int]: 代表当前批次单词出现的次数, 相当于,Seq(1,1,1)
* Option[Int]: 代表上一批次累加的结果, 因为有可能有值, 也有可能没有值, 所以用Option来封装
* 在获取Option里的值的时候, 最好用getOrElse, 这样可以给一个初始值
* 第二个参数: 指定分区器
* 第三个参数: 是否记录上一批次的分区信息
*/
val func = (it: Iterator[(String, Seq[Int], Option[Int])]) => {
it.map(x => {
(x._1, x._2.sum + x._3.getOrElse(0))
})
}
}
实现批次累加功能的模板代码如下
LoggerLevels.setStreamingLogLevels()
val conf = new SparkConf()
.setAppName("SparkStreamingACCWC")
.setMaster("local[2]")
val ssc = new StreamingContext(conf,Seconds(5))
// 设置检查点, 因为需要用检查点记录历史批次结果处理数据
ssc.checkpoint("hdfs://min1:9000/cp-20180611-1")
// 获取数据
val dStream = ssc.socketTextStream("192.168.158.115",6666)
5.3、通过Kafka来获取数据来实现WordCount
需要启动zookeeper及kafka
import org.apache.spark.{HashPartitioner, SparkConf}
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Milliseconds, StreamingContext}
object LoadKafkaDataAndWordCount {
def main(args: Array[String]): Unit = {
LoggerLevels.setStreamingLogLevels()
val conf = new SparkConf()
.setAppName("SparkStreamingACCWC")
.setMaster("local[2]")
val ssc = new StreamingContext(conf,Milliseconds(5000))
// 设置检查点, 因为需要用检查点记录历史批次结果处理数据
ssc.checkpoint("hdfs://min1:9000/cp-20180611-3")
// 设置请求kafka的几个必要参数
val Array(zkQuorum, group, topics, numThread) = args // zk列表, 消费者组名, 多个topics, topic中的线程数
// 获取每个topic放到Map里
val topicMap: Map[String, Int] = topics.split(",").map((_,numThread.toInt)).toMap
// 调用kafka工具类来获取kafka集群的数据, 其中key为数据的offset值, value就是数据
val data: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap, StorageLevel.MEMORY_AND_DISK)
// 把offset值过滤掉
val lines: DStream[String] = data.map(_._2)
//分析数据
val tuples: DStream[(String, Int)] = lines.flatMap(_.split(" ")).map((_,1))
val res: DStream[(String, Int)] = tuples.updateStateByKey(func, new HashPartitioner(ssc.sparkContext.defaultParallelism),true)
res.print()
ssc.start()
ssc.awaitTermination()
}
val func = (it: Iterator[(String, Seq[Int], Option[Int])]) => {
it.map{
case (x,y,z) => {
(x, y.sum + z.getOrElse(0))
}
}
}
此时通过Kafka的shell命令发送消息,控制台会打印出WordCount结果
6、关于Window Operations窗口操作
窗口操作指的是, 一段时间内数据发送的变化
我们在操作窗口函数是需要传入两个重要的参数:
窗口长度(window length): 代表窗口的持续时间, 也就是指窗口展示的结果数据的范围
滑动间隔(sliding interval): 代表执行窗口操作的间隔, 也就是展示的结果范围之间的时间间隔
这两个参数必须是源DStream批次间隔的倍数
使用窗口函数实现WordCount
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
object WindowOperationWordCount {
def main(args: Array[String]): Unit = {
LoggerLevels.setStreamingLogLevels()
val conf = new SparkConf()
.setAppName("WindowOperationWordCount")
.setMaster("local[2]")
val ssc = new StreamingContext(conf,Seconds(5))
// 设置检查点, 因为需要用检查点记录历史批次结果处理数据
ssc.checkpoint("hdfs://min1:9000/cp-20180611-5")
// 获取数据
val dStream = ssc.socketTextStream("192.168.158.115",6666)
val tuples = dStream.flatMap(_.split(" ")).map((_,1))
val res: DStream[(String, Int)] = tuples.reduceByKeyAndWindow((x: Int, y: Int) => (x + y),Seconds(10), Seconds(10))
res.print()
ssc.start()
ssc.awaitTermination()
}
}