Spark Streaming将不同的数据源,经过处理之后,结果输出到外部文件系统。
特点:低延时;能从错误中高效地恢复过来;能够运行在成百上千的节点上;能够将批处理、机器学习、图计算等子框架综合使用
工作原理:
粗粒度:Spark Streaming接收到实时数据流,把数据按照指定的时间段切成一片片小的数据块,然后把小的数据块传给Spark Engine处理。
细粒度:Spark应用程序运行在Driver端(Spark Context、Streaming Context),在Driver端会要求在Executor端开启Receiver接收器,当收到输入过来的数据(Input Stream)时,拆分成数据块先存进内存,如果是多副本可以拷贝到其他的Executor中。拷贝完成之后,Receiver会把block信息(在哪些机器上面)发送给Streaming Context;每隔几秒钟(时间周期),就会通知Spark Context开启几个jobs,然后把jobs分发到Executor去执行。
DStreams:源源不断的数据流【一系列的RDD】,这里面的每个RDD,都是包含这一批次的数据
对DStreams操作算子,比如map/flatmap,其实底层会被翻译成对DStreams中的每个RDD都做相同的操作,因为一个DStreams是由不同批次的RDD构成的。
Input DStreams and Receiver :每一个Input DStreams(除了文件系统上的数据)都需要关联一个Receiver,Receiver是用来从源 头接收数据的,然后把数据存入spark内存中。
Transformations:从Input DStreams过来的数据进行修改(map、flatMap、filter ... )
Output Operation:把DStreams中的数据写进外部系统(数据库或者文件系统)
依赖:
2.11.8
0.8.2.1
2.2.0
2.6.0-cdh5.7.0
1.2.0-cdh5.7.0
cloudera
https://repository.cloudera.com/artifactory/cloudera-repos
org.scala-lang
scala-library
${scala.version}
org.apache.kafka
kafka_2.11
${kafka.version}
org.apache.hadoop
hadoop-client
${hadoop.version}
org.apache.hbase
hbase-client
${hbase.version}
org.apache.hbase
hbase-server
${hbase.version}
org.apache.spark
spark-streaming_2.11
${spark.version}
org.apache.spark
spark-streaming-flume_2.11
${spark.version}
org.apache.spark
spark-streaming-flume-sink_2.11
${spark.version}
org.apache.commons
commons-lang3
3.5
org.apache.spark
spark-streaming-kafka-0-8_2.11
${spark.version}
org.apache.spark
spark-sql_2.11
${spark.version}
com.fasterxml.jackson.module
jackson-module-scala_2.11
2.6.5
net.jpountz.lz4
lz4
1.3.0
mysql
mysql-connector-java
5.1.37
org.apache.flume.flume-ng-clients
flume-ng-log4jappender
1.6.0
/**
* Spark Streaming处理Socket数据
* 测试:nc
*/
object NetworkWordCount {
def main(args: Array[String]): Unit = {
//因为主机没有配置hadoop环境,所以需要加上这句话
System.setProperty("hadoop.home.dir", "E:/winutils/")
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
//创建StreamingContext需要两个参数:SparkConf和batch interval
val ssc = new StreamingContext(sparkConf, Seconds(5))
val lines = ssc.socketTextStream("localhost", 6789)
val result = lines.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)
result.print()
ssc.start()
ssc.awaitTermination()
}
}
用nc -lk 6789命令,输入单词进行测试。
/**
* 使用Spark Streaming处理文件系统(local/hdfs)数据
*/
object FileWordCount {
def main(args: Array[String]): Unit = {
//因为主机没有配置hadoop环境,所以需要加上这句话
System.setProperty("hadoop.home.dir", "E:/winutils/")
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("FileWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(5))
// 文件系统的数据不需要Receiver
val lines = ssc.textFileStream("file:///D:/BidDataTestFile/ss/")
val result = lines.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)
result.print()
ssc.start()
ssc.awaitTermination()
}
}
注意:它只会读取新文件的数据,不会处理已经处理过的文件。
/**
* 使用Spark Streaming 完成已有单词统计
*/
object StatefulWordCount {
def main(args: Array[String]): Unit = {
//因为主机没有配置hadoop环境,所以需要加上这句话
System.setProperty("hadoop.home.dir", "E:/winutils/")
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("StatefulWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(5))
//如果使用了stateful的算子,必须要设置checkpoint
//在生产环境中,建议大家把checkpoint设置到HDFS的某个文件夹中
//"." 表示把检查文件放在当前目录
ssc.checkpoint(".")
val lines = ssc.socketTextStream("localhost", 6789)
val result = lines.flatMap(_.split(" ")).map((_, 1))
//调用updateStateByKey算子,统计单词在全局中出现的次数
val state = result.updateStateByKey[Int](updateFunction _)
state.print()
ssc.start()
ssc.awaitTermination()
}
/**
* 把当前的数据去更新已有的或者是老的数据
* @param currentValues 当前的数据
* @param preValues 老的数据
* @return
*/
def updateFunction(currentValues: Seq[Int], preValues: Option[Int]): Option[Int] = {
//相当于遍历currentValues,把当前值相加求和
val current = currentValues.sum
val pre = preValues.getOrElse(0) //如果取不到值的话,就返回0
Some(current + pre)
}
}
/**
* 使用Spark Streaming完成词频统计,并将结果写入到MySQL数据库中
*/
object ForeachRDDApp {
def main(args: Array[String]): Unit = {
//因为主机没有配置hadoop环境,所以需要加上这句话
System.setProperty("hadoop.home.dir", "E:/winutils/")
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("ForeachRDDApp")
val ssc = new StreamingContext(sparkConf, Seconds(5))
val lines = ssc.socketTextStream("localhost", 6789)
val result = lines.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)
result.print()
//将结果写入MySQL数据库
result.foreachRDD(rdd => {
rdd.foreachPartition(partitionOfRecords => {
val connection = createConnection()
partitionOfRecords.foreach(record => {
val sql = "insert into wordcount(word, wordcount) values('" + record._1 + "'," + record._2 + ")"
connection.createStatement().execute(sql)
})
connection.close()
})
})
ssc.start()
ssc.awaitTermination()
}
/**
* 获取MySQL连接
* @return
*/
def createConnection() = {
Class.forName("com.mysql.jdbc.Driver")
DriverManager.getConnection("jdbc:mysql://localhost:3306/imooc_spark", "root", "MySQLKenan_07")
}
}
定时的进行一个时间段内的数据处理
window length:窗口的长度
sliding interval:窗口的间隔
这2个参数和batch size有关系:整数倍
eg.每隔10秒计算前10分钟的wc ==> 窗口间隔10s,窗口长度10min
// 窗口长度,窗口间隔
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int, b:Int) => (a+b), Seconds(30), Seconds(10))
访问日志:20180808,zs 20180808,ls 20180808,ww【 ==> (zs:20180808, zs)】
黑名单列表:zs、ls 【 ==> (zs,true)】
left join ==> (zs:[<20180808,zs>,
/**
* 黑名单过滤
*/
object TransformApp {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("TransformApp")
val ssc = new StreamingContext(sparkConf, Seconds(5))
//构建黑名单
val blacks = List("zs", "ls")
val blacksRDD = ssc.sparkContext.parallelize(blacks).map(x => (x, true))
val lines = ssc.socketTextStream("localhost", 6789)
//前面是名字后面是完整的信息
val clickLog = lines.map(x=>(x.split(",")(1), x)).transform(rdd => {
rdd.leftOuterJoin(blacksRDD)
.filter(x => x._2._2.getOrElse(false) != true)
.map(x => x._2._1)
})
clickLog.print()
ssc.start()
ssc.awaitTermination()
}
}
4.5 Spark Streaming 整合 Spark SQL
/**
* Spark Streaming 整合 Spark SQL完成词频统计
*/
object SqlNetworkWordCount {
def main(args: Array[String]): Unit = {
//因为主机没有配置hadoop环境,所以需要加上这句话
System.setProperty("hadoop.home.dir", "E:/winutils/")
val sparkConf = new SparkConf().setAppName("SqlNetworkWordCount").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(5))
val lines = ssc.socketTextStream("localhost", 6789)
val words = lines.flatMap(_.split(" "))
//从DStream里面遍历每一个批次
words.foreachRDD((rdd, time) => {
val spark = SparkSessionSingleton.getInstance(rdd.sparkContext.getConf)
import spark.implicits._
val wordsDataFrame = rdd.map(w => Record(w)).toDF()
wordsDataFrame.createOrReplaceTempView("words")
val wordCountsDataFrame = spark.sql("select word, count(*) as total from words group by word")
println(s"=========== $time ===========")
wordCountsDataFrame.show()
})
ssc.start()
ssc.awaitTermination()
}
//用来将RDD转成DataFrame
case class Record(word: String)
object SparkSessionSingleton {
@transient private var instance: SparkSession = _
def getInstance(sparkConf: SparkConf): SparkSession = {
if (instance == null) {
instance = SparkSession
.builder
.config(sparkConf)
.getOrCreate()
}
instance
}
}
}
Flume采集数据,直接传给Spark Streaming
vim flume_push_streaming.conf
simple-agent.sources = netcat-source
simple-agent.sinks = avro-sink
simple-agent.channels = memory-channel
simple-agent.sources.netcat-source.type = netcat
simple-agent.sources.netcat-source.bind = hadoop000
simple-agent.sources.netcat-source.port = 44444
simple-agent.sinks.avro-sink.type = avro
# 本地测试的时候,这里应该是本地ip
simple-agent.sinks.avro-sink.hostname = hadoop000
simple-agent.sinks.avro-sink.port = 41414
simple-agent.channels.memory-channel.type = memory
simple-agent.sources.netcat-source.channels = memory-channel
simple-agent.sinks.avro-sink.channel = memory-channel
依赖
org.apache.spark
spark-streaming-flume_2.11
${spark.version}
/**
* Spark Streaming 整合 Flume的第一种方式 --- Push
*/
object FlumePushWordCount {
def main(args: Array[String]): Unit = {
//因为主机没有配置hadoop环境,所以需要加上这句话
//System.setProperty("hadoop.home.dir", "E:/winutils/")
if (args.length != 2) {
System.err.println("Usage: FlumePushWordCount ")
System.exit(1)
}
val Array(hostname, port) = args
//通过submit方式提交到服务器,需要注释掉下面的内容
val sparkConf = new SparkConf()
//.setMaster("local[2]").setAppName("FlumePushWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(5))
//如何使用SparkStreaming整合Flume
//从hostname 的 port 中获取传过来的数据
val flumeStream = FlumeUtils.createStream(ssc, hostname, port.toInt)
flumeStream.map(x => new String(x.event.getBody.array()).trim)
.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _).print(10)
ssc.start()
ssc.awaitTermination()
}
}
测试:先启动应用程序,再启动Flume
flume-ng agent \
--name simple-agent \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/flume_push_streaming.conf \
-Dflume.root.logger=INFO,console
本地测试总结:
1> 启动Spark Streaming作业
2> 启动flume agent
3> 通过telnet输入数据,观察IDEA控制台的输出
打包方式:mvn clean package -DskipTests
./spark-submit \
--class com.xq.spark.examples.FlumePushWordCount \
--name FlumePushWordCount \
--master local[2] \
--executor-memory 1G \
--packages org.apache.spark:spark-streaming-flume_2.11:2.2.0,org.apache.flume:flume-ng-sdk:1.6.0 \
/home/Kiku/lib/sparktrain-1.0-SNAPSHOT.jar \
hadoop000 41414
flume-ng agent \
--name simple-agent \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/flume_push_streaming.conf \
-Dflume.root.logger=INFO,console
telnet hadoop000 44444
测试成功!!
Flume把数据采集过来之后,丢给sink,然后spark streaming到sink去获取数据。
依赖
org.apache.spark
spark-streaming-flume-sink_2.11
${spark.version}
org.apache.commons
commons-lang3
3.5
vim flume_pull_streaming.conf
simple-agent.sources = netcat-source
simple-agent.sinks = spark-sink
simple-agent.channels = memory-channel
simple-agent.sources.netcat-source.type = netcat
simple-agent.sources.netcat-source.bind = hadoop000
simple-agent.sources.netcat-source.port = 44444
simple-agent.sinks.spark-sink.type = org.apache.spark.streaming.flume.sink.SparkSink
simple-agent.sinks.spark-sink.hostname = hadoop000
simple-agent.sinks.spark-sink.port = 41414
simple-agent.channels.memory-channel.type = memory
simple-agent.sources.netcat-source.channels = memory-channel
simple-agent.sinks.spark-sink.channel = memory-channel
/**
* Spark Streaming 整合 Flume的第二种方式 --- Pull
*/
object FlumePullWordCount {
def main(args: Array[String]): Unit = {
//因为主机没有配置hadoop环境,所以需要加上这句话
//System.setProperty("hadoop.home.dir", "E:/winutils/")
if (args.length != 2) {
System.err.println("Usage: FlumePullWordCount ")
System.exit(1)
}
val Array(hostname, port) = args
//通过submit方式提交到服务器,需要注释掉下面的内容
val sparkConf = new SparkConf()
//.setMaster("local[2]").setAppName("FlumePullWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(5))
val flumeStream = FlumeUtils.createPollingStream(ssc, hostname, port.toInt)
flumeStream.map(x => new String(x.event.getBody.array()).trim)
.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _).print()
ssc.start()
ssc.awaitTermination()
}
}
本地调试:
先启动Flume,后启动Spark应用程序
flume-ng agent \
--name simple-agent \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/flume_pull_streaming.conf \
-Dflume.root.logger=INFO,console
服务器调试:
项目打包【注意:注释掉local[2],appName参数】
./spark-submit \
--class com.xq.spark.examples.FlumePullWordCount \
--name FlumePullWordCount \
--master local[2] \
--executor-memory 1G \
--packages org.apache.spark:spark-streaming-flume_2.11:2.2.0,org.apache.flume:flume-ng-sdk:1.6.0,org.apache.spark:spark-streaming-flume-sink_2.11:2.2.0 \
/home/Kiku/lib/sparktrain-1.0-SNAPSHOT.jar \
hadoop000 41414
通过Receiver接受从Kafka过来的数据,存储到Spark Executor;job启动的时候,处理存储的这些数据。
1)启动ZK zkServer.sh start
2)启动Kafka kafka-server-start.sh -daemon /home/Kiku/app/kafka_2.11-0.9.0.0/config/server.properties
3)创建topic kafka-topics.sh --create --zookeeper hadoop000:2181 --replication-factor 1 --partitions 1 --topic xxxxxx
4)通过控制台测试topic能否正确的发送、接受消息
kafka-console-producer.sh --broker-list hadoop000:9092 --topic kafka_streaming_topic
kafka-console-consumer.sh --zookeeper hadoop000:2181 --topic kafka_streaming_topic
依赖
org.apache.spark
spark-streaming-kafka-0-8_2.11
${spark.version}
/**
* Spark Streaming 对接 Kafka的第一种方式 -- Receiver
*/
object KafkaReceiverWordCount {
def main(args: Array[String]): Unit = {
//因为主机没有配置hadoop环境,所以需要加上这句话
//System.setProperty("hadoop.home.dir", "E:/winutils/")
if (args.length != 4) {
System.err.println("Usage: KafkaReceiverWordCount ")
}
val Array(zkQuorum, group, topics, numThreads) = args
val sparkConf = new SparkConf()
//.setMaster("local[2]").setAppName("KafkaReceiverWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(5))
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
//Spark Streaming 如何对接Kafka
val messages = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap)
//第二位是我们字符串的值
messages.map(_._2).flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _).print()
ssc.start()
ssc.awaitTermination()
}
}
代码中去掉,local[2]、appName
spark-submit \
--class com.xq.spark.examples.KafkaReceiverWordCount \
--name KafkaReceiverWordCount \
--master local[2] \
--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.0 \
/home/Kiku/lib/sparktrain-1.0-SNAPSHOT.jar \
192.168.6.130:2181 test kafka_streaming_topic 1
无Receiver,端到端的数据保障。周期性的从每一个topic分区中查询Kafka偏移量,周期性的把这个范围的数据通过每个批次进行处理。
优点:简化并行度,不需要创建多个Input Stream,使用direct Stream进行处理
0数据丢失,Receiver方式中,我们需要WAL(Write Ahead Log)写入日志中,以副本的方式存储,才能保证数据不丢失。
效率不高。Direct方式,不再需要WAL,性能提升。
缺点:无法更新偏移量到Zookeeper,需要手动更新。
/**
* Spark Streaming 对接 Kafka的第二种方式 -- Direct
*/
object KafkaDirectWordCount {
def main(args: Array[String]): Unit = {
//因为主机没有配置hadoop环境,所以需要加上这句话
//System.setProperty("hadoop.home.dir", "E:/winutils/")
if (args.length != 2) {
System.err.println("Usage: KafkaReceiverWordCount ")
System.exit(1)
}
val Array(brokers, topics) = args
val sparkConf = new SparkConf()
//.setMaster("local[2]").setAppName("KafkaDirectWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(5))
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet
)
//第二位是我们字符串的值
messages.map(_._2).flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _).print()
ssc.start()
ssc.awaitTermination()
}
}
提交到服务器
spark-submit \
--class com.xq.spark.examples.KafkaDirectWordCount \
--name KafkaDirectWordCount \
--master local[2] \
--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.0 \
/home/Kiku/lib/sparktrain-1.0-SNAPSHOT.jar \
192.168.6.130:9092 kafka_streaming_topic