数据流
随处可见的数据流
流处理
它能够更快地提供洞察力,通常在毫秒到秒之间
大部分数据的产生过程都是一个永无止境的事件流
常用流处理框架
是基于Spark Core API的扩展,用于流式数据处理
高容错
可扩展
高流量
低延时(Spark 2.3.1 延时1ms,之前100ms)
微批处理:输入->分批处理->结果集
1、一个JVM只能有一个StreamingContext启动
2、StreamingContext停止后不能再启动
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
val conf=new SparkConf().setMaster("local[2]").setAppName("kgc streaming demo")
val ssc=new StreamingContext(conf,Seconds(8))
/*
在spark-shell下,会出现如下错误提示:
org.apache.spark.SparkException: Only one SparkContext may be running in this JVM
解决:
方法1、sc.stop //创建ssc前,停止spark-shell自行启动的SparkContext
方法2、或者通过已有的sc创建ssc:val ssc=new StreamingContext(sc,Seconds(8))
*/
//按照nc服务器
yum install nc
//数据服务器。当ssc启动后输入测试数据,观察Spark Streaming处理结果
nc -lk 9999
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)//指定数据源
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
离散数据流(Discretized Stream)是Spark Streaming提供的高级别抽象
DStream代表了一系列连续的RDDs
Input DStream指从某种流式数据源(Streaming Sources)接收流数据的DStream
文件系统
def textFileStream(directory: String): DStream[String]
Socket
def socketTextStream(hostname: String, port: Int, storageLevel: StorageLevel): ReceiverInputDStream[String]
Flume Sink
val ds = FlumeUtils.createPollingStream(streamCtx, [sink hostname], [sink port]);
Kafka Consumer
val ds = KafkaUtils.createStream(streamCtx, zooKeeper, consumerGrp, topicMap);
map,flatMap
filter
count, countByValue
repartition
union, join, cogroup
reduce, reduceByKey
transform
updateStateByKey
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
//DStream支持的转换算子与RDD类似
val input1 = List((1, true), (2, false), (3, false), (4, true), (5, false))
val input2 = List((1, false), (2, false), (3, true), (4, true), (5, true))
val rdd1 = sc.parallelize(input1)
val rdd2 = sc.parallelize(input2)
val ssc = new StreamingContext(sc, Seconds(3))
import scala.collection.mutable
val ds1 = ssc.queueStream[(Int, Boolean)](mutable.Queue(rdd1))
val ds2 = ssc.queueStream[(Int, Boolean)](mutable.Queue(rdd2))
val ds = ds1.join(ds2)
ds.print()
ssc.start()
ssc.awaitTerminationOrTimeout(5000)
ssc.stop()
转换算子-transform
// RDD 包含垃圾邮件信息
// 从Hadoop接口API创建RDD
val spamRDD = ssc.sparkContext.newAPIHadoopRDD(...)
val cleanedDStream = wordCounts.transform { rdd =>
//用垃圾邮件信息连接数据流进行数据清理 rdd.join(spamRDD).filter( /* code... */)
// 其它操作...
}
DStream输出算子
输出算子-foreachRDD
dstream.foreachRDD { rdd =>
val connection = createNewConnection() // 在driver节点执行
rdd.foreach { record =>
connection.send(record) // 在worker节点执行
}
}
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
val connection = createNewConnection()
partitionOfRecords.foreach(record =>
connection.send(record))
}
}
val sparkConf = new SparkConf().setAppName("HdfsWordCount").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
// 创建FileInputDStream去读取文件系统上的数据
val lines = ssc.textFileStream("/data/input")
//使用空格进行分割每行记录的字符串
val words = lines.flatMap(_.split(" "))
//类似于RDD的编程,将每个单词赋值为1,并进行合并计算
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
需求:计算到目前为止累计词频的个数
分析:DStream转换操作包括无状态转换和有状态转换
无状态转换:每个批次的处理不依赖于之前批次的数据
有状态转换:当前批次的处理需要使用之前批次的数据
updateStateByKey属于有状态转换,可以跟踪状态的变化
实现要点
定义状态:状态数据可以是任意类型
定义状态更新函数:参数为数据流之前的状态和新的数据流数据
关键代码:UpdateStateByKeyDemo.scala
//定义状态更新函数
def updateFunction(currentValues: Seq[Int], preValues: Option[Int]): Option[Int] = {
val curr = currentValues.sum
val pre = preValues.getOrElse(0)
Some(curr + pre)
}
val sparkConf = new SparkConf().setAppName("StatefulWordCount").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(5))
//todo 做一个checkpoint存储数据
ssc.checkpoint(".")
val lines = ssc.socketTextStream("localhost", 6789)
val result = lines.flatMap(_.split(" ")).map((_, 1))
val state = result.updateStateByKey(updateFunction)
state.print()
ssc.start()
ssc.awaitTermination()
case class Word(word:String)
val sparkConf = new SparkConf().setAppName("NetworkSQLWordCount").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(5))
val spark=SparkSession.builder.config(sparkConf).getOrCreate()
val lines = ssc.socketTextStream("localhost", 6789)
val result = lines.flatMap(_.split(" "))
result.print()
result.foreachRDD(rdd => {
if (rdd.count() != 0) {
import spark.implicits._
//将RDD转换成DataFrame
val df = rdd.map(x => Word(x)).toDF
df.registerTempTable("tb_word")
spark.sql("select word, count(*) from tb_word group by word").show
}})
ssc.start()
ssc.awaitTermination()
<dependency>
<groupId>org.apache.sparkgroupId>
<artifactId>spark-streaming-flume_2.11artifactId>
<version>${spark.version}version>
dependency>
vi flumeStream
a1.sources = s1
a1.channels = c1
a1.sinks = k1
a1.sources.s1.type = netcat
a1.sources.s1.bind = hadoop01
a1.sources.s1.port = 44444
a1.sources.s1.channels = c1
a1.channels.c1.type =memory
a1.sinks.k1.channel = c1
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop01
a1.sinks.k1.port = 55555
package flume
import org.apache.spark.SparkConf
import org.apache.spark.streaming.flume.{FlumeUtils}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object FlumePushDemo extends App {
//todo 创建一个StreamingContext对象
val ssc = new StreamingContext(new SparkConf().setAppName("testflume").setMaster("local[*]"),Seconds(5))
//todo 创建一个 flumeStream
val flumeStream = FlumeUtils.createStream(ssc,"hadoop01",55555)
flumeStream.map(x=>new String(x.event.getBody.array()))
.flatMap(_.split("\\s+")).map((_,1)).reduceByKey(_+_).print()
ssc.start()
ssc.awaitTermination()
}
运行方式
spark-submit --class flume.FlumePushDemo spark_day05-1.0-SNAPSHOT.jar
flume-ng agent -c /opt/install/flume/conf/ -f /opt/install/flume/conf/job/flumeStream -n a1
telnet hadoop01 44444
vi streaming_pull_flume.conf
agent.sources = s1
agent.channels = c1
agent.sinks = sk1
#设置Source的内省为netcat,使用的channel为c1
agent.sources.s1.type = netcat
agent.sources.s1.bind = hadoop01
agent.sources.s1.port = 44444
agent.sources.s1.channels = c1
#SparkSink,要求flume lib目录存在spark-streaming-flume-sink_2.11-x.x.x.jar
agent.sinks.sk1.type=org.apache.spark.streaming.flume.sink.SparkSink
agent.sinks.sk1.hostname=hadoop01
agent.sinks.sk1.port=55555
agent.sinks.sk1.channel = c1
#设置channel信息
#内存模式
agent.channels.c1.type = memory
agent.channels.c1.capacity = 1000
package flume
import org.apache.spark.SparkConf
import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
object FlumePushDemo2 extends App {
//todo 创建一个StreamingContext对象
val ssc = new StreamingContext(new SparkConf().setAppName("testflume").setMaster("local[*]"),Seconds(5))
//todo 创建一个 flumeStream
FlumeUtils.createPollingStream(ssc,"hadoop01",55555).map(x=>new String(x.event.getBody.array()))
.flatMap(_.split("\\s+")).map((_,1)).reduceByKey(_+_).print()
ssc.start()
ssc.awaitTermination()
}
运行方式
flume-ng agent -c /opt/install/flume/conf/ -f /opt/install/flume/conf/job/streaming_pull_flume.conf -n agent
spark-submit --class flume.FlumePushDemo2 spark_day05-1.0-SNAPSHOT.jar
telnet hadoop01 44444
<dependency>
<groupId>org.apache.sparkgroupId>
<artifactId>spark-streaming-kafka-0-10_2.11artifactId>
<version>${spark.version}version>
dependency>
<dependency>
<groupId>org.apache.kafkagroupId>
<artifactId>kafka-clientsartifactId>
<version>2.0.0version>
dependency>
package cn.kafaka
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object SparkKafkaDirectDemo extends App {
//TODO 创建streamcontext
val kafkaParams = Map(
(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG->"hadoop01:9092"),
(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG->"org.apache.kafka.common.serialization.StringDeserializer"),
(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG->"org.apache.kafka.common.serialization.StringDeserializer"),
(ConsumerConfig.GROUP_ID_CONFIG->"kafka_01")
)
val ssc = new StreamingContext(new SparkConf().setAppName("testkafka").setMaster("local[*]"),Seconds(5))
val message: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(ssc, LocationStrategies
.PreferConsistent, ConsumerStrategies
.Subscribe(Set("testPartition"), kafkaParams))
val value: DStream[(String, Int)] = message.map(x=>x.value()).flatMap(_.split("\\s+")).map((_,1)).reduceByKey(_+_)
value.print()
ssc.start()
ssc.awaitTermination()
}
减少批处理时间
设置合适的批次间隔
内存调优