第六天 -- Kafka API -- Spark Streaming -- DStream

第六天 – Kafka API – Spark Streaming – DStream

文章目录

  • 第六天 -- Kafka API -- Spark Streaming -- DStream
      • 一、Kafka API
          • 生产者
          • 消费者
          • 自定义分区器
      • 二、Kafka文件存储机制
          • Kafka文件存储基本结构
          • Kafka partition segment
          • Kafka查找message
      • 三、Spark Streaming介绍
          • spark streaming介绍
          • spark streaming优点
      • 四、DStream
          • DStream介绍
          • DStream相关操作
      • 五、DStream案例
          • 安装netcat插件
          • WordCount的实现
          • updateStateByKey实现按批次累加
          • transform原语
          • 获取Kafka的数据并进行计算
          • 窗口操作

一、Kafka API

生产者
import java.util.Properties

import kafka.producer.{KeyedMessage, Producer, ProducerConfig}

object ProducerDemo {
  def main(args: Array[String]): Unit = {

    // 添加配置信息
    val prop = new Properties()
    // 指定kafka列表
    prop.put("metadata.broker.list","cdhnocms01:9092,cdhnocms02:9092,cdhnocms03:9092")
    // 指定序列化类
    prop.put("serializer.class","kafka.serializer.StringEncoder")
    // 指定发送数据后的响应方式
    prop.put("request.required.acks","0")
    // 指定分区器类
//    prop.put("partitioner.class", "kafka.producer.DefaultPartitioner")
    // 自定义分区类
    prop.put("partitioner.class", "cry.day05.kafka.CustomPartitioner")

    // 创建ProducerConfig对象
    val config: ProducerConfig = new ProducerConfig(prop)
    // 创建生产者对象
    val producer: Producer[String, String] = new Producer(config)

    // 指定一个用于接收数据的topic
    val topic = "testapi"

    // 模拟数据
    for(i <- 1 to 1000){
      val msg = s"producer send data:$i"
      producer.send(new KeyedMessage[String,String](topic, msg))
      Thread.sleep(500)
    }
  }
}
消费者
import java.util.Properties
import java.util.concurrent.{ExecutorService, Executors}

import kafka.consumer.{Consumer, ConsumerConfig, KafkaStream}

import scala.collection.mutable

class ConsumerDemo(val consumer: String, val stream: KafkaStream[Array[Byte], Array[Byte]]) extends Runnable{
  override def run(): Unit = {
    val it = stream.iterator()
    while (it.hasNext()){
      val data = it.next()
      val topic = data.topic
      val partition = data.partition
      val offset = data.offset
      val msg = new String(data.message())
      println(s"consumer:$consumer, topic:$topic, partition:$partition, offset:$offset,msg:$msg")

    }
  }
}

object ConsumerDemo {
  def main(args: Array[String]): Unit = {
    val topic = "testapi"

    // 定义一个map,用来存储多个topic
    // key:topic名称,value:指定线程数用于获取topic的数据
    val topics = new mutable.HashMap[String, Int]()
    topics.put(topic, 2)

    // 配置信息
    val prop = new Properties()
    // 指定zk列表
    prop.put("zookeeper.connect","cdhnocms01:2181,cdhnocms02:2181,cdhnocms03:2181")
    // 指定consumer组名
    prop.put("group.id", "xxx")
    // 指定offset异常时需要获取的offset的值
    prop.put("auto.offset.reset", "smallest")

    // 创建consumer配置信息对象
    val config = new ConsumerConfig(prop)

    // 创建Consumer对象(单例)
    val consumer = Consumer.create(config)

    // 获取数据,在返回的map类型中,key为topic名称,value为topic对应的数据
    val streams: collection.Map[String, List[KafkaStream[Array[Byte], Array[Byte]]]] = consumer.createMessageStreams(topics)

    // 获取指定的topic数据
    val stream: Option[List[KafkaStream[Array[Byte], Array[Byte]]]] = streams.get(topic)

    // 创建一个固定大小的线程池
    val pool: ExecutorService = Executors.newFixedThreadPool(3)

    for(i <- 0 until stream.size){
      pool.execute(new ConsumerDemo(s"Consumer:$i", stream.get(i)))
    }
  }
}
自定义分区器
import kafka.producer.Partitioner
import kafka.utils.{Utils, VerifiableProperties}
// 1.自定义分区器类需要继承Partitioner类
// 2.主构造器需要有一个类型为VerifiableProperties的参数,否则会报错
class CustomPartitioner(props: VerifiableProperties = null) extends Partitioner{
  override def partition(key: Any, numPartitions: Int): Int = {
    Utils.abs(key.hashCode) % numPartitions
  }
}

二、Kafka文件存储机制

Kafka文件存储基本结构

​ 在kafka文件存储中,同一个topic下有多个不同的partition为一个分区,partition命名规则为topic名称+有序序号,第一个partition序号从0开始,序号最大值为partitions数量减1

​ 每个partition相当于一个巨型文件被平均分配到多个大小相等segment(段)数据文件中。但每个段segment file消息数量不一定相等,这种特性能方便old segment file快速被删除,默认保存7天的数据。

​ 每个partition只需要支持顺序读写就行了,segment文件生命周期由服务端配置参数决定。(什么时候创建,什么时候删除)

Kafka partition segment

​ segment file:由2大部分组成,分别为index file和data file,此2个文件一一对应,后缀.index表示为segment索引文件,.log表示segment数据文件

第六天 -- Kafka API -- Spark Streaming -- DStream_第1张图片

​ segment文件命名规则:partition全局的第一个segment从0开始,后续每个segment文件名为上一个segment文件最后一条消息的offset值。数据最大为64位long大小,19位数字字符长度,没有数字用0填充。

​ 索引文件存储大量元数据,数据文件存储大量消息,索引文件中元数据指向对应数据文件中message的物理偏移地址。

Kafka查找message

​ 读取上图文件offset=368776的message,需要通过下面2个步骤查找

​ 00000000000000000000.index表示最开始的文件,起始偏移量(offset)为1

​ 00000000000000368769.index的消息量起始偏移量为368770 = 368769 + 1

​ 00000000000000737337.index的起始偏移量为737338=737337 + 1

​ 其他后续文件依次类推。

​ 以起始偏移量命名并排序这些文件,只要根据offset 二分查找文件列表,就可以快速定位到具体文件。当offset=368776时定位到00000000000000368769.index和对应log文件。

​ 当offset=368776时,依次定位到00000000000000368769.index的元数据物理位置和00000000000000368769.log的物理偏移地址,然后再通过00000000000000368769.log顺序查找直到offset=368776为止

三、Spark Streaming介绍

spark streaming介绍

​ spark streaming用于流式数据的处理,有高吞吐量和容错能力强等特点。spark streaming支持的数据输入源很多,例如kafka、flume、zeroMQ等等。数据输入后可以使用spark的高度抽象原语如map、reduce、join、window等进行计算。结果也能保存在很多地方,如hdfs、数据库等

第六天 -- Kafka API -- Spark Streaming -- DStream_第2张图片

spark streaming优点

spark streaming有易用、容错、易整合到spark体系等优点。

第六天 -- Kafka API -- Spark Streaming -- DStream_第3张图片

四、DStream

DStream介绍

​ Discretized Stream是Spark Streaming的基础抽象,代表持续性的数据流和经过各种spark原语操作后的结果数据流,在内部实现上,DStreams是一系列连续的RDD来表示,每个RDD含有一段时间间隔内的数据,如下图

第六天 -- Kafka API -- Spark Streaming -- DStream_第4张图片

对数据的操作也是按照RDD为单位来进行的

第六天 -- Kafka API -- Spark Streaming -- DStream_第5张图片

计算过程由Spark engine来完成

第六天 -- Kafka API -- Spark Streaming -- DStream_第6张图片

第六天 -- Kafka API -- Spark Streaming -- DStream_第7张图片

DStream相关操作

​ Dstream上的原语与RDD的算子类似,分为Transformations(转换)和Output Operations(输出)两种,此外转换操作中还有一些比较特殊的原语,如updateStateByKey()、transform()、以及各种Window相关的原语

Transformation Meaning
map(func) Return a new DStream by passing each element of the source DStream through a function func.
flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items.
filter(func) Return a new DStream by selecting only the records of the source DStream on which func returns true.
repartition(numPartitions) Changes the level of parallelism in this DStream by creating more or fewer partitions.
union(otherStream) Return a new DStream that contains the union of the elements in the source DStream and otherDStream.
count() Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.
reduce(func) Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative so that it can be computed in parallel.
countByValue() When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.
reduceByKey(func, [numTasks]) When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark’s default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.
join(otherStream, [numTasks]) When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for
each key.
cogroup(otherStream, [numTasks]) When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.
transform(func) Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.
updateStateByKey(func) Return a new “state” DStream where the state for each key is updated by applying the given function on the previous state of
the key and the new values for the key. This can be used to maintain arbitrary state data for each key.

特殊的Transformations

  1. updateStateByKey Operation

    updateStateByKey原语用于记录历史记录,若不用UpdateStateByKey来更新状态,那么每次数据进来后分析完成后,结果输出后将不再保存

  2. transform Operation

    Transform原语允许DStream上执行任意的RDD-to-RDD函数。通过该函数可以方便的扩展Spark API。

  3. window Operations

    可以设置窗口的大小和滑动窗口的间隔来动态的获取当前Steaming的允许状态。

    第六天 -- Kafka API -- Spark Streaming -- DStream_第8张图片

Output Operations可以将DStream的数据输出到外部的数据库或文件系统,当某个Output Operations原语被调用时(与RDD的Action相同),streaming程序才会开始真正的计算过程。

Output Operation Meaning
print() Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging.
saveAsTextFiles(prefix, [suffix]) Save this DStream’s contents as text files. The file name at each batch interval is generated based on prefix and suffix:
“prefix-TIME_IN_MS[.suffix]”.
saveAsObjectFiles(prefix, [suffix]) Save this DStream’s contents as SequenceFiles of serialized Java objects. The file name at each batch interval is generated
based on prefix and suffix: “prefix-TIME_IN_MS[.suffix]”.
saveAsHadoopFiles(prefix, [suffix]) Save this DStream’s contents as Hadoop files. The file name at each batch interval is generated based on prefix and suffix:
“prefix-TIME_IN_MS[.suffix]”.
foreachRDD(func) The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

五、DStream案例

安装netcat插件

DStream需要实时的数据输入,我们可以借助netcat插件实现一个简单的实时输入数据的功能。

root用户下安装

yum install -y nc

root用户下启动netcat插件,并且指定端口6666

nc -lk 6666

启动后窗口处于等待输入的状态

WordCount的实现
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}

object StreamingWordCount {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("StreamingWordCount").setMaster("local[2]")
    val sc = new SparkContext(conf)
    // 创建Spark Streaming的上下文
    val ssc = new StreamingContext(sc,Seconds(5))

    // 获取netcat服务的数据
    val dStream: ReceiverInputDStream[String] = ssc.socketTextStream("cdhnocms01",6666)

    // 分析数据
    val res: DStream[(String, Int)] = dStream.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_)

    // 打印到控制台
    res.print()

    // 开始提交任务
    ssc.start()

    // 线程等待,等待处理下一批次任务
    ssc.awaitTermination()
  }
}

在netcat命令窗口实时输入数据,streaming会每五秒进行处理一次数据

updateStateByKey实现按批次累加

​ 上一个案例的WordCount只能计算每一个批次的单词统计,并不会与之前相同key的value进行累加,此时需要使用DStream中一个特殊的原语updateStateByKey进行累加操作。注意,此原语必须设置checkpoint,否则会报错。

import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.{HashPartitioner, SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object StreamingAccWordCount {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("StreamingAccWordCount").setMaster("local[2]")
    /*val sc = new SparkContext(conf)
    // 创建Spark Streaming的上下文
    val ssc = new StreamingContext(sc,Seconds(5))*/
    // 创建Spark Streaming的上下文
    val ssc = new StreamingContext(conf,Seconds(5))

    // 设置检查点,最好checkpoint到hdfs
    // 在有需要记录历史批次结果的需求的过程中,必须要checkpoint
    // checkpoint不光可以记录RDD的元数据和依赖关系等数据,还可以记录历史结果
    ssc.checkpoint("hdfs://cdhnocms01:8020/out/cp-20181127-1")

    // 获取netcat的数据
    val dStream: ReceiverInputDStream[String] = ssc.socketTextStream("cdhnocms01",6666)

    // 开始分析
    val tups: DStream[(String, Int)] = dStream.flatMap(_.split(" ")).map((_, 1))
    val res: DStream[(String, Int)] = tups.updateStateByKey(func, new HashPartitioner(ssc.sparkContext.defaultParallelism), true)

    res.print

    ssc.start()
    ssc.awaitTermination()
  }


  // 在调用updateStateByKey时,需要传入一个用于计算力此批次和当前批次的函数
  // 该函数中有以下类型:
  // String:元祖的每一个单词,也就是key
  // Seq[Int]:当前批次相同key对应的value,比如Seq(1,1,1,1)
  // Option[Int]:代表上一批次中相同key对应的累加的结果,有可能有值,有可能没值
  // 此时,获取历史批次的数据时,最好用getOrElse方法
  val func = (it: Iterator[(String, Seq[Int], Option[Int])]) => {
    it.map(tup => {
      (tup._1, tup._2.sum + tup._3.getOrElse(0))
    })
  }
}

transform原语

​ 可以调用transform原语操作DStream里的任意RDD

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object TransformDemo {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("TransformDemo").setMaster("local[2]")
    val ssc = new StreamingContext(conf,Seconds(5))

    val dStream: ReceiverInputDStream[String] = ssc.socketTextStream("cdhnocms01", 6666)

    // 调用transform操作DStream里的RDD
    val res: DStream[(String, Int)] = dStream.transform(rdd => {
      rdd.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)
    })
    res.print

    ssc.start()

    ssc.awaitTermination()

  }
}
获取Kafka的数据并进行计算
import org.apache.spark.{HashPartitioner, SparkConf}
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * 获取kafka的某个topic的数据实现wordcount
  * 有批次累加的功能
  */
object LoadKafkaDataAndWC {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("TransformDemo").setMaster("local[2]")
    val ssc = new StreamingContext(conf,Seconds(5))

    ssc.checkpoint("hdfs://cdhnocms01:8020/out/cp-20181127-2")

    // 设置用于请求kafka的几个参数
    val Array(zkQuorum, group, topics, numThreads) = args

    // 把获取到的每个topic封装到一个map里
    val topicMap: Map[String, Int] = topics.split(",").map((_,numThreads.toInt)).toMap

    // 调用kafka工具类用receive的方式获取kafka的数据
    val data: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap, StorageLevel.MEMORY_AND_DISK)

    // 获取到的数据中是对偶元祖,key为offset,value为offset对应的数据
    // 此时offset没用,需要过滤掉
    val lines: DStream[String] = data.map(_._2)

    // 进行分析
    val tups: DStream[(String, Int)] = lines.flatMap(_.split(" ")).map((_, 1))
    val res: DStream[(String, Int)] = tups.updateStateByKey(func, new HashPartitioner(ssc.sparkContext.defaultParallelism), true)

    res.print
    ssc.start()
    ssc.awaitTermination()
  }

  val func = (it: Iterator[(String, Seq[Int], Option[Int])]) => {
    it.map{
      case (x, y, z) => {
        (x, y.sum + z.getOrElse(0))
      }
    }
  }
}
窗口操作

第六天 -- Kafka API -- Spark Streaming -- DStream_第9张图片

使用窗口操作实现WordCount,要求是批次间隔为5秒,窗口长度和滑动间隔都是10秒

import org.apache.spark.{HashPartitioner, SparkConf}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object WindowOprationsDemo {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("TransformDemo").setMaster("local[2]")
    val ssc = new StreamingContext(conf,Seconds(5))

    ssc.checkpoint("hdfs://cdhnocms01:8020/out/cp-20181127-3")

    val dStream: ReceiverInputDStream[String] = ssc.socketTextStream("cdhnocms01",6666)

    val tups: DStream[(String, Int)] = dStream.flatMap(_.split(" ")).map((_, 1))

    val update: DStream[(String, Int)] = tups.updateStateByKey(func, new HashPartitioner(ssc.sparkContext.defaultParallelism), true)

    val res: DStream[(String, Int)] = update.reduceByKeyAndWindow((x: Int, y: Int) => x + y, Seconds(10), Seconds(10))

    res.print
    ssc.start()
    ssc.awaitTermination()
  }

  val func = (it: Iterator[(String, Seq[Int], Option[Int])]) => {
    it.map{
      case (x, y, z) => {
        (x, y.sum + z.getOrElse(0))
      }
    }
  }
}

你可能感兴趣的:(学习)