Spark Streaming 接任意数据源作为 Stream

Spark Streaming 接任意数据源作为 Stream

问题出发点

工程中遇到流式处理的问题时,多采用Spark Streaming 或者 Storm 来处理;Strom采用Spout的流接入方式,Streaming采用Stream的流接入方式,为了方便本地测试,所以选择了spark streaming,但是官方仅支持如下几种方案,当遇到其他高吞吐数据量作为流时,就需要主角 Receiver 登场:

Spark Streaming 接任意数据源作为 Stream_第1张图片

 

实现关键类

Receiver:

Receiver是spark内部实现的一套机制,通过自定义一个类继承Receiver即可实现自定义数据源,再通过ssc的receiverStream接口即可实现数据转RDD的操作,即可像Kafka,Flume等正常操作Spark Streaming。本质上通过receiverStream得到的是ReceiverInputDStreaming。

class MyReceiver(storageLevel: StorageLevel) extends NetworkReceiver[String](storageLevel) {
    def onStart() {
        // Setup stuff (start threads, open sockets, etc.) to start receiving data.
        // Must start new thread to receive data, as onStart() must be non-blocking.

        // Call store(...) in those threads to store received data into Spark's memory.

        // Call stop(...), restart(...) or reportError(...) on any thread based on how
        // different errors need to be handled.

        // See corresponding method documentation for more details
    }

    def onStop() {
        // Cleanup stuff (stop threads, close sockets, etc.) to stop receiving data.
    }
}

 

这里需要实现两个函数,onStart 和 onStop ,onStart里就是你数据源的具体逻辑,按照官方的说法,onstart方法下你需要启动线程,连接sockets以开始接收数据。要求必须启动新线程以接收数据,且保证onStart() 是非阻塞的。在这些线程中调用store()方法将接收到的数据存储到Spark的内存中,作为一次流的内容,这里store方法是Receiver中自带的,无需自己实现。这里需要注意你连接的client必须非堵塞,如果同时连接多个端口或者一个key只能一个线程消费时,就会引发异常。

 

具体实现

spark streaming 主类:

    import org.apache.spark.SparkConf
    import org.apache.spark.streaming.{Seconds, StreamingContext}

    val sparkConf = new SparkConf().setAppName(appName)
    val ssc = new StreamingContext(sparkConf, Seconds(interval.toInt))
    val stream = ssc.receiverStream(new MyReceiver())
    stream.foreachRDD(rdd => {
      rdd.foreachPartition(partition => {
        partition.foreach(line => {
          println(line)
        })
      })
    })

    try {
      ssc.start()
      ssc.awaitTermination()
    } catch {
      case e: Exception => {
        println(e.getStackTrace)
      }
    }

 

MyReceiver类:

大概解释一下 onStart 方法启一个线程,执行receiver函数,receiver中初始化自己的数据连接服务器并get数据,将get到的数据调用store方法,即可存到spark的内存中。正常情况下,receiver函数中while (ture) 即可,除非是限时的流式处理(比较少见)

1)onStop方法不写也可以,主要实现onStart方法即可

2) 可以根据自己服务器环境 调整StorageLevel

3) 如果非堵塞 也可以在onstart方法中实现多线程增加吞吐

import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.receiver.Receiver


class MyReceiver(host: String, port: String) extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) {

  def onStart(): Unit = {
    new Thread("Socket Receiver") {
      override def run() { receive() }
    }.start()
  }

  def onStop: Unit = {
    if (Thread.currentThread.isInterrupted) {
      sys.exit(1)
    }
  }

  // myClient可以是任意连接
  private def receive(): Unit = {
    var client: MyClient = null
    try {
      client = new MyClient(host, port)
    } catch {
      case e: Exception => {
        println(e.getStackTrace)
        println("MyClient 连接失败!")
      }
    }

    while ({
      !Thread.currentThread.isInterrupted
    }) {
      try {
        val message = client.get(key)
        if (message != null) store(message)
      } catch {
        case e: Exception => {
          e.printStackTrace()
        }
      }
    }
  }

}

 

Tips:

具体实现Receiver的话还有RawNetworkReceiver和SocketReciver两种方法,有兴趣实现也可以参考文档和上面的写法实现。核心就是onStart对数据源接入的定义。

你可能感兴趣的:(spark,streaming)