参考文档:SparkStreaming编程指南(官方文档)http://spark.apache.org/docs/2.0.0-preview/streaming-programming-guide.html
本文实现代码语言Scala
总体流程分为以下几步:
实现自定义接收器(receiver)要注意以下几点
自定义的接收器(receiver)其实相当于Socket套接字的客户端编程,用于接收服务端特定IP和端口发送的数据,这里的IP和端口需要根据特定的Socket服务端的IP和端口变化而变化。
具体包:org.apache.spark.streaming.receiver.Receiver
class CustomReceiver(host: String, port: Int, isTimeOut: Boolean, sec: Int) extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging{
参数类型: 必须包含host:String和port: Int ,其中host为数据源服务端的IP、port为数据源服务端的端口号,其余参数可选其中isTimeOut为Boolean类型表示是否设置超时,true代表设置超时,false代表不设置,set为Int类型代表超时时间格式为s秒级。
复写方法:需要复写的方法有两个onStart()和onStop()方法
override def onStart(): Unit = { // Start the thread that receives data over a connection new Thread("Socket Receiver") { override def run() { receive() } }.start() } override def onStop(){ // There is nothing much to do as the thread calling receive() // is designed to stop by itself if isStopped() returns false }代码里面onStart()启动通过连接接收数据方法的线程,其中调用receiver()方法进行连接数据源服务端获取服务端数据并推送给Spark。
onStop不需要实现任何代码块。
onStart()方法启动相关调用receice()方法线程,实现接收数据源服务端发送的数据并且推送数据给Spark,具体方法实现如下代码:
/** Create a socket connection and receive data until receiver is stopped */ private def receive() { val _pattern: String = "yyyy-MM-dd HH:mm:ss SSS" val format: SimpleDateFormat = new SimpleDateFormat(_pattern) val _isTimeOut: Boolean = isTimeOut val _sec :Int = sec var socket: Socket = null var userInput: String = null try { // Connect to host:port socket = new Socket(host, port) println(format.format(new Date())) println("建立了链接\n") if(_isTimeOut) socket.setSoTimeout(_sec * 1000) // Until stopped or connection broken continue reading val reader = new BufferedReader( new InputStreamReader(socket.getInputStream, StandardCharsets.UTF_8)) userInput = reader.readLine() //while(!isStopped && userInput != null) { while(!isStopped && userInput != null) { println(userInput) store(userInput) userInput = reader.readLine() } reader.close() socket.close() // Restart in an attempt to connect again when server is active again restart("Trying to connect again") } catch { case e: java.net.ConnectException => // restart if could not connect to server restart("Error connecting to " + host + ":" + port, e) case t: Throwable => // restart if there is any other error restart("Error receiving data", t) } }核心: 其中核心在于调用其父 类Receiver的store(userInput),代表把数据推送到Spark,其中userInput代表接收到自定义数据源服务端的每一行的数据, 其余代码跟Socket的客户端编程类似。
/** * 自定义Socket服务器,发送消息到CustomReceiver */ class CustomServer(port: Int, isTimeOut: Boolean ,sec: Int) { val _pattern: String = "yyyy-MM-dd HH:mm:ss SSS" val format: SimpleDateFormat = new SimpleDateFormat(_pattern) val _isTimeOut = isTimeOut val _sec: Int = sec val _port = port def onStart(): Unit = { // Start the thread that receives data over a connection new Thread("Socket Receiver") { override def run() { sServer() } }.start() } def onStop(): Unit = { // There is nothing much to do as the thread calling receive() // is designed to stop by itself if isStopped() returns false } def sServer(): Unit = { println("----------Server----------") println(format.format(new Date())) var tryingCreateServer = 1 try { val server = new ServerSocket(_port) println("监听建立 等你上线\n") if(_isTimeOut) server.setSoTimeout(_sec) val socket = server.accept() println(format.format(new Date())) println("与客户端建立了链接") val writer = new OutputStreamWriter(socket.getOutputStream) println(format.format(new Date())) val in = new Scanner(System.in) //这里只是设置\n为数据分隔符,默认是空格 in.useDelimiter("\n") println("请写入数据") var flag = in.hasNext while (flag){ val s = in.next() /** * 注意:writer写入s数据,如果不加\n那么客户端接收不到数据 */ writer.write(s + "\n") Thread.sleep(1000) if(socket.isClosed){ println("socket is closed !") }else{ try{ writer.flush() }catch { case e: java.net.SocketException => println("Error 客户端连接断开了!!!!!!!!!") flag = false writer.close() socket.close() server.close() onStart() return } } } System.out.println(format.format(new Date())) System.out.println("写完啦 你收下\n\n\n\n\n") /** * 重新尝试建立监听 */ if(tryingCreateServer < 5){ writer.close() socket.close() server.close() onStart() tryingCreateServer += 1 } } catch{ case e: SocketTimeoutException => System.out.println(format.format(new Date()) + "\n" + _sec + "秒没给我数据 我下啦\n\n\n\n\n"); e.printStackTrace() case e: SocketException => e.printStackTrace() case e: Exception => e.printStackTrace() } } } object CustomServer { def main(args: Array[String]): Unit = { new CustomServer(8888, false, 0).onStart() } }
val receiverInputDStream = ssc.receiverStream(new CustomReceiver("hadoop01", 8888, false, 0))
其中自订阅接收器类实例的第一个参数“hadoop01”代表自定义数据源服务端的ip或者主机名,第二个参数代表8888代表数据源服务端的端口号,第三个参数false代表不启动超时设置,第四个参数0代表超时时间。
备注:在启动SparkStreming分析程序前必须先启动数据源服务端,可以通过服务端的控制台输入相关数据进行整个流程测试。