Spark Streaming整合flume实战

flume作为日志实时采集的框架,可以与SparkStreaming实时处理框进行对接,flume实时产生数据,sparkStreaming做实时处理。Spark Streaming对接FlumeNG有两种方式,一种是FlumeNG将消息Push推给Spark Streaming,还有一种是Spark Streaming从flume 中Poll拉取数据。

1.poll方式

(1)安装flume1.6以上

(2)下载依赖包

spark-streaming-flume-sink_2.11-2.0.2.jar放入到flume的lib目录下

(3)写flume的agent,注意既然是拉取的方式,那么flume向自己所在的机器上产数据就行

(4)编写flume-poll.conf配置文件

a1.sources = r1

a1.sinks = k1

a1.channels = c1

#source

a1.sources.r1.channels = c1

a1.sources.r1.type = spooldir

a1.sources.r1.spoolDir = /root/data              注:存放数据文件的地址

a1.sources.r1.fileHeader = true

#channel

a1.channels.c1.type =memory

a1.channels.c1.capacity = 20000

a1.channels.c1.transactionCapacity=5000

#sinks

a1.sinks.k1.channel = c1

a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink

a1.sinks.k1.hostname=node-1

a1.sinks.k1.port = 8888

a1.sinks.k1.batchSize= 2000                     

flume的启动命令:

bin/flume-ng agent -n a1 -c conf/ -f conf/flume-poll-spark.conf -Dflume.root.logger=INFO,console

(5)启动spark-streaming应用程序,去flume所在机器拉取数据

(6)代码实现

需要添加pom依赖


   
org.apache.spark
   
spark-streaming-flume_2.11
   
2.0.2

具体代码如下:

 package cn.testdemo.dstream.flume

    import java.net.InetSocketAddress
    import org.apache.spark.{SparkConf, SparkContext}
    import org.apache.spark.storage.StorageLevel
    import org.apache.spark.streaming.{Seconds, StreamingContext}
    import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
    import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent}

    //todo:利用sparkStreaming对接flume数据,实现单词计算------Poll拉模式
    object SparkStreamingFlume_Poll {
      def main(args: Array[String]): Unit = {
         //1、创建sparkConf
          val sparkConf: SparkConf = new SparkConf().setAppName("SparkStreamingFlume_Poll").setMaster("local[2]")
        //2、创建sparkContext
          val sc = new SparkContext(sparkConf)
          sc.setLogLevel("WARN")
        //3、创建StreamingContext
          val ssc = new StreamingContext(sc,Seconds(5))
        //定义一个flume地址集合,可以同时接受多个flume的数据
        val address=Seq(new InetSocketAddress("192.168.216.120",9999),new InetSocketAddress("192.168.216.121",9999))

        //4、获取flume中数据

           val stream: ReceiverInputDStream[SparkFlumeEvent] =

           FlumeUtils.createPollingStream(ssc,address,StorageLevel.MEMORY_AND_DISK_SER_2)

        //5、从Dstream中获取flume中的数据  {"header":xxxxx   "body":xxxxxx}
          val lineDstream: DStream[String] = stream.map(x => new String(x.event.getBody.array()))
        //6、切分每一行,每个单词计为1
          val wordAndOne: DStream[(String, Int)] = lineDstream.flatMap(_.split(" ")).map((_,1))
        //7、相同单词出现的次数累加
          val result: DStream[(String, Int)] = wordAndOne.reduceByKey(_+_)
        //8、打印输出
        result.print()

        //开启计算
          ssc.start()
          ssc.awaitTermination()
      }
    }

(7)观察IDEA控制台输出

Spark Streaming整合flume实战_第1张图片

2.Push方式

(1)编写flume-push.conf配置文件

#push mode

a1.sources = r1

a1.sinks = k1

a1.channels = c1

#source

a1.sources.r1.channels = c1

a1.sources.r1.type = spooldir

a1.sources.r1.spoolDir = /root/data

a1.sources.r1.fileHeader = true

#channel

a1.channels.c1.type =memory

a1.channels.c1.capacity = 20000

a1.channels.c1.transactionCapacity=5000

#sinks

a1.sinks.k1.channel = c1

a1.sinks.k1.type = avro

a1.sinks.k1.hostname=192.168.11.25

a1.sinks.k1.port = 8888

a1.sinks.k1.batchSize= 2000        

注意配置文件中指明的hostname和port是spark应用程序所在服务器的ip地址和端口。

启动flume:

bin/flume-ng agent -n a1 -c conf/ -f conf/flume-push-spark.conf -Dflume.root.logger=INFO,console

(2)代码实现如下:

 package cn.testdemo.dstream.flume

    import org.apache.spark.{SparkConf, SparkContext}
    import org.apache.spark.streaming.{Seconds, StreamingContext}
    import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
    import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent}

    //todo:利用sparkStreaming对接flume数据,实现单词计数------Push推模式
    object SparkStreamingFlume_Push {

      def main(args: Array[String]): Unit = {
        //1、创建sparkConf
          val sparkConf: SparkConf = new SparkConf().setAppName("SparkStreamingFlume_Push").setMaster("local[2]")
        //2、创建sparkContext
          val sc = new SparkContext(sparkConf)
          sc.setLogLevel("WARN")
        //3、创建StreamingContext
          val ssc = new StreamingContext(sc,Seconds(5))
        //4、获取flume中的数据
        val stream: ReceiverInputDStream[SparkFlumeEvent] = FlumeUtils.createStream(ssc,"192.168.11.25",9999)
        //5、从Dstream中获取flume中的数据  {"header":xxxxx   "body":xxxxxx}
        val lineDstream: DStream[String] = stream.map(x => new String(x.event.getBody.array()))
        //6、切分每一行,每个单词计为1
        val wordAndOne: DStream[(String, Int)] = lineDstream.flatMap(_.split(" ")).map((_,1))
        //7、相同单词出现的次数累加
        val result: DStream[(String, Int)] = wordAndOne.reduceByKey(_+_)
        //8、打印输出
        result.print()
        //开启计算
        ssc.start()
        ssc.awaitTermination()
      }
    }

                

(3) 启动执行

先执行spark代码,再执行flume

(4) 观察IDEA控制台输出

Spark Streaming整合flume实战_第2张图片


你可能感兴趣的:(spark,spark)