flume作为日志实时采集的框架,可以与SparkStreaming实时处理框进行对接,flume实时产生数据,sparkStreaming做实时处理。Spark Streaming对接FlumeNG有两种方式,一种是FlumeNG将消息Push推给Spark Streaming,还有一种是Spark Streaming从flume 中Pull拉取数据。
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-flume_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-flume-sink_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
将spark-streaming-flume-sink_2.11.jar放在flume的lib目录
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.8.2</version>
</dependency>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro-ipc</artifactId>
<version>1.8.2</version>
</dependency>
将两jar包放在flume的lib目录,避免Could not initialize class rg.apache.spark.streaming.flume.sink.EventBatch
的错误出现
b1.sources = r1
b1.sinks = k1
b1.channels = c1
#source
b1.sources.r1.type = netcat
b1.sources.r1.bind = localhost
b1.sources.r1.port = 44444
#channel
b1.channels.c1.type =memory
b1.channels.c1.capacity = 20000
b1.channels.c1.transactionCapacity=5000
#sinks
b1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
b1.sinks.k1.hostname=hadoop005
b1.sinks.k1.port = 7474
b1.sinks.k1.batchSize= 2000
#source-channel-sinks
b1.sources.r1.channels = c1
b1.sinks.k1.channel = c1
启动flume:
bin/flume-ng agent -n b1 -c conf -f job/spark-streaming-flume-poll.conf -Dflume.root.logger=INFO,console
启动telnet:
telnet localhost 44444
业务代码:
/**
* flume版wordcount
*/
object SparkPollFlume {
Logger.getLogger("org").setLevel(Level.ERROR)
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[*]")
val ssc = new StreamingContext(conf,Seconds(6))
val pollData: ReceiverInputDStream[SparkFlumeEvent] = FlumeUtils.createPollingStream(ssc,"hadoop005",7474)
pollData.map(x=>new String(x.event.getBody.array()).trim).flatMap(_.split(" +")).map((_,1)).reduceByKey(_+_).print(1000)
ssc.start()
ssc.awaitTermination()
}
}
//val flumeData: DStream[String] = pollData.map(x=> new String(x.event.getBody.array()).trim)
// val flatData: DStream[String] = flumeData.flatMap(_.split(" +"))
// val mapData: DStream[(String, Int)] = flatData.map((_,1))
// val result: DStream[(String, Int)] = mapData.reduceByKey(_+_)
// result.print(1000)
#push mode
b1.sources = r1
b1.sinks = k1
b1.channels = c1
#source
b1.sources.r1.type = netcat
b1.sources.r1.bind = localhost
b1.sources.r1.port = 44444
#channel
b1.channels.c1.type =memory
b1.channels.c1.capacity = 20000
b1.channels.c1.transactionCapacity=5000
#sinks
b1.sinks.k1.type = avro
b1.sinks.k1.hostname=192.168.100.121
b1.sinks.k1.port = 5555
b1.sinks.k1.batchSize= 2000
b1.sources.r1.channels = c1
b1.sinks.k1.channels = c1
业务代码:
object FlumePushSpark {
Logger.getLogger("org").setLevel(Level.ERROR)
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[*]")
val ssc = new StreamingContext(conf,Seconds(6))
val pollData: ReceiverInputDStream[SparkFlumeEvent] = FlumeUtils.createStream(ssc,"192.168.100.121",5555)
val originData: DStream[String] = pollData.map(x=>new String(x.event.getBody.array))
val flatMap: DStream[String] = originData.flatMap(_.split(" +"))
val wordWithOne: DStream[(String, Int)] = flatMap.map((_,1))
val result: DStream[(String, Int)] = wordWithOne.reduceByKey(_+_)
result.print(1000)
ssc.start()
ssc.awaitTermination()
}
}
注:
由于是push模式,需先将接受端程序端口打开(启动scala程序),再启动flume。
启动flume命令:
bin/flume-ng agent -n b1 -c conf -f job/spark-streaming-flume-push.conf -Dflume.root.logger=INFO,console
启动telnet:
telnet localhost 44444
报错:Could not configure sink k1 due to: No channel configured for sink: k1
,将配置文件的channels改为channel即可。