只需要了解一下即可,Flume直接对接Spark Streaming是很少见的
官网:http://spark.apache.org/docs/latest/streaming-flume-integration.html
官网有详细的介绍,可以通过官网进行操作
基于Push的方式
Flume是被设计用来push data在多个Flume agents之间
Spark Streaming本质上设置了一个receiver,它充当Flume的Avro agent,而Flume可以push data
选择一个机器在你的集群之上
当你的Flume + Spark Streaming应用被启动起来,其中1个Spark worker必须运行在这个机器上
可以将Flume配置为将数据push到该机器上的一个端口
agent.sinks = avroSink
agent.sinks.avroSink.type = avro
agent.sinks.avroSink.channel = memoryChannel
gent.sinks.avroSink.hostname = 's hostname>
agent.sinks.avroSink.port = port on the machine>
既然选择了一台机器来运行,那么数据必须到这台机器上,将数据以avro类型sink到该台机器上来
<dependency>
<groupId>org.apache.sparkgroupId>
<artifactId>spark-streaming-flume_2.11artifactId>
<version>${spark.version}version>
dependency>
编程
需要一个工具类:FlumeUtils
详细代码见FlumePushApp.scala
实现
Flume Agent的配置:
[$FLUME_HOME/conf/nc-memory-avro.conf]
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.26.131
a1.sources.r1.port = 44444
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = 192.168.26.131
a1.sinks.k1.port = 44443
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
为什么source和sink的端口不同
日志的输入是在44444端口上监听
监听完之后,把这个数据通过avro写到本地的44443这个上面去
flume-ng agent \
--name a1 \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/nc-memory-avro.conf \
-Dflume.root.logger=INFO,console
产生报错
Caused by: org.apache.flume.FlumeException: NettyAvroRpcClient { host: localhost, port: 44443 }: RPC connection error
at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:182)
at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:121)
at org.apache.flume.api.NettyAvroRpcClient.configure(NettyAvroRpcClient.java:638)
at org.apache.flume.api.RpcClientFactory.getInstance(RpcClientFactory.java:89)
at org.apache.flume.sink.AvroSink.initializeRpcClient(AvroSink.java:127)
at org.apache.flume.sink.AbstractRpcSink.createConnection(AbstractRpcSink.java:211)
at org.apache.flume.sink.AbstractRpcSink.verifyConnection(AbstractRpcSink.java:272)
at org.apache.flume.sink.AbstractRpcSink.process(AbstractRpcSink.java:349)
... 3 more
原因分析
44443端口无法接收数据
先启动Flume会报错,需要先启动Spark Streaming
因为是Flume推数据到Spark Streaming上去,因此需要先将Spark Streaming启动起来
使用spark-submit
spark-submit \
--class com.zhaotao.SparkStreaming.FlumePushApp \
--master local[2] \
/opt/lib/scala-train-1.0.jar
报错
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/streaming/flume/FlumeUtils$
at com.zhaotao.SparkStreaming.FlumePushApp$.main(FlumePushApp.scala:16)
at com.zhaotao.SparkStreaming.FlumePushApp.main(FlumePushApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.streaming.flume.FlumeUtils$
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 11 more
原因分析
因为我们在pom.xml里添加了spark-streaming-flume_2.11
并且我们打成package的时候是瘦包,没有将这个包给打进去
因此我们需要手动去指定–packages xxxxx
spark-submit \
--class com.zhaotao.SparkStreaming.FlumePushApp \
--master local[2] \
--packages org.apache.spark:spark-streaming-flume_2.11:2.2.0 \
/opt/lib/scala-train-1.0.jar
再次启动Flume Agent
启动telnet
$>telnet 192.168.26.131 44444
输入信息:
huhuhu
zhaotao
zhaotao
huhuhu
spark streaming启动的那端控制台,显示信息:
(zhaotao,2)
(huhuhu,2)
该案例的流程如下:
nc --> flume --> sink ip+port --> streaming
步骤
provided
:官网:http://spark.apache.org/docs/latest/streaming-flume-integration.html#approach-2-pull-based-approach-using-a-custom-sink
替代了直接push data到Spark Streaming,这种方法运行了一个自定义的Flume sink
<dependency>
<groupId>org.apache.sparkgroupId>
<artifactId>spark-streaming-flume-sink_2.11artifactId>
<version>${spark.version}version>
dependency>
<dependency>
<groupId>org.scala-langgroupId>
<artifactId>scala-libraryartifactId>
<version>${scala.version}version>
dependency>
<dependency>
<groupId>org.apache.commonsgroupId>
<artifactId>commons-lang3artifactId>
<version>3.5version>
dependency>
编程
见FlumePullApp.scala
Flume Agent配置
[$FLUME_HOME/conf/nc-memory-spark.conf]
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.26.131
a1.sources.r1.port = 44444
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000
a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
a1.sinks.k1.hostname = 192.168.26.131
a1.sinks.k1.port = 44443
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
flume-ng agent \
--name a1 \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/nc-memory-spark.conf \
-Dflume.root.logger=INFO,console
启动Spark Streaming应用程序
telnet
$>telnet 192.168.26.131 44444
输入信息:
huhuhu
zhaotao
huhuhu
zhaotao
zhao
显示结果
-------------------------------------------
Time: 1518957240000 ms
-------------------------------------------
(zhao,1)
(zhaotao,2)
(huhuhu,2)
object FlumePushApp {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()//.setMaster("local[2]").setAppName("FlumePushApp")
val ssc = new StreamingContext(conf, Seconds(10))
val lines = FlumeUtils.createStream(ssc, "192.168.26.131", 44443)
val words = lines.map(x => new String(x.event.getBody.array()).trim)
.flatMap(_.split(" "))
val pairs = words.map(word => (word,1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
object FlumePullApp {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[2]").setAppName("FlumePullApp")
val ssc = new StreamingContext(conf, Seconds(10))
val lines = FlumeUtils.createPollingStream(ssc, "192.168.26.131", 44443)
val words = lines.map(x => new String(x.event.getBody.array()).trim)
.flatMap(_.split(" "))
val pairs = words.map(word => (word,1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}