Flume对接Spark Streaming的2种方式

只需要了解一下即可,Flume直接对接Spark Streaming是很少见的
官网:http://spark.apache.org/docs/latest/streaming-flume-integration.html
官网有详细的介绍,可以通过官网进行操作

Approach 1: Flume-style Push-based Approach

基于Push的方式
Flume是被设计用来push data在多个Flume agents之间
Spark Streaming本质上设置了一个receiver,它充当Flume的Avro agent,而Flume可以push data

前置要求

选择一个机器在你的集群之上
当你的Flume + Spark Streaming应用被启动起来,其中1个Spark worker必须运行在这个机器上
可以将Flume配置为将数据push到该机器上的一个端口

配置Flume

agent.sinks = avroSink
agent.sinks.avroSink.type = avro
agent.sinks.avroSink.channel = memoryChannel
gent.sinks.avroSink.hostname = 's hostname>
agent.sinks.avroSink.port = port on the machine>

既然选择了一台机器来运行,那么数据必须到这台机器上,将数据以avro类型sink到该台机器上来

与Spark Streaming对接

  • 添加依赖:
<dependency>
      <groupId>org.apache.sparkgroupId>
      <artifactId>spark-streaming-flume_2.11artifactId>
      <version>${spark.version}version>
 dependency>
  • 编程
    需要一个工具类:FlumeUtils
    详细代码见FlumePushApp.scala

  • 实现
    Flume Agent的配置:

[$FLUME_HOME/conf/nc-memory-avro.conf]
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.26.131
a1.sources.r1.port = 44444
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = 192.168.26.131
a1.sinks.k1.port = 44443
a1.sources.r1.channels = c1 
a1.sinks.k1.channel = c1

为什么source和sink的端口不同
日志的输入是在44444端口上监听
监听完之后,把这个数据通过avro写到本地的44443这个上面去

启动Flume Agent

flume-ng agent \
--name a1  \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/nc-memory-avro.conf \
-Dflume.root.logger=INFO,console 

产生报错

Caused by: org.apache.flume.FlumeException: NettyAvroRpcClient { host: localhost, port: 44443 }: RPC connection error
at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:182)
at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:121)
at org.apache.flume.api.NettyAvroRpcClient.configure(NettyAvroRpcClient.java:638)
at org.apache.flume.api.RpcClientFactory.getInstance(RpcClientFactory.java:89)
at org.apache.flume.sink.AvroSink.initializeRpcClient(AvroSink.java:127)
at org.apache.flume.sink.AbstractRpcSink.createConnection(AbstractRpcSink.java:211)
at org.apache.flume.sink.AbstractRpcSink.verifyConnection(AbstractRpcSink.java:272)
at org.apache.flume.sink.AbstractRpcSink.process(AbstractRpcSink.java:349)
... 3 more

原因分析
44443端口无法接收数据
先启动Flume会报错,需要先启动Spark Streaming
因为是Flume推数据到Spark Streaming上去,因此需要先将Spark Streaming启动起来

使用spark-submit

spark-submit \
--class com.zhaotao.SparkStreaming.FlumePushApp \
--master local[2] \
/opt/lib/scala-train-1.0.jar

报错

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/streaming/flume/FlumeUtils$
at com.zhaotao.SparkStreaming.FlumePushApp$.main(FlumePushApp.scala:16)
at com.zhaotao.SparkStreaming.FlumePushApp.main(FlumePushApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.streaming.flume.FlumeUtils$
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 11 more

原因分析
因为我们在pom.xml里添加了spark-streaming-flume_2.11
并且我们打成package的时候是瘦包,没有将这个包给打进去
因此我们需要手动去指定–packages xxxxx

spark-submit \
--class com.zhaotao.SparkStreaming.FlumePushApp \
--master local[2] \
--packages org.apache.spark:spark-streaming-flume_2.11:2.2.0 \
/opt/lib/scala-train-1.0.jar

再次启动Flume Agent

启动telnet
$>telnet 192.168.26.131 44444
输入信息:
huhuhu
zhaotao
zhaotao
huhuhu
spark streaming启动的那端控制台,显示信息:
(zhaotao,2)
(huhuhu,2)

总结

该案例的流程如下:
nc --> flume --> sink ip+port --> streaming

步骤

  1. 先启动spark streaming
    注意:
    • 如果打的是胖包,pom.xml中必须加上provided
    • 如果打的是瘦包,则使用–packages这个参数
      但是在工作中不建议使用–packages这个参数,因为一旦使用了这个参数;也就意味着需要去外网下(如果公司有私服还行,如果没有就会很麻烦)。注意:生产上慎用–packages
      还有第2种解决方案,以该案例为例:
      先去maven仓库中将spark-streaming-flume-assembly下载到本地;然后使用spark-submit提交的时候,使用–jars参数指定相应的jar包去提交。
      但是也会带来一个问题:如果jar包太多,一个一个写会很麻烦;我们可以写一个shell脚本,将目录里的jar包给遍历出来,然后拼成一个字符串放到–jars里去,然后再提交( –jars xxx 指定目录 –> 不行!!)
  2. 再启动flume
  3. telnet,输入数据,完成word count

Approach 2: Pull-based Approach using a Custom Sink

官网:http://spark.apache.org/docs/latest/streaming-flume-integration.html#approach-2-pull-based-approach-using-a-custom-sink
替代了直接push data到Spark Streaming,这种方法运行了一个自定义的Flume sink

  • Flume push data进sink,这个数据保持缓冲状态
  • Spark Streaming使用一个可靠的Flume receiver和事务机制 将data从sink中给拉(pull)过来
    只有在接收到数据并通过Spark Streaming进行复制(即有了副本)之后,事务才会成功。
    这种方式比起前一种方式确保了更强的可靠性和容错性
    如果要用,建议使用这种方式

Configuring Flume

  • 设置spark-streaming-flume-sink_2.11
<dependency>
  <groupId>org.apache.sparkgroupId>
  <artifactId>spark-streaming-flume-sink_2.11artifactId>
  <version>${spark.version}version>
dependency>
  • 设置scala-library
<dependency>
  <groupId>org.scala-langgroupId>
  <artifactId>scala-libraryartifactId>
  <version>${scala.version}version>
dependency>
  • 设置commons-lang3
<dependency>
  <groupId>org.apache.commonsgroupId>
  <artifactId>commons-lang3artifactId>
  <version>3.5version>
dependency>
  • Configuration file
    agent.sinks = spark
    agent.sinks.spark.type = org.apache.spark.streaming.flume.sink.SparkSink
    agent.sinks.spark.hostname =
    agent.sinks.spark.port =
    agent.sinks.spark.channel = memoryChannel

Configuring Spark Streaming Application

  • 编程
    见FlumePullApp.scala

  • Flume Agent配置

[$FLUME_HOME/conf/nc-memory-spark.conf]
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.26.131
a1.sources.r1.port = 44444
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000
a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
a1.sinks.k1.hostname = 192.168.26.131
a1.sinks.k1.port = 44443
a1.sources.r1.channels = c1 
a1.sinks.k1.channel = c1       
  • 启动Flume Agent
flume-ng agent \
--name a1  \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/nc-memory-spark.conf \
-Dflume.root.logger=INFO,console
  • 启动Spark Streaming应用程序

  • telnet
    $>telnet 192.168.26.131 44444
    输入信息:
    huhuhu
    zhaotao
    huhuhu
    zhaotao
    zhao

  • 显示结果

-------------------------------------------
Time: 1518957240000 ms
-------------------------------------------
(zhao,1)
(zhaotao,2)
(huhuhu,2)

代码

object FlumePushApp {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()//.setMaster("local[2]").setAppName("FlumePushApp")
    val ssc = new StreamingContext(conf, Seconds(10))

    val lines = FlumeUtils.createStream(ssc, "192.168.26.131", 44443)
    val words = lines.map(x => new String(x.event.getBody.array()).trim)
                     .flatMap(_.split(" "))
    val pairs = words.map(word => (word,1))
    val wordCounts = pairs.reduceByKey(_ + _)
    wordCounts.print()

    ssc.start()
    ssc.awaitTermination()
  }

}
object FlumePullApp {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[2]").setAppName("FlumePullApp")
    val ssc = new StreamingContext(conf, Seconds(10))

    val lines = FlumeUtils.createPollingStream(ssc, "192.168.26.131", 44443)
    val words = lines.map(x => new String(x.event.getBody.array()).trim)
                     .flatMap(_.split(" "))
    val pairs = words.map(word => (word,1))
    val wordCounts = pairs.reduceByKey(_ + _)
    wordCounts.print()

    ssc.start()
    ssc.awaitTermination()
  }

}

你可能感兴趣的:(Spark,Flume)