SparkStreaming集群下使用Pull方式整合Flume

集群环境

spark版本:2.2.2
Flume版本:1.6.0

Spark集群:

角色 IP
Master 192.167.17.200
Slave1 192.167.17.201
Slave2 192.167.17.202

采集(Flume单点)服务器:

角色 IP
init01 192.168.17.100

查看资料

根据官网(http://spark.apache.org/docs/latest/streaming-flume-integration.html )说明,SparkStreaming整合Flume有两种方式:
Approach 1: Flume-style Push-based Approach
Approach 2: Pull-based Approach using a Custom Sink
这里我要使用第二种Pull的方式,它比较第一种方式的优点在于:
SparkStreaming集群下使用Pull方式整合Flume_第1张图片与之前的方法相比,这确保了更强的可靠性和容错保证。 但是,这需要配置Flume来运行自定义接收器。

操作步骤

1.将spark-streaming-flume-sink_2.11,scala-library,commons-lang3三个jar包放在flume的lib目录下(在maven仓库:http://mvnrepository.com/ 搜索下载对应版本的jar包即可)。

2.在flume的conf目录下创建一个配置文件:

vi flume-poll.properties
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /gew/data
a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
a1.sinks.k1.hostname = init01
a1.sinks.k1.port = 10099

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

3.启动flume:

bin/flume-ng agent --conf conf/ --conf-file conf/flume-poll.properties --name a1 -Dflume.root.logger=INFO,console >> logs/flume_launcherclick.log

启动成功后打印的日志:
在这里插入图片描述

4.在spark的Master下打开spark-shell测试连通性,依次输入以下代码

import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.flume.SparkFlumeEvent
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
val scc: StreamingContext = new StreamingContext(sc, Seconds(10))
sc.setLogLevel("WARN")
val flumeStream = FlumeUtils.createPollingStream(scc,"192.168.17.100",10099)
flumeStream.count().map(cnt => "AAAAAA-Received " + cnt + " flume events." ).print()
scc.start()

遇到的问题:
① 在执行第1行代码import org.apache.spark.streaming.flume.FlumeUtils的时候,提示没有FlumeUtils这个类,这个时候我们要确保spark的jars文件下有spark-streaming-flume和spark-streaming-flume-sink这个两个jar包。
② 如果不导入import org.apache.spark.streaming.flume.SparkFlumeEvent,在执行第7行代码的时候会报错:

scala> val flumeStream = FlumeUtils.createPollingStream(ssc,"192.168.17.100",10099)
error: missing or invalid dependency detected while loading class file 'SparkFlumeEvent.class'.

③ 提交任务后又报错:

scala> scc.start()
                                                                                
scala> Exception in thread "dag-scheduler-event-loop" java.lang.NoClassDefFoundError: Lorg/apache/flume/source/avro/AvroFlumeEvent;
	at java.lang.Class.getDeclaredFields0(Native Method)
	at java.lang.Class.privateGetDeclaredFields(Class.java:2583)
	at java.lang.Class.getDeclaredField(Class.java:2068)
	at java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1803)
	at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:79)
	at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:494)
	at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:482)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.io.ObjectStreamClass.(ObjectStreamClass.java:482)
	at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:379)
	at java.io.ObjectOutputStream.writeClass(ObjectOutputStream.java:1213)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1120)
	at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
	at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
	at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
	at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
	at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
	at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
	at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
	at scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1128)
	at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
	at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
	at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
	at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
	at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
	at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
	at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
	at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
	at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
	at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
	at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
	at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
	at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
	at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
	at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
	at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1007)
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:930)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:933)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:932)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:932)
	at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:874)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1711)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1703)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1692)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Caused by: java.lang.ClassNotFoundException: org.apache.flume.source.avro.AvroFlumeEvent
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 70 more

找不到 org.apache.flume.source.avro.AvroFlumeEvent这个类,这个类是在flume-ng-sdk包下,这个时候我们需要把flume-ng-sdk-1.6.0.jar这个jar包放在spark的jars文件下,这个问题困扰了我一个星期,当直接将代码打成jar包使用spark-submit提交时,在flume监控目录下放入文件,不会报错,也没有任何数据,但是本地(local[2])模式却可以跑出数据。

5.以上步骤完成后,连通成功,查看flume的日志

2019-01-10 15:16:53,438 (New I/O server boss #1 ([id: 0xaa7c85a1, /192.168.17.100:10099])) [INFO - org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.handleUpstream(NettyServer.java:171)] [id: 0x1181803f, /192.168.17.201:42616 => /192.168.17.100:10099] OPEN
2019-01-10 15:16:53,439 (New I/O  worker #1) [INFO - org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.handleUpstream(NettyServer.java:171)] [id: 0x1181803f, /192.168.17.201:42616 => /192.168.17.100:10099] BOUND: /192.168.17.100:10099
2019-01-10 15:16:53,439 (New I/O  worker #1) [INFO - org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.handleUpstream(NettyServer.java:171)] [id: 0x1181803f, /192.168.17.201:42616 => /192.168.17.100:10099] CONNECTED: /192.168.17.201:42616

6.丢一个文件到flume监控的文件夹下,可以看到SparkStreaming打印出来的获取到的数据条数
在这里插入图片描述

总结

需要注意的是:

  1. 在整合SparkStreaming和Flume的过程中,如果只将代码打成jar包,使用spark-submit方式提交,需要确定flume的lib文件下有spark-streaming-flume-sink_2.11,scala-library,commons-lang3三个jar包;spark的jars文件下有flume-ng-sdk-1.6.0.jar,spark-streaming-flume和spark-streaming-flume-sink三个jar包。
  2. Spark-2.3以后已经将flume的方法标记为过时,是因为flume直接对接SparkStreaming的情况下,有可能对SparkStreaming造成较大的压力。

你可能感兴趣的:(spark-streaming)