spark版本:2.2.2
Flume版本:1.6.0
Spark集群:
角色 | IP |
---|---|
Master | 192.167.17.200 |
Slave1 | 192.167.17.201 |
Slave2 | 192.167.17.202 |
采集(Flume单点)服务器:
角色 | IP |
---|---|
init01 | 192.168.17.100 |
根据官网(http://spark.apache.org/docs/latest/streaming-flume-integration.html )说明,SparkStreaming整合Flume有两种方式:
Approach 1: Flume-style Push-based Approach
Approach 2: Pull-based Approach using a Custom Sink
这里我要使用第二种Pull的方式,它比较第一种方式的优点在于:
与之前的方法相比,这确保了更强的可靠性和容错保证。 但是,这需要配置Flume来运行自定义接收器。
1.将spark-streaming-flume-sink_2.11,scala-library,commons-lang3三个jar包放在flume的lib目录下(在maven仓库:http://mvnrepository.com/ 搜索下载对应版本的jar包即可)。
2.在flume的conf目录下创建一个配置文件:
vi flume-poll.properties
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /gew/data
a1.sources.r1.fileHeader = true
# Describe the sink
a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
a1.sinks.k1.hostname = init01
a1.sinks.k1.port = 10099
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
3.启动flume:
bin/flume-ng agent --conf conf/ --conf-file conf/flume-poll.properties --name a1 -Dflume.root.logger=INFO,console >> logs/flume_launcherclick.log
启动成功后打印的日志:
4.在spark的Master下打开spark-shell测试连通性,依次输入以下代码:
import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.flume.SparkFlumeEvent
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
val scc: StreamingContext = new StreamingContext(sc, Seconds(10))
sc.setLogLevel("WARN")
val flumeStream = FlumeUtils.createPollingStream(scc,"192.168.17.100",10099)
flumeStream.count().map(cnt => "AAAAAA-Received " + cnt + " flume events." ).print()
scc.start()
遇到的问题:
① 在执行第1行代码import org.apache.spark.streaming.flume.FlumeUtils的时候,提示没有FlumeUtils这个类,这个时候我们要确保spark的jars文件下有spark-streaming-flume和spark-streaming-flume-sink这个两个jar包。
② 如果不导入import org.apache.spark.streaming.flume.SparkFlumeEvent,在执行第7行代码的时候会报错:
scala> val flumeStream = FlumeUtils.createPollingStream(ssc,"192.168.17.100",10099)
error: missing or invalid dependency detected while loading class file 'SparkFlumeEvent.class'.
③ 提交任务后又报错:
scala> scc.start()
scala> Exception in thread "dag-scheduler-event-loop" java.lang.NoClassDefFoundError: Lorg/apache/flume/source/avro/AvroFlumeEvent;
at java.lang.Class.getDeclaredFields0(Native Method)
at java.lang.Class.privateGetDeclaredFields(Class.java:2583)
at java.lang.Class.getDeclaredField(Class.java:2068)
at java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1803)
at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:79)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:494)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:482)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.(ObjectStreamClass.java:482)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:379)
at java.io.ObjectOutputStream.writeClass(ObjectOutputStream.java:1213)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1120)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1128)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1007)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:930)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:933)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:932)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:932)
at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:874)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1711)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1703)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1692)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Caused by: java.lang.ClassNotFoundException: org.apache.flume.source.avro.AvroFlumeEvent
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 70 more
找不到 org.apache.flume.source.avro.AvroFlumeEvent这个类,这个类是在flume-ng-sdk包下,这个时候我们需要把flume-ng-sdk-1.6.0.jar这个jar包放在spark的jars文件下,这个问题困扰了我一个星期,当直接将代码打成jar包使用spark-submit提交时,在flume监控目录下放入文件,不会报错,也没有任何数据,但是本地(local[2])模式却可以跑出数据。
5.以上步骤完成后,连通成功,查看flume的日志:
2019-01-10 15:16:53,438 (New I/O server boss #1 ([id: 0xaa7c85a1, /192.168.17.100:10099])) [INFO - org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.handleUpstream(NettyServer.java:171)] [id: 0x1181803f, /192.168.17.201:42616 => /192.168.17.100:10099] OPEN
2019-01-10 15:16:53,439 (New I/O worker #1) [INFO - org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.handleUpstream(NettyServer.java:171)] [id: 0x1181803f, /192.168.17.201:42616 => /192.168.17.100:10099] BOUND: /192.168.17.100:10099
2019-01-10 15:16:53,439 (New I/O worker #1) [INFO - org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.handleUpstream(NettyServer.java:171)] [id: 0x1181803f, /192.168.17.201:42616 => /192.168.17.100:10099] CONNECTED: /192.168.17.201:42616
6.丢一个文件到flume监控的文件夹下,可以看到SparkStreaming打印出来的获取到的数据条数:
需要注意的是: