Spark Streaming与Flume集成小测试:PUSH的方式

需求:监控目录/opt/datas/spark-flume下的wctotal.log文件,并将文件内容通过Spark Streaming 进行批处理,在屏幕上输出event的数量

实验在伪分布式环境下,用local的模式启动spark(CDH5.5.0版本)
为了看每条代码比较清楚,采用bin/spark-shell –master local[2]方式启动
集成这两个功能需要将三个jar包导入到spark classpath中

flume-avro-source-1.6.0-cdh5.5.0.jar
flume-ng-sdk-1.6.0-cdh5.5.0.jar
//以上两个包在flume的lib下
spark-streaming-flume_2.10-1.5.2.jar
//这个包在spark编译后的external/flume/target目录下
使用以下命令启动spark local(三个jar包的路径写自己的就好)
bin/spark-shell \
--master local[4] \
--jars externalJars/flume-ng-sdk-1.6.0-cdh5.5.0.jar,externalJars/flume-avro-source-1.6.0-cdh5.5.0.jar,externalJars/spark-streaming-flume_2.10-1.5.2.jar
切记包之间用逗号隔开,且不能有空格
输入以下代码运行(一直收集流进行批处理)
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3
import org.apache.spark.streaming.flume._
import org.apache.spark.storage.StorageLevel


val ssc = new StreamingContext(sc, Seconds(5))
val stream = FlumeUtils.createStream(ssc, "BPF", 9999, StorageLevel.MEMORY_ONLY_SER_2)

stream.count().map(cnt => "Received " + cnt + " flume events." ).print()

ssc.start()             // Start the computation
ssc.awaitTermination()  // Wait for the computation to terminate
flume agent:

# define agent
a2.sources = r2
a2.channels = c2
a2.sinks = k2

# definde sources
a2.sources.r2.type = exec
a2.sources.r2.command = tail -f /opt/datas/spark-flume/wctotal.log
a2.sources.r2.shell = /bin/bash -c

# define channels
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100

# define sinks
a2.sinks.k2.type = avro
a2.sinks.k2.hostname = BPF
a2.sinks.k2.port = 9999

# bind channels to sources and sinks
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2
启动flume

bin/flume-ng agent -c conf -n a2 -f conf/flume_spark_push.conf -Dflume.root.logger=DEBUG,console

手动修改wctotal.log的内容(如echo)观察Spark Streaming的效果

你可能感兴趣的:(Spark)