大数据实战之Spark Streaming整合Flume

SparkStreaming与flume做整合的时候,一定要先去官网上查看可兼容的版本号
http://spark.apachecn.org/docs/cn/2.2.0/streaming-flume-integration.html

一:Push方式整合:flume_push_streaming.conf

simple-agent.sources = netcat-source
simple-agent.sinks = avro-sink
simple-agent.channels = memory-channel

simple-agent.sources.netcat-source.type = netcat
simple-agent.sources.netcat-source.bind = hadoop000
simple-agent.sources.netcat-source.port = 44444

simple-agent.sinks.avro-sink.type = avro
simple-agent.sinks.avro-sink.hostname = hadoop000
simple-agent.sinks.avro-sink.port = 41414

simple-agent.channels.memory-channel.type = memory

simple-agent.sources.netcat-source.channels = memory-channel
simple-agent.sinks.avro-sink.channel = memory-channel

官网的推荐步骤:
pom.xml添加的依赖:
 groupId = org.apache.spark
 artifactId = spark-streaming-flume_2.11
 version = 2.2.0

 import org.apache.spark.streaming.flume._

 val flumeStream = FlumeUtils.createStream(streamingContext, [chosen machine's hostname], [chosen port])
 
  ./bin/spark-submit --packages org.apache.spark:spark-streaming-flume_2.11:2.2.0 ...

启动命令:
flume-ng agent --name simple-agent --conf $FLUME_HOME/conf --conf-file $FLUME_HOME/conf/flume_push_streaming.conf -Dflume.root.logger=INFO,console &

telnet localhost 44444

本地测试总结:
1.启动SparkStreaming作业
2.启动flume agent
3.通过telnet输入数据,观察IDEA控制台输出

1.将程序打包,提交到spark-submit上面运行
  

  spark-submit --class com.imooc.spark.FlumePushWordCount --master local[*] --packages org.apache.spark:spark-streaming-flume_2.11:2.2.0 /home/hadoop/lib/sparkstream-1.0-SNAPSHOT.jar hadoop000 41414


2.启动flume命令:
    

flume-ng agent --name simple-agent --conf $FLUME_HOME/conf --conf-file $FLUME_HOME/conf/flume_push_streaming.conf -Dflume.root.logger=INFO,console &


3.连接端口,输入数据:
    telnet localhost 44444
    
二:pull方式整合:flume_pull_streaming.conf

simple-agent.sources = netcat-source
simple-agent.sinks = spark-sink
simple-agent.channels = memory-channel

simple-agent.sources.netcat-source.type = netcat
simple-agent.sources.netcat-source.bind = hadoop000
simple-agent.sources.netcat-source.port = 44444

simple-agent.sinks.spark-sink.type = org.apache.spark.streaming.flume.sink.SparkSink
simple-agent.sinks.spark-sink.hostname = hadoop000
simple-agent.sinks.spark-sink.port = 41414

simple-agent.channels.memory-channel.type = memory

simple-agent.sources.netcat-source.channels = memory-channel
simple-agent.sinks.spark-sink.channel = memory-channel

官网的推荐步骤:
(i) Custom sink JAR: Download the JAR corresponding to the following artifact (or direct link).

 groupId = org.apache.spark
 artifactId = spark-streaming-flume-sink_2.11
 version = 2.2.0
(ii) Scala library JAR: Download the Scala library JAR for Scala 2.11.8. It can be found with the following artifact detail (or, direct link).

 groupId = org.scala-lang
 artifactId = scala-library
 version = 2.11.8
(iii) Commons Lang 3 JAR: Download the Commons Lang 3 JAR. It can be found with the following artifact detail (or, direct link).

 groupId = org.apache.commons
 artifactId = commons-lang3
 version = 3.5
 
 (iiii)Configuration file: On that machine, configure Flume agent to send data to an Avro sink by having the following in the configuration file.

 agent.sinks = spark
 agent.sinks.spark.type = org.apache.spark.streaming.flume.sink.SparkSink
 agent.sinks.spark.hostname =
 agent.sinks.spark.port =
 agent.sinks.spark.channel = memoryChannel
 
  (iiiii)Programming: In the streaming application code, import FlumeUtils and create input DStream as follows.

 import org.apache.spark.streaming.flume._

 val flumeStream = FlumeUtils.createPollingStream(streamingContext, [sink machine hostname], [sink port])

你可能感兴趣的:(大数据实战)