SparkStreaming整合flume

文章目录

      • 目标一:Flume-style Push-based Approach
      • 目标二:Push-based Approach using a Custom Sink

SparkStreaming整合flume有两种方式,下面会一一列举这两个Demo
github地址:https://github.com/2NaCl/spark_flume_demo

目标一:Flume-style Push-based Approach

首先来看一下官方文档,之前所介绍的socket或者fileSystem都属于基本数据源,但是在这里,我们要主要介绍一下高级数据源。

在这里插入图片描述

这是官网给出的三种高级数据源,我们来主要看一下Flume的相关文档

SparkStreaming整合flume_第1张图片
大意是,我们可以把数据放入多个Flume agent之间,可以串联放入,可以并联放入,然后,sparkstreaming作为一个 avro 的接收方,接收flume采集过来的数据。

配置方法是

  1. 让flume和Worker启动在一台节点上
  2. Flume要配置之后将数据发送给一个端口之中

 
另外,因为SparkStreaming是接收数据的,所以要先启动,并且监听一个flume注入数据的端口

  1. 首先进行一下配置flume
# Name the components on this agent
simple-agent.sources = netcat-source
simple-agent.sinks = avro-sink
simple-agent.channels = memory-channel

# Describe/configure the source
simple-agent.sources.netcat-source.type = netcat
simple-agent.sources.netcat-source.bind = linux01
simple-agent.sources.netcat-source.port = 44444

# Describe the sink
simple-agent.sinks.avro-sink.type = avro
simple-agent.sinks.avro-sink.hostname = linux01
simple-agent.sinks.avro-sink.port = 41414 

# Use a channel which buffers events in memory
simple-agent.channels.memory-channel.type = memory
simple-agent.channels.memory-channel.capacity = 1000
simple-agent.channels.memory-channel.transactionCapacity = 100

# Bind the source and sink to the channel
simple-agent.sources.netcat-source.channels = memory-channel
simple-agent.sinks.avro-sink.channel = memory-channel
  1. 书写SparkStreaming应用程序,导入FlumeUtils创建DStream

首先导入新依赖

		<dependency>
            <groupId>org.apache.sparkgroupId>
            <artifactId>spark-streaming-flume_2.11artifactId>
            <version>2.1.1version>
        dependency>

然后写一个Push方式的wordcount demo
也是先进入配置
SparkStreaming整合flume_第2张图片
获取输入的内容,进行拆分,因为我们知道flume 在传送数据的时候是有header有body的,我们只要他们的body的内容,所以我们要利用方法去除header,并且删除前后的空白符

在这里插入图片描述
然后按照正常wordcount的计算就可以了
在这里插入图片描述

  1. 本地测试

在本地测试中,我们需要将flume的配置中,sink的配置改成主机ip地址,而不是服务器地址,然后启动sparkstreaming,然后启动flume,用talent输入数据,观察idea控制台的输出
simple-agent.conf

# Name the components on this agent
simple-agent.sources = netcat-source
simple-agent.sinks = avro-sink
simple-agent.channels = memory-channel

# Describe/configure the source
simple-agent.sources.netcat-source.type = netcat
simple-agent.sources.netcat-source.bind = linux01
simple-agent.sources.netcat-source.port = 44444

# Describe the sink
simple-agent.sinks.avro-sink.type = avro
simple-agent.sinks.avro-sink.hostname = 192.168.1.101
simple-agent.sinks.avro-sink.port = 41414

# Use a channel which buffers events in memory
simple-agent.channels.memory-channel.type = memory
simple-agent.channels.memory-channel.capacity = 1000
simple-agent.channels.memory-channel.transactionCapacity = 100

# Bind the source and sink to the channel
simple-agent.sources.netcat-source.channels = memory-channel
simple-agent.sinks.avro-sink.channel = memory-channel

然后启动flume

flume-ng agent 
--name simple-agent 
--conf /home/centos01/modules/apache-flume-1.7.0-bin/conf/ 
--conf-file /home/centos01/modules/apache-flume-1.7.0-bin/conf/flume_push_streaming.conf  
-Dflume.root.logger=INFO,console

在这里遇到一点小问题,那就是使用telnet,要先开放端口,然后再启动telnet-server才能连接上

  1. spark-submit上线部署

测试之后,就进入线上部署,先把flume的配置文件改成之前的linux01的hostname,然后用mvn clean package -DskipTests将sparkstreaming打成jar包,然后启动spark-submit

[centos01@linux01 spark-2.1.1-bin-hadoop2.7]$ spark-submit 
--name spark_flume 
--class com.fyj.spark.spark_flume 
--master local[*] 
--packages org.apache.spark:spark-streaming-flume_2.11:2.1.1 /home/centos01/modules/apache-flume-1.7.0-bin/test_dataSource/flume_spark/target/flume_spark-1.0-SNAPSHOT.jar 
linux01 41414

这个步骤有点bug,就不贴图了,很难受,昨天没有更新就是因为这个

SparkStreaming整合flume_第3张图片

目标二:Push-based Approach using a Custom Sink

SparkStreaming整合flume_第4张图片
与push的方式相反,是指sparkstreaming拉取过来信息,只需要让flume将数据push到一个buffer区,然后sparkstreaming就会使用一个合适的Flume receiver,从sink内拉出来,并且这个操作只会在数据被SparkStreaming完成副本和接收成功之后才会完成
所以这种方式比第一种更安全更可靠,支持容错很高。所以我们需要配置flume到一个自定义的sink上面

我们需要:使用一台机器运行flume agent ,然后用sparkstreaming去方位这台正在工作的自定义sink就ok了。

  1. 首先配置sink的jar包到SparkStreaming的pom文件上
		<dependency>
            <groupId>org.apache.commonsgroupId>
            <artifactId>commons-lang3artifactId>
        dependency>
        <dependency>
            <groupId>org.apache.sparkgroupId>
            <artifactId>spark-streaming-flume-sink_2.11artifactId>
            <version>2.1.1version>
        dependency>
  1. 配置Flume Agent Conf
# Name the components on this agent
simple-agent.sources = netcat-source
simple-agent.sinks = spark-sink
simple-agent.channels = memory-channel

# Describe/configure the source
simple-agent.sources.netcat-source.type = netcat
simple-agent.sources.netcat-source.bind = linux01
simple-agent.sources.netcat-source.port = 44444

# Describe the sink
simple-agent.sinks.spark-sink.type = org.apache.spark.streaming.flume.sink.SparkSink
simple-agent.sinks.spark-sink.hostname = linux01
simple-agent.sinks.spark-sink.port = 41414

# Use a channel which buffers events in memory
simple-agent.channels.memory-channel.type = memory
simple-agent.channels.memory-channel.capacity = 1000
simple-agent.channels.memory-channel.transactionCapacity = 100

# Bind the source and sink to the channel
simple-agent.sources.netcat-source.channels = memory-channel
simple-agent.sinks.spark-sink.channel = memory-channel
  1. Configuration with sparkstreaming

在这里插入图片描述

你可能感兴趣的:(分布式计算)