SparkStreaming整合Flume的两种方式

Flume整合SparkStream两种方式

官网http://spark.apache.org/docs/latest/streaming-flume-integration.html

Apache Flume是一种分布式,可靠且可用的服务,用于有效地收集,聚合和移动大量日志数据。在这里,我们说明如何配置FlumeSpark Streaming以从Flume接收数据。有两种方法。

方法一:基于push的方式

在推送式方法 (Flume-style Push-based Approach) 中,Spark Streaming 程序需要对某台服务器的某个端口进行监听,Flume 通过 avro Sink 将数据源源不断推送到该端口。这里以监听日志文件为例,具体整合方式如下

Flume Agent编写

新建配置文件:flume_push_streaming.conf

使用netcat监听44444端口作为数据源,采用avro-sink方式发送到 本机的 41414 端口(这里是local模式进行sparkStreaming代码的测试)

simple-agent.sources = netcat-source
simple-agent.sinks = avro-sink
simple-agent.channels = memory-channel

//数据监听端口(linux)
simple-agent.sources.netcat-source.type = netcat
simple-agent.sources.netcat-source.bind = localhost
simple-agent.sources.netcat-source.port = 44444

//本机ip(windows)
simple-agent.sinks.avro-sink.type = avro
simple-agent.sinks.avro-sink.hostname = 192.168.43.233
simple-agent.sinks.avro-sink.port = 41414

simple-agent.channels.memory-channel.type = memory

simple-agent.sources.netcat-source.channels = memory-channel
simple-agent.sinks.avro-sink.channel = memory-channel

添加依赖:

<dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-flume_2.11</artifactId>
            <version>2.1.1</version>
        </dependency>

应用程序开发(IDEA):

package com.zgw.spark.streaming

import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * Created by Zhaogw&Lss on 2019/11/25.
  * SparkStreaming整合Flume的第一种方式
  */
object FlumePushWordCount {
  def main(args: Array[String]): Unit = {
    if (args.length!=2){
      System.err.print("Usage:FlumePushWordCount  ")
      System.exit(1)
    }
    val Array(hostname,port) = args

    val sc: SparkConf = new SparkConf().setMaster("local[3]").setAppName("FlumePushWordCount").set("spark.testing.memory", "2147480000")

    Logger.getLogger("org").setLevel(Level.ERROR)
    //创建StreamingContext两个参数 SparkConf和batch interval
    val ssc = new StreamingContext(sc, Seconds(5))

    val flumeStream = FlumeUtils.createStream(ssc,hostname,port.toInt)

    flumeStream.map(x => new String(x.event.getBody.array()).trim).flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).print()


    ssc.start()

    ssc.awaitTermination()
  }

}

并在IDEA中配置程序启动参数(Edit Configurations=>Program Arguments0.0.0.0 41414

启动项目

  • 启动SparkStreaming程序(本地)

  • 启动flume agent(linux)

    flume-ng agent \
     --name simple-agent \
     --conf conf \
     --conf-file $FLUME_HOME/conf/flume_push_streaming.conf
     -Dflume.root.logger=INFO,console &
    
  • 通过telnet输入数据,观察idea输出

    [hadoop@hadoop000 conf]$ telnet localhost 44444
    Trying 192.168.43.174...
    Connected to localhost.
    Escape character is '^]'.
    hadoop
    OK
    map
    OK
    hello
    OK
    

提交到生产环境运行:

1.修改下面这行语句

    val sc: SparkConf = new SparkConf()
    /**.setMaster("local[3]").setAppName("FlumePushWordCount").set("spark.testing.memory", "2147480000")*/

2.打包:

mvn clean package -DskipTests

3.将打包的jar包传到linux

4. 部署:因为 Spark 安装目录下是不含有 spark-streaming-flume 依赖包的,所以在提交到集群运行时候必须提供该依赖包,你可以在提交命令中使用 --jar 指定上传到服务器的该依赖包,或者使用 --packages org.apache.spark:spark-streaming-flume_2.12:2.4.3 指定依赖包的完整名称,这样程序在启动时会先去中央仓库进行下载。

提交spark程序(将spark-streaming-flume_2.11:2.2.0包打进来)

spark-submit \
--class com.zgw.spark.FlumePushWordCount   \
--master local[2] \
--packages org.apache.spark:spark-streaming-flume_2.11:2.2.0 \
/home/hadoop/lib/SparkTrain-1.0.jar \
hadoop000 41414

运行(要保证联网,因为需要下载依赖的jar,由于篇幅,我这里省去部分日志):

[hadoop@hadoop000 lib]$ spark-submit \
> --class com.zgw.spark.FlumePushWordCount   \
> --master local[2] \
> --packages org.apache.spark:spark-streaming-flume_2.11:2.2.0 \
> /home/hadoop/lib/SparkTrain-1.0.jar \
> hadoop000 41414
Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
:: loading settings :: url = jar:file:/home/hadoop/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.spark#spark-streaming-flume_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
	confs: [default]
	found org.apache.spark#spark-streaming-flume_2.11;2.2.0 in central
	found org.apache.spark#spark-streaming-flume-sink_2.11;2.2.0 in central
	found org.apache.flume#flume-ng-sdk;1.6.0 in central
	found org.apache.avro#avro;1.7.7 in local-m2-cache
	found org.codehaus.jackson#jackson-core-asl;1.9.13 in spark-list
	found org.codehaus.jackson#jackson-mapper-asl;1.9.13 in spark-list
	found com.thoughtworks.paranamer#paranamer;2.6 in spark-list
	found org.xerial.snappy#snappy-java;1.1.2.6 in central
	org.mortbay.jetty#jetty-util;6.1.26 from spark-list in [default]
	org.mortbay.jetty#servlet-api;2.5-20110124 from spark-list in [default]
	org.slf4j#slf4j-api;1.7.16 from central in [default]
	org.slf4j#slf4j-log4j12;1.7.16 from central in [default]
	org.spark-project.spark#unused;1.0.0 from central in [default]
	org.tukaani#xz;1.0 from spark-list in [default]
	org.xerial.snappy#snappy-java;1.1.2.6 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   32  |   3   |   3   |   0   ||   32  |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
	confs: [default]
	0 artifacts copied, 32 already retrieved (0kB/29ms)
-------------------------------------------
Time: 1574733430000 ms
-------------------------------------------

-------------------------------------------
Time: 1574733435000 ms
-------------------------------------------

-------------------------------------------
Time: 1574733440000 ms
-------------------------------------------


修改flume配置文件为服务端:

simple-agent.sources = netcat-source
simple-agent.sinks = avro-sink
simple-agent.channels = memory-channel

simple-agent.sources.netcat-source.type = netcat
simple-agent.sources.netcat-source.bind = hadoop000
simple-agent.sources.netcat-source.port = 44444

simple-agent.sinks.avro-sink.type = avro
#修改为服务端
simple-agent.sinks.avro-sink.hostname = hadoop000
simple-agent.sinks.avro-sink.port = 41414

simple-agent.channels.memory-channel.type = memory

simple-agent.sources.netcat-source.channels = memory-channel
simple-agent.sinks.avro-sink.channel = memory-channel

再次启动flume

flume-ng agent \
 --name simple-agent \
 --conf conf \
 --conf-file $FLUME_HOME/conf/flume_push_streaming.conf
 -Dflume.root.logger=INFO,console &

新开一个窗口telnet localhost 44444

[hadoop@hadoop000 ~]$ telnet localhost 44444
Trying 192.168.43.174...
Connected to localhost.
Escape character is '^]'.
hadoop spark
OK
hadoop
OK
spark
OK

结果

-------------------------------------------
Time: 1574733695000 ms
-------------------------------------------
(spark,2)
(hadoop,2)

-------------------------------------------
Time: 1574733700000 ms
-------------------------------------------

方式二:基于pull的方式

本地代码编写联调

flume agent的编写

simple-agent.sources = netcat-source
simple-agent.sinks = spark-sink
simple-agent.channels = memory-channel

//数据监听端口
simple-agent.sources.netcat-source.type = netcat
simple-agent.sources.netcat-source.bind = hadoop000
simple-agent.sources.netcat-source.port = 44444

//本机ip
simple-agent.sinks.spark-sink.type = org.apache.spark.streaming.flume.sink.SparkSink
simple-agent.sinks.spark-sink.hostname = hadoop000
simple-agent.sinks.spark-sink.port = 41414

simple-agent.channels.memory-channel.type = memory

simple-agent.sources.netcat-source.channels = memory-channel
simple-agent.sinks.spark-sink.channel = memory-channel

代码编写:

package com.zgw.spark.streaming

import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * Created by Zhaogw&Lss on 2019/11/25.
  * SparkStreaming整合Flume的第二种方式
  */
object FlumePullWordCount {
  def main(args: Array[String]): Unit = {
    if (args.length!=2){
      System.err.print("Usage:FlumePullWordCount  ")
      System.exit(1)
    }
    val Array(hostname,port) = args

    val sc: SparkConf = new SparkConf().setMaster("local[3]").setAppName("FlumePullWordCount").set("spark.testing.memory", "2147480000")

    Logger.getLogger("org").setLevel(Level.ERROR)
    //创建StreamingContext两个参数 SparkConf和batch interval
    val ssc = new StreamingContext(sc, Seconds(5))

    val flumeStream = FlumeUtils.createPollingStream(ssc,hostname,port.toInt)

    flumeStream.map(x => new String(x.event.getBody.array()).trim).flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).print()


    ssc.start()

    ssc.awaitTermination()
  }

}

注意:这里要先启动flume,再启动sparkStreaming

启动flume

flume-ng agent \
 --name simple-agent \
 --conf conf \
 --conf-file $FLUME_HOME/conf/flume_pull_streaming.conf
 -Dflume.root.logger=INFO,console &

启动程序(设置程序启动参数hadoop000 41414

新开窗口并telnet localhost 44444

[hadoop@hadoop000 lib]$ telnet localhost 44444
Trying 192.168.43.174...
Connected to localhost.
Escape character is '^]'.
hadoop
OK
spark
OK

Time: 1574735850000 ms
-------------------------------------------
(hadoop,1)

-------------------------------------------
Time: 1574735855000 ms
-------------------------------------------
(spark,1)

服务端联调:

1.和之前一样先修改代码

val sc: SparkConf = new SparkConf()
/*.setMaster("local[3]").setAppName("FlumePullWordCount").set("spark.testing.memory", "2147480000")*/

2.打包:同上

3.在linux上替换之前push方式的jar包

4.启动flume

flume-ng agent \
 --name simple-agent \
 --conf conf \
 --conf-file $FLUME_HOME/conf/flume_pull_streaming.conf
 -Dflume.root.logger=INFO,console &

5.提交

spark-submit \
--class com.zgw.spark.FlumePullWordCount   \
--master local[2] \
--packages org.apache.spark:spark-streaming-flume_2.11:2.2.0 \
/home/hadoop/lib/SparkTrain-1.0.jar \
hadoop000 41414

6.telnet localhost 44444

[hadoop@hadoop000 ~]$ telnet localhost 44444
Trying 192.168.43.174...
Connected to localhost.
Escape character is '^]'.
hadoop spark zookeeper
OK
hadoop spark
OK

-------------------------------------------
Time: 1574736605000 ms
-------------------------------------------
(zookeeper,1)
(spark,2)
(hadoop,2)

至此:SparkStreaming整合Flume的两种方式完成

代码托管地址https://github.com/daizikaikou/learningSpark

你可能感兴趣的:(spark,Flume)