Flume
整合SparkStream
两种方式官网http://spark.apache.org/docs/latest/streaming-flume-integration.html
Apache Flume是一种分布式,可靠且可用的服务,用于有效地收集,聚合和移动大量日志数据。在这里,我们说明如何配置Flume
和Spark Streaming
以从Flume接收数据。有两种方法。
在推送式方法 (Flume-style Push-based Approach
) 中,Spark Streaming 程序需要对某台服务器的某个端口进行监听,Flume 通过 avro Sink
将数据源源不断推送到该端口。这里以监听日志文件为例,具体整合方式如下
Flume Agent编写
新建配置文件:flume_push_streaming.conf
使用netcat
监听44444
端口作为数据源,采用avro-sink
方式发送到 本机的 41414
端口(这里是local
模式进行sparkStreaming
代码的测试)
simple-agent.sources = netcat-source
simple-agent.sinks = avro-sink
simple-agent.channels = memory-channel
//数据监听端口(linux)
simple-agent.sources.netcat-source.type = netcat
simple-agent.sources.netcat-source.bind = localhost
simple-agent.sources.netcat-source.port = 44444
//本机ip(windows)
simple-agent.sinks.avro-sink.type = avro
simple-agent.sinks.avro-sink.hostname = 192.168.43.233
simple-agent.sinks.avro-sink.port = 41414
simple-agent.channels.memory-channel.type = memory
simple-agent.sources.netcat-source.channels = memory-channel
simple-agent.sinks.avro-sink.channel = memory-channel
添加依赖:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-flume_2.11</artifactId>
<version>2.1.1</version>
</dependency>
应用程序开发(IDEA):
package com.zgw.spark.streaming
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* Created by Zhaogw&Lss on 2019/11/25.
* SparkStreaming整合Flume的第一种方式
*/
object FlumePushWordCount {
def main(args: Array[String]): Unit = {
if (args.length!=2){
System.err.print("Usage:FlumePushWordCount " )
System.exit(1)
}
val Array(hostname,port) = args
val sc: SparkConf = new SparkConf().setMaster("local[3]").setAppName("FlumePushWordCount").set("spark.testing.memory", "2147480000")
Logger.getLogger("org").setLevel(Level.ERROR)
//创建StreamingContext两个参数 SparkConf和batch interval
val ssc = new StreamingContext(sc, Seconds(5))
val flumeStream = FlumeUtils.createStream(ssc,hostname,port.toInt)
flumeStream.map(x => new String(x.event.getBody.array()).trim).flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).print()
ssc.start()
ssc.awaitTermination()
}
}
并在IDEA中配置程序启动参数(Edit Configurations=>Program Arguments
)0.0.0.0 41414
启动SparkStreaming程序(本地)
启动flume agent(linux)
flume-ng agent \
--name simple-agent \
--conf conf \
--conf-file $FLUME_HOME/conf/flume_push_streaming.conf
-Dflume.root.logger=INFO,console &
通过telnet
输入数据,观察idea输出
[hadoop@hadoop000 conf]$ telnet localhost 44444
Trying 192.168.43.174...
Connected to localhost.
Escape character is '^]'.
hadoop
OK
map
OK
hello
OK
1.修改下面这行语句
val sc: SparkConf = new SparkConf()
/**.setMaster("local[3]").setAppName("FlumePushWordCount").set("spark.testing.memory", "2147480000")*/
2.打包:
mvn clean package -DskipTests
3.将打包的jar包传到linux
上
4. 部署:因为 Spark 安装目录下是不含有 spark-streaming-flume
依赖包的,所以在提交到集群运行时候必须提供该依赖包,你可以在提交命令中使用 --jar
指定上传到服务器的该依赖包,或者使用 --packages org.apache.spark:spark-streaming-flume_2.12:2.4.3
指定依赖包的完整名称,这样程序在启动时会先去中央仓库进行下载。
提交spark
程序(将spark-streaming-flume_2.11:2.2.0
包打进来)
spark-submit \
--class com.zgw.spark.FlumePushWordCount \
--master local[2] \
--packages org.apache.spark:spark-streaming-flume_2.11:2.2.0 \
/home/hadoop/lib/SparkTrain-1.0.jar \
hadoop000 41414
运行(要保证联网,因为需要下载依赖的jar,由于篇幅,我这里省去部分日志):
[hadoop@hadoop000 lib]$ spark-submit \
> --class com.zgw.spark.FlumePushWordCount \
> --master local[2] \
> --packages org.apache.spark:spark-streaming-flume_2.11:2.2.0 \
> /home/hadoop/lib/SparkTrain-1.0.jar \
> hadoop000 41414
Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
:: loading settings :: url = jar:file:/home/hadoop/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.spark#spark-streaming-flume_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found org.apache.spark#spark-streaming-flume_2.11;2.2.0 in central
found org.apache.spark#spark-streaming-flume-sink_2.11;2.2.0 in central
found org.apache.flume#flume-ng-sdk;1.6.0 in central
found org.apache.avro#avro;1.7.7 in local-m2-cache
found org.codehaus.jackson#jackson-core-asl;1.9.13 in spark-list
found org.codehaus.jackson#jackson-mapper-asl;1.9.13 in spark-list
found com.thoughtworks.paranamer#paranamer;2.6 in spark-list
found org.xerial.snappy#snappy-java;1.1.2.6 in central
org.mortbay.jetty#jetty-util;6.1.26 from spark-list in [default]
org.mortbay.jetty#servlet-api;2.5-20110124 from spark-list in [default]
org.slf4j#slf4j-api;1.7.16 from central in [default]
org.slf4j#slf4j-log4j12;1.7.16 from central in [default]
org.spark-project.spark#unused;1.0.0 from central in [default]
org.tukaani#xz;1.0 from spark-list in [default]
org.xerial.snappy#snappy-java;1.1.2.6 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 32 | 3 | 3 | 0 || 32 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 32 already retrieved (0kB/29ms)
-------------------------------------------
Time: 1574733430000 ms
-------------------------------------------
-------------------------------------------
Time: 1574733435000 ms
-------------------------------------------
-------------------------------------------
Time: 1574733440000 ms
-------------------------------------------
修改flume配置文件为服务端:
simple-agent.sources = netcat-source
simple-agent.sinks = avro-sink
simple-agent.channels = memory-channel
simple-agent.sources.netcat-source.type = netcat
simple-agent.sources.netcat-source.bind = hadoop000
simple-agent.sources.netcat-source.port = 44444
simple-agent.sinks.avro-sink.type = avro
#修改为服务端
simple-agent.sinks.avro-sink.hostname = hadoop000
simple-agent.sinks.avro-sink.port = 41414
simple-agent.channels.memory-channel.type = memory
simple-agent.sources.netcat-source.channels = memory-channel
simple-agent.sinks.avro-sink.channel = memory-channel
再次启动flume
flume-ng agent \
--name simple-agent \
--conf conf \
--conf-file $FLUME_HOME/conf/flume_push_streaming.conf
-Dflume.root.logger=INFO,console &
新开一个窗口telnet localhost 44444
[hadoop@hadoop000 ~]$ telnet localhost 44444
Trying 192.168.43.174...
Connected to localhost.
Escape character is '^]'.
hadoop spark
OK
hadoop
OK
spark
OK
结果
-------------------------------------------
Time: 1574733695000 ms
-------------------------------------------
(spark,2)
(hadoop,2)
-------------------------------------------
Time: 1574733700000 ms
-------------------------------------------
pull
的方式flume agent
的编写
simple-agent.sources = netcat-source
simple-agent.sinks = spark-sink
simple-agent.channels = memory-channel
//数据监听端口
simple-agent.sources.netcat-source.type = netcat
simple-agent.sources.netcat-source.bind = hadoop000
simple-agent.sources.netcat-source.port = 44444
//本机ip
simple-agent.sinks.spark-sink.type = org.apache.spark.streaming.flume.sink.SparkSink
simple-agent.sinks.spark-sink.hostname = hadoop000
simple-agent.sinks.spark-sink.port = 41414
simple-agent.channels.memory-channel.type = memory
simple-agent.sources.netcat-source.channels = memory-channel
simple-agent.sinks.spark-sink.channel = memory-channel
代码编写:
package com.zgw.spark.streaming
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* Created by Zhaogw&Lss on 2019/11/25.
* SparkStreaming整合Flume的第二种方式
*/
object FlumePullWordCount {
def main(args: Array[String]): Unit = {
if (args.length!=2){
System.err.print("Usage:FlumePullWordCount ")
System.exit(1)
}
val Array(hostname,port) = args
val sc: SparkConf = new SparkConf().setMaster("local[3]").setAppName("FlumePullWordCount").set("spark.testing.memory", "2147480000")
Logger.getLogger("org").setLevel(Level.ERROR)
//创建StreamingContext两个参数 SparkConf和batch interval
val ssc = new StreamingContext(sc, Seconds(5))
val flumeStream = FlumeUtils.createPollingStream(ssc,hostname,port.toInt)
flumeStream.map(x => new String(x.event.getBody.array()).trim).flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).print()
ssc.start()
ssc.awaitTermination()
}
}
注意:这里要先启动flume
,再启动sparkStreaming
启动flume
flume-ng agent \
--name simple-agent \
--conf conf \
--conf-file $FLUME_HOME/conf/flume_pull_streaming.conf
-Dflume.root.logger=INFO,console &
启动程序(设置程序启动参数hadoop000 41414
)
新开窗口并telnet localhost 44444
[hadoop@hadoop000 lib]$ telnet localhost 44444
Trying 192.168.43.174...
Connected to localhost.
Escape character is '^]'.
hadoop
OK
spark
OK
Time: 1574735850000 ms
-------------------------------------------
(hadoop,1)
-------------------------------------------
Time: 1574735855000 ms
-------------------------------------------
(spark,1)
1.和之前一样先修改代码
val sc: SparkConf = new SparkConf()
/*.setMaster("local[3]").setAppName("FlumePullWordCount").set("spark.testing.memory", "2147480000")*/
2.打包:同上
3.在linux上替换之前push方式的jar包
4.启动flume
flume-ng agent \
--name simple-agent \
--conf conf \
--conf-file $FLUME_HOME/conf/flume_pull_streaming.conf
-Dflume.root.logger=INFO,console &
5.提交
spark-submit \
--class com.zgw.spark.FlumePullWordCount \
--master local[2] \
--packages org.apache.spark:spark-streaming-flume_2.11:2.2.0 \
/home/hadoop/lib/SparkTrain-1.0.jar \
hadoop000 41414
6.telnet localhost 44444
[hadoop@hadoop000 ~]$ telnet localhost 44444
Trying 192.168.43.174...
Connected to localhost.
Escape character is '^]'.
hadoop spark zookeeper
OK
hadoop spark
OK
-------------------------------------------
Time: 1574736605000 ms
-------------------------------------------
(zookeeper,1)
(spark,2)
(hadoop,2)
SparkStreaming
整合Flume
的两种方式完成代码托管地址https://github.com/daizikaikou/learningSpark