Spark Streaming整合Flume

Spark Streaming整合Flume方式有两种

方式一:Flume-style Push-based Approach

pom文件依赖

<dependencies>
   <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming-flume_2.11</artifactId>
        <version>2.0.2</version>
    </dependency>

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming_2.11</artifactId>
        <version>2.1.1</version>
    </dependency>
</dependencies>

  <!-- 打包-->
    <build>
        <plugins>
            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>2.3.2</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

Push方式整合之Flume Agent配置

[root@hadoop1 conf]# vim flume_push_streaming.conf

//添加参数
simple-agent.sources = netcat-source
simple-agent.sinks = avro-sink
simple-agent.channel = memory-channel

simple-agent.sources.netcat-source.type = netcat
simple-agent.sources.netcat-source.bind = hadoop1.x
simple-agent.sources.netcat-source.port = 44444

simple-agent.sinks.avro-sink.type = avro
simple-agent.sinks.avro-sink.hostname = 192.168.126.171
simple-agent.sinks.avro-sink.port = 41414

simple-agent.channels.memory-channel.type = memory

simple-agent.sources.netcat-source.channels = memory-channel
simple-agent.sinks.avro-sink.channel = memory-channel

Spark Streaming应用,代码编写

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.flume.FlumeUtils


/**
 * Spark Streaming整合Flume的第一种方式
 */
object FlumePushWordCount {
  def main(args: Array[String]): Unit = {
    val Array(hostname, port) = args

    val sparkConf = new SparkConf() //.setMaster("local[2]").setAppName("FlumePushWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    //TODO... 如何使用SparkStreaming整合Flume
    val flumeStream = FlumeUtils.createStream(ssc, hostname, port.toInt)

    flumeStream.map(x=> new String(x.event.getBody.array()).trim)
      .flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).print()

    ssc.start()
    ssc.awaitTermination()
  }
}

传参步骤
Spark Streaming整合Flume_第1张图片
Spark Streaming整合Flume_第2张图片
虚拟机启动flume服务

[root@hadoop1 bin]# ./flume-ng agent --name simple-agent  conf $FLUME_HOME/conf --conf-file $FLUME_HOME/conf/flume_push_streaming.conf -Dflume.root.logger=INFO,console

加载过程图:
Spark Streaming整合Flume_第3张图片

运行程序,控制台打印
Spark Streaming整合Flume_第4张图片

提交到生产环境中去,需要打包
Spark Streaming整合Flume_第5张图片
自己把这个jar包上传到lib文件上

[root@hadoop1 lib]# rz -be
rz waiting to receive.
Starting zmodem transfer.  Press Ctrl+C to cancel.
Transferring sparktrain-1.0-SNAPSHOT.jar...
  100%       7 KB       7 KB/sec    00:00:01       0 Errors  


[root@hadoop1 lib]# pwd
/home/hadoop/lib
[root@hadoop1 lib]# ll
total 8
-rw-r--r--. 1 root root 7608 Apr 12 13:43 sparktrain-1.0-SNAPSHOT.jar
[root@hadoop1 lib]# 
//进程有
[root@hadoop1 spark]# jps
12801 ResourceManager
12930 NodeManager
12470 DataNode
12646 SecondaryNameNode
13750 Jps
12071 Application
12330 NameNode
[root@hadoop1 spark]#  spark-submit  --class com.imooc.spark.FlumePushWordCount  --master local[2]  --packages org.apache.spark:spark-streaming-flume_2.11:2.2.0 /home/hadoop/sparktrain-1.0-SNAPSHOT.jar  hadoop1.x 41414  

  
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/usr/local/etc/hadoop/module/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.spark#spark-streaming-flume_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0

发生错误:

20/04/13 16:33:29 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://hadoop1.x:9000/directory
        at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
        at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
        at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:93)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:531)
        at org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:836)
        at org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:84)
        at com.imooc.spark.FlumePushWordCount$.main(FlumePushWordCount.scala:15)
        at com.imooc.spark.FlumePushWordCount.main(FlumePushWordCount.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
20/04/13 16:33:29 INFO ShutdownHookManager: Shutdown hook called
20/04/13 16:33:29 INFO ShutdownHookManager: Deleting directory /tmp/spark-1c4c0da4-e097-48c2-9e91-b97d81965a0a

待解决

你可能感兴趣的:(Spark,Streaming)