Flume+Kafka+SparkStreaming 进行WordCounts实例

1. flume

flume的安装配置就不说了,网上一大堆。
我还是给一个网址吧,https://www.jianshu.com/p/82c77166b5a3
编写flume配置文件

cd /opt/apache-flume-1.8.0-bin
vim conf/flume_kafka_and_hdfs.conf

填写内容如下:

a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/syslogdatatest.txt
a1.sources.r1.channels = c1 c2

a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /home/flume/flumeCheckpoint
a1.channels.c1.dataDirs = /home/flume/flumeData, /home/flume/flumeDataExt
a1.channels.c1.capacity = 2000000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 2000000
a1.channels.c2.transactionCapacity = 100

a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.path = hdfs://cn01:9000/flume/events/%Y/%m/%d/%H/%M
a1.sinks.k1.hdfs.filePrefix = cmcc
a1.sinks.k1.hdfs.minBlockReplicas = 1
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.rollInterval = 60
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.idleTimeout = 0

a1.sinks.k2.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k2.topic = test1
a1.sinks.k2.brokerList = 192.168.10.101:9092,192.168.10.102:9092,192.168.10.103:9092
a1.sinks.k2.requiresAcks = 1
a1.sinks.k2.batchSize = 100
a1.sinks.k2.channel = c2

之后保存退出即可

2. kafka

同样kafka 的安装配置也给一个地址,https://www.jianshu.com/p/3cb394ef41c0
kafka不需要额外的写什么,只是一个消息中间件,只要启动了kafka并且创建了topic(本文是test1,和flume配置文件里面的要相同)就好了。

3. spark

关于spark集群的搭建给一个网址https://www.jianshu.com/p/f9a9147176a7,都比较简单。
编写scala脚本

cd /opt/spark-2.2.1-bin-hadoop2.7
mkdir test #
cd test
mkdir -p src/main/scala 
vim src/main/scala/DirectKafkaWordCount.scala

填写如下代码到DirectKafkaWordCount.scala脚本里。

import kafka.serializer.StringDecoder
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.kafka.KafkaUtils._
import org.apache.spark.SparkConf
object DirectKafkaWordCount {
    def main(args: Array[String]) {
        if(args.length < 2) {
            System.err.println(s"""
                |Usage: DirectKafkaWordCount  
                |   is a list of one or more Kafka brokers
                |   is a list of one or more kafka topics to consume from
                |
                """.stripMargin)
            System.exit(1)
        }
        //StreamingExamples.setStreamingLogLevels()
        val Array(brokers, topics) = args
        val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
        val ssc = new StreamingContext(sparkConf, Seconds(2))
        val topicsSet = topics.split(",").toSet
        val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
        val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
        val lines = messages.map(_._2)
        val words = lines.flatMap(_.split(" "))
        val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
        wordCounts.print()
        ssc.start()
        ssc.awaitTermination()
    }
}

保存退出即可,在编写一个spark相关依赖的脚本。

vim build.sbt

填写如下内容即可。

name := "Simple Project With DirectKafkaWordCount"

version := "1.0"

scalaVersion := "2.11.8"

libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-8_2.11" % "2.2.1"

libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "2.2.1"

同样的保存退出。
最后我们使用命令来编译一下。

sbt package

当然需要先安装sbt命令。网上一大堆。
他会下载一些依赖,我们等着就行了。看到最后的输出信息有success就表示编译成功了。
我们可以看到test目录下多了两个子目录,其中在target/scala-2.11目录下有一个jar包。这正是我们需要的。

4. 启动运行提交作业

先启动flume:

cd /opt/apache-flume-1.8.0-bin
bin/flume-ng agent --conf conf/ --conf-file conf/flume_kafka_and_hdfs_test.conf --name a1 -Dflume.root.logger=INFO,console

然后另外打开一个终端用来运行spark job。命令如下。

cd /opt/spark-2.2.1-bin-hadoop2.7
spark-submit --jars /home/spark-streaming-kafka-0-8-assembly_2.11-2.2.1.jar test/target/scala-2.11/simple-project-with-directkafkawordcount_2.11-1.0.jar cn01:9092,cn02:9092,cn03:9092 test1

其中--jars 后面跟的是依赖项, 我们需要先到这里找到对应自己spark版本的下载并上传到服务就可以了。
或者用--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.1代替--jars参数。他会在线下载。
OK! 你就会看到程序正常运行了。
最后一步就是我们需要往/home/syslogdatatest.txt文件中写一点内容了,用来做wordCounts。
在另开一个终端。

vim /home/syslogdatatest.txt
#写一些东西
hello flume
hello kafka
hello spark
apache spark
apache kafka
apache flume

保存退出即可。
不出意外的话就立即能在刚才提交spark job的终端上看到对应的词频统计结果了。
我们可以在UI界面上看到更多的信息。


END

你可能感兴趣的:(Flume+Kafka+SparkStreaming 进行WordCounts实例)