Spark(七) --一文带你了解Spark Streaming对接Flume

1. Spark Streaming 对接Flume

Flume是一个分布式、可靠、和高可用的海量日志采集、聚合和传输的系统。Flume可以采集文件,socket数据包等各种形式源数据,又可以将采集到的数据输出到HDFS、hbase、hive、kafka等众多外部存储系统中。以下将介绍Flume采集日志后直接对接Spark Streaming 两种方式 – push 和 poll。(使用的spark streaming版本为1.6.1, 使用flume版本为1.6.0)
Spark(七) --一文带你了解Spark Streaming对接Flume_第1张图片

1.1 Flume通过push方式对接Spark Streaming

  1. 在flume安装目录的conf文件下新建 flume-push.conf 配置文件,内容如下:
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# source 
a1.sources.r1.type = spooldir
# 定义flume采集日志的监控目录
a1.sources.r1.spoolDir = /home/hadoop/flumespool
a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = avro
#这是接收方 即streamin本地运行的 ip地址 与 端口
a1.sinks.k1.hostname = 192.168.72.1
a1.sinks.k1.port = 8888

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

  1. 编写Streaming wordcount, 数据源为flume push到8888端口上的数据,并以本地模式运行Streaming wordcount
import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}

object FlumePushWordCount {
  def main(args: Array[String]): Unit = {
    LoggerLevels.setStreamingLogLevels()
    val conf = new SparkConf().setAppName("FlumePushWordCount").setMaster("local[2]")
    val ssc = new StreamingContext(conf,Seconds(5))
    //推送方式: flume向Spark发送数据
    val flumeStream = FlumeUtils.createStream(ssc,"192.168.72.1",8888)
    //flume中的数据通过event.getBody()才能拿到真正的内容
    val words = flumeStream.flatMap( x => new String(x.event.getBody().array()).split(" ")).map((_,1))
    val results  = words.reduceByKey(_+_)
    results.print()
    ssc.start()
    ssc.awaitTermination()

  }

}
  1. 启动本地的Flume agent
/home/hadoop/apps/flume-1.6.0/bin/flume-ng agent -n a1 -c conf -f /home/hadoop/apps/flume-1.6.0/conf/flume-push.conf

  1. 新建测试的words.txt,内容如下,并放入到flume-push.conf中配置的flume监控目录 /home/hadoop/flumespool 下
hello tom
hello jerry
hello tom
hello kitty
hello world
1,laozhao,18
2,laoduan,30
3,laomao,28

  1. 观察控制台输出如下,flume成功向streaming推送采集到的words.txt,streaming成功完成计算。
    Spark(七) --一文带你了解Spark Streaming对接Flume_第2张图片

1.2 Flume通过poll方式对接Spark Streaming

  1. 在flume安装目录的conf文件下新建 flume-poll.conf 配置文件,内容如下:
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# source
a1.sources.r1.type = spooldir
# flume监控日志目录
a1.sources.r1.spoolDir = /home/hadoop/flumespool
a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
# flume agent的ip地址和推送日志的端口
a1.sinks.k1.hostname = 192.168.72.128
a1.sinks.k1.port = 8888

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
  1. 由于(1)中配置的flume sink类为spark官方提供,需要在flume-1.6.0/lib/ 路径下放入 spark-streaming-flume-sink_2.10-1.6.1.jar 包,同时还需放入commons-lang3-3.3.2.jar scala-library-2.10.5.jar 包。
  2. 编写Streaming wordcount , 并本地模式运行
import java.net.InetSocketAddress

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.flume.FlumeUtils

object FlumePollWordCount {
  def main(args: Array[String]): Unit = {
    LoggerLevels.setStreamingLogLevels()
    val conf = new SparkConf().setAppName("FlumePollWordCount").setMaster("local[2]")
    val ssc = new StreamingContext(conf,Seconds(5))
    //推送方式: 从flume中拉取数据,flume-poll.conf 中配置的ip和端口
    val address = Seq(new InetSocketAddress("192.168.72.128",8888))
    val flumeStream = FlumeUtils.createPollingStream(ssc,address,StorageLevel.MEMORY_AND_DISK)
    //flume中的数据通过event.getBody()才能拿到真正的内容
    val words = flumeStream.flatMap( x => new String(x.event.getBody().array()).split(" ")).map((_,1))
    val results  = words.reduceByKey(_+_)
    results.print()
    ssc.start()
    ssc.awaitTermination()
  }

}

  1. 启动flume agent
/home/hadoop/apps/flume-1.6.0/bin/flume-ng agent -n a1 -c conf -f /home/hadoop/apps/flume-1.6.0/conf/flume-poll.conf
  1. 新建测试的words.txt,内容如下,并放入到flume-push.conf中配置的flume监控目录 /home/hadoop/flumespool 下
hadoop spark sqoop hadoop spark hive hadoop
  1. 观察控制台输出如下,streaming成功从flume agent中拉取到words.txt,并完成计算。
    Spark(七) --一文带你了解Spark Streaming对接Flume_第3张图片

你可能感兴趣的:(Spark)